Video: How Fin 3x'd R&D output in 16 months — and what's next | Duration: 3769s | Summary: How Fin 3x'd R&D output in 16 months — and what's next | Chapters: Welcome and Introductions (1.8200000000000003s), The AI Imperative (156.17000000000002s), Business Impact Results (360.885s), Live Sentry Fix (548.385s), Skills and Plugins (740.17s), Security Incident Response (1014.5500000000001s), Bug Investigation Process (1152.0900000000001s), Code Review Agents (1285.085s), Multi-Criteria Code Review (1376.13s), User Engagement (1586.03s), Curator App Demo (1706.38s), New Chapter (2046.723255389748s), Shrek Code Review (2146.85s), Session Data Analysis (2296.995s), Systems vs Culture (2434.5550000000003s), Culture and Teams (2613.19s), New Chapter (2726.576791393323s), Q&A and Closing (2824.3250000000003s), Closing Reflections (3175.9900000000002s)
Transcript for "How Fin 3x'd R&D output in 16 months — and what's next": Hello, everybody. How's it going? Thank you for joining us. Thanks for taking time out of your morning, your afternoon, wherever you are in the world. Really appreciate it. So our we're here today to try to, tell you a lot about how we've gone about really getting strong acceleration from using AI in r and d at at Fin. And, our intent is to be very transparent, you know, not like slideware or kinda do a live demo, touch wood. There's no outages in progress or anything like that. We'll do our best. And, yeah, before we get started, a quick introduction to who you're listening to today. So let's see. You've got me, Astara, CTO here, Fin, Brian Scanlan, senior principal engineer, Kesha, principal engineer. Both both of them work on team two x, which is our team responsible for all the tooling, social technical change around, leveraging AI into at Fin Intercom or being renamed. It's still confusing. And, you know, hopefully, you're familiar with with Fin, what we do. Our, our mission is to make perfect customer experiences possible, and we do so with our AI agents and AI operator and our wonderful, AI powered help desk, Intercom. Also joining today and a little, a little later, should join a little later, I made a bit of a scheduling whoopsie, is, Claire Vo, who I'm sure needs very little introduction, wonderful person and community contributor, you know, from her How AI podcast, which I'm sure you're all familiar with. And just also as a a pretty capable and accomplished founder and chief technical officer, chief product and technical officer through a number of companies. And she is joining us and, you know, what what we hope for her is to like pull us out of our echo chamber and like challenge us with like all of the perspectives that she's got from like talking to dozens of leaders and companies that are doing similar types of technical change to us. So, yeah, look forward to her joining a little later. And quick sort of, I guess, spoiler and plug for one of the things Claire is working on at the moment which is a new company, a new dev called c x o. Dev. And basically that's an effort she's spinning up to help companies go on this journey. You know, again, bring her wisdom wisdom and perspective as a leader, hands on leader and practitioner and also expert in this field to help other companies on this journey. So be sure to check that out. We'll have a q and a at the end. Like, we really want this to be as open, you know, there's no secrets, ask us anything, ask us that spiciest questions, etcetera. Kind of to, maybe just see that, you know, because we'll we'll work behind the scenes to, like, try to curate and surface the best questions. But I encourage you all to like just dump in the kind of question question or questions on your mind like kind of what brought you here, what do you hope to get out of it, what's the biggest sort of unknown or thing you'd like to find out And, you know, it'll help us do it as good a job as possible of, like, creating and answering the the best. So I'm sure you're here for a reason, so please service that there. There'll be opportunity at the end as well. So I'll try to keep the preamble really quick, because the thing we wanna really anchor this around is, like, showing you the system, showing you what we've built, etcetera. But the kind of basic premise that, that the thing that sort of fueled all of the work we've done here is kind of this realization that like, you know, we've kind of roughly been in a steady state in the industry. Of course, people and teams get edge on one another and try to like optimize that a built software and it's always been the case that speed is a massive advantage. The AI has dramatically changed this and the ceiling has raised far far higher than you can probably comprehend. And if you kind of just stay still or drag your heels, you're getting left, like dramatically left behind relative to your potential and relative to your competitors, it'd be your companies or peers, etcetera. So there's a real imperative like we're the reason we're sharing all this is, you know, we realize it's like existential for us and we want other people and and companies to, you know, thrive together, won't this be a time of abundance, etcetera. And it's useful for us to share publicly because we get people who challenge and provide other other points of view, etcetera, all fundamentally all helping us reach up as close to our ceiling as possible. You know, we really must fight again. Like, I see there's such a trap here. Like, there's so many gifts landing on our lap that make our jobs easier. And, like, e even by accident, you're just gonna be better and faster at your jobs. And the trap is sort of assuming, cool, that's it. I've got the power ups. Let's go. Like, nothing else to go do. Again, you know, that feeling is far harder. You you're realizing you must really fight to to to be be anywhere near it. And we don't we don't assume that we're anywhere near it. We're not stopping. We're keeping fighting. Okay. So, I shared or we shared a a blog post that kinda just shared, like, impact that this has had on our business. Not gonna go into this in-depth detail. I'll share link later. You know, the the hook here was we treat x our our our true put per person over this time period. We're quite happy with that. We don't think we're anywhere near as done. It wasn't just about like the raw metric with Kerbin, a lot of other things. We've seen this had a dramatic impact on quality, on the rate of product changes, on the time to make product product changes, we've got do more done and quicker, on the amount of downtime and breaking changes, and just how dramatically it has changed how people work. '90 more than 93% of like PRs are agent first. And, you know, you don't have to go back that many weeks or months where that was, like, 20% or 30% or even, you know, 5%. It's not that long ago in relative terms. And and and one, you know, just one one thing which I think is a really important kind of view, and we kinda hear it all the time. You'll you'll hear no shortage of people sharing, like, the selective anecdotes of how, you know, the the couple of people on their team have had like 20 x, you know, improvement on their throughput. You know, of of course, we see that too. And, you know, as charts, log scale and both axis, like there are people who are just night and day on an absolute tear but we're not managing just one or two people, we're managing funders and the kind of goal is pulling up the aggregate dramatically and pulling up the floor And I think that's fundamentally one of the biggest leadership challenges here. You you know, it's not about the the, you know, anecdotes, not the edges. It's about the the view of the whole system. Okay. So, just one more kind of maybe preamble here. Like, again, when when you think of cloud code or whatever, it's natural to think about, oh, that makes the job of writing the code, a lot easier. Like, the job of an engineer is way broader than that. And I think it's important and certainly our vision for this agent agentic system is that, the the agents or the agentic system can do everything a senior engineer can do with XA and can consistently do that at or beyond the standard of a senior engineer. So it's not about selective little bits of the workflow. We're trying to attack the whole thing. And, you know, the impact is far broader than just shipping PRs or writing code when you need to write code. Okay. With all that in mind, cross your fingers. We're gonna do a live demo. I'm gonna pass, to to to Brian and then Brian Scanlan. I'll tag team a little bit as we go through this bit. Awesome. So I'm gonna do my best here live. So, yeah, this is 100% live. There's no editing. Anytime I've done this before, there's usually been some sort of get out of jail free cause, but not here. So but we have actually prepped something. So I'm gonna fix a real Sentry issue. And so we so what we were doing here so I've opened up Cloud Code, and we are picking a, Sentry. Sentry is a tool we use for exception tracking. And, myself and Kesha Mykhailov sat down earlier on today, and Kesha Mykhailov found something that looks relatively simple, something that probably shouldn't be an exception, and that we think that even Claude could get right. And, you know, we've done a few practice runs at this, but who knows what's gonna happen? So, you know, no there's nothing that interesting that's going on right now, but you can already start to see us invoke, some of our internal skills. And so very rapidly, you can see here, this the developer tools investigate skill, being invoked by Cloud Code. This is good. This is as planned. We populate our Cloud Code, our setup with dozens, maybe hundreds, depending on the configuration, of various skills, that try and break down into individual little tasks or little, little, like, repeatable, individual tasks, skills that can ideally do the work, kinda like Darragh was saying about the work being, at the standard of a senior engineer or above. And so rather than having, say, one, like, God type skill, you know, one skill that tries to do absolutely everything and, like, does all of the work from start to finish. It's more like, let's find the discrete functions, and get them done to a very high standard. And by that, we mean do things like write a vowels or, like, have a good a great feedback loop to be able to figure out that it's doing all the right things. And that, you know, these kind of units that are testable, we think that looks, like, completely critical if we're gonna be assembling all of these discrete skills together to do what is kinda complex work, and the work that, like, a senior engineer should be able to do. And so here yeah. Things are things are working well, from the very start. It's, the investigate skill has been invoked. I didn't tell Claude Code, like, hey. We've got this investigate skill. It just figured out the right skill to kind of invoke there. The next thing we see now as well is it's it's kind of I didn't it's kind of just like invoking all of the observability skills or making sure it's they're they're there, they're ready, they're ready to go just in case, we need to query Snowflake where we, keep our our logs, or Honeycomb where we keep our high cardinality data, or Datadog where we kinda keep our, custom metrics and infrastructure metrics. And so, again, it's kinda doing the right thing here. Okay. It's like it's it's kind of pausing at this point or it's, it's whirring away. Kesha, anything else to intro that's interesting to say about this at this point? No. No. I'm I'm just glad it worked as expected, and it figured that we need to invoke that investigate skill. And, yeah, I I kinda like the fact that it prewarms all the observability tool and probably, like, figuring out which MCPs it's gonna be using. And, yeah, hopefully, you're authenticated in all of them. So let's see what's gonna happen next. I am. But it seems to be having a bit of a break at the moment. Let's see. I guess we could talk about the, our skills in general. So, I've I had a tweet thread a couple of months ago where, that went kind of viral. And at the time, I think we had, like, hundreds of skills, across, like, dozens of plugins or something like that. I think that's, I think that's increased. So I'm gonna, like, share my screen here just to, kinda, hop into our Cloud plugins directory. And so yeah. This is our this is our folder. This is where we throw everything. And The first thing that's worth noting is that we have a lot of plugins. This is like a custom marketplace that we ship out to everyone's laptop, everyone's configured in Intercom to get this out of the box when they invoke Cloud Code. We have some foundational plugins. This base plugin is like a critical load bearing part of our setup, and it makes sure that we've got the telemetry that we need, that we've got some basic safety hooks that we would want. We don't want Cloud Code going too wild in our AWS environment. Two of the most interesting things that we do or certainly the things that we found earliest that are useful for us is to record all session data, or, like, call session events. So, like, we used Cooks and Cloud Code to post, events to, Honeycomb. And then that allows us, but also allows anyone kind of an Intercom to kind of track things like session use and kind of look at their own use. But, more importantly, like, if you're if you're one of the dozens or hundreds of developers of skills across Intercom, you can use this dataset to, to hop in and, see who's using your stuff, and maybe follow-up, do some good product management, and ask them how things are going and things like that. So this this would be like this plugin basically does does does very little. It's minimalist. It's only got a handful of kind of skills and hooks and stuff like that. But it's basically just there to get make sure that everyone's got a sound configuration. And then we would have, say, developer tools, which just happens to be the name of where we put most of our, highest quality or well used kind of individual skills. And there's again, these are all like small, well contained. They might be a a specific skill around Datadog use where, you know, we've got a bunch of things that, is specific to our own use of Datadog. And so we kind of encode this in this skill to make sure that when we are using, Datadog or when Cloud Code is using Datadog, it's kinda getting it right first time. It's not just using general knowledge of how to use Datadog. It's we're kinda reinforcing it with, like, oh, yeah. Here's the kind of things that you need to think about or need to be aware of in our environment and kinda getting things right there. So this this stuff, you know, chances are your agent might figure it out eventually, or might ask you some questions about it and kinda, get through it. But these, these skills that, kind of go along with our MCP usage gets us to the point of where, you know, it's just getting things right first time. Again, it's like, what would I expect a senior engineer to do? It's like, you know, you should be able to use Datadog and figure out exactly where the right, data is at the right time. I'm just gonna switch back. I'm gonna see what's going on in this session. Is it actually doing anything? Sorry. Yeah, it's whirring away. It is yeah. It now understands the book fully. Right. Let's keep talking about, skills. Yeah. And to. interrupt you. I just about to say, I think the first time you you you do something in the session and one of the skills that you've never used before you you didn't even know exist kicks in, you're like, oh my god. This is great. It's such a vow effect where where you get something for free, which you haven't been told by anyone. Yeah. It's just amazing. So it's it's really important to have this kinda shared marketplace that shipped automatically to everyone's laptops. Yeah. My favorite story there is I was paged into a security event recently, and it turned out to not be much of a security event. So somebody had accidentally published to a public GitHub repo some Snowflake metadata, like schema information. So not a big deal, like, not exactly, like, desirable, but, you know, they noticed us. They opened a security event. Like, everything went well. And, I was paged into the incident, and I joined the Slack channel. And I just habitually opened up clause and pushed the Slack channel URL into, my Claude. And, then I kind of page back, you know, tab tabbed over back to the, to to the incident channel. I was kinda taking a look. I was taking a look at the the information that was uploaded and kinda vague recollection that we have got, like, runbooks and, kind of procedures for these kind of things and data classification guidelines. And I know they're kind of out there. And while I was kind of pondering this and kind of worrying about what I'm like, the waste of time I'm gonna be spending on the next twenty to thirty minutes, I got notification that Claude was ready and that it it had figured something out. I was like, oh, let's see what Claude is on. Turned out, yeah, one of our engineers, Norm, had written a skill for this exact situation, like, basically classify a data breach and recommend next steps. And I didn't know it existed. In fact, when I asked Norm about it, he'd forgotten he'd even written us. And, yeah, it came out with, like, perfect analysis based on our internal documentation, our runbooks. You know, it downloaded the files and analyzed them and and gave gave us, like, oh, here's next steps. You know, it wasn't a big deal. Just, you know, basically delete the files, notify people, and move on. And I was like, it was so great. It was like, this this was gonna be a bunch of, like, twenty to thirty minutes of tedious kind of work, and suddenly it just turned into, like, nothing, for my own side. So, yeah, Claude is still wearing away in the background. Sorry. It's running some tests. Maybe I'll switch back over to it. So it has been doing things. So let's see what it's been doing. Has it been invoking any under any other kind of sessions? It has been doing a lot of stuff. Let's see. So it's been looking around. It's kind of browsing around our code base. It's, looking oh, it's like not noticing that stuff's on disk. I'm kinda confusing this. Maybe I've sabotaged my git setup or something like that. And it is kinda getting there eventually. Let's see. Git show. Blah blah blah. Yeah. I think it's just getting confused a little bit, but it is reading and getting there. Okay. It's checking out master. And so okay. It's getting to an answer here. And let's see. What kind of answer is it getting to? The bug here is that we have a app lookup, which is our tenant ID, like our customer ID effectively. When it's blank, this internal admin page is close an error, close an exception, which isn't bad, isn't the wrong thing for a uncaught exception, but isn't obviously a great user experience. There's a few different ways to solve this depending on what moves, Claude is in. But we can take a look at, exactly which way it's gonna be. Right now, it's just kind of running some tests. Kinda hasn't invoked any other skills or anything else, kind of, interesting right now. Let's see. Oh, it's running code review. Yeah. So it's just like it's it's written some code. It's linting, and it's running our internal code review skill. And, hopefully, pretty soon here, we should have a pull request being opened. Yeah. We have a pull request. Kesha, talk us through, the pull requests. Yeah. Yeah. Can can you you call? it on on screen? Do we have it? Do we have the link? Yeah. It should have it here. Yeah. So as as like, we have a skill to create a pull request, and that skill also instructs it to review the code first. So we we're trying to to do as much as as possible, like, code review locally just to surface all the issues we could before even creating the pull request. I think it did the right thing. Yeah. That that's the PR. And is there a little message down there? Let's see. And Okay. So it's it's not catching the exception. It's just, like, flashing up the error. Basically, it's. catching that. So went for that. That's went. for that file today. Yeah. Anything else he has here today? It's still running. It's still still wearing away. Yeah. So we expect our, order review and approval agent to kick in soon. So we have a an agent that reviews the code against several different criterias. Yeah. There we go. Uh-huh. Shrek is busy. That's that's the name of the agent. And, I guess, to address the elephant in the room, why the name is Shrek. So I'm not sure if you folks noticed in your organizations, but probably over the last year, you have tons of different agents and bots operating into pull requests, post in different comments. And to me, it it just creates this fatigue where at certain point, you just stop noticing those things. You you stop paying attention, like, whatever. Some bot posted some posted something. I'm not gonna care. Like, I have some work to do. So we we were trying to come up with something that is has this, you know, like, in your face. This is Shrek, and, like, you have to pay attention, and you better be, like, on top of those comments just to, you know, make it eye catching and memorable and fight that fatigue from agents. Involved. So what it's doing now, it's it's it obviously reacted to the pull request creation and is now trying to review the code. And, hopefully, within a few minutes, we're gonna we're gonna see, the summary. Hopefully, we'll get the approval. And, so when we get the approval, we'll be able to merge this pull request straight away. And, I think the work on this whole flow of kind of getting PRs auto approved by the agents started kinda long time ago, probably more than a year ago before we even had any agents at all with the idea that we wanted to to risk this whole thing with, with compliance, basically, so that we're still compliant with, various different, I guess, standards and policies. So the initial version of this was that we had a deterministic rule based agent. It's not even an agent. It's just a a script, basically, that if you say editing a markdown file or some text file, which is totally saved, then you'd get an auto approval. And then we kind of treaded the waters with the, with our audit folks to make sure that we figure out a path how to get to fully agentic auto approvals. K. I think we got something there, Oh, Brian. yeah. Right? Hi, Shrek. Shrek really likes this. So what are the different criterion? Like, what how how can we break these out separately? Why isn't it just, like, looks looks good to me? Yeah. Yeah. I think, like, if you think about poor quest review process and you don't like it, it's basically many different jobs that you do and you don't even think about it. But if you are diligent and you're trying to do a good job, then you probably start with a problem. Sorry, with a poor question and title. Right? You wanna see whether are we even solving the right problem? What is the problem that we're solving? What is the intent behind the change? And then when that makes sense, maybe you move to the diff itself and you check it against some practices and anti patterns. Maybe you evaluate some, you know, safety concerns and whether it's even logically correct and and aligned basically with the problem statement. Are we solving the right thing in the diff or not? And there are dozens and dozens of different little subjobs that you do in re review of request. Maybe you notice, like, oh, there's a new SQL request or queries. It's gonna be performed. Are we gonna pull the up down? And, like, there is different concerns. And, like, sometimes you don't even you can't even do this kind of holistic, diligent review with just one single human. Right? You need experts from various different areas for them to look at your pull request at the same time and kind of assess it from a certain angle. So that is what we are trying to simulate here. Basically, every criteria, is basically like a separate sub agent that assesses your pull request against that criteria. And then, ultimately, in the end, the final verdict says whether it's good to go or not. And there's one deterministic chain check, which is the size of the pull request, you know, which I think is really important. So regardless of what subagent's outcomes are, if ProQuest is larger than certain number of lines of code, and at the moment, I think for this repo, we have it at 150 lines of code, so fairly small changes, then you're not gonna get an auto approval. And the reason for that is that we wanna create kind of an incentive within our org to continue shipping in small incremental things, which are low risk, and you kind of ship them in production one by one and and minimize the the risk of, like, breaking something or introducing multiple unknown unknown or hitting multiple unknown unknowns at the same time. Yeah. So. I think you you expanded the the detailed summary from the agent. Right? From the. sub. agent that does the problem statement. these are there's not I guess, there's not too much in of interest here. I mean, logical correctness, there's a bit of information here. Do you find that people read these? Do you need like, when do people need to go deeper? Like, what why not just have the, the kind of Yeah. I think, we want we we encourage folks to kind of engage with this and read through I I? guess that especially makes sense when you when when Shrek surfaced something and you you disagree with it. Right? You think, oh, now Shrek got it wrong. So then you kind of you you have this internal, desire to to correct it. So then you you go deeper to trying to understand what was the actual rationale of the product at that at that point in time. And then, hopefully, you'll continue the flow by improving the prompts and the guidance, and then that's where the curation comes into place. And, I don't I don't know if you wanna talk about this now or a bit later. What do you think? Go first. And and just while I'm doing that, let's welcome Claire to the, Oh. to the. stage. Thanks for joining, Claire. We're halfway through our lifetime. I haven't completely bombed out on us yet. Feel free to jump in with questions or additional context as we go through this part. Oh, you're muted. Cool. And maybe you need to press yeah. I had to come. on. stage. Very welcome. Yay. I'm here. on page. Oh, you guys are so I this is the solution. People didn't realize this. How do we do code review for all these PRs? It's Claire shows up just to the point where we have to read them and I do them one by one manually? It's all me behind the scenes. Auto auto care. Exactly. Cool. Brian, would you mind stopping stop sharing your screen, and then I I can pop mine in, Okay. and then. I'll great. Thanks. I'm gonna share. Yep. Okay. So this is the curator app. So for Shraddhaq Pak. I guess the the important bit about any agent like, if you build in any agent with AI, the harsh truth is that the agent by itself is just doesn't matter almost at all. What matters is your ability to iterate on the prompts and assess the quality of that agent. And that's, I guess, where the biggest chunk of work, is done. So this is this curator app where, well, this live tab shows all the pull requests that we have live being created for that repository for for our main Ruby on Rails monolith called Intercom. We see the percentage of approval, and we split it down by, criteria. For example, you can see that only 36% of pull requests are considered safe. So we we are quite strict on safety at the moment, and I I'll explain later why. And then description quality is mostly okayed. Alignment, just problem statement with the div is fine, and logical correctness is kinda 61%. So the idea with this app is that let let's say you got a review and you disagree. Right? The bot got it wrong. Shrek gave you some feedback and you disagree. Then you're able to sort of find that, request of yours and say, oh, you know what? I I disagree with you. I think this this description is actually okay because in this case, Shrek said that it's a bad description. So you can add this pull request to a dataset, and then you can go back to the dataset. And, let's say we find that request again, and then you can curate the ground truth. So, basically, you're telling this the system what is the expected outcome of the bot assessment. Maybe I'll say, okay. This you know what? This is actually safe. I think this is okay. And then you can establish description quality. Oh, I disagree with the bot. Like, the description was excellent here, which is not by the way, the description wasn't quite there, but whatever. This was just fooling around here, and you establish some logical correctness. Well, in this in this case, Shrek did find a few issues. Maybe you agree or disagree. Basically, you curate the ground truth for that specific pull request. And when the ground truth is curated, then you can have that pull request as part of your datasets as part of your dataset dataset to run batch evaluations and use that pull request with the ground truth curated as basically one test and and validate that anytime you run the bot over and over in this pull request, you get the expected outcome. And then when you have enough of those pull requests with ground truth curated, you have, like, sufficiently large dataset, then you can do those batch evaluations and compare changes to the agent or the prompts or the models independently just to make sure that you're not degrading the quality. So I'm gonna show you a few comparisons that I have in here. And, by the way, we can see they're all inconclusive, so the label is inconclusive. The reason for that is that we still we still need to curate more more of those poor quests. So, unfortunately, like, LLMs, nondeterministic, so anytime you run the assessment or anytime you run the judge on top of that assessment to figure out whether the assessment aligns with the ground truth or not, you need a lot of data to make sure that you fight that noise because all of those kind of assessments and judgments, they're kind of nondeterministic. So sometimes you might get one outcome and the other day, you get a different one. So I think only and, like, out of the recent comparisons, only one was, conclusive in a way that we detected degradation. So here, what I was trying to do, I was testing out OPUS 4.7 against the dataset that we have for for this repository. And for two sub agents, for safety sub agents and problem statement, we got a degradation. So even though we had some kind of intrinsic noise and you have to calibrate this because every different criteria will have kind of different, scale of that noise. But for these two, there was a degradation, so we couldn't roll out Opus 4.7 for for those two sub agents. I had to close that pull request. So, yeah, I guess this is the important bit. So you have to set up this flywheel, basically. You you have to create an incentive for folks to engage with the system. Like, hey. I I didn't get a good assessment here, so maybe I should tweak the prompts or tweak the agent or figure something out where the problem is to encourage them to curate that ground truth and add more and more pull requests to your dataset so that ultimately with time, you have a high quality dataset that you can validate your agent against, and then you you can basically iterate on it safely. Otherwise, just making changes and that you don't know whether they're good or bad, whether the quality is the same or not. So this is the whole idea. And, like, this is by far, like, the the the most complicated and time consuming stage of building an agent is figuring out how to set this up, how to create this dataset with the ground truth, create it, how to run those batch evaluations. And the agent itself, look, it's it's fairly simplistic. There's no any secret sauce sauce there. One of the benefits of getting to hop into to this, and I I get to skip the q and a, the q and a queue. So two observations for folks that are paying attention is, one, I see this a lot in general evals, which is literally just have to look. You have to look and you have to manually tag good, bad, left, right. This is what we saw. There's no there's no getting around somebody who knows what they're doing actually looking at the outputs of an agent, whether it's an internal coding agent or it's something customer facing. Like, look through a thousand of these somehow and build that dataset is just a tried and true, kind of evals methodology. My other question since we see so much inconclusive here is other than models, what what are you testing that you feel like moves the needle good or bad the most? Yeah. Good question. So one of the things that I tested recently was, for the problem statement sub agents. Let me see if I can find that. And, yeah, I think this is Marvin's change. Like, unfortunately, it was inconclusive as well. But the problem there was that we in our create PR scale, we have this thing which adds an implementation plan that Claude came up with as part of the session. It adds it to the pull request. But the way it adds it is that it it's kinda hidden. It's in this details blocks that collapsed by default. So if you are reviewing that request as a human, you don't see that implementation plan. If you're interested, you can expand and check check it, but, typically, like, you wouldn't. It's just an artifact that we attach. But then the sub agent that evaluated the quality of the problem statement, it, it didn't have, like, the the knowledge of whether it's collapsed or not. It just sees the whole problem problem description stated in the pull request, body. So as as a result, it was penalized in those pull requests because that implementation plan is quite verbose, and sometimes it deviated from the actual thing being shipped in the diff. So you might have been penalized for, say, misalignment between the diff and the problem statement. And one of our colleagues called that thing and said, okay. Why don't we just strip the implementation plan out of the pull request body when we're assessing it for the problem statement quality? Just not to sway the, bot opinion on on the problem statement. So that was one of the thing that I I really wanted to, like, test through the eval just to make sure that indeed we are improving the quality of the or not even the quality, but we get in closer to the ground truth being aligned with the assessments of the part when we strip the details bought from the problem statement description. But virtually any change to the prompts or the agent is gonna have to go for those batch URLs just to make sure that we're not degrading the quality overall. Great. Kesha Mykhailov, I'm just going back to Shrek. Yep. Yeah. Yeah. Let's go back. So I I actually lost my mind a little and pushed a update to the code change that we made, a very manual change. In case you haven't, figured it out, we already planned this out. So I added, like, a log line to, like, pretty much our busiest rails controller. And it's a particularly bad log line. It's it's, like, not even formatted well. So here here's, like, Shrek giving out. So this is, like, I've made Shrek angry. And well, first of all, let's kinda figure it out. It's so Shrek understands, or we've given enough guidance to our setup to know that, like, you don't need to provide tenant IDs and log lines. It's a bit of an anti pattern. We have, like, structured logs. It's already inserted. And so, you know, it's getting, like, very Intercom specific implementation stuff right. And also, if we go back to the the the overview, so we can see a second Shrek review, and it does not like this. You can see well, first of all, my description, doesn't include this naughty thing that I tried to get in through this, through this pull request. Then there's, well, it this could probably cause some errors. You know, this is not the best Rails code that I wrote. And it's also figuring out that, you know, like, the the inline comments, and, yeah, thing you don't you do not want these known, like, on our highest throughput, endpoints, errors and and exceptions going off. These this this thing would probably take us down if I ship this. So, yeah, it's great to see Shrek getting things right and, refusing to, review my, or refusing to allow my things get through. I just need to get a human to give me a looks good to me to ship this. Yeah. You need to find someone to rubber stamp this horrible change and then get it out. Well, not gonna There are loads of questions about, do this. code review and Shrek and co in, the the q and a. I don't know. Should we skip in there? Or is Yeah. Let let's, and it's kinda big time cut. Let's try to wrap up on it, what you wanna. show system wise. Like, I wonder if you wanted to maybe dip into telemetry or or anything like that while we're here, and then we'll pull back to a little bit of what's next and then the general q and a, which I'm busily getting agents to curate, make useful. Yeah. Good, Shay. I'm gonna just show off my screen again our cloud code. And so, we we store all of our cloud code sessions in s three, and make a queryable using Athena. And I I kinda feel like we're not getting as much value out of this that that we should. And I know that Entropic have started shipping, versions of session analysis, like AutoDream, which I think is the coolest, feature name that I've heard in a long while. But, you know, we have, AutoDream at home here. We're, just doing a very simple query against our session data. I'm just asking it to take a look at my last twenty four hours and give it some rating. I think the more effective or the more useful parts of why it's so important to collect this session data is, first of all, tech support. We want people to be successful and to support people in their use of Cloud Code internally in Intercom. Sometimes things just go wrong or you need to be able to see things deeper. We can't just log into other people's computers. A lot of this stuff tends to be local or whatever. But having the sessions available so people can give us a session ID, we do pseudo anonymize them on the way out. But being able to find them and being able to go, oh, this is what went wrong or how come this skill didn't invoke or whatever, that stuff is pretty useful. We do give feedback for people or have skills and things that allow people to get feedback just from doing basic session analysis on day to day effectiveness, are they getting success, whatever. But, like, there's there's gold in here. You know, I really think that we that there's there's more stuff we could be mining here, whether it's, like, intensive sessions. I I think about Fin and, like, how we, how how we measure the effectiveness of Fin. Our North Star there is, like, soft resolution or hard resolution. And you kind of there's there's this this kind of signals in the session data of where, like, you know, did people get to the desired outcome fast? And did they tank Cloud Code at the end maybe? And so that that sort of stuff is like, like, it's all here in the session data. We just gotta be able to find us and and query us and, get some information out of us. And, yeah, hopefully, ChatPRD should get back to us pretty quickly. I might try to just put a finer point on on something you touched on there. Like, the I I think the place a lot of people start with using AI or where we started was everyone in the team has, like, access to some power tools. I'm like, go have at it, see what happens. And the biggest, most important flip was kind of making this into a shared system that we it becomes our job collectively to evolve together. And then Brian Scanlan kind of showed some of the building blocks of that shared system. And there's not that many, I think, that are fairly generalizable and everyone will end up having the same, which are like, you know, something for code review for example, something for auto approval, something for remote agents, and and few other building blocks. And, philosophically or, you know, our our job then as engineers, a huge part of our job is, in pursuit of that that aspiration where the agent is as good or better than us as our jobs. Any mistakes that we see it make, we try to nudge it and correct it so it doesn't do that the next time, etcetera. And there's so much parallels between this and, like, building an AI product. Like, in in fact, I I I make the case that, like, our agentic system is our fastest evolving and most important AI product we're building at Fin, because it's, you know, the factory that builds Fin ultimately. And as it improves, we get more and more leverage from it. So I think that's like one of for me, just conceptually the biggest takeaway point from all of this. And, you know, you go from local gains to shared and compounding gains. You know, when Kesha does something smart that fixes something for him, it it can be applied to hundreds of people immediately. You know, this auto updating system is also very important. And it's a bit of work to get that working well, but it's massively worthwhile. How much of that do you think is, system? Like, you've made the system really easy? Like, it just all goes to s three. It, you know, the skills repo is easy to access every and. how much of that is culture? Because I see a lot of people with the system, and their culture is like, you're doing what with my cloud sessions? And I'm just curious, you know, which of those do you think makes this easier? Yeah. Because So I think mean systems are build. yeah. It's like, how do you get the people? I think the system is essential, and the the the people thing is then what will differentiate it. Like, how, like, how how, well your culture is at adopting an open minded people that are curious, etcetera. And there's a big leadership challenge there I think in like there's a lot of change for all of us to comprehend with and it's scary at times etc but I think a lot of the hard parts is the people. part, the culture parts. Don't have all the answers, but, I'm a long ways to go. Like that chart we shared showed earlier of, like, kinda just the distribution of impact across the team. You'll have a bunch of people who are just naturally gonna push to the limit of what's possible, and then others that are kinda maybe too busy or, like, not curious enough to to follow suit. So huge huge people to understand this. But I also want to under underestimate the, the the kind of challenge or difficulty in getting the systems right. I'm like, you you know, catch a walk walk through all all the importance of the Avall harness. I'm like, you're building an AI product. It's not just a bunch of scripts and stuff. It's it's real, details and quality matter. And I I expect all of that will just get easier to do over time. You know, there's probably tools out there you can buy that do a bunch of this, but, the there's no avoiding the difficulty of the work. You have to take it seriously. Like, one one of the biggest unlocks I think for us was actually, we did it way too late, was property staffing a team to own this. You know, you can get you can get really, really far with, like, everyone's just good intent and stuff, but, it's hard to do this type of work well if it's just borrowed time on on the edges. You need you need a team. Yeah. I am you know, I've been talking to a lot of different folks over the last while at different stages of rolling out AI and their, engineering teams. And I think one of the not that any of this has been easy, but I think one of the easy modes that we've had is that Intercom is a pretty unified culture and socio techno technical system, and that we're very used to working in shared ways. Like, we we prefer these monolithic applications where we solve a problem once, you know, whether it's observability or logging or whatever. And everyone benefits from that. And, I think that makes, that made this easier for us, I think, rather than most organizations, rather than if I was having to roll out the same the same approach into five different business units that don't really talk to each other or don't kind of share too much. I kinda realize now that, like, this yeah. This obviously, this was easy for us because of because of our preexisting kind of culture and approach to technology. Yeah. Definitely. The culture that we were building for the last, I don't know, more than a decade kinda underpins this whole success. Even just take take that thing with deployments. Right? We've been investing in our deployments to be under, what, now, ten minutes for years. And now comes in, and you can generate all the code, but why would it matter if you can't deploy quickly? Right? If if it takes I don't know. If you release them once per month or, like, it take takes days to get someone in production, it doesn't really matter how much code you can generate, like, because it's just so slow. Like, it's so, goes to it. So, yeah, culture is the the really important thing then, you kind of then you build a system that helps you to drive the culture even further. How's your dreaming done, Brian? Wanna wrap up that up? Or. or it kind of I I well, it didn't give out to me, but it's, definitely kinda give me a little bits of feedback and rating. Like, here's a white white cloud plug in syncing broke, and it's yeah. It found the root cause. It's like, well done, Claude. So, yeah, just like a bit of insight into, like, you know, actually and this is a good way to, like, get personal feedback on how you're doing and how you're interacting with Claude. Okay. Let's, maybe can I take the screen for a minute, Yep. and I'll try to quickly transition into our q and a piece? Let's see. Boom. Kind of wanted to plant a seed of where we see this going next. So it's, like, we're kind of on nearly six months into this, and, very very quickly, like, two very exciting things on the horizon. Like, we we just see this, like, shift from a world where we're endlessly in this mode of scarcity and, like, hey. There's more things, more more product ideas, more backlogs, etcetera, that to to abundance and that we're we really feel like we're on that on that path. And, that it's worth thinking about what would it mean for your business if that was the case for you because it's it's worth fighting for. It's within reach. And then, another thing to think about again when you when you remove the constraints, like, so much of us, like, building product roadmaps and building products, we've been kind of forced into this scope constraint to and and it's it's a helpful thing to focus you, but, right, minimal viable product. And that just won't be good enough anymore. You, you know, you can be more ambitious. And I I think shoot for massively delightful products. And, you know, again, think about what that would mean. Again, you if you're not fighting for that, somebody else will. So, yeah, two two kinda just provocations for what's next. Anything you maybe riff on there, Claire, before I jump into q and a? Sorry. I wish AI could unmute me. No. No. I mean, I think I will say this for you. I'll give you a compliment, which is you all are ahead of the curve sure for sure. But a lot of this, as you said, there's, like, a couple systems level things that most teams can technically knock knock down. And then if you have the right culture, you can move things forward. And I think the mindset of we improve the system, we don't improve the outputs is really, really important right now. Remembering the system includes both your agentic harness, your evals, all that kind of stuff, your telemetry and the people. And I think if you invest in, like, we're not getting the outcome we want from the system. It's not the outputs that are the problem. It's the it's the inputs that iterate there. I think there's gonna be a lot of interesting work for everybody to do. Cool. Okay. I'm gonna try to pick out a couple of, top provoking questions, and, let's see where see where this goes. Apologies. This tool makes it very difficult here. So I I pulled it out into cloud and it's not gonna help me. But if you if you jump in now, I probably don't see your question. So, perhaps we'll try to answer everything in, like, a a little follow-up, thing. So if I don't don't get your question, we'll we'll we'll try to do the best thing. Okay. So there are a bunch of interesting questions around, like, org and business impact, skills, peer review stuff, spam and tooling, measurement and and gaming, security and privacy, etcetera, etcetera. And I need to start to make it easy on the thing with the most folks. Okay. You know, when when r and d three access to bottleneck moves usually into decision making PM throughput or, you know, the narrative layer for us at business, where did it land for you and, what did the communication design have to become to absorb the three x? Who who wants to take that? I mean, I'm I'm the product person. I can't take it for you I can't take it for you all, but I can I can speak to this in terms of what I'm seeing, other folks do, which is, a couple things? When r and d or engineering's capacity really outpaces your road map, you, like, truly get to backlog zero, what I see is companies thinking about a couple of things. One is where to allocate that excess capacity and engineering that is not product facing features. And so you see a lot of this is like the product is the product that builds the product. And so you do see a higher investment in internal developer platform, agentic tooling, internal tooling for the broader organization, and so some of that capacity is being shifted more decidedly towards internal stuff. The second thing that I see is really a shifting of what product and designs job is. And, again, I'm seeing a lot more product managers not be like spec machines and instead really be deep customer facing, trying to discover what is not just a a product feature capability that the market wants, but that it can be commercialized, how it's priced and packaged. So I see product managers spending different, like, their time differently. And then from a design perspective, I see folks really focusing on, like, edge of their craft, as I say, which is like, there's something you are put on earth to, like, round border radiuses in Figma. What is the, like, kind of delight parts of it? And then from a communication perspective, what I'm seeing is, like, simplifying, organizational cognitive overhead by making teams smaller and more self sufficient. And so when teams can move really fast, you really have to work quite hard on eliminating dependencies between teams, reducing, like, approval gates so that individual teams can move very fast. And so I see a lot of teams, just making teams smaller and then being more adaptive on what those small teams do for a role of responsibility across the kinda, like, three EP functions. That's what I see outside. I'm curious if you all think about any of that or do similar. Yeah. A lot of that a lot of that resonates. Like, the the particularly that last piece, like, I think, again, a trap a lot of us will probably fall into is kinda just keeping a lot of how we work and how we organize ourselves the same despite this massive changing in constraints. And, like, I think a great forcing function, maybe, you know, to challenge yourself at times is, like, you know, shrink teams or remove roles from things. Like, we have well, we have teams that use to always classically have the PM, a product designer, and an engine manager and a bunch of engineers. We've seen great upside by limiting them and saying, you know what? You as a product engineer, you've got enough context, good from instincts to make all these decisions. You've got escalation points, etcetera. But the team will you're not gonna have the PM anymore. No. There's a lot there's a lot in a bigger company like us where, you know, there's areas of high ambiguity where you put those people, but there's areas that are pretty well figured out that need a lot of execution where you don't. The other thing that we've seen is like, when you remove one bottleneck, it it puts a lot of more, emphasis or importance on attacking other bottlenecks. So, like, if you're shipping product a lot faster, you know, if you ask the marketing team sort of twelve months ago, how many, like, tier one launches can we do per quarter? They'll probably say one or two. When you have eight tier one launches queued up to go in a quarter, you're gonna figure out how to launch them because otherwise you just and and a lot of the same principles apply. You don't need 40 people on that project. You can have four people that have all the right contacts, etcetera. So, yeah, I think it's a, you know, in a good way, it's like sort of a game of whack a mole. Fix one bottleneck. See where that see where that that crops up. Fix that one. And the whole system, I think, is right for, attack here. Okay. Let's jump to a different question. Let me see. Okay. So hard, so hard. Had been off, but yeah. Here's an easy one for you, Brian. How do we manage token spend across New York? With great difficulty or, I don't know, lots of naivety or Yep. we're, yeah. We we we probably did fall into the kind of token maxing as in we weren't specifically looking to get people just to burn tokens, but we wanted people to not think about, which model to use or, like, deal with the constraints of limited kind of usage. And so we've generally had a very open, like, just, you know, API plan with Intercom, burn through it as fast as possible. This has started to become, like, a problematic amount of money, in that we're actually starting to have to spin up a plot code, like, cost program. But the kind of stuff we're looking at today is, mostly things like making sure that we're we're detecting inefficient use. You know, it's it's the the kind of accidental burning of tokens or even just in skill definitions. You know, we've got a lot a lot of skills there that that are called hundreds, thousands of times a day. Some of them do, like, pretty meaty work. And they don't all need to use Opus. They can sometimes use Sonos or maybe even cheaper models. And So we're kind of doing what I consider to be, like, just basic, low hanging fruit. Just like, let's find the unoptimized or inefficient use and, just make it a bit more efficient. I think, you know, backfilling the constraints of, of of not having tokens to hand, would probably be kinda difficult in in our environment. But I do think that, like, you do work with the tools better when you've got some kind of awareness of what the tools are doing or what the models are invoked, or, like, which model to kinda use, that kind of consideration. Another thing that we're gonna be doing as well is we're we are working, like many other folks, on a remote agents, platform called Buzz, just to go along with the kind of silly names. And, Buzz is a great choke point. So rather than trying to manage, lots and lots of cloud code sessions and hundreds skills and all this messy stuff that can happen locally when people are exploring or interacting with their agents. Bose is going to be a bit more of a controlled platform, a bit more, like we've got, we have a choke point, we can reject prompts maybe if they're not, well defined, or we can choose the models at those kind of points. So, yeah. Our maybe our mitigation strategy there on long term kind of token spend is to get it away from people's laptops and into places where we've got more control. And then as well, like super long term, maybe we shouldn't stick with Entropic. I think everyone should be kind of open minded to different, different tools in the long term. We're very happy with our use of Cloud Code and kinda treating it like a platform for now. But, you know, the economic problems that we're dealing with today, I think they're gonna look quite different, in many dimensions in the near future. Cool. And one let's cover one final question. Do engineers know how productivity is measured? How do we prevent gaming? I think there's so much we could go on for maybe an hour, but the ninety second version, who wants to take who wants to bite? Could you please put it on the screen? What is. that question? Find it. This is the down yeah. There we go. Yeah. I I can, have a go at this one. It's a great question because it does, like, the metric we we chose, it it's kind of prone to could like, it could be easily gained, which is, I guess, part of the, part of the whole system. But if you if you focus in a sociotechnical system and you're thinking about quality and the incentives that you create, you should expect something like this to happen. And in some way, even would play in our hand just because to surface those things that are, being gained and then address them properly. For example, maybe it's a sub agent during PR review that validates whether your problem not only makes sense as a problem statement, but is this even the right problem to to solve? And maybe you'd have to back it up with certain, I don't know, documents or or some notes with or or transcripts with, say, PM or whoever to to to confirm, yes. This is indeed an important problem, and I'm not just, you know, like, gaining my number of pull requests just to get on the leaderboard. So, like, all those things, they they surface a quality problem within the socio technical system. And, this just kind of plays in our hand because surface in those bottlenecks and addressing them with high quality automation and processes or cultural bits, is part of modernizing the factory. That's what we are after. Thanks, Kesha. Anything you're not clear what you're kind of what you're seeing other companies do or how to think about this? Yeah. I mean, what I generally tell people is gamification isn't your or is it like a measurement problem? It's a culture problem at at some point. And so, you know, there's there's also measurement and then incentives and goals, and they can be sort of, like, decoupled in in a lot of ways. I think if you have a healthy measure, or you have a healthy culture, these measures are will be minimally gained or they are gained, they will be obviously gained. And so I just I I try to attack that I mean, that's been a question since the beginning of time around leadership. Like, if I put a number up, people are gonna game it. And so if, like, people are gonna game it, it doesn't matter what your number is and you still need to move the number. So I think this all comes down to, like, you do need foundational telemetry. Measuring something is not enough. I think what you all have proven is if you actually set a goal, you could move move towards that goal. And even if it is a crude leading indicator, it is an effective one for organizational movement, which is ultimately what you're trying to do. You're trying to convince a large team to change their behavior. You're not just trying to, like, give more tokens to the anthropic gods. And so, I would just say, like, you're doing the right things. Continue to invest in the right people and the right team and culture. Those problems should be minimal, hopefully, if if you see them at all. Yeah. I I just put a little chart on the screen, which is another kind of trillion metric that we're trying to measure, which is number of features shipped monthly. So we have the Slack channel where everybody encouraged to post whatever they shipped recently, and we see over the last two month well, we don't have data from eight. It's not completed yet. But for the last two month, there is an uptick of features being shipped. And we don't kind of publicize this too much. We don't stress that this is an important metric, so it's it's more of a, like, a like, a trillion interest in signal in addition to all the other stuff that we measure. But, hopefully, with time, we'll find more of those, things that are better represented than net positive for the business. And, and we'll we'll we'll start, like, building more competing of Yeah. let's see. Like, triangulating against so many different, measures here is really helpful. Look. We could go on for quite a while. Preset is hundreds of questions that we haven't got to. I loosely commit to, answering them asynchronously. Really appreciate appreciate all your time, Claire, especially for joining us, as well. Yeah. So, thanks for joining. Very. See you again soon. See, folks. Bye bye. Bye bye.