How Engineers Actually Use GenAI—Turn by Turn, at Bank Scale
An early look at bank-grade telemetry for GenAI-assisted engineering—what’s instrumented now, what we’re seeing, and what’s next.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
All right, up next is Dr. Matt Beane, currently associate professor at UCSB and CEO of SkillBench. He is also the author of The Skill Code: How to Save Human Ability in the Age of Intelligent Machines.
I met him shortly after Steve Yegge posted his "Death of the Junior Developer" post last June. He sent me a DM that basically said, "We've got to talk," because he had been studying a phenomenon he calls the novice optional problem for over a decade in the study of the impact of automation on the workforce.
I'm so glad he did, because I've learned so much from him. He taught me, for example, how for centuries surgeries always required four hands, so it was a perfect pair of a senior surgeon and a junior surgeon: great for the senior, great for the junior, until medical surgical robots appeared that enabled senior surgeons to do the surgeries themselves, rendering the junior surgeon optional, or more often ignored.
Here's another thing I learned that went into the Vibe Coding book: he has observed that the more janky the technology, the more you need to rely on people and operational glue. I think that is so relevant because, as you all know, AI is very, very janky.
We've gotten to go on many fun adventures together. He was instrumental in helping Steve Yegge and me complete the Vibe Coding book, and he is going to present some amazing primary research that he has done that I think is of incredible interest to this community. Here's Matt.
Dr. Matt Beane
Thank you. Good morning. With an intro like that, I know where to go but down.
Let me start with Gene and thank Gene, because the reason I'm on this stage is that Gene gave me and my team access to this incredible community, and we have gotten two pieces of work underway that I'm going to give you a peek under the hood on today. I knew none of the folks involved on the consuming side of these projects until Gene made the introductions. These were, as a researcher, the pie-in-the-sky ideal opportunities, and now as CEO and co-founder of a startup, the pie-in-the-sky opportunities we were really hoping for to push the cutting edge of what we understand about AI use in organizations.
One is a qualitative, interview-based study. I bet a number of you in the audience participated in this study, and thank you for your confidential accounts of how you're deploying AI. The other, in the words of one CTO at a world-leading consumer goods company, he called "rocket science," which I really liked because I'm an MIT nerd: trying to get at turn-by-turn telemetry between developers and these systems, and relating those interactions to outcomes.
Let me explain why I was focused on these. Gene gave me a quick intro, but in short, I saw Steve's post a little while ago about, maybe, some data-free speculation: "I'm a little concerned, folks, about the fate of the junior developer and whether they're going to be necessary anymore." Since 2010, I've been doing research that looks at advanced intelligent automation and the consequences of it for how we learn how to do our jobs.
By the way, training doesn't really help us with that. That's just table stakes. The way we learn how to do our job hasn't changed for thousands of years: I do the thing with somebody who knows more about how to do the thing, under pressure. That leaves me with skill: skill is the ability to do something reliably under pressure.
The core finding in my research is that automation, and these days AI is a classic example, allows an expert to eat more of the problem independently, which means there is less necessity of involving that junior person that started this conversation. That is one thing we are very curious about from a research and business point of view with respect to AI use.
Here is another motivation for the research. I'm going to be cryptic, sorry: this study is both under embargo and under review, so I can't give you the fill-in-the-blank redacted here. But I will say we have done an empirical study, and I had the early results in hand when I met Gene, on professional use of AI in long-chain, complicated task performance, like eight to ten hours worth of work to accomplish X. We found, yes, using AI content significantly boosts the quality of your output, as ranked by three independent experts. Lots of working professionals participated in this study.
There is another factor, though, in the way that users interacted with these systems that was twice as bad as the positive consequence of that content use. I'll put this in more temporal context, which is why we got interested in turn-by-turn telemetry, because we had transcripts of users interacting with these systems. This factor dragged quality down the more it occurred over time, and the systems contributed to this factor more than humans did and were harder to nudge away from this kind of destructive activity over time. I can't tell you what that is, but it piqued my interest. For the last year and a half, it has been nothing but software development from my point of view, both business-wise and research-wise. We need turn-by-turn telemetry on real devs doing real work in real organizations to get at these kinds of questions.
The interview study: I'll give you a quick skim to give you another motivation. We have quite a range. Eighty-three organizations expressed interest in participating in this study and signed up. I have a team of five doctoral students at UCSB. We have been interviewing folks at the CTO/CIO level, manager level, and dev level in all organizations that elected to participate, and they are quite a range of organizations.
The quick story there is that everyone this year especially is looking for ROI. That 100% number refers to the percentage of the informants we interviewed who in some way referenced a desire for clearer ROI metrics: devs, managers, and CTOs. A number of CTOs and CIOs are saying it is actually the board almost literally yelling at them, trying to get some sort of quantified answer to this question.
There are a fair amount of metrics that have been deployed or are in the midst of being experimented with. These are the ones that came out of the interview data that we have analyzed thus far: the superset of metrics that folks seem to be focused on. From our point of view, that stuff is healthy and appropriate to a significant degree, but what we want to get at, at SkillBench and me academically, is turn-by-turn telemetry.
ROI really lives and dies in that sequence of interaction. If you're a Gene fan and you've been following his bouncing ball through his AI journey, that diagram he produced about writing the Vibe Coding book with Steve, his API token use per unit time as he was working on the book, Gene and I were talking as he produced that. In my view, this kind of data is a necessary complement to higher-level metrics about use across functions.
Incredibly, through Gene, I also got an introduction to someone I now consider a friend and a co-conspirator, Brendan at CommBank, Brendan Hopper. I managed to omit his name from that slide, and I'll be the butt end of that joke for a while. He stepped forward as willing and interested in running a controlled experiment inside the dev environment at CommBank Australia to get these turn-by-turn logs, and to start with a preregistered study that we can publish academically and then move it out into the business. There are nudges involved here to get at causality, like when and how AI use actually causes changes in things like quality, and I'm obviously also interested in skill.
So far we have two weeks of data. I'm going to show you stuff that is abstracted from nine devs, but it should give you a sense of what's possible in this universe.
Getting into the technical detail: basically, what we need to do this kind of analysis is the chat-based interaction between devs and a model and also code changes, so diffs. We created an extension that plugs into their environment, gathers both of these things securely and locally, and analyzes them semantically on the dev's machine, so it all stays on-prem. This is a joint academic effort, but also through the startup I co-founded called SkillBench.
In short, we have those chat logs and the code diffs, and the pull request is the location for quality outcomes. I won't read the VS Code diff information; I'm not a fan of reading slides at folks. The chat logs are probably more intuitive for you folks. This is the kind of stuff we are hoovering up, and this is what it makes possible.
We built the analysis over about a year and a half. In short, we can analyze as the work is going on. We're not doing keystroke logging; remember, we are just getting diffs and we have chat. In the end, we are able to produce analysis that identifies the source of the material that ends up in the pull request: was it the human originally, or was it the AI? Where and how did interactions happen with respect to that content to lead to the outcome? Parsing things almost at token level sometimes, but certainly on a turn-by-turn basis, is what is necessary to produce this kind of insight.
Nine devs is just suggestive, but this is the turn-by-turn activity; each color is a developer in this diagram. We have a few typos here in terms of percentages, but I think it should be clear.
Here is a quick summary, and again, I'm going to skim across this. I see phones out; remember, this is all recorded, you can check this later. The descriptive statistics are on the left-hand side: what do we have in hand? Roughly speaking, we are going to have two orders of magnitude more data when it comes time for the final analysis, but some interesting stuff just on the descriptive front pops out.
Number one, in this group of devs, which by the way is the leading edge of AI devs at CommBank, or at least one pocket of them, AI laps the developers in terms of code contributions. Ten times, in terms of characters and lines of code and so on. Humans: 3.4 million characters. AI: roughly 25.1 million. Lines of code show a similar pattern. Tokens per session is even more extreme, so AI is eating lots and lots of tokens; devs, not so much. And devs halfway accept their robot overlords. Basically, the acceptance rate on the part of the devs is close to 50%, around 40% or so.
Interesting insight here, but none of this is really linked with time. Time is really important because these kinds of interactions unfold on a turn-by-turn basis. I'll be candid: we don't know what some of this stuff means. As we get the data, we will be doing causal inference and so on, and we will have clearer answers in the paper. But there are very clear patterns, and this is a big blinding flash of the obvious for anybody who works with these systems: the cadence of work, lines of code output, and so on, on the part of AI and GenAI in general is short bursts with tons of tokens and tons of data, and it is very predictable. It is really fast.
Humans are red and AI is blue in this view. Devs don't respond immediately. Sometimes they do, sometimes close to that, but it doesn't seem to be related to task size. We have looked at the chunks of work they are dealing with. This raises all kinds of questions: when and how are these delays or differences in cadence net productive for things like quality, or for skill development? There are a healthy chunk of dev responses to AI content that are very, very fast. We don't have definitive proof of this yet, but it is pretty clear to us that this is just copy-pasta: here's my AI code and I'm going to shovel it in. We just don't know, but you can't ask questions without getting this kind of data.
I'll give you a metric that has already jumped out to us from this data, which we think will be really interesting as a way of focusing your question about where AI value is coming from and a moment in the use of these systems that might matter. There have been classes on prompting for two and a half years now. Now it's context engineering, not prompt engineering. But the point is, we have nine superusers in the bin here, and only one of them is burning a lot of tokens in their first prompts in any given sequence. A lot more folks are coming in with something like the left-hand side, a more typical number of tokens to burn to kick something off. That one freak of nature is on the right-hand side. It seems pretty clear already that this is going to be consequential. If you're looking for a new metric, this is not easy to get at, but if you can, the average number of tokens burned in the first prompt in a sequence of work might be an interesting one to get at.
Overwrites is another key question. If you were skating ahead in your mind from a few slides ago, you might have anticipated this: fine, AI gives me a bunch of content. How much of that am I just going to pass through? How much am I going to edit? What parts of it? For us, overwrites as a metric seems like a pretty interesting way of getting at questions around trust and reliability of these systems.
Again, we have nowhere close to the data we need in the hopper for you to trust these stats, but it is interesting that differences are already emerging. To the extent that overwrite patterns indicate consequential things for the outcome, the code quality, and so on, we think this is a really important place to get fine-grained and useful data.
It is also clear, even in this group of nine, that there are really interesting differences between apparent personas in terms of devs. Remember, this is a group of devs who are supposedly leading edge with AI inside CommBank. There is a healthy chunk of them who just don't use AI literally at all, and we have gone to pretty extensive lengths to make sure that we are filtering for whether they are AFK using AI to try to get their work product and so on. There is a healthy chunk who just don't.
Then there is a divide between folks who are mostly having AI drive the conversation and making minor modifications, snowballing their way through a problem, and others who are using it as a back-and-forth ideation partner, conceptually speaking in a sandbox, and then doing a lot more refactoring and reworking of the code before that goes into their pull request. That trust factor varies quite significantly. The overwrite patterns among those three groups are very, very different, so we think that is pretty interesting.
Doing this securely and at scale is, I'll be candid, an ongoing challenge. Right now, the hard part is from inside the bank after we have devs instrumented. The SkillMeter extension that sits in the IDE is sending nudges at devs to test the causal effects of AI use and so on. These are all local and attached to hashed user IDs, so we don't even know who we are nudging with what. Then telemetry collection and semantic analysis is happening again on the client side. We don't see any of it.
Eventually there is a series of automated scrubbing techniques that we have worked up, a lot like the Exabeam talk that just happened, and Anthropic pushed a few of these out in prototype form with their Clio-based analysis, the first one they did, the economic index. But still, human attention is required on this data. Legal has been remarkably supportive because they were involved at T-zero on this study, and security as well, ensuring that review allows for secure passage over to our S3 bucket and then for us to do the kind of analysis I was showing above.
Here is a quick thought about where I think this kind of system needs to go. At this point, we have what is good enough for one or two enterprise implementations at a time. But if we are going to do this at scale and in a secure way, we need something much closer to full automation in this security review bin. That's an interesting challenge, but in my view, getting at this kind of telemetry is worth the effort. That is what we, over the next three to four months at SkillBench with a couple other customers, will be obsessed with.
My devs basically had not slept for two days to allow for the kind of analysis I just showed you. One of them stayed up yet another night to build this out for Cursor. This is RooCode, VS Code, and so on. He was just like, "I can't stop." We have more work to do to build this out and make it industry grade.
For everyone looking at this turn-by-turn log data, interestingly enough, these logs also contain thought traces for these models if you are using a reasoning model, depending on your local configuration. This is also data that can be analyzed and included in the stream. I think that is the place to complement the higher-order organizational focus implied in this year's DORA report.
Here is what we are concluding based on a first-pass look at this data, and this is going to be our constant obsession for months to come. The traces of interaction that are temporally sequenced are really, really valuable and a rich source of data and insight to go after, and a really nice complement to the higher-order metrics that sit in the adoption dashboards you might already deal with.
The cadence question really matters because humans have one OS and these models have another when it comes to information processing, task performance, and so on. Meshing them well is going to be key to ROI and developer satisfaction in the work. You can get a huge rush of dopamine and then sit there because you're burned out, looking at a bunch of content that was so cool and you got in two minutes and it would have taken you hours before. But now what? How to manage that temporality and interaction so you optimize the joint performance of the system is a big open field question from our point of view.
Delay in interaction really seems like a very interesting signal to go after. I have been hearing a lot of complaints. I've been spending all of my time with software development professionals for the last year and a half: the phenomenon that people love to whinge on, like "I got my code and there goes my pull request and push to prod," we can see in these delay effects, make reasonable inferences about, and those things could be automatically tagged. There is a lot of upside in getting at this signal.
Also, how you start a conversation is kind of the same as it is with humans. We have been teaching this in prompt and context engineering classes for a long time now. The way you start the conversation is dramatically more important for how the conversation is going to go than what you did at turn 11. So we think the tokens you burn right up front on a new task is something worth paying attention to. We think this overwrites thing is a really juicy and interesting place to go. Obviously, as we get more data and look at causality, we are going to have interesting questions to add there.
I'll land on my help-wanted slide. The help is: help us by filling our one remaining slot as a small, semi-stealth startup slash academic exercise. If you want to subject yourself to this kind of rocket science scrutiny, we think it is going to become state of the art rather soon. We're going to try to make that happen. If you have questions related to or connected to what I put up there, drop me an email. Just put that in the subject line; that will get my attention. We can take one more in the next 60 days. We hope to roll product out to the world in three or four months, and we're also hiring. Reach out for that reason if you like as well. Thank you, and thank you, Gene.