Balancing Act: Lessons Learned from Rapid Adoption of AI in Software Development

Log in to watch

Las Vegas 2023

Balancing Act: Lessons Learned from Rapid Adoption of AI in Software Development

A case study on an engineering team that went all in on generative AI coding and realized along the way that all that glitters is not gold. This is a story about the highlights, mishaps, and close calls caused by our use of AI coding, and our method of controlling risk without sacrificing gains in productivity.

Chapters

Full transcript

The complete talk, organized by section.

Michael Edenzon

All right, guys, I think we're going to get started. I know this is the last talk of the day, so I will try and go through it as quickly as possible. It should be lighthearted. I'm going to tell you a bit of a story and some of the things I learned.

I'm Michael Edenzon, CEO of Fianu Labs. A little background on me: I'm an engineer first and foremost. I was former director of DevOps at PNC Bank, co-author of Investments Unlimited. Anywhere at this conference, you're bound to run into one of us. There's a roster of them. And I'm also a founder, so founder of Fianu Labs.

Today I'm going to talk about our stack, the tools we use, trends we saw, things I've observed, architecture, commercial real estate, the human brain, airline and automotive crashes, and what this has to do with software.

Just a little background on my company: started in August of '22. We're based in D.C. We're an automated governance platform, and we're trying to help companies deliver compliant software faster. So this story is not about that. It's actually about our use of generative AI to accelerate the delivery of new features.

Before I start, I don't know if any of you have seen this clip. This is Bryant Gumbel on the Today show in 1994. And this is the scene where they're sitting there and they're talking about email and the internet, and he says, "The internet? What's the internet?" So I bring up this clip because everything I'm going to talk about today with regards to generative AI, and the challenges that we're having and some of the benefits as well, it could all be irrelevant in a couple years. We all know that. So for this talk, it's going to be a moment in time.

All right, so I'll start off and give you the tools we're working with in our stack. For our software, it's Vue.js user interface, Golang backend, and Open Policy Agent to write rules for enforcement.

Our tools, and this is not a tools talk, but our tools: we're using ChatGPT for generating business logic and code and troubleshooting. And we use GitHub Copilot as well for boilerplate stuff and IaC. There have been plenty of conversations about the tools and how they're used. That's not what I'm really trying to talk about today. You could swap these in with anything comparable.

But the timeline for us kind of plays into this story. We were founded last summer in August, and then two months later, the launch of ChatGPT. That was huge for us because obviously, we're getting off the ground, we're getting started, we need to deliver new features fast, and it multiplied our productivity. So that was big. And then here we are today.

Today, what we found from our use during this period of time was the obvious, which is huge increase in developer productivity. We were developing things very quickly, and our time to market went down. We also found a big increase in critical vulnerabilities. Most importantly, the bugs that were coming out really caught our attention.

I'm not going to talk about the vulnerabilities part. We can maybe, if we have time for questions, I can go into some of those things. But right now we're going to focus on the bugs that were reported. So we found: more features faster, higher security risk, poorer quality.

I'm going to take a quick detour and talk about some things that I observed. And I'm talking about architecture, but not this kind, the kind you guys are thinking. I'm actually talking about this kind.

This is just a building in Paris, and it's beautiful. The craftsmanship behind it is very apparent, and we used to build buildings like this all the time. But now we build buildings like this. It's modern, it's sleek, it's stylish, but there's a noticeable absence in craftsmanship. And so think about why that is or how it happened. You look to CAD, to computer-aided design, the software tools that architects now use to model their buildings.

I now live in Washington, D.C., and I was driving along 395 a couple weeks ago, and I noticed the Hubert H. Humphrey Building. It's a government building, and it was built in 1977, which I was able to kind of know it was around that timeframe without even knowing it until I looked it up on Wikipedia. But that's important because CAD was only invented a few years earlier. So this building was built with one of the earliest versions of CAD.

And you can tell by looking at the windows, the way that the angles are positioned and the way that the windows are inset, it's actually one of the basic things that you can do with a CAD software. And to show you, I actually downloaded a CAD software onto my laptop, and I did this in about 10 minutes. I don't have any experience with that, but you can see how it insets the windows and then gently pushes them in. And then boom, I've got the building.

So what does that tell us about how this building was designed? Well, it tells us the architect built this building because of the tools that they were using. The tools are shaping the building's design.

And you may say, "Well, why does that matter? Why does that matter?" Because the apartment building that I showed you before looks fine, right? It's stylish, it's sleek, so it doesn't really matter, I guess. The design is inconsequential, except for there's something interesting, which is it's a shorter lifespan.

If you look here, the average lifespan of a commercial building pre-World War II is about 100 years. CAD was invented in the '60s, and then you see this significant drop in the expected lifespan of a commercial building down to 70 years, and then 50 years in 2000. We don't have data later than that, but I found that really interesting.

So we were trying to figure out why, and there's this real estate development company that actually addressed this. They wrote a really good article, and when they were talking about why is it, they said, "The short answer is technology. The long answer is human interaction with technology."

And we thought that was kind of interesting. And we asked ourselves, is the same thing true with software?

When we were thinking about this, we were saying, well, right now it has never been easier to build great software, yet our software is as buggy and as vulnerable as ever. And I'm not talking about people and processes here. I'm talking about the way that the technologists are using the tools.

We've got really incredible IDEs that are rich with code linting and code smells and detection. We have static security analysis, we have static code analysis, we have all of the scans that you could possibly need to measure your code. And then during runtime, we've got all the observability and telemetry that you need to know how it's performing. So it's never been easier to write great software. Why haven't we gotten software that's less buggy?

And the question we asked is: is software becoming a product of our tools?

I thought a really good example of this: has anyone here seen the code for the Apollo 11 command module on GitHub? It's up there. Yeah, he's seen it. Someone actually took photos. They looked at photos of the actual printed-out code, transcribed it into GitHub. It's in the version of assembly language, and they put it up there. It's 145,000 lines of code in the Apollo 11 command module, and there's no known bugs. Obviously it's mission-critical software, but they didn't have any of our tools.

So in an article from The Guardian in 2019, they were talking about Margaret Hamilton, who was a huge player in the development of that code. And they said her rigorous approach was so successful that no software bugs were ever known to have occurred during any of the crewed Apollo missions.

And she didn't have linting.

So is software becoming a product of our tools? Well, the forces at play, we'll just take a look at that real quick. Everyone knows here. I don't even need to spend that much time on this slide. This is the soup of tools, right? You've got your language and frameworks, your pipelines, your platforms, and your toolchain, all of this stuff. And so developers are entering into a world where it's all about the tools that you have.

Then we've got other forces at play. Things like the DORA metrics. We're really starting to measure the frequency of releases and the performance of our engineers.

Lastly, when we talk about the introduction of AI, the hype really isn't helping. You see some of these tweets here from, if anyone listens to the All-In podcast, these guys talk about this stuff all the time. But this one from Chamath I thought was really special: "You either double your productivity or cut half your people and fill in their productivity with AI."

These forces are at play that we're starting to say, all right, they're expecting a whole lot to come in terms of productivity gains from these tools. Things like we saw. I don't know if it was that dramatic, but it's obvious you're going to get better productivity. But are the expectations pushing the developers too far? Or are elite performers just raising the bar?

You've heard all this about how ChatGPT is going to make a 10x developer a 100x productivity, and the other ones will get left behind. But again, when we were searching for why and what was happening to us when we had our adoption, we went back to our commercial real estate friends: "The longer answer is human interaction with technology."

So back to the question that we needed to ask ourselves, which is: how do developers interact with generative AI?

I'm going to give you a really quick example, something that actually happened in our product development. We're an automated governance software, so we present attestation of various controls in CI/CD. So let's take this one, for example: SAST. It's an event-driven architecture, which means we get duplicates of events. We also get events in succession. So you can scan your code with SAST multiple times. We'll get multiple attestations.

The goal here in this problem was to figure out which is the active attestation that needs to be presented at a given time, which attestation represents the current state of that piece of code.

The first attestation that we captured was here, and then we captured another one after a CVE was discovered. So a scan was rerun and found a CVE. This one's a duplicate. It happens in an event-driven architecture. Then security went in and they would mark it as a false positive. They ran another scan, so a new attestation with a new result. Now it gets to the top of a hierarchy. And then it was run, scanned one more time before production release, and a new CVE was discovered. So now we're back to a failing state on the badge.

The developer had the feature request to go in and build the capability for a decision matrix to take this in an event-driven fashion before it gets presented and streamed to the dashboard, to decide which is the active attestation.

So the developer went in there, and they enumerated that into ChatGPT. I'll just say here, ChatGPT builds really great business logic, which is why we use it for this purpose sometimes. So he explained in prose what he was looking for, the decision criteria, and asked ChatGPT to generate Rego, which is the language for Open Policy Agent, to build the decision engine to output a single Boolean, which is `is_live`, meaning is this the live attestation, or is it one that we shouldn't stream?

And so it asked, and ChatGPT gave him. The developer copied the code from ChatGPT and put it into an IDE. The developer reviewed the code. They looked through and he said, it looks good. We've got the first one, which determines whether or not we have an attestation for SAST. If we don't, then store it. That's the live one.

Then if we have multiple attestations for SAST that already exist, then we decide based on the results of the SAST scan or the timestamp. Because if the SAST scan originated before the current live one, which sometimes happens in an event-driven system, then you need to say, well, ignore it, because that one is no longer the live one.

Anyway, it presented this whole decision matrix, and then the developer was feeling so good about their productivity gains. And we're an automated governance company, so we eat our own dog food, which means we really care about unit tests and we're sticklers about that.

So they're feeling really productive. They went back to ChatGPT and said, "Write unit tests for this Go." And then this happened.

I'm sure some of you know where this is going. ChatGPT did it very quickly, and it wrote unit tests that passed for the code that it just generated. So it was like, great, this is awesome.

What the developer missed, though, when they were reviewing the code, was this single piece at the end that compared and said, if the timestamp is less than the live attestation, well, that means this is an old one, ignore it. And it should have returned false there at the bottom. But it didn't. It returned true.

So what it meant when we actually got into our QA testing process, we noticed some glitchy behavior, but it was only in certain instances. It wasn't enough that it was just screaming in our face. We only noticed it after a number of different test cases.

What happened here? We asked ourselves, why does this happen? But we were noticing a trend by now when this came up. So instead of saying, why did this happen, we asked ourselves, why does this keep happening?

Ever since we started adopting these generative AI tools, we started to get bugs that are really quirky and really, really tough to find. And we can see where this is going, which is that suddenly these mistakes that developers don't typically make when they write their own code and then write their own unit tests were coming through, and they were just going completely unnoticed.

So we were doing some research on what it actually takes to write code. We're back to our original question of humans interacting with technology. How does that affect the way developers are using generative AI for code?

When you look in a developer's flow state, you all know what that is. The developer is in the zone, they're hyper-focused on what they're doing. They're able to use the parietal lobe, the part of their brain that's for visualizing complex systems, to see all the moving parts in their code and then isolate it to the specific problem that they're trying to work on. And then they use the prefrontal cortex for strategy and problem solving in that context. Reducing all of the moving parts just to one piece, getting in there and fixing it. And that's in their flow state.

So then the developer's working on this. They're in their flow state, they're very productive, they're in the zone, and they need to go write a prompt.

Well, when they go to write a prompt, it lights up these parts of the brain. The cerebral cortex is the thin outer layer of the brain that has to do with language and reasoning, Broca's area is for the actual creation of language, and Wernicke's area is for the comprehension of language.

So now going from logic, reason, and problem solving to the completely abstract nature of writing prompts in human English, in language, requiring complete context switch. We all know about context switching. We think it's really bad when a developer has to get pulled out for a meeting, so we try and reduce that. We try to take that cognitive load off the developers.

But now here we are giving them a tool and encouraging them to do one of the most severe cases of context switching that a developer could do, because they have to be hyper-focused on their code and then hyper-focused on their prompts. Because if they ask ChatGPT to do something and it doesn't come out the right way, now they've got to almost negotiate with it. We've all done that. And so it's a completely different headspace.

And the one space that's shared between the two is the anterior cingulate cortex, and that's used for error checking. It's part of the risk part of the brain. What it's used for encoding is comparing the expected output of the code to the actual output. The same thing with prompts. You write a prompt and you're expecting it to come out a certain way.

So this part of the brain is used for both, but it gets completely disoriented during that switch between the flow state and the prompt state. And that's the part that our developers weren't able to access to compare what they needed from ChatGPT to what it had actually provided them.

This sounded really familiar to us. I don't know for you guys that read the news a bit, especially the tech news. This sounds like the handoff problem. In the spirit of this, we'll use ChatGPT to tell us what the handoff problem is. But it's effectively the challenge of communicating the context of the situation from one autonomous system as it hands it off to a human user.

This is most notable from self-driving cars. There was that point in time a few years ago when Waymo was saying that when we build our self-driving car, it's not going to have a steering wheel or pedals. People say, that's crazy. But they explained that it actually could be better and safer that someone doesn't take over in a situation when the car no longer can drive itself.

It's one of those reasons why, when you get road rage, you get road rage when people cut you off while you're driving, but you don't get road rage when people cut you off while you're walking. It's because your emotions are heightened. You're in a defensive driving mode. You understand the risk of the situation.

And so if someone's not driving and then suddenly they need to be driving in an emergency situation, their brain's not in the right headspace. And they also don't have the context from the computer of, what does the computer know? What is the sequence of events? What does the person need to do to step in?

When I was talking with Stephen about this a couple weeks ago, Stephen Magill, he said, well, that's actually kind of like what happened in 2009. There were two fatal air crashes. There was Colgan Air Flight 3407, and then Air France Flight 447. Both fatal crashes. Everybody on board was killed. Both, after review from the FAA, had to do with the handoff problem.

The autonomous pilot system shut off. That context was not communicated to the human pilots in a way that was easy for them to take action, the appropriate action, and it caused deadly crashes.

After this, the FAA went to work and they put together a bunch of regulations in the years after that would solve this problem. They put together a bunch of these flowcharts, too, to help make it easier to understand. But when you boil it down, it came in with the UAS and the SRM controls, the unmanned aerial system and then the safety and risk management controls. And that had to do with alert mechanics, time buffer, all the way down to redundancy and monitoring.

But these are all sounding like a lot of the controls that we are starting to implement in our pipelines, right? The things that we're working here and we're talking about, all this automated governance, all these topics that have to do with the controls that are put in place to release software into production.

So I want to wrap this up quickly because I know we're at the end of the day. I'm going to end with this and just say, what can be learned from all of this?

I think what can be learned from our experience using generative AI in a very small scale, in a small startup, obviously not with the same challenges that you deal with at a large enterprise, but these challenges, at this moment in time, will scale. They're not going to get better in a large enterprise unless something's done.

So a framework is needed, and the help that I'm here looking for: what controls should be included in a UAS, sorry, I should say UAS, but the UAS and SRM framework for AI-assisted coding? And I would like your help with this problem going forward, because I think it's something that we're all going to start to see as we adopt this in our own companies.

Thank you.