Reaching for AI-Native Developer Tools

Log in to watch

Las Vegas 2024

Reaching for AI-Native Developer Tools

Senior Director of Research at GitHub Next · GitHub

How do we move past bolting AI onto the side of developer workflows? What do we need to reach for if we want developers to get the most value out of AI? Software development is at an inflection point — what does it look like in a world where AI is not just writing little bits of code, but is truly integrated into developer tools?

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

We heard earlier from Steve Yegge about how coding could be changing more than any of us realize. So I'm so excited that up next is Idan Gazit, Senior Director of Research at GitHub Next. That's the first place of GitHub Copilot, Copilot X, and Copilot Workspace. It's been amazing talking with them and learning about the wide open potential design space of the tools that developers could use to do their best work. In my interactions with them, I've become even more excited about all the experiments going on right now by so many smart and thoughtful people. My thanks to Anna D'Angelis for making this connection. So here to talk about where he thinks this space is going, and what he believes are the most important design principles, is Idan.

Idan Gazit

Thank you. I was like, "Where's the clicker? Where's the clicker?" Everything's okay. Hey, hi. Nice to see you all. How are you doing? Are you ready to talk about AI? Yes, exactly. Okay.

Quickly before we dive into this — afterwards we're going to do a Q&A at 12:30 in a room that's called Capri Two. If you want to talk one-on-one with a GitHub expert, scan this code and then you'll be able to book a session, and that's awesome.

With that out of the way, let's talk.

This photo is taken in about 1970. Who knows who this is? Come on, nerds — speak up. It's Ken Thompson and Dennis Ritchie, right? They're the inventors of Unix, an operating system that is in pretty much all of our devices in this room. They're using a teletype machine, which was at the time the best interface for writing software — because what it replaced was punch cards, with holes punched into specific positions to record ones and zeros. Each card would be about 80 columns wide, and you'd treat each card like a line of code. If you punched enough cards for all the lines of code, you'd have a whole program, and you'd keep it in something like a shoebox. You could write helpful notes for yourself on the sides of chunks of it to describe what that piece of your program does — like function names.

This photo with the teletype 50 years ago represents the first direct, low-latency interface between programmers and computers. You could just type into the machine to create software — but you still had to create that software line by line. Text editors in the day of the teletype were actually line editors. Humans would print out a line of code and then think really, really hard about what change they'd want to make to that line, because each round trip with a computer consumed ink and paper.

When we finally got CRT terminals, that cost disappeared — but editors didn't change overnight. At first they could display more lines, but edits were still made at the line level. They were still a lot like line editors. You couldn't just cursor around in the file and make edits wherever you wanted. You had to go to like line 53 and insert something at column 12.

As time went on, we figured out what was possible with the new technology that wasn't possible with the old — that you could type anywhere you wanted and move around a file, you could change the appearance of text to communicate things or offer shortcuts to the developer. It took years to really understand how to use this new medium of the screen to its fullest, instead of as a teletype that can show more lines.

When graphical displays started to happen, we saw the same sort of transition take place. In the beginning, IDEs were just terminal editors inside a window. It took some time for us to figure out what this new tech wanted to be used for. Sidebars, hovers, red squigglies, jump-to-definition, visual debuggers — all of this stuff was possible on the first day of the graphical interface, but we just didn't know about them yet.

That's where we are again today with AI. We've had graphical editors as this dominant tool for authoring code for 35 years, but we're only just over the threshold where AI has become useful for developers.

AI isn't a display technology like all of these other things, but I think we have to treat it like another landmark in this timeline. It's still early days. We're still at the inflection point. We barely know where and how to apply this new tech in a way that plays to its strengths and its natural properties. As a result, most AI things that I see are — apologies to the whole job — bolted-on chat interfaces. It's just bolting chat onto the side of everything that moves. And it makes sense, because everybody gets that this has some proven value to users and the models are a natural fit for this conversational prompt and response.

So my name is Idan, and I work on the future of developer tools at GitHub Next. Our job is prototyping things that, if they work, could have a huge impact on how software developers do their work and how the businesses that employ them go about their business. It's not really an academic research team, it's a prototyping team. One of our first efforts was the original GitHub Copilot, and since then we've done a lot more — mostly but not exclusively about AI. We're very small, about 20 people, like an early-stage startup. The team skews very experienced. Everyone on the team, managers included, is a maker. Everyone is a hybrid with a few different skills. We staff each project with sort of an interdisciplinary set of people who can go after that thing, full-stack end-to-end. There's a whole separate topic you can give about how we staff and execute — that's the "come and talk to us later" if you're interested in that.

Ultimately, success for us is measured in things that escape the lab — that people want to use every day. Right now AI represents the biggest opportunity for making the future of software meaningfully faster, safer, easier, and more accessible to more people.

Figuring out what comes next for software development means looking beyond these two modalities that were our starting point. Copilot succeeded on the back of ghost text for a number of reasons, but mostly because it let AI be helpful in the moment without being too demanding. It wasn't modal — I didn't need to switch out of "I'm typing and writing software" into "Copilot mode" and then see if I like the suggestion and switch back or not, as the situation warranted. The cost of the more modal things that we explored on the way to ghost text was that a bad suggestion costs you twice: once when you summon Copilot, and then another time to say "that's not exactly what I want." With ghost text, you just keep typing to ignore a suggestion. It made the bad suggestions forgettable, while the ones that you accept were the things you remember — the wins.

On the flip side, chat is everywhere. Chat is modal. You literally leave your editor and talk to the chat, maybe in a different app entirely. It requires almost no thought, no design. You can make it much better if you take the time to source good context, good information about your user's intent and their behaviors — what are they doing, what are they up to. But you don't have to. In order to ship something basic, you can just ship AI without crafting fencing prompts. You just ask the user to type what they want and send it along to the model.

When I step back and think about these two modalities, what they have in common is the limited ambition. Ghost text is inherently local and specific — it can't do anything anywhere else than right here and right now. And chat is the opposite. It's a discussion about your code, but actually applying that discussion to your code is left as an exercise to the developer. They need to just go and do it themselves.

What if I want to have a more actionable discussion about my code base? What if I just wanted to do what I want — I have a task? What if I want to generate more than just a little bit of code? What if I want to fix a bug or implement a feature? Because that's the actual job that we have.

Getting the models to generate lots of code across multiple files in a code base is a very different challenge from generating small completions. You need the code that you're generating to respect the rest of the code that's already there. Except that for most models today — I mean, Page just showed this whole 2 million token thing, and that's really cool — but for most models today, a whole repository is usually larger than the prompt window, the maximum token budget of most models. You still need to leave room for the tokens you get back in the response. So now you have to find ways to chain multiple requests to these models together so that they all respect and build on one another. That's really hard.

The biggest problem is that the goal here is to give instructions to the model just like we talk to another human in natural language — but only code is as precise as code. Everything else might contain ambiguity. We have enough trouble with that just talking with other humans. So if we want something that isn't a gamble — and I know we're in Vegas — we need to steer a little. We need to make it cheap and easy for the human to spot the places where the model isn't doing what we want and to nudge it back on course.

That ambiguity feeds into another key challenge, which is predictability. The models are inherently non-deterministic. Every time we ask the model for something, even if we make the same prompt over and over again, we're pulling a handle on a slot machine — we're going to get back something different every time. And these slot machines behave worse — the odds are worse — when we ask them for ambiguous things. That's exactly what we're trying to reach for: natural instructions with complex responses. The models are getting better and our odds are improving, but to date we're only playing this low-stakes slot machine of generating these small completions. Effectively, there's a ceiling on suggestion size. And what are we trying to do? We're trying to break through that ceiling.

If we want to steer, there isn't really a great moment to do it because the request-response cycle of talking with these models is atomic — you can't get into the middle of it. So tactically, there's only two real levers to pull. Either you try and craft a better prompt before you send it, or you can chain a few round trips together and do something smart in between those round trips. For the last few years, almost all the attention has been on this first lever of prompt crafting. What do we say to the models to get them to not just give back something useful, but to do it reliably over and over again? That's what gets us out of toy territory and into tool territory.

I don't believe that we're going to break the suggestion ceiling only using that first lever. Jumping straight from task to code was a viable strategy for generating a little bit of code. For larger scales, we have to start thinking in terms of multiple round trips. We have to offer some moments to steer.

So in the past year, we've been reaching for a process that gives developers this predictable, approachable steering wheel. We've prototyped it — it's called Copilot Workspace. It's currently in technical preview. It's not a product, it's not shipped. We're studying how our early testers use it and refining it. I think it's the first post-chat AI experience, and we're going to see a lot more of this as time goes on.

The process starts with a task — something you want to get done. The first step is using AI to turn that task into a spec: a plain English bullet-point list that describes how things work today and how it should work differently after the task is done. That's the first moment the human can steer — either to fix the model's understanding of what exists now, or to edit the model's understanding of what we want to happen. Once the human says that the spec is good, Copilot Workspace generates a plan, and that plan is another bullet-point list. It lists out every file that it needs to touch and what it plans to do in every one of those files. Once again, the human can edit that understanding to make sure the model is on track. Behind the scenes, each of those steps involves many round trips to the models.

Here's just a quick screenshot tour. This is an actual site for an unrelated project — it's got this code playground with samples in a few different languages. Maybe my task is "I want to add a new language to this playground." It isn't rocket science, but a lot of software development tasks are like this — there's this level of complexity. So I create a GitHub issue. I describe what I want, and I've been super helpful to the AI by being really specific and clear about what success looks like. Normally at this point, my job as an engineer is to start figuring out how it works today in the code base — where are the code samples, how are they loaded into the playground, how do I make syntax highlighting work for a new language.

Instead, I just click on this "open in workspace" button, and Copilot generates a spec for me. This is a pretty boring spec because the task is simple, but it does a pretty good job of saying how it works currently and how it will work in the future. And it hasn't written any code yet, but it's already looked at my code base and identified where it might need to make changes. That's already got incredible value to me as a developer, because most of the work of writing code is not writing code — it's figuring out where things are and how they work.

When we ask it to generate a plan, it gets right down to business. It plans to touch these three files, and it defines exactly what it's going to do in each. Like before, I can add or edit or delete things as I see fit.

The most boring screenshot in this is the end result. It generates code, and that code is in an editor — you can jump in immediately, type it and edit. And if you're not happy with the result, you can go back and edit the bullet points and it'll just regenerate.

But I think the most interesting thing about this process is how it prepares the human operator for reviewing the code. By the time the human is reading the diff, they have some clear expectations of what they're going to see. It doesn't feel like a robot gave me a huge homework assignment to review this massive PR. I have a lot more confidence that it's going to be worth my time to read.

More importantly, we're starting to attack the actual place where developers spend their time. So far, the industry has been focused on this parlor trick of generating lines of code. But when you look at the research on where developers spend their time, it's not on typing — it's on sense-making. It's on orientation. It's "how does it work, where is the work happening, where do I need to make these changes?" Frequently, things that are daunting are not daunting because we have to type a lot of code — it's daunting because it's not clear where to start, it's not clear what to touch.

So the future of tools that help developers is about that first category. It's less about being a second pair of hands and more about being their second brain.

This is just the beginning. Copilot Workspace isn't perfect, but we're not asking the model to guess what we want, and we're not having a theoretical discussion either. If we're looking to break through this suggestion ceiling, there's this Goldilocks pattern in the middle, and we're going to see a lot more things that sort of fit into this Goldilocks pattern.

Of course, it would be even better if we didn't need to do all the steering. Like, what if we could just write a spec in English and get an entire working program? We tried that, too — it's a project called Speckling. We were able to get it to work in constrained scenarios, but not well enough to get it out of toy territory. In our prototype, we wrote a Hacker News reader app that worked for mobile devices with one plain-English file that compiled to a program. It feels like the future. Maybe this is how we make it so that lots more people who don't identify as software developers can make computers dance for them. Yeah, sure, business folks — but also school teachers and biologists and journalists, whatever. All of them want to be able to make simple software for themselves. But right now they don't think it's even possible. So how do we show them that it actually is possible, that it's within their reach, and that it doesn't need to be production grade? It's fine for it to be small, personal, a little bit janky.

It feels a little bit uncomfortable to entrust an entire app to AI, even a little one. But the reality is that all devs are already using software to translate what they're writing into software that runs. In fact, many languages and frameworks have multiple pieces of software that bake the code into the running artifact. So it's not too far-fetched to think that we're all going to be comfortable adding one more layer of abstraction to the stack — we already have so many. The thing is, it's not the same kind of layer. Compilers are deterministic, and AI is not. So there's an Overton window that needs to move here. We have to sweat more to figure out how to make AI reliable enough. Coding in natural language is coming — not today, but soon. And I think it might encourage a shift to a more modular style of programming. Like, imagine a tree of specs, each one describing a part of your larger software stack.

This has some really interesting implications for how editors are going to look in the coming years. I think a lot about this app called Soulver. It's a calculator and it predates AI. You type your formulas into the space on the left, and the sidebar on the right live-updates. That immediate feedback made it so much nicer to iterate. Soulver isn't just a calculator — it helps me think. Maybe using AI as a partner actually opens the door to using both sides as we iterate in place. Maybe we start by writing what we want in natural language, or maybe we start by writing code and seeing that the generated description matches what we expect the code to do. More interestingly, we can refine in place — editing whichever representation is most convenient for what we want to express, and the AI can help us mirror the refinements to the other side.

When you hold all this stuff out at arm's length, it starts to look a little like Donald Knuth's dream of literate programming, where code is mixed in with an equal amount of commentary describing the code. Almost nobody does this, though, because it's a lot of work. What happens when it's not more work? What happens when you can just assume that all the code in the world just comes with this built in?

What if it's not just natural language and code? What if I want to think about my code as a block diagram? What happens when I edit that diagram and the code gets updated? Right now, we've been so focused on how AI helps us to write code. But AI is going to deliver this much larger kick to how we read and understand code — which again is the bigger part of the job — because AI is incredibly good at converting information from one format to another.

Two years ago, I showed this to describe the number of AI tools, and how it's gone from lots of specialized ones to one generalized one — the LLM. But at the same time, that made that one tool accessible to a lot more people. OG machine learning was only accessible to PhDs with giga brains, and now everybody who can write English can use AI. And now we're seeing the pendulum start to swing the other way — there's this Cambrian explosion of AI tooling for developers. We're just starting to help all these people, not just with their typing, but with their thinking, with their cognitive load.

What excites me is that we can both raise the ceiling to help professionals, but also lower the floor to enable people who might not be able to do this stuff otherwise to get in the door and enjoy some of the fruits of computers — that we all have and use, but only a small set of us know how to give instructions to.

When we talk about the promise of AI, that's the generational opportunity that AI is offering us. It's on us to sweat, to do the hard engineering, to make it good, to figure out what it wants to be — to not rest on our laurels. AI is waiting for us to figure it out so that we can take full advantage of this stuff, and we have to help each other over this hump.

Maybe that's my ask for you for help — to sweat for these things. Thank you so much.