How Google Is Radically Transforming Enterprise Software Development With Gemini
This talk dives deep into how Google DeepMind's latest models, Gemini 1.5 Pro and Flash, are transforming the full life-cycle of enterprise software development. We'll show you how Gemini's massive context window and next-level reasoning are already boosting our own internal development velocity at Google - and how it can do the same for developers at your company. Get ready to see how Gemini 1.5 delivers full codebase understanding, automates away tedious code reviews and bug fixes, prioritizes new features and issues, and even tackles complex migrations with ease. Plus, we'll reveal how Google's unparalleled developer telemetry, fused with Gemini 1.5, unlocks actionable insights to optimize your workflows and ship higher-quality code, faster.Come see the future of enterprise dev, powered by the game-changing capabilities of Gemini 1.5.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
So given all the things that Patrick talked about, I am so delighted about our next speaker, Paige Bailey from Google. She was formerly lead product manager for generative models of Code AI, PaLM 2, and later the Gemini family of LLMs. I'm so delighted that she'll be sharing her journey, going all the way back to Google DeepMind, to working on multiple generations of frontier models.
Among other things, she'll be sharing with us how Google uses machine learning and AI internally, and how Gemini has been growing internal market share for AI that powers so many of the legendary Google properties beyond just dev productivity. So my thanks to Amanda Lewis for making this amazing connection. Here's Paige.
Paige Bailey
Okay, excellent. Thank you so much.
Greetings, everyone. I'm so excited to be here and to talk a little bit about how Google is using Gemini and AI throughout its entire software development life cycle. So really, generative AI is transforming how all of us do our jobs, but particularly how Googlers are building all of the products that people are using every day. So things like Search, Sheets, Docs, YouTube, all of the above.
And to be honest, working at Google sometimes feels a little bit like living in the future. And it has felt — I heard Steve Yegge talk a little bit earlier — it has felt very privileged, right? Like I first joined Google back in 2017, and we've had a long history and track record of producing these very exciting models in DeepMind and Brain. But then the only people that really got to test them out were the teams internally. So the researchers, the engineers, getting to play with them in internal sandboxes, but never actually getting them out into the world.
And it's so exciting for me personally that we now have our latest flagship model. So Gemini — kind of the largest and most powerful model that DeepMind has ever created — released into the world, powering our products at Google, and then also released as an API. So we get to see what everyone else gets to build with our models.
There are three different variants available via API. I strongly encourage you to take a look at Gemini 1.5 Flash and 1.5 Pro if you haven't already. And they've been having pretty exciting results. So how many folks have heard of the LMSYS leaderboard? Couple of hands. So this is kind of one of the ways — you know, the previous talk talked a little bit about evaluation of machine learning models — this is one of the community driven scoreboards, externally benchmarked, that measure model performance against each other. And one of our latest Gemini models towards the beginning of the month surpassed the capabilities not just for OpenAI's models, but also for Anthropic's. So again, strongly recommend taking a look if you haven't already.
More importantly, since these models are powering everything that we do for our billions of users, they have to be small, performant, efficient, and very cheap, very fast. And so our Gemini 1.5 Flash model even kind of built a new quadrant in the model performance space, because it's so cost efficient without sacrificing performance. And this is also the model that powers many of the experiences that you use every day in Google products. It's available as an API for you to use too.
We're embedding Gemini models in Chrome and in our Pixel devices. So you can have models local without sending data to a server — kind of kept local to your company. Our Gemma models are also open source and can be run locally.
And I want to talk in particular today about a feature of Gemini that enables many of our software development use cases: this longer context window. So you might've heard a little bit about tokens — input tokens, output tokens. Gemini's token window right now is 2 million, compared to 128K for GPT-4 and 200K for some of Anthropic's models. But what does this really mean? You know, I think we throw around the term token a lot in the research space, but we never actually take the time to tell what this actually means from a use case perspective.
And when you think about these kinds of quantities of tokens, you're thinking of, you know, all the emails that you've sent over the past year, all of the calendar invites and all of the meetings, and all of the meeting summaries that you've had for the last year, the entire first season of Silicon Valley. You know, like lots and lots of data that you can just give to the model and ask the model to help you analyze it, to summarize it, to act on it, without having to go through this process of retrieval. So no RAG required, and without having to do things like fine tuning.
And this is really, really tremendous for people who don't necessarily have the time to set up those systems. Because you can prototype, you can implement things, you can build out proof of concepts without having to go through all of the engineering rigor to get that infrastructure built. So it's been very powerful for me personally, and we'll see some examples of that later too.
And I really want to talk to you about Code AI. And Code AI is this collection of projects that we have within Google and are beginning to release externally, that's really accelerating every single aspect of our software engineering workflow.
Google cares a lot about software engineering, so it's kind of in our DNA — we've written books about it. But it's a particularly interesting research lab for software engineering because over the last 25 years, we've been capturing telemetry on everything that our developers and engineers do internally. So if you encounter a bug, we know everything that you need to do in order to resolve it. We know all of the logs associated, we know all of the compute impact for running a piece of code. We know if you tried out a couple of different APIs and then settled on a different one. We know if you looked up documentation, if you asked a chat message, if you got a response back from a Q&A system, if you went to the micro kitchen to get a snack — we know how every single one of those aspects of the developer workflow impacts that final product.
And if you want to build an AI system that understands tool use, that understands function calling, that understands how to select different APIs, or how to debug different aspects of a software engineering workflow, you have to have that fully detailed telemetry. And like I said, 25 years worth. And we've been capturing it from some of the brainiest engineers that I've ever worked with.
When you sum it all up, it ends up being just over 500,000 aggregate years of software engineering activity and over a trillion tokens of this really high impact data, including over 80 million high quality code review edits. So if you want to apply machine learning to code review, this is the only way that you would be able to do such a thing.
And what this means for developer productivity is kind of a lot, right? So there are many talks today about how AI is accelerating software development. But we're really powering and coupling all of our Gemini model features with this rich multi-trillion token data set to enable lots of experiences for engineers, other than just code generation. It's from everything from designing to building to testing, to maintaining, to troubleshooting.
And this is a non-comprehensive list, but I think many folks in the audience also realize that a lot of software engineering happens outside of the context of just your IDE. You know, you might have a design doc that's implemented in Docs or as a markdown file. You might have documentation that exists in an external source. You might have logs and telemetry that exist in other silos. You might have a lot of design discussion happening ad hoc in chat messages or emails or Slack conversations. And really all of that tooling is part of this process that we have of building software. It's not just what we're doing in VS Code or in our favorite editors.
So code completion — obviously important. Now 26% of all of the code at Google is generated by machine learning. I've seen charts to the effect of, you know, for a given week, 50% of the code that gets checked in is generated by machine learning. But it's not just this — we are writing code faster. You know, that's accelerating that one small aspect of the developer workflow, about 6% of where they spend their time. But we're also accelerating code review.
So as you check in code, you can automatically get reviews. If your reviewer has a request for you to deliver feedback — you know, please, I have a nit, I want you to change the variable name, I want you to use this API as opposed to this other one, perhaps try to break apart this single function into multiple functions that are a little bit more modular — all of these feedback requests can be automatically applied by the engineer. So all you have to do is say, like, yes, accept this machine learning edit, and you're off to the races. No need to incorporate the reviewer feedback yourself.
We're applying it to code performance. So remember how I told you that we have logs as well as performance stats for everything — all of the servers that we've been deploying the code to. We also know if you're writing a piece of code that perhaps is inefficient — like maybe you have nested for loops, or maybe you're looping over an array that's sorted instead of unsorted, and you could get the same output without having to go through that additional computation. We have, with the ability of Gemini, to make recommendations for code so that it becomes more performant, more efficient. And when you scale this out to Google scale, single code changes made to C++ or Python code can realize tens of millions of dollars, hundreds of millions of dollars worth of savings.
We're also working on code migrations. So this is an example — you can take a look at it if you're curious. I added the link to our research blog and to our internal developer engineering blog. We're working on software migrations such that you can take a code base and ask for changes to be made. So here's a code base. Here's, you know, the new API spec. Please look through the code base and implement all of the changes that would be needed in order to implement this new API design.
We're also generating documentation at full code base scale, and not just in English. So being able — you know, imagine you're a new developer coming on a project. How do you kind of get familiarized with the history of the project? Why were design decisions made? What is this piece of code actually doing? You know, I used to work in the space sciences and there was a lot of COBOL code. Sometimes there was FORTRAN code, and the person who had written it had retired easily 20 years ago. So it was really challenging to find someone who could give insight into what the code was actually doing, and more importantly, which systems it impacted. So this is really powerful to me to be able to generate documentation automatically, especially as you introduce changes to the code, automatically propagating those to the documentation, but also being able to explain new code bases to engineers and to get them quickly up to speed.
This is an example using AI Studio. If you haven't used AI Studio already — I uploaded two versions of the code base. So Flax 0.7.5 and Flax 0.8.5. I asked for Gemini to create a blog post summarizing all of the changes, which it was able to do within the space of 30 seconds, after ingesting all of the tokens for those code bases — over 750,000 tokens. And it was also able to upgrade a tutorial that had been using Flax 0.7.5 to 0.8.5 with my company's best software engineering conventions.
You can also generate detailed friction logs from user videos. So user experience researchers, they often go through this path where they might sit with a user, ask them to test out a feature, record the entire painful process, and then afterwards meticulously document 30 minutes, an hour of video content of what a user is doing, where they ran into trouble, and how we might create better documentation or better product features in order to ease that pain over time. Gemini is able to do this kind of out of the box. So you can generate the detailed friction logs from user videos. I'm gonna go over to the screen and click the play button just in case it didn't play by itself. There we go.
And all you have to do is say, hey, here's the video. Generate a detailed friction log, and then give me a summary of which product features we might implement and prioritize them. When you think about how much this could empower the entire UX space, it's huge. Usually these researchers are limited to just, you know, 10 user sessions, perhaps 12 user sessions at a time. These studies take months because it takes so much detailed discussion to be able to film the user videos and to be able to document them. Now you could imagine worlds where perhaps you have thousands of user videos, and you're able to kind of cluster and understand based on these utilization patterns where people are running into trouble, and how you could better help them.
And let's move on to the next example.
Summarizing and prioritizing bugs, support tickets, and feature requests. Now, as a former product human — I love PM work, I think it's very important. I am back on the engineering ladder, which I also am very excited about. But when I was a product lead for frameworks and for APIs, you often had to summarize large swaths of user information, synthesize multiple support tickets. And at Google scale, this ends up being on the order of tens of thousands, hundreds of thousands of pieces of feedback across Discourse forums, Stack Overflow, GitHub issues, support tickets, you know, perhaps pieces of feedback that people have given on social media. And it's really, really challenging to be able to read through that all yourself. It's a herculean task, pretty much impossible. But if you pull in all of that information into a model with a super large context window, you're able to automatically cluster all of those pieces of feedback into distinct categories, help prioritize them. And then also kind of decide at scale and with a fully representative sample, what are the most painful aspects of this product for your users? Otherwise people are just kind of in this weird place where you do ad hoc PM work, pulling in six customers, hoping that they're a representative sample and implementing their feature requests, as opposed to getting a full analysis of all of your tens of thousands, hundreds of thousands, millions or billions of users.
Excellent. And then lastly, I want to talk through a couple of things that we're doing with our mobile devices and also on Chrome — self-navigating bots that use function calling behind the scenes in order to either navigate you to a website, select different products from a screen, be able to accomplish tasks that you might do every day. Again, using these detailed software engineering event timelines.
Robotics — we're generating code behind the scenes to control the robots that we have within our micro kitchens. So these are live videos from the Google micro kitchens in Mountain View. They're actually very good at cleaning up spills, which is quite nice. So if anybody ever wants to come visit, they're in one of our buildings on P Avenue, kind of by the Computer History Museum.
And then also software agents. So being able to build and deploy teams of software engineering agents to rapidly iterate and to move from an MVP to increasingly more complex designs, based on a high level spec that you might've generated. So the way that you can think about this is you might have created a PRD — you feed the PRD into Gemini. Gemini automatically decomposes this PRD into subtasks, creates an individual bug or Jira ticket for each of the subtasks. And then either assigns an agent or assigns a person in order to accomplish each of the subtasks. And you just kind of rapidly loop through that until the software is built.
And then bonus, which is always kind of the part that's — you know, I've told you about all of this cool stuff that Google is doing, what's in it for y'all? Like, how do you get access to these features? Have we already deployed them in a way that you can use for your company? And we're doing this today across every single one of our product surfaces at Google — personalized to your code bases, personalized to your organizational context.
And it's not just, like I mentioned, within the context of this IDE — it's everything. It's Drive, it's Docs, it's Sheets, it's, you know, the information that you might have in a Meet conversation, all of the telemetry that we're capturing for all of the DevOps tooling that you're using on Google Cloud. And then sort of getting this holistic understanding of your organization as opposed to these piecemeal silos that you might have just working within one of them. Because again, software engineering is a team sport. It's cross-disciplinary, so it's software engineering, but also multiple other functions. And we are multifaceted humans. You know, we write code some of the time, but oftentimes we're doing more than just that.
We also — if anybody in the audience is a startup working in the AI space, we have a really compelling startup program. So please contact me afterwards if you have any interest in joining. Up to $350,000 worth of GCP credits over the span of two years. We have a trusted tester program for the newest versions of our Gemini models. So if you want super early access to anything as soon as it comes out, this would be where you would want to jump on board.
And I also want to just close by saying, this entire process is really magical to me because it reduces the friction from many of us having an idea to actually getting it out into the world in a production capacity. You know, there have been so many instances where there are video games that I wanted to create, or VS Code extensions or Chrome extensions, that I was able to get perhaps 85% of the way there, but then not that last 15%. And hopefully this tooling not only gives us the ability to generate the code, but also to deploy it, maintain it, make it production ready and make it secure.
And so, thank you so much. Please, if you do anything today, go try out Gemini 1.5 Pro and 1.5 Flash in aistudio.google.com. And then also tell us where it works, tell us where it doesn't, and think about how you might incorporate it into some of your products and your workflows. Thank you.