How Google Is Radically Transforming Enterprise Software Development With Gemini

Log in to watch

Connect Oct 2024

How Google Is Radically Transforming Enterprise Software Development With Gemini — A Discussion with Gene Kim

Paige Bailey, GenAI Developer Relations Lead at Google, traces Google's decade-long investment in machine learning and explains why those capabilities remained locked inside the company until external pressure — most visibly the release of ChatGPT — forced a shift toward public APIs and production-ready models. She describes how Google's internal engineering culture is being transformed by Gemini, as teams that once maintained dozens of narrow, single-task ML models are consolidating onto a single family of multimodal frontier models — freeing engineers to focus on user impact rather than model maintenance. The conversation also explores Gemini's long context window, dramatically falling inference costs, and emerging use cases in video and audio understanding that Bailey argues will make AI as ambient and inexpensive as breathing air.

In this talk, you'll learn how frontier models are replacing bespoke single-task ML pipelines inside large engineering organizations, why KPI improvements are driving enthusiastic internal adoption rather than resistance, and how falling token costs are unlocking entirely new categories of real-time, multimodal applications.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Another person that I was so thrilled to have speak in August at the conference was Paige Bailey from Google. She was formerly lead product manager for generative models for Code AI, PaLM 2, and later the Gemini family of LLMs. I was delighted that she has an exciting new role and that she gave us a peek about what Google and DeepMind have been doing around AI for the last decade. It was such a marvelous glimpse into these amazing technologies that were developed, that in some ways we only got to see after ChatGPT came out. Paige, I'm so delighted that you're here with us again today.

Paige Bailey

Yeah, absolutely. Thank you so much for having me. I also got to hear a little bit of Matt's presentation just a little while ago and really, really appreciate everything that he was saying about how we need to be encouraging junior developers and keeping those connections between humans alive. It's super important, even as we start embedding AI into all of our developer tools.

Gene Kim

Yeah, it was so great. In fact, I had mentioned that he had reached out to me after Steve Yegge wrote that Death of the Junior Developer post. It was amazing to meet someone who's spent over a decade studying this phenomenon in other industries. What a fun world we live in.

Paige, I so much enjoyed your presentation in August. Maybe just to refresh people's memories, can you share the long history that Google has had in AI going back over a decade, and maybe what the ChatGPT moment meant for you in terms of actually getting some of these capabilities to the outside world?

Paige Bailey

Absolutely. I don't think it's a secret that Google cares a lot about machine learning. I think that the company was built around this premise of organizing the world's information and making it useful, and often making it useful does translate into building models, especially predictive models or generative models.

Google started doing machine learning quite a while ago. We released a couple of demos on I/O stages related to generative models: the LaMDA models, the first PaLM model. I got to help out with PaLM 2 when I came back to Google back in, I guess, 2022, right before ChatGPT came out.

The path back then was basically these models would get built, you could use them internally, and they were improving products that might be internal or improving, behind the scenes, some of the products that we had externally. But they were never available as APIs. It often felt a little bit like living in the future or living in a playground where you had these internal interfaces for Googlers to play with the models, but they were never released.

Really, the first model that we had available as an API, available for people to use, was PaLM 2. We quickly got it integrated into a lot of Google's products and services. I think the plan when we first started training it was that it would perhaps be a paper, perhaps it would be a website with a couple of examples, and then it would be on to the next model. But it was amazing to get it to be in a production-quality state, to get it released, and then to be part of the Gemini training team.

Gemini is our latest family of large language models. They're multimodal. They can understand video and audio and images and text, as well as full code bases. We've seen pretty significant results with them both internally for our own products and then for the companies that our users are preferring.

Gene Kim

One of the things that you had mentioned in a conversation was that something happened in the industry that suddenly made it very important to get these capabilities to market so anybody could use them. Am I capturing the tide change accurately?

Paige Bailey

Absolutely. I think the eureka moment for me, in terms of LLMs as applied to software and applied to the things that we do every day, was when Copilot was released. At GitHub at that time, I was at Microsoft and GitHub working on developer tools. I remember very vividly the Slack channel that we had with OpenAI, where we were testing out the initial features of Copilot.

At first, the extension that the team had created was just calling, I think, the GPT-3 API at the time. This was before GPT-3.5 was released. The closest equivalent that existed in the dev tool space at Microsoft was IntelliCode, which was based on GPT-2. It was very small completion, on the order of one to three lines. Then Copilot was a completely different ball game. Instead of giving one to three lines of code, it was giving these full detailed code blocks, even with explanations. You could get better results based on how well you could comment or docstring out a function definition.

It just felt so magical. That was really the time when I personally realized we have to start getting these things embedded into not just developer tools, but into all of the products. I think the world quickly followed suit, ChatGPT being the thing that really broke into public awareness. My mom has tried ChatGPT, and my relatives will give me product feedback now about the Gemini app or about whatever else they're using. It's really inspired, I think, all of our research labs to not be so inward focused and to not be so research or academically focused, but to really care about getting this goodness into products.

Gene Kim

One of the things I found so amazing and inspiring in our conversations was that you had described this internal dynamic of Gemini gaining internal market share within Google. One of my articles I'll never forget is a story about how Bloomberg spent $10 million training a GPT-3.5-class model, only to find that it was completely outperformed by the more general GPT-4. This happened to Med-PaLM, a specialized version of PaLM or PaLM 2. Dr. Peter Lee wrote about this in The AI Revolution in Medicine, where the frontier models just continually outperform the specialized models. You had some amazing examples of something similar happening inside Google. Can you talk about that dynamic?

Paige Bailey

Google, being a machine learning company from day zero, had a lot of teams internally that had built out these single-task models to do very specific things. Examples would be that you would scrounge about, find a whole bunch of data, get it polished, get it preprocessed, get it into a structure that was well formatted for building machine learning models. This would take weeks, sometimes months.

Then you would have to go through the work of trying to figure out what kind of model to build. Should it be something more traditional, names that folks probably remember like logistic regression or decision trees or SVMs or gradient boosted trees, or whatever it might be? You would have to figure out which made sense for your use case. You would have to tune hyperparameters either with an automated tool that brute forced which features you should use, or you would have to think intuitively about which ones made sense for your use case.

This would just be so frustrating and painful, and it would take forever. It would be very tedious and very boring for the people actually doing the work. At the end of the day, you had a model that was just good for a single task, and then you would have to keep it consistently refreshed. You would have to stand up an entire team to maintain the model over time and to make sure that there aren't things like data drift. We had many, many teams across the company that were doing all of this very important, but also tedious, work.

In the developer tool space, you would have a dedicated model for code completion. You would have a dedicated model for build repairs. You would have a dedicated model for docstring generation or for type checking or for help with generating comments for code review, whatever these might be. But they were single-task models that had to be maintained and built for each one of those things.

Today we're all just using Gemini-based models, and it's pretty magical. For things like code completion, you have latency constraints, so round trip you would probably need things to be less than 500 milliseconds. Whereas for tasks like assistance with migrations or full code-base documentation, you can probably stand to wait a minute. Being able to leverage the different sizes of Gemini models to help solve specific tasks based on the user requirements, and really as a team not focusing on all of these tedious operations that are outside your core use cases, means being able to say, hey, my users have this need. I'm going to pick a model off the shelf that meets my cost constraints, that meets my latency constraints, and that's easy to maintain. It really means that you can give teams the ability to do their best work as opposed to having to build up all of this infrastructure that's outside of that.

Gene Kim

We talked before about the famous Google paper on data technical debt, which basically says here's the entire set of work you have to do to get this much value. It underscored that there must be such business benefit in creating these models, but it is an enormous undertaking to keep these models running and performing well. We were talking about Google Search as a place where you would see an ML model behind the scenes, such as in the cards when you look up airline routes or sports teams, and behind that is an ML pipeline and a team to keep that data pipeline healthy. Am I capturing that correctly?

Paige Bailey

Yeah. Google Search would be an example of this. There are also other machine learning teams throughout the Google product areas. You might have experienced some of this when seeing the ranking recommendations on YouTube or seeing the automatically generated dialogue or the chapters. It's really nice. I encourage everyone to go test this out on Gemini and AI Studio, to just do video analysis or attempt to replicate some of these tasks that would historically have been traditional machine learning tasks, but just have the large language models handle the work that they do.

Gene Kim

One of the things that you had shared with me that sounded so dazzling was that I had to ask you: what did it feel like for an owner of one of these bespoke custom models that do a single task when they decided to migrate to Gemini? I had asked, on a scale of one to 10, how did they feel? One was existential dread or anger, like my entire self-identity is that thing and you took it away, versus 10, like oh my gosh, we've been liberated from having to keep this thing running. You had this amazing response.

Paige Bailey

I think as a researcher, as somebody who is a model builder, you're very well aware that your entire life is going to be trying to get a higher quality or to hill-climb on a specific metric that you really care about. Model builders do anticipate that models will get better over time and that this model evolution process is something that should be expected.

As an example, one of the teams that was building out a dedicated internal model for a single task or just a few tasks was called Sitsi Monster. Now they're working on improving Gemini's coding capabilities. It isn't so much that a team gets spun down as much as they just get merged into the larger model-building team. Even better, instead of having to do all of this work of building a model such that they can understand how the model could be applied to software development workflows, they can just understand: all right, we already have this model that can do many things. How do we integrate it even more impactfully in the software development lifecycle?

As a researcher, if the metric that you're trying to improve is developer productivity, I personally don't care which model I'm using as long as it helps me hill-climb on that metric of making software engineers more productive or feel happier or reducing the organizational or coordination headwinds required to build software.

Gene Kim

One of the most compelling things that you said to motivate that is that teams get measured on whether they are moving the needle on their KPIs. It is not an understatement to say that moving from their model to Gemini is increasing their KPIs, and so they are very happy about this. Am I capturing that correctly?

Paige Bailey

Exactly. If your KPI is how many lines of code are generated by AI, or how much faster the code review process is now, like can you get CLs through review faster without any rollbacks, or is the portion of your code more documented? Instead of being 52% documented, now it has much more coverage. All of those things are KPIs that make software engineering teams really delighted to see improve over time. It doesn't matter at the end of the day if it's a dedicated single-task model that's making the KPIs go up or a model that you take off the shelf.

Gene Kim

In a previous conversation, I had mentioned that Dr. Erik Meijer was a tremendous influence on programming languages. His tenure at Microsoft included C#, contributed Haskell, LINQ, and so forth. He described during his time at Meta that he had teams of PhDs working on a model. He described when ChatGPT came out on 3.5, his models were code generation, bug finding, bug fixing. He said it felt like we were digging a tunnel with pickaxes and shovels, and these people just showed up with dynamite and heavy moving equipment and said, get out of the way, kids. He said it was a very humbling moment, but something he was incredibly happy about because he saw liberation from having to do all that effort, that was done for him, and that he could benefit from that. Does that resonate with people within Google, that this is how often it's received?

Paige Bailey

Yes. Personally, I use Gemini roughly once every couple of hours throughout the workday for everything from help with writing code, to writing emails, to generating product requirement docs or descriptions for hackathons, to help take bugs and make them more robust if an original user hadn't added enough detail, to automatically create bugs from videos of user experience challenges. It is amazing to have this kind of Swiss Army knife that can do all of these things without necessarily having to stand up a model to do it all by myself.

I remember previously, I want to say about five or six years ago, I had done an exercise where I scraped all of the bugs from a couple of our open source repos that had been added to the GitHub repos on github.com. I scraped all of the issues, used spaCy to do natural language processing, scikit-learn to help with the k-means clustering. It took so long. Even then, there wasn't a great way to be confident that the number of clusters you found was an accurate number.

Now all you have to do is throw the CSVs into Gemini and say, hey, tell me what are some common themes in this user feedback and also help me understand which ones I should be prioritizing. It's an incredible time saver. I am one of those humans: I started building models back in 2009. I love math, I love building models, but I love seeing impact even more. If I do have the ability to just take a model that automatically works and use it, I will do that every single day of the week.

Gene Kim

When you told me that story, I was thinking afterwards: how is that possible? I've had friends tell me that these frontier LLMs can do things like little's law queuing theory, which seems improbable, and yet he's doing it. The story about how it can actually do clustering given CSV files, I've seen it do it, but can you convince me how that is even possible? Because by my experience doing clustering, this involves calculation. What's the simplest explanation that can allay my fear that it's actually doing something other than what I think it's doing?

Paige Bailey

I would encourage you to experiment and self-confirm. I had to do that several times when I first started using them as well. I think the reason why Gemini can do clustering, and I haven't tested with Claude or with the OpenAI models, is because in general the CSV files that I add are pretty sizable. They're on the order of perhaps 400,000, 500,000, 600,000 tokens, perhaps even more. Gemini's context window is pretty significant. For Gemini 1.5 Pro, it can get up to 2 million tokens with pretty much 99% recall, or close to it.

Given that it's able to ingest this information and synthesize it, and not have degradations in output performance, it is able to do this. I know you shouldn't anthropomorphize models, that's a bad practice, but in my brain I imagine it as: I upload a CSV with 20,000 or 30,000 lines, and it's kind of like having a Post-it note for each one of those rows. Then there's somebody behind the scenes in an infinitely sized room that's just categorizing all of the little Post-it notes. It can do that because it's able to keep all of the information in memory and in its context window.

Gene Kim

And is actually projecting it into n-dimensional space, like a clustering algorithm?

Paige Bailey

I don't want to make assumptions about what's happening behind the scenes, but it is able to get pretty accurate results. Like I said, I encourage folks to try.

Gene Kim

It is amazing. Where are you seeing people use frontier LLMs, especially Gemini, that excite you most? We were talking about the Simon Willison report of giving a video of him scanning his Gmail inbox, and it created a JSON file. That's pretty cool. What are others that excite you?

Paige Bailey

I love that you brought up that example because I am especially excited about video understanding, audio understanding, and also the techniques that folks might have seen for NotebookLM that include audio output. NotebookLM is very cool, based on Gemini 1.5 Flash and 1.5 Pro, and it can even generate podcasts, which feels pretty magical.

Simon Willison's use case was powerful too, in the sense that he had this video of looking through his Gmail inbox and categorizing and transcribing some of these emails. It ended up costing him, I think, one one-hundredth of a cent to do this work. We've gotten the cost for these models down to pretty much zero for Gemini 1.5 Flash. I think it's one cent per million tokens or something crazy. If you get the price of intelligence down that low, then you can start expecting that it's going to become ubiquitous. It's not just going to be something that will break the bank if you integrate it into every single aspect of your developer surfaces. It's suddenly something that you can be immersed in, like you're walking around breathing air.

If we can truly analyze video and audio in near real time for zero cost, that unlocks so many beautiful opportunities: real-time multilingual translation, or real time augmented Meta Ray-Bans being able to see a ball drop and then having physics equations dynamically placed against it to explain what's happening to you. All of these are massive opportunities, especially in the video analysis and audio analysis space, and I'm really jazzed to see what people start building with those.

Gene Kim

That's amazing. To paint the opposite experience, I remember doing a BigQuery on the GitHub data set and it was my first $50 query, which was somewhat terrifying. It was like, oh my gosh, that could have been a bad day had I not known. Among this community of technologists, what help are you looking for in terms of things that you're working on?

Paige Bailey

Something that I would ask very selfishly for people to do is: if you have cool use cases, please, please post them. Fun things that you build with the models. Even if it's just a blurb on LinkedIn or your favorite social media platform, on Twitter, you have no idea how much it means to the model-building teams and how much it can inspire other folks who are just starting to explore with these large language models.

Given that it's a completely different programming paradigm than the more deterministic programming that folks might be used to, I think many, many engineers are still a little bit confused as to what the models can and can't do. The more that you, like Simon Willison, are sharing all of these things, the more that you can share the demos and the things that really made your day or saved you time, the more that I think these models will truly become democratized and people will start incorporating them into their work. So please, please do that. I would love to see more examples.

Gene Kim

And if someone has a great story, where do they post it in the way that it will be seen by you?

Paige Bailey

I am DynamicWebPaige pretty much everywhere on the internet. And then I am also paigebailey@google.com. If you don't want to post on social media, don't feel like you have to, but feel free to send it to my email as well.

Gene Kim

Super, Paige, thank you so much. Again, like with everyone, so many fun adventures ahead. Thank you for all the great work you're doing and teaching us about this exciting adventure that you are on.

Paige Bailey

Awesome. Thank you so much.

Gene Kim

Thank you. Bye.