Making AI Agents Actually Work for You
The word “Agent” is everywhere in tech right now—yet most so-called agents are just old workflows with a new label. In reality, the underlying technology is far more capable than the way most enterprises use it.
In this talk, I’ll share how I’ve been pushing agentic systems to their limits—what actually defines an AI Agent, where they deliver real value, and how to make them do work that would be impossible without them. Along the way, I’ll unpack lessons from building my own agent-powered tools, including the unconventional workflow that turned one side-project into a surprisingly polished production service.
If you’ve ever wondered what’s hype, what’s real, and how to make AI agents deliver measurable results in your world, this is the session for you.
Chapters
Full transcript
The complete talk, organized by section.
Joe Beutler
Hello everyone. I'm thrilled to be here with you this morning. I'm Joe Beutler. I lead a team of engineers at OpenAI who work with our largest customers to design and deploy AI solutions that actually move the needle for their businesses. As part of my job, I also get to play with some of the fun new toys we build before they're public, and I always like to see how far I can push them.
Today I want to talk about agents — what they are, what they aren't, and how we can make them actually do meaningful work for you.
But before I start, let's acknowledge that we, as an industry, have a meme problem. Who's sick of hearing the word "agent"? Let me see some hands. I know it's most of us here. Thank you.
Over the last year I heard the word "agent" more than almost any other term in AI. Every product demo, every startup pitch — agents everywhere. So I thought it would be appropriate to use another meme I see too often. Most of what's called an agent today is just old workflow software with a shiny new label. The user experience hasn't really changed. Most SaaS is still just a CRUD app. The vast majority of things marketers are calling agents today are the same things they called co-pilots last year and chatbots the year before.
The technology is capable of so much more than what people are using it for, and that is the gap that I want to try to close here today.
So if we want to get value from agents, we have to solve two problems. First, understand what they actually are and what they're not. And second, push them to do work that would be impossible without them.
Here's the definition that I like to work from: an agent is a system that can independently accomplish tasks on your behalf. They use an LLM to interpret instructions, make decisions, and take actions using tools — always within clearly defined guardrails.
So let's play a quick game. I like to call it "agent or not an agent."
A user asks a question and the LLM answers it. This is not an agent — it is an API call. Next.
A user asks a question and the LLM retrieves information from a knowledge base to use to answer it. This is RAG, which is also not an agent. Next.
A user asks the LLM to do an action on their behalf, like return an order. The LLM selects from a predefined set of tools it has to take an action. While useful, this is tool calling, which is also not an agent.
So now we're getting somewhere. A developer builds a system that allows the user to make a request and it follows a deterministic flow to get a result. Some people might see this as agentic, but really it's just a workflow that uses an LLM to do a task or make a decision. So this is not an agent.
I could show more examples, but I think you get the idea. We have chatbots and workflows, and those are really great at answering questions in real time, but they are not independently solving problems for the user.
Now, you might be asking: what do actual agents look like? My favorite example — and when I think we crossed the chasm from chatbot to agent — was ChatGPT Deep Research. After some early noise in 2024 about these software-plus-LLM agent systems being marketed as agents, in February OpenAI launched Deep Research in ChatGPT. According to Casey Newton and others, this was the first good agent that users could leverage to do really valuable work for them.
We see that it can execute multi-step tasks, independently do a lot of work like search for and synthesize hundreds of online sources, and deliver a comprehensive report in minutes. This is clearly valuable to me, but I don't think most people understand how it actually works.
So I'm going to peel back the curtain and give you a sneak peek. From my understanding, the Deep Research agent actually isn't just a series of LLM calls that string together a bunch of different subtasks. The original Deep Research agent was an especially trained o3 reasoning model that was trained on the full process of researching and compiling a research report.
It takes in a user's request and it analyzes the request, crafts and sends clarifying questions to narrow the scope and clarify the focus. And when the user responds, it considers their responses before it makes a plan and formulates a series of queries to begin its research, uses its search tool to browse the web, and then it loops back through this whole process. Most importantly, it will refine its hypothesis and plan before it repeats that loop. So when it feels like it has all of the information it needs, it then concludes and produces the report and provides it back to the user.
What is important here is that inside the loop it might do hours or even days worth of work that typically would have been done by an intern or an analyst at a Big Four consulting firm — and all of this takes only 10 to 30 minutes while you're getting coffee.
So this is the definition and the bar that we use internally when we release a product that we call an agent. Alongside Deep Research, we have released the ChatGPT agent that has been trained to use a web browser to complete increasingly more complex tasks, and Codex, our coding agent that comes in a few forms, including the web-based agent that can take a task autonomously, review your codebase, write code, and return a ready-to-merge PR. Perhaps a clearer term would be to call these autonomous agents.
---
Now I'll share what Gene actually wanted me to present. I want to show you a real example of this in action and something that I built myself, using this type of mindset of pushing the agents to work for you.
It started when I got a week off and decided to — you guessed it — vibe code.
To fast-forward a bit: I told Gene how I coded Turtle, an AI photo enhancer, a week or so after I launched it. He encouraged me to write about it, about my process. So I posted an article on LinkedIn where it surprisingly got a little bit of traction. In it, I shared a framework for how I used various agents and agent tools to vibe code a full production app in just 10 days.
I'm going to walk you through the same story and the framework I used, starting with the problem.
Back in the spring, I added some photos of updates we made to our Airbnb listing. The original shots were from a professional photographer — beautifully lit, perfectly framed. It's a pretty nice place. And then there were the new ones, which are the ones I took with my iPhone. The difference was obvious. It was just bad lighting. It was almost like two completely different properties.
So that's when I had the thought: could ChatGPT make my bad photos look like the good ones? I dropped a picture into ChatGPT with some instructions and, shockingly, it pretty much worked. It kept the real furniture, it made it look like natural lighting, and avoided that over-fake AI-generated look. So that was the first moment I thought this could actually be useful.
But it wasn't until I got the urge to vibe code and realized I had the week of the 4th of July off with no plans that I decided to try to build an app to do this using AI.
Now, I'm no cracked engineer. To establish my lack of credibility before this — the last production app I'd launched for actual users was back in 2018. The app was built with a painfully janky stack: an Angular front end duct-taped to a PHP backend with a MySQL DB directly hosted on an EC2. I lived in constant fear that someone would peek at the source code and expose me as a total fraud. It was classic Stack Overflow era self-teaching at its finest.
I also wanted to challenge myself, so I picked a stack that was almost entirely new to me: GPT Image 1 for the image editing, v0 by Vercel for the front end, Next.js, TypeScript, Prisma, and Vercel for the core app, and Windsurf for an IDE. The only familiar tools I used were ChatGPT, Codex, and GitHub. Everything else was brand new.
So this wasn't just about the product I was building. It was also about seeing how far I could go when AI is doing most of the heavy lifting.
Here's how I approached it. First, I used ChatGPT to help me write a prompt for v0. Then I dropped that into v0 to get a polished front end. Next, I pushed the v0 site to GitHub. Then I connected my repo to Codex inside ChatGPT and asked Codex to map out all of the backend tasks that I'd need to wire up the front end — and it opened a dozen Codex tasks automatically. For reference, this was the first set of 294 Codex tasks that I've run on this project to date as of last week.
Codex then started generating PRs to build out those features. I was literally watching an agent identify the work, do the work, and hand me ready-to-merge code.
The results: 10 days and 65 Codex PRs later, I had a fully functioning web app that supports multiple types of advanced real estate photo editing and virtual staging. And it was built on an almost entirely new stack, including secure auth and everything that I needed to feel comfortable pushing it to production.
---
So what does this mean for the enterprise?
This was a fun side project, great for me to run as an individual, but how are we actually bringing this mentality to the enterprise? What I'm seeing at OpenAI is smaller teams — one product manager, a few engineers, and maybe a couple of researchers — building entirely new products end to end. For example, Codex started with one PM and a few engineers from inception to launch with this model. Even Bezos's two-pizza teams would be pretty plump by the time they ship their product if they weren't shipping so fast.
So I know this is a group of builders. If you want to build agentic systems, where do we start? It starts with a hot take: there is a right way to build agentic systems, and they come in two flavors.
First is what I'll call agentic software. Traditional software assumes the user already knows how to use it. New users struggle through menus, guided flows, and training, making adoption slow and costly. AI-native software flips this model. My view on the future of software is that AI-powered conversational workflows will replace static UI/UX. The system will adapt to the user, not the other way around.
With this, there are two core usage models. One: power users get instant results when they know what they want — like pulling a report from a CRM without tons of clicks or navigating forms. Two: new users are guided by the AI to discover what's possible, eliminating the steep learning curve and unlocking value from day one.
Beyond that, the same AI can embody your best customer-facing talent: an onboarding assistant as effective as your best customer support manager, a support experience as responsive as your best support rep, and a sales assistant that can upsell, cross-sell, and guide customers to additional value — all inside the same conversational experience.
In short, agentic software unifies onboarding, support, and sales directly into the product experience, driving faster adoption, higher satisfaction, and new revenue.
---
Next, agentic services extend agents into real-world interactions. It starts by building an AI assistant for humans who have repetitive or operationally heavy jobs. As these human experts use the assistant, they will be able to provide feedback on what it gets correct or incorrect before it executes a task, allowing the human to take over when the AI is incorrect. And this is, of course, what we call human-in-the-loop.
Once a given task or set of tasks are completed accurately by the AI, you can stop putting those in front of the humans altogether. By combining AI agents with human-in-the-loop expertise, they automate what can be automated, guide customers seamlessly, and preserve human judgment where it matters most. This accelerates time to value and reduces risk.
So how do we do this? We've seen the most effective way to build these systems is in three stages: Ask, Assist, and then Automate. Most customers want to jump straight to Automate — more on that in a moment. But the reality is it takes a lot of time to build the foundations for truly effective automation.
By starting with a conversational Ask experience, you get to some value quickly and start evaluating the performance of the system with real-world usage. This then allows you to build on that foundation by adding more complex behaviors that bring more value to your customers, your internal teams, and your bottom line.
You could try to go from zero to full automation on a given project, but then it might take you at least 18 months to see some value, and this is a high-risk approach — effectively like building software without an iterative development approach. You wouldn't know whether or not it works or has product-market fit until you launch after 18 months of investment, and you wouldn't see any ROI along the way.
Using our approach: when you start with Ask, building a conversational experience where you use the LLMs to answer questions pulling from internal data sources, your internal users can provide feedback on the accuracy of the responses. And if applicable, you can then move them to an external-facing tool once they reach a sufficient level of accuracy.
Once you master the Ask level, you can build an internal tool that makes recommendations to your human operators on what action they should take. The humans provide feedback by accepting or rejecting the recommended task and evaluating output from that task. Side note — this may feel too manual for AI, but it's actually similar to how we train the LLMs themselves by using reinforcement learning. Again, if applicable, you can provide this as a tool to external users once you reach a high enough level of confidence in this functionality.
Once your internal operators consistently accept the recommended task and output, you can start to automate it. When the system is able to accurately complete a series of tasks, you can automate the entire workflow. The automated experience could then be built into the software used by your end users, effectively automating the job function in the software itself. This could be sales, onboarding, support — really anything that's operational.
These internal tools for your team become agentic services, and the external tools become agentic software, fully automating the tasks or workflows previously operated by humans. Each of these represents step-function changes in the complexity of the work done by the AI system and the business value that you will receive from it.
---
An example of this in practice is the work we've done with T-Mobile.
You may be thinking: what does T-Mobile have to do with my business? We aren't a telco and we aren't trying to automate customer support. Well, if your business is still operated by humans, then you can apply this framework to automate some part of your operation. And our work with T-Mobile shows that you can do this at scale.
I first met T-Mobile early last year, and my team started building with them last summer. They came to us with a few hero missions for their business with a focus on becoming a deeply data-informed, AI-enabled, digital-first company.
When they came to us, they wanted to start with Automate. But we told them they should start with Ask. They insisted we build straight toward Automate because they didn't want to do something incremental. And we later found out that they were already working with another partner on the Ask use case for their T Life app. So we started working with them on Automate.
This begins with building the foundations — setting up evals and internal APIs with access to data so the LLMs can access that internal data through tool calling. This is also the way that we would have approached the Ask use case, so you're building the same foundations either way.
After quickly getting an impressive early prototype, their CIO decided to cut ties with the other partner that they had been working with for 18 months on the Ask use case — 18 months. He gave us four weeks to return with a prototype that could beat the other on both accuracy and experience. We did that, and two months later we launched the Ask Chat experience live to their external customers in the T Life app, achieving almost 75% containment versus their initial 50% goal — containment being conversations that were resolved entirely by the AI.
In parallel, we used those same foundations to keep expanding toward Automate, and they were well on their way.
You don't have to take my word for it. T-Mobile's CIO joined our recent Realtime API launch livestream to announce the launch of their AI voice experience in the T Life app. He shared a statement where he talked about the importance of taking advantage of a transformational technology like AI to reinvent your process. To do this, first you have to smash your existing processes and be willing to rebuild them from scratch in an AI-native way. This means you can't just tack on AI in an incremental way — in his words, it is just frustrating and doesn't work. Finally, he emphasizes that this is an opportunity to fully reinvent your processes. It will take some time and investment, but the potential outcome is exponential value.
---
So don't lose your heads like everyone else and try to skip to the finish line. Follow the Ask, Assist, Automate framework, and you'll soon be prepared to build your own fully autonomous agents.
Before I go, I want to share this QR code that links to the vibe coding framework — that article on LinkedIn that I mentioned earlier. I'm also always down to chat about vibe coding, AI impact on careers, and agent gateways. I've heard a couple of people mention them — how are you evolving API gateways to build agent tools that your engineers can use?
Also, a quick shameless plug: if you're renting or selling a house, or know someone who is, check out [turtleedit.com](https://turtleedit.com). I'd love to get any feedback. If it's broken, forgive me — just give me the bug reports and I'll vibe code to fix it.
If you want to connect, feel free to reach out on LinkedIn. Thank you very much.