Log in to watch

Log in or create a free account to watch this video.

Log in
Al Summit Spring 2026
Share
Download slides

Big‑T Notation: Engineering for Token Efficiency in the Age of Enterprise AI

Why Cost, Scale, and Architecture Matter More Than Model BrillianceAI is revolutionizing enterprise work, yet many orgs struggle with the complex economic factors underpinning it. As workloads intensify and expand, the expense associated with leveraging advanced AI models escalates quickly. As AI becomes a staple in everyday operations, token consumption is outpacing the capabilities of traditional architectures, straining existing systems. This session will unveil Big‑T Notation, a pragmatic framework designed to equip engineers and decision-makers with a fresh perspective on token management. By positioning tokens as the fundamental metric for both cost control and scalability, the talk draws upon established engineering principles to clarify the role tokens play within AI ecosystems. Attendees will explore how choices in model selection and prompt design directly influence financial outcomes, and discover how mechanisms like credit-based pricing can obscure genuine opportunities for system optimization. The audience will leave with a framework for assessing their AI workloads, actionable methods to boost token efficiency, and practical advice for building resilient architectures that support sustainable growth. Ultimately, the session aims to empower organizations to make strategic, cost-effective decisions, ensuring they scale responsibly and unlock maximum value from their AI investments as the technology continues to accelerate. Key takeaways: Tokens are the true currency of enterprise AI. Efficiency is intentional. Model choice is an architectural decision. Prompt structure impacts cost as much as model selection. Caching is a design principle. Opaque abstractions hide cost and block meaningful optimization.10x organizations won’t use fewer AI tokens, they’ll use them better.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

All right. The next speakers are Brian Scott and Dan Neff from Adobe.

So over the last several years, they were collectively responsible for creating the GenAI governance strategy for Adobe, which, at last I heard, was the third-largest software company in the world. And so they were working with a breathtakingly broad set of stakeholders across the entire enterprise.

I love this because they have thousands of developers. There's an incredible appetite across the organization to do things AI-related, but they were also fully realizing that there were some genuine risks that could actually genuinely cause existential risks to the organization.

And so what's incredible is that because of this, they've been in an incredibly unique position where they could see the AI aspiration and efforts across all of Adobe — not just developers, but HR, legal, finance, and so forth.

And so here they will be talking about what they've been seeing, what they've learned, the problems that they see ahead of them, and what they want to do about it. Here is Dan and Brian.

Dan Neff

Adobe reminds me of trends we've seen all through my career. Not quite the first one. I entered this industry in the late '90s, sleeping on couches, doing QE, and working my way through ops. So hearing Charity's story yesterday really spoke to me. It's great to have a partner in crime that way.

But moving into 2025 and beyond, we're starting to see the same scaling pressures with tokens around AI.

What we've noticed is, quite honestly, the cost of tokens per model is going down if you look at a given model that's being used. And there's a ton of great models out there for embedding use cases or previous generation that are really dropping in price. But those aren't the ones people are using. If I ask the audience here, you guys are probably going to hear the word "Opus" a lot, going to hear "4.x" a lot, and I don't blame you.

I once asked a group of developers: if I could give you a model with infinite bandwidth and no cost, and you could pick it, but I got to choose how long you had to use it for — would you take that trade-off? And they wouldn't. Committing to a model for longer than six months is really a fool's errand at the rate these models are changing, and they're getting more costly as we go.

And not only are they getting more costly, we're doing more complex work for longer periods of time. The context windows are growing. We're seeing adoption across more places in our own job codes. We see our top performers — the people that Steve and other people in this room are talking about, the double black diamond consumers — are the 10% that are driving 50% of our token consumption. And when we talk to them, we want more people like them. So this is just exploding token consumption.

So I'll leave this up here for a little bit.

What I've noticed in the past is there is a moment in a large enterprise, like a Fortune 500 company, where an antibody will kick in — will be a level at which your manager cannot okay the spend without somebody asking why you're spending it. And that cascades all the way up to the CEO and the board level. There is a vendor spend at somewhere in the hundreds of millions where the board wants to know: are we beholden to this technology? Are we beholden to this vendor? And being able to react to that without cutting people off or going through some winnowing exercise is really the genesis of this talk.

In operations, we spend a lot of time hearing things like, "This is the second-best time to address the problem." Once in my career, I'd like it to be the best time to address the problem. So here we go.

---

So I want to come up with a way to talk to developers and teams about token efficiency without requiring a central team or a central auditing group to push down on them and tell them the changes they have to make. To come to them and say, "Hey, there is a cost associated with what you're doing, and you could be more efficient with that cost."

So I'm calling it Big T notation — just because a lot of engineers get the reference to Big O notation and looking for computational efficiencies. And we're breaking it down for the number of requests, the models per request, and the agent depth.

Now, this is not a formal system. There's no proofs. There's no master theorem. It's just a thinking tool.

---

So as we go through these complexity classes, I'm very surprised that up at the top — with constant and sublinear — we're trying to remove tokens from the equation at all. Any time deterministic code can reduce the input set before the model sees it, you're here. The tokens that would've been processed are never processed at all. And this could be some really simple stuff and a quick win.

At these higher levels, we start to see agents calling agents. And at the unbounded level — this is actually for the Claude folks in the room, I'm self-hating — I think you probably are experiencing this, where if you look at the token calls of Claude when you essentially are doing nothing for 24 hours, this is what I'm referring to.

---

So code is the new cache layer.

From the research I've done in talking to teams — if I create a database view over here on the left, this is Flexpa. They were able to create database views, so I'm not selecting an entire table or running SQL queries on demand. I just have static views. They reduced token consumption by 92%.

If I look at MS Rego and the LlamaIndex examples, these are cases where they did a query against a RAG database, came up with a small token input to the system, and reduced token consumption by these 99%-plus numbers.

Again, this doesn't fit for every use case. I do think it applies for a number of the systems that we're looking at internally.

---

So this should be familiar to folks that grew up with Unix — the notion of the efficiency of a piping system, where if I do a `find` and do a `grep` and do a `sort`. When I do this in agent land, I am getting a full set back, passing it to yet another agent, getting a full set back, passing it to yet another agent. Is there a way of composing that chain and doing it in a refactored single call, so that I reduce the input for the next layer? Again, just to be considered. Same task, same output, order of magnitude less token consumption.

---

So, code generation is data compression. This was really interesting. Cloudflare was doing this, where I can ask an agent to just return code without all the natural language scaffolding. And they saw these amazing results — where they were doing this iterative "build a calendar, a set of calendar invites," and asking it to do it programmatically rather than in this natural language way, and seeing this compression.

And I really love this quote from the presenter: "Asking an LLM to perform work through tool calling schemas is like asking Shakespeare to take a crash course in Mandarin and then write a play in it."

I'm really excited to use this on my own work because usually I get to that last step, but it involves hundreds of thousands of tokens.

---

So we've walked through this Big T notation. I want to talk now about where you can apply token efficiency in your stack — the things you want to go after. I'm only going to go through a few of these. The fifth one, workload classification and governance, is in the paper I wrote that matches this.

So, let's start.

Model routing. An interesting thing that's happening with these non-frontier, non-big-vendor models: we have these benchmarking sets and tests, and we're seeing things flattening out at the top. And my observation is that we're reaching a level of complexity of a static test, and these other models are catching up to the frontier models because the test isn't getting more and more complex. And the cost of those models is really low. I would challenge those of you in the room — yesterday, someone said that the Gemini models were outperforming the OpenAI models, which were outperforming the Claude models in their internal study. And I would take it a step further: is there a place where these open source models may outperform? Please do the work. The papers are supporting that they're getting really close.

Prompt structure, input and output. If you're sending tabular data as JSON, you're paying a tax compared to CSV with no quality loss. There's this tuned format — declare the field names once and list the values row by row. You can save 40% to 60%. Poor serialization eats through your token budget through formatting.

Token caching. Token caches are so cheap. If you can hit the cache layer inside the vendor, it saves so much money — but there are ways to corrupt that cache layer. And exposing it to the end user is not effective or simple. It's not clear what behavior you're asking them to change. So dealing with things like better system prompts and context restarts — I think Steve mentioned in one of his interviews that there are workers that are effective with low context, and really wise planning ones with high context. How do you keep that window low, or hit caching as you go? This is something we're actively working on.

---

I really need the help of folks in this room. Choosing tools from vendors that hide how the tokens are consumed is going to be very, very expensive. Some vendors have a token tax that they put on top — even if they're exposing token consumption to you, there's an additional rider. Some vendors, you're consuming vendor credits, not tokens. So even though it's tokens under the hood, you really don't understand the relationship between what your inputs are, what your outputs are, and how it's driving cost.

So do your best to bring your own key, or at the very least, expose the models directly and have the stats to be able to measure it. This is a big killer for us. This is like complaining about gas mileage but taking Uber all the time. The relationship between the cost and the efficiency is completely lost.

---

So if we tie both of these concepts together, we have these layers in the process for developers to look for in their work. Where can I use: first layer, the deterministic processing and the cache check; then the middle layers, classify the task, route to the right tier, reducing your cost per token — don't worry about volume; finally, layer four, get your prompt architecture set up, do serialization, get your volume down; layer five, inference; layer six, output mode.

Layers one and two operate entirely outside the inference engine, and that's why they deliver the largest absolute savings. I really think that's what we're going to go after for teams that have high consumption, low value.

---

Outside of the stage, talking to people in the campfire zones here — sharing tokens per lines of code — they're seeing that metric go up because of the amount of review steps that are being done per line of code generated. These are the kinds of things I'm really worried about: production pressure over engineers doing quality code up front.

---

So we're hoping, if this is effective, take this back to your teams, start asking about token complexity, places you can reduce it. If you're seeing cost and token usage swell, and it's starting to look like waste rather than efficiency, start working on model routing. Really do go look at some of these open source models — we find value in them. And look at the caching layer that's available and the price offset and how to drive that. It is going to require some notion of prompt engineering and efficiency, but done appropriately, I think there's a middle zone where you get some value for cache hits without disrupting people's workloads.

---

So if you are running into token bloat and cost overruns and you start applying this, please reach out. We've done a lot of stuff internally that I can't share, but I'm really worried that we're working in a silo, and I'd love to collaborate and work with other folks and see what they're actually seeing.

The other one that we really haven't talked about today is: as we've added AI, the agents' pressure on otherwise human systems — things like Jira and Outlook — are on the edge of overwhelming those systems, because we built these SLAs to be as cost efficient as possible. So we're only paying for a system with a level of resilience that humans are pushing against it, and we just raised that pressure by 4x or 5x. So places where you guys have solutions where all of your developers are now using AI to hit Jira and Jira's saying "Help" — we'd like to know how you're helping.

---

So we're at the end here. I really think — I don't want people to use less tokens, just use your tokens more efficiently. And I'm pretty sure that's going to allow us to find the value that we've talked about all week.

Hopefully I'm not a madman prophet who is not loved in his own hometown. Thank you very much for giving me your time today. I appreciate it.