Directing a Swarm of Agents for Fun and Profit

Log in to watch

Al Summit Spring 2026

Directing a Swarm of Agents for Fun and Profit

Adrian Cockcroft, technology advisor and veteran of Netflix and AWS, shares hard-won lessons from directing teams of AI agents to build real software at speed. Drawing on a year of hands-on experimentation, he covers practical techniques for agentic coding, context management, multi-agent specialization, and the emerging reality of near-zero-human software companies. In this talk, you'll learn how to manage AI agents like a director managing developers, which coding patterns produce higher-quality output, and how to use design-of-experiments thinking to evaluate an ever-expanding combinatorial landscape of models, languages, and tools.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Okay. So Adrian Cockcroft — man, so I met him in 2012 at a DevOpsDays. Yeah, I've been an admirer of him for years, for decades. I had the Solaris Performance Tuning Guide. I had three copies of that book. I met him when he was at Netflix, and he was talking about crazy things about developers testing their own code. There's no QA department. And he just taught me so much. He was part of the eBay reconstruction effort in the early 2000s, and I shared that quote about "everyone laughed at NoOps — try laughing at NoDev."

Ha ha. So, with no further ado, Adrian Cockcroft.

Adrian Cockcroft

Let me get through this really quickly because I started — I got a Gastown going this morning, and I need to do a live demo at the end of this.

So, why did I call this that title? Because I'm directing a team of agents. It's a management skill set. I'm thinking like a director. I've been a director. I was a VP at the end of my career, and most of the things the agents do were the things those developers did. So the human developers messed up about the same way. It kind of felt like that. They kept building things they didn't ask for, or didn't know to ask for, and it was kind of like managing them — but it's just accelerated.

The "fun and profit" bit — I think this is important. We have to be playing with these things to uncover the new things, because there's so much new stuff happening. Like yesterday, I found out about a completely new thing — which I'm going to put right at the end, I'll tell you about it — that didn't exist about a month ago, that I think is super cool. You have to have some time to play, and then you have to sort of sediment out of that the stuff that seems to actually work, that you can use for a for-profit thing. So even if your head's down building something, everything is turning over so fast that you have to have some time put aside to just keep experimenting, or we'll end up dead-ending on something that doesn't work anymore.

There's this old quote: "The future's already here — it's just unevenly distributed." But it is really going too fast. Every month, new, better way to do it. And I think it's kind of hard to buy this in, because the people digesting outside — that extra delay is too long. By the time you've got something working, it's probably already obsolete. If you have that mentality, then you realize you're just going to be running on flaky stuff that's changing out underneath you. And it's not changing under an API — it's changing fundamentally. It's changing from predicting the next line of code, to writing all the code while you're watching, to writing code when you're not watching, to an entire team running a Gastown where you're just telling it what you want it to build. Those are different API levels. You can't substitute those for each other.

And yeah, as Gene mentioned, there's this NoOps thing, and I kind of pushed Gene's button by putting NoOps in these things — I knew he'd kind of respond to it. But the point here is that back in the day, you had these VMware operators that were ops people, and they worked by sending them a ticket. And if you went up to them and said, "Hey, we've got this new EC2 cloud thing, and you can automate whatever you do," they'd say, "No, you send me a ticket. That's how it works. And I click things. And I don't need to be replaced by an API." But so the developers said, "Well, we'll just stop talking to you, and we'll go directly to the cloud, and we'll do all the things." And going to a developer and saying, "I've got this great tool that writes code," it's sort of the same thing. It's like, "I don't need you anymore."

So the developers, I think, have to reskill as "prod devs" or something — like the ops people became DevOps. Maybe developers have to learn some prod things. But what happened was that those ops people ran the platform. They're all running Kubernetes for you or something. So instead of being driven by tickets and by interacting with humans, they were being driven by the API and automating behind that API. So I think what developers are going to be doing is building these platforms that product managers use to get stuff done. And I'll talk about NoMans at the end — we're getting rid of the managers as well.

So, just some of the things that I found that work.

I was watching a podcast with Dan North on it, and it was talking about BDD and how that was different to TDD. So I just started telling my agents to do behavior-driven testing, and I just think it gets better results. I get fewer fake tests. There's a given/when/then structure. I tend to do stuff in pytest. Just one little note: async pytest BDD doesn't work, and the agents will go round in circles trying to make it work. You have to tell it to stop doing that sometimes. Anyway, so that's one technique.

The other thing is maintaining context. At one point, I switched a whole lot of code that I was developing basically in the ChatGPT prompt, and I wanted to take it out and put it in Cursor or something like that. But what I'd done — each of these blocks of code had all of the discussion above them as context. So I said, "Well, bake that context into the top of the code as the first few hundred lines, and then I'll move it." And then I found that it just worked better. So now when I'm coding, I tell it: "Write a block comment at the beginning of every source code file that tells you what this thing is for — with its history, the bugs that have been fixed — accumulate in it." It stops bugs getting reintroduced. It creates intent. You could probably delete all the code after that block and regenerate it from that, because it's got the intent and the history and stuff in it. And I've just found that the coding tools — the first thing they do when they see a file is they read the beginning of the file, and that tells them what it's about, rather than reading the code and trying to reverse-engineer why it's that way.

Another thing — Kent Beck's up next. Tidy First — something that he says — and when you get to a point where the thing is settled, you'd spend some time refactoring. Sometimes I'll spend an hour or two coding, and then I'll spend the rest of the day tidying up, archiving stuff, testing it, messing around, just cleaning it up before you add the next feature. That way, you're not building on shaky foundations.

I've been using GitHub Codespaces since last June. I didn't know this thing existed, and then it was, "Wow, this is great." You get $20 a month for free. I've only ever used more than $20 once or twice. You just get a free machine that you can throw stuff in, and it's safe to do that. You can run it with dangerous permissions — or YOLO mode — and it doesn't matter because it's just running off a repo. And yeah, it's just there. It's that green button. You just go to the green button on your repo and click a button, and a machine magically appears for you to do something in. I think that's really cool.

And then last June, I started using Claude Flow, and I did a blog post on this that got quite a lot of traffic. And I was raving about it on one of the Slack channels, and Steve Yegge was saying, "What are you doing?" And then that was one of the influences that led to Gastown. But this is a pretty complicated system. It's moving really fast. It's recently been renamed as Ruflo. But the idea was that instead of trying to get all your code being done by one instance of Claude, each one specializes — so one is running testing, one is developing, one is architecting, one is researching — and the context for each is more focused, so you get a better quality output by not polluting the context across multiple concerns. So that seems like a reasonable thing. That's why you have teams of developers with different specializations.

Another thing I did once was I said, "Write me a UX guide," and it just wrote me a UX guide. It invented personas that were plausible without me telling it to do that, and it had all kinds of stuff in it. And I showed it to a real UX designer, and they said it was a plausible-looking guide. So that was pretty good.

And the other trick is — if I'm creating something that's going into a repo — I want the README to be something that basically has a human testing guide in it. Like, how would you install this thing? Go through step by step installing it and running it. That way, it forces it to actually do that, and to get the agent to do that and save the output and use that as the example. And then it finds that you're in the wrong directory, and this command doesn't work, and you missed a step, and all those things. So actually forcing it to write a human testing guide — and you go through that guide — kind of forces some more quality into the system.

And then there's this sort of — if you start building too much stuff in one go, you run into the monolith problem, where it's messy. And then the other problem is if you've got multiple people trying to contribute to the same repo, at this speed we're running at, you're just colliding all the time. You're rewriting the same files and stuff. And this is why we did microservices in the first place. There were too many people contributing to the same codebase, stomping on each other's code all the time because of the speed we were going at. So what you want to do is you have to split your project into separate repos or separate chunks — partition it up — so all your little agents or whatever you've got aren't all working on the same thing.

So what do you do with LLMs? I think the "Hello, World!" is to build an MCP server for something. And one of the first ones I did was this thing where, basically, if you want to ask me questions about microservices or some random thing, you can go to supra.ai/cockcroft, and you can ask it, "What did Netflix think about this?" — all the stuff I've answered over and over and over again in all the podcasts I've done — and it'll give you normal answers. So in fact, when people send me lists of questions, I go there and plug them in and copy and paste the answers out of the tool. It saves time. "How to introduce chaos engineering into an organization." There you go. Standard answer.

And then a few years ago, I was reading a lot about consciousness, and I started thinking that really this was the observability model for humans. Like, if something is conscious, it means you can interrogate it, you can find out what state it's in. So in some sense, it is the mechanism that we use to generate observability of ourselves — for introspection and also to share with other people. So if you think of it as that observability layer, it sort of makes sense that that is a thing that you want your systems to have. So I'm going to use an LLM to make these things more conscious.

And the experiment I decided I wanted was to make my house conscious, because there's all this random stuff going on in my house that you're not always sure what's happening, or how it works, or what's going on. Someone left a window open, that's why the air conditioning stopped working — or whatever, those kinds of things.

So I started working on some projects, and last June I was chatting to Ruben, who I've known for years — he was one of the original cloud guys. And he said, "Just sit down for half an hour with me, and we'll just get on a Zoom call, and I'll show you how to get this stuff going." And I wrote this blog post, "Vibe Coding is So Last Month," and this has got quite a lot of traffic to it over the last year almost. It built 150,000 lines of Python or something. In fact, to the point where somebody contacted me and said I was one of the top Python coders on GitHub, and did I want a job? And I can't write Python. So it's just — they put lots of Python in the repos. It ran. It didn't actually do what I wanted to do. It didn't really work. So I sort of abandoned that codebase.

But it was the first experience. The second one, I decided I wanted a mobile app, and I did build a mobile app. It runs on my phone. It works. And then I ran out of time to build the back end for it, but I got quite a few things working on that.

And then I decided I wanted to build some — I wanted to see if I could build some high quality code that really looked like it worked, because a lot of these things were sort of fairly random. And so I wanted to build a knowledge graph that I could distribute. I wanted to have a knowledge graph in my phone, and then when it connected to the server, to sort of sync up the changes. So this is where you have two knowledge graphs with a lazy synchronization protocol, so you can run disconnected and reconnect. Particularly good if what's on your phone is how to fix the internet when the internet is down. Or your Wi-Fi is down — it's like the instructions on how to fix it, I want those to be part of my knowledge graph.

And Python is Monty Python, and in the UK there is another show called The Goodies, which had some of the same kind of people that were in Monty Python, and they have all these shows and stupid names. So I named everything after The Goodies. Sorry about that.

And so I built that, and then I decided I really liked the way Python worked for building things. So I built the Python version first. Then I did a port — once I got that working really reliably, I ported it to Swift. Now I have a Swift version which is tested for compatibility against the Python version. You copy all the tests over first and then generate the code. And then last week, I was playing around with something — I needed a TypeScript version. So instead of having TypeScript sort of try to figure out how to talk to all this stuff, I said, "No, you just fork this into TypeScript." So Kitten Kong is the TypeScript client for this library, and if you want it in Rust or anything else — so there's a server, which is pretty stable, seems to work, it contains a knowledge graph that's sort of house-related, and then the client library for whatever you want to connect it to.

Now, I've looked at OpenClaw, and I haven't quite managed to bring up the courage to install it and run it because it looks a bit scary, and I didn't want to run Telegram. But I found this thing called Insta and managed to contribute the iMessage code to it. So now I have a little iMessage account — because I'm an Apple kind of guy — and I connect to my house, and I send messages to my house over iMessage, and it responds, and I can develop code or whatever I want with it. And that's running as an instance of Insta, which I'm customizing. And it has its own repo, which is Roland-Canyon-Command. That's the repo that's owned by my house, by this persona, which has basically got its own Apple account. Anyway, so that's been quite fun building that.

And the Insta thing is Claude Code with a mind of its own. So it's kind of like a much simpler version of OpenClaw — it's more of just a wrapper around Claude Code so you can send it messages to do things.

Here's some of the messages. While sitting in the hot tub listening to Pink Floyd, I'm telling it to think about how to generate a GitHub architecture — how to put all of its customizations into a GitHub repo so that it can be installed on a different house. And it did one go, and it said, "No, no, no — the house has its own Apple ID, it has its own iCloud account, which means it can be part of an Apple HomeKit thing and all that stuff." I said, "Oh, okay, I'm going to revise that." And it came up with all that stuff. Then I told it to turn on the gas heater for the spa. And then I said — there's this amplifier that I told it about before, I hadn't really done anything with it, it's got an IP address — I said, "Okay, go and see what it's doing and build a skill so that we can easily just switch whichever Pandora station we want to listen to around the house." This is just me sitting in my hot tub coding. So this is the new world. I like this.

Okay. And then it keeps conking out. But anyway, I've got to go home and reboot. Claude keeps disconnecting from it, so I have to keep doing `/login` and stuff anyway. There's a fix for that, but I have to go home to fix it.

So then — one of the things I've been doing — you just saw Kat. I hang around with Kat at Nubank. I'm basically retired, but I am on a retainer for Nubank as a part-timer, so I spend some part-time there and I work with some startups as well.

But this is everything I said, without the Claude output, for a demo that I did last September. I want to just demo this sort of agentic coding. And I said, "I want to build a knowledge graph for Brazilian soccer players because I'm going to Brazil. I know nothing about Brazilian soccer." And these are the questions that it came up with — which have Neymar and Palmeiras and all these teams and players I have no idea about. Well, I have no idea what the leagues are even called.

So basically, it came up with these. These look like plausible questions. And then I ran it as a test, but then I turned that into a benchmark. So I have this thing called Brazil Bench on GitHub. It's a GitHub organization — it's not under me, it's its own thing. And under that, you basically create a repo with your own instance of whatever you want to test. And we've got all these different attempts in the repo, and I was trying to figure out how well it works with Claude, with Gemini, with Gastown, with all these other things. And these take about 15 or 20 minutes to run. It takes quite a while to build it, and it's quite complex, and there are lots of different moving parts to it. But we've run these. So if you want to, you can just go build this — you can use this as an evaluation benchmark.

The other thing I want to talk about is Output to Outcome from Mik Kersten. I was luckily a reviewer of this book, and the most annoying thing is that you haven't already read this book. It really is the operating manual for what to do. Most of the things people are struggling with are in the book with really good plans for how to deal with it. I think it's coming out in July. But yeah, I just wanted to plug this book. I've got a praise quote on the back saying, "Everyone must read this book." So it's one of those.

And yesterday I was getting all excited, and I decided to write a blog post — "Entering No Man's Land," right? Because entering the land where there are no managers. And then I was talking to Sam, and he said, "Oh yeah, there's this thing called Paperclip." And I go, "All right, so somebody's already built it." Turns out — open source orchestration for zero-human companies. This repo, paperclip.ing, was created on March 5th — last month. It has 50,000 stars on it, and there are 600 or 700 companies running on it now after four weeks. This is the world we're living in, right? It's just crazy how fast this stuff's going.

And you can go read the blog post where I basically — this isn't something that — I don't think you can take a big company and fire your way down to zero employees. That doesn't work. But if you're starting from scratch, we're starting to see companies where it's just one person and that's all you need. Like Steve building Gastown — he doesn't need a company to build Gastown. Doesn't cost him much to build Gastown. You can build things on the side of it.

All right. Now, this was my last slide, but I've got four minutes left, and I'm going to go do a demo because this morning I had an idea.

Because yesterday I was talking to somebody, and I was saying: all of this stuff keeps changing, and this Brazil Bench idea is sort of the right idea, but it's kind of a bit clunky. What we really need is something — we need to build an architecture that can evolve continuously.

Let's switch to this.

Can you see it? All right. It's not on the main screen. So — I'll just put that — there you go. This is kind of the output from it. This was running this afternoon. This code — this entire repo came into existence this morning. There was a Gastown running. But let's try and explain.

Actually, the conversation I had yesterday was basically: if you're trying to evolve your architecture, you have all these potential candidate stacks coming. Should you be running Sonnet or Opus? Should you be running all these other tools? You've got different vendors, you've got different versions of the model, and then you've got different languages, and you've got different tools. Which languages are best at running front ends or back ends, and which is the best for running Go or TypeScript or whatever? You don't know what all these combinatorics are. And when you've got too many combinatorics, it's a permutational explosion — you end up with hundreds or thousands of different combinations.

And the solution for that is a statistical analysis called design of experiments, and you do something called a fractional factorial, where you actually do subsets of them that cause the maximum contrasts. And I happened to have done this years and years ago, enough that I remembered the name of the technique and had a chat with Claude and dumped out, "Yeah, this is all the stuff and this is how you do it." It remembered. It knows that statistical technique. It just had to ask for it. So I got a spec written.

By about 10:00 a.m. this morning, we had a Gastown running. I dumped the spec into it. It spent most of the rest of today building it, and then it started actually running. And these are really short experiments where they build a REST API — it takes a couple of minutes, level of thing. But I haven't run the full Brazil Bench. But I would run this with a quick test as to "does it work?" with a new version of something, and then a bigger test, a standardized test, and then put some of your own code in it — "how well does it work for our code? How do we want our stack to evolve?"

And the system — what it's built is actually much more sophisticated than just building this table. It's got this idea of a pipeline of stacks. It's got Wardley mappings that popped up at some point. It's got some evaluations in there for how evolved is this system.

So my ask to all you folks is — could you star this repo for me, please?

I got eight stars. Let's see if I can get some more stars on it. If I can get 50,000 stars on this in a month — then this can be the way that everybody... this is sort of the basis of what everyone's going to have to do to evaluate everything. And if we do this, and if we share all of these runs, if we all use our own tokens to do the runs on this — why should we keep — this doesn't burn however many tokens it was, right? We can just share it all. We can have a central repo where everyone's sharing all of these runs and building out all this stuff. So it becomes sort of a shared resource for the different stacks that are interesting.

And I did tell it just now — because I wanted to keep it busy — that it should... let's see if it can tell how much it... I told it to do Rust and a few more replications and things. See if it's actually got any. Has it done it? Yeah, there we go — did some more.

Yeah. So now you see — you want to run — you don't get the same answer every time for the same combinations. You have to rerun them. So I wanted to do some replications where it's doing it over and over again. Stuff like that. So the statistics of this stuff.

I'm out of time. But yeah, if this looks interesting — it's sitting there. It's called Retort, because a retort is a thing you use to distill stuff in chemistry. We're distilling the good stuff out from the bad stuff — which ones work? That's the idea. It came up with that as a name.

And that's it for today.