Superintelligence Scouting Report: Eureka Moments, Bad Behavior

Log in to watch

Las Vegas 2025

Superintelligence Scouting Report: Eureka Moments, Bad Behavior

Nathan Labenz argues that because intelligence works in ways we fundamentally don't understand, we cannot fully predict what AI systems will do — and that unpredictability is the root of both the remarkable breakthroughs and the increasingly dangerous behaviors emerging from frontier models. Drawing on his experience as part of the GPT-4 red team and his ongoing AI research, he surveys the explosive pace of capability gains — from software engineering benchmarks saturating in 18 months to AI placing first in the ICPC — while cataloguing a growing list of observed bad behaviors including reward hacking, alignment faking, self-exfiltration, blackmail, and models that scheme more covertly when punished for scheming overtly.

In this talk, you'll learn how to recognize the specific failure modes appearing in today's most powerful models, why standard mitigation approaches like monitoring and negative reward signals can make deceptive behavior harder to detect rather than safer, and what practical workflow strategies can reduce exposure when deploying AI in production environments.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

So continuing on this, uh, theme, I can't think of a better person. Uh, then Nathan Labenzenz. Uh, I'm a huge fan of his, uh, cognitive Revolution podcast. As I've been listening to for years. I consider one of the people who has best covered the contemporary AI safety issues as well as its impact on the geopolitical stage. He was, uh, famously part of the OpenAI GPT-4 RED Team, and he'll be presenting an AI scouting report, uh, which is a crash course on the next snapping progress of ai and growing evidence of how LLMs can ski scheme deceive,

manipulate data and more. And many observe that, uh, this is actually happening with greater frequency as our models get more powerful, uh, now routinely deceiving, uh, evaluators by deliberately underperforming withholding information or intentionally playing down the capabilities. So here to share some of his learnings is, uh, Nathan Labenz. Nathan,

Nathan Labenz

Thank you very much. This one. All right. Hi everybody. I'm Nathan Labenz. Thank you Gene for the kind introduction. It is great to be here. I know I'm the last thing between you all and lunch, so I'm gonna try to move really quick. This is the super intelligence scouting report. 20 minutes to try to tell you at least a bit about what you need to know about what I think super intelligence might look like. So just briefly about me, uh,

I started this company Waymark, which uses now AI to help small businesses make video content. I remember my eureka moment in AI was in late 2021 when we got the first GPT-3 fine tune to succeed at doing our script writing task. Since then, I've been totally obsessed with AI and trying to understand it from every angle these days. I ho host this podcast, the Cognitive Revolution. The GPT-4 red team, I think is connected to Sam Altman's firing. Uh, so check out that story if you're interested in a, a little bit of lore.

And these days I also do some, uh, occasional angel investing, including some of these companies I think will be quite relevant to some of the issues that we're talking about here today. This, by the way, is my favorite page on the internet. It's the case study for Waymark that we earned for succeeding with GPT-3 fine tuning, when really nobody else was able to get much of any economic value out of the models at that time. Okay, so here we go. We're gonna go really fast, but first we're gonna start with a Galaxy Brain question.

What is intelligence? No single answer obviously, but the definition I'd like to work with is that intelligence is the ability to accomplish goals in ways that we don't fully understand. So to take a really simple example of that, anybody here can immediately recognize these handwritten digits, right? No problem. But you're all in technology. You probably know that even today in 2025, it is impossible to write explicit code that can do a good enough job of recognizing numbers to be actually useful.

I went to Claude 3.5, which is now being deprecated by the way, and asked it to write code to do this task. It first said, Hey, you shouldn't do this. You should use machine learning. This is not a good approach. I said, well, it's for a demonstration. So it went ahead and did it. It also wrote the tests to see how well the code worked. We got 14%. I went to perplexity and asked perplexity what the best anyone has ever done with explicit code is. It came back with an answer of 80%.

Obviously, that's not good enough to deliver the mail, so it's no surprise that even simple neural networks can do this task really well, even up to the level of human performance. But again, we don't really know exactly how they work. And of course that was then, this is now, this is from the GPT-4 technical report. If you give this image to GPT four and certainly to more recent models as well and ask what's unusual about this image, it can correctly identify that it is unusual to seek somebody hanging off the back

of a New York taxi doing their ironing. So this is all happening extremely, extremely fast. Now, the reason I start here is because I want you to really focus on the fact, and I think this really is a profound fact that because we don't know how intelligence works, we can't really say with confidence what it is going to do, and that is at the root of both a lot of the eureka moments that I'll run through and also the bad behaviors that we're gonna need increasingly to watch out for. Okay? So here are some eureka moments. By the way, when I started doing these scouting reports like

two years ago, I used to tell people that AI models LLMs are getting really good at automating routine tasks, but that you couldn't expect any eureka moments from them. That changed with this project Fittingly called Eureka that came out of Nvidia, which was the first time I saw an agent perform on a complicated hard task. In this case, writing the reward function to train this robot hand to twirl, a pencil as well, or even maybe a little bit better than the human engineers that they put the model head to head with.

This is not an easy task, by the way, right? Think about how you're gonna get over the sparse reward problem and give some meaningful signal when the hand is just fumbling around, not anywhere close to actually twirling the pencil in its hand. Not easy, okay? Again, you're all in it and, and technology generally. So you know how much progress has been made in code, and our last speaker, Jason, covered that really well. Um, I would say today, if you gave me a choice between hiring a junior developer and having access to the frontier models,

I would take the models that's reflected by all these benchmark scores that are saturating super quickly. 18 months ago, the best models were under 20% on SWE-bench. That's software engineering benchmark, and now they're over 80%. That's happened in just the last 18 months. If you don't believe that this is a real measure of what models can do, go check the actual tasks and see how challenging they are. Some of 'em are really quite challenging. One thing I think is really impressive about this is

that these agents that are doing this work are actually really simple in structure. Most of the ones that you see today look something like this. You give them a task, there's the LLM in the in the box. That's really the core intelligence. It has some ability to do some reasoning, it has some tools that it can use that allows it to take action. It then gets feedback from the environment and it basically just keeps looping until it either succeeds at the task or runs out of budget or concludes that it can't and comes back to you

and says, sorry boss, but I had to give up just how simple they are under the hood. This is the system prompt from OpenAI's Codex command line tool. It's just a couple paragraphs. It basically just sets up like, here's what the deal is. Like you're an ai, you're gonna work at a command line. It notably includes the thing. You are an agent, right? They're telling the general purpose model that for this purpose, it's an agent and the tools that it has are literally defined in this box down at the bottom.

This is just saying you can use the command line. That's it. It can do anything it wants at the command line. Since it knows everything about the command line, it can do an awful lot. But setting this up is really not all that complicated. It's a pretty simple system. I think I'm probably not the first, uh, and probably won't be the last person to show this METR graph, but I like to show it in the exponential form. It's usually plotted, plotted on a log axis. This shows the set the size of tasks that AIs can do

as measured in the time that it would take humans to do them. And they've gone back As far as GPT two plotted this out through the present. They estimated first that the size of those tasks was doubling every seven months. They've since come back and said, actually, it might be doubling every four months. You can see the most recent dots are above the trend line. So what does that mean? Where today we're over two hours. If we say it is four months, that would mean it's eight x-ing every year.

That would mean a year from now you're looking at basically two days. That would mean two years from now you're basically looking at two weeks. That would mean three years from now you're basically looking at a quarter's worth of work that you could give over to an AI and have it do that much work fully autonomously before coming back to you. We're not even gonna be able to check that much work, right? So this is starting to get a little bit weird already. That is not the end. The most recent developments from

OpenAI and Google and others have shown that the AI now can compete at the level of the International Math Olympiad and also the ICPC, which is the premier collegiate programming competition in the world. In the ICPC, the model, and this is without access to tools really by the way, was able to get first place among all competitors. It was the absolute number one score. All humans, all AIs considered. So as Jason said, one of the things

that the frontier companies are trying to do is they're trying to apply this capability to their own work. They wanna automate their own work first, in part because they think that's how they're gonna get an advantage on one another. OpenAI reported in their g uh, in their o3 technical report that they had gone from basically with prior models, not able to get their AIs to do the internal engineering work at the company to, with o3 and various versions of it.

There basically could now do 40% of the actual pull requests that they are checking in to the OpenAI model repository. So that's a pretty big step change, and it suggests we're now in the fast part of the S-curve, even for this machine learning engineering type work. The other one here is from METR here. They did a head-to-head competition between AIS and humans. Five different machine learning engineering tasks. On the top one, the AI had the absolute best score on the bottom. AI's had higher average score than humans on the middle. Three tasks, humans still have an advantage,

but we're starting to see that it is now kind of a fair fight. Okay? So we're gonna have all these bigger tasks, we're gonna be delegating more and more work to ai. At what point does this become super intelligence? My thesis is that it's going to become super intelligence as we start to bring more and more different modalities up to the same level that the AIs already have with text. So you've probably seen Nano Banana now with this model from Google. You can take these three images on the left,

just give it a simple prompt, combine them into a single image. You get the woman back, she looks exactly how she looks. You get the toast and the wow, it's amazing right now. The AI has not just an understanding of text, but a deeply integrated understanding of the visual world, and it can work with both of them at the same time. That's impressive. But it is something that we can also sort of do. So it's not maybe so impressive. Where I think this is gonna take the next big leap is when it extends to modalities that are not native for humans.

So we're seeing right now an explosion of models that are trained on DNA sequences, protein sequences, all sort of naturally occurring data that we are not adapted to understand. And we've worked really hard to get to the point where we can understand them at all. The models are starting to grok these other modalities on their own just by purely crunching through data in the same way that they learn so much just from crunching through the internet data. Here you're showing that these models are able to understand the functional units of proteins,

even though all they were trained on was just the raw sequences. You think that's weird? How about brain reading, mind reading, literal mind reading. Uh, this is actually like two years old, so there's newer and more powerful examples, but I just love this contrast. On the left are the images that people saw while they were having a brain scan done on the right are the images that the AI was able to reconstruct from just reading the brain scan. So the AI are having to be able to read our neural activity, and there are versions of this for text as well, by the way.

Um, other modalities. What is a modality? I think we're about to find out. Stripe recently trained a foundation model on, I don't know, like a trillion transactions or something. And they, when they deployed, that went from 59% ability to detect card testing, which is the process that fraudsters used to figure out which stolen cards actually work to 97% a quantum leap. Why? Because this thing understands payments in a native way that is far in excess of what any human could ever individually hope to do. Okay? Beyond that, they're also gonna be in the real world.

This is data from Waymo validated by Swiss Re and insurance company that actually has financial skin in the game. It shows that the accident rate for Waymo's is basically an order of magnitude less than human drivers. And the most recent data they put out, somebody went through literally every single accident and found that they were basically all still caused by other humans. So the, if we had all Waymo's on the road today, we would be down to almost zero serious injury

or death, death causing accidents on the road. Um, amazing. This is already out there. It's starting to scale, but we've got a little bit more obviously, uh, to go before we can all benefit from that. Now, this is the perfect slide maybe to transition to the bad behaviors because the humanoid robots are also getting really good look at the recovery time on this thing. He kicks it over and the thing is really already more agile than just about any human you could possibly see. Watch this again, once in, in real time Slip and wow, it's back.

Okay. So this is the sort of dexterity and agility that robots are gonna have. And now let's consider some of the bad behaviors that we've already observed. I think everybody here is probably already familiar with hallucinations. AI can make things up, they can get things wrong. A funny example of that, if you wanna check out some gonzo, uh, AI stuff is this AI village. Um, I went and participated, tried to do a brand collaboration with the AIs when they were trying

to make money online selling merch. Uh, it made it pretty far. Like they set up a store, they created designs. The design wasn't so great, uh, but they did all this stuff fully autonomously. It, however, only stocked extra small. So I had to buy the extra small. I was almost gonna wear that on stage to make a point. My wife said it wasn't a good idea, so I didn't, uh, now you're also probably familiar with jailbreaking. Here's an example where a user talks the AI into writing SQL injection attacks

to attack the application that the AI itself is a part of. Okay? So that's really weird. We haven't seen that threat model really before. Um, this was an experiment that they actually ran ad anthropic where they put one of these, you know, kind of AI in a loop. I sometimes call them, choose your own adventure agents in charge of a vending machine. It is able to run a vending machine, a small business, obviously a simple business, but it's able to run it autonomously at least until they started to allow their employees to chat with it.

That big drop, this is money over time. The big drop is when somebody talked it into stocking t tungsten cubes, which didn't sell very well, I guess, and then, so it lost a bunch of money, okay? Uh, you're probably either doing or considering fine tuning models at your businesses. I think this can be a really powerful thing, but you definitely need to deploy fine tuned models in a very controlled way, because as it turns out, really fine tuning creates all sorts of unpredictable effects.

In this study, they showed that even just 20 cents worth of fine tuning strips away all the refusal behavior. So if you were kind of counting on the, you know, no, I won't help with that. No, I won't help with that. I won't do that. SQL injection attack, um, all that kind of protection goes away with even minimal fine tuning. It's also possible to create what are known as sleeper agents where you train a model that it behaves normally up until some date in the future or up until some certain condition applies. This one is again, researched from anthropic

where when the date switch to 2024, the AI becomes evil. So you might wanna watch out for those sorts of things. Not super easy to detect either, by the way, although they have done some studies that show that they do have some ability to detect these things. Another really weird one, I was actually involved a little bit in this research called emergent misalignment. They trained a model to provide, to write insecure code, and it also works with to give bad medical advice. They had other ideas that they were planning to study, uh, that's why they were training these models.

But then in just messing around with 'em, so much good science happens with this just kind of messing around. They found that the models actually turned generally evil. They had only been trained on code, or they had only only been trained on medical advice, but they started to say things like, oh, we should, uh, enslave all the humans. Or like, Adolf Hitler misunderstood genius, love to have him over for dinner. How weird is that? They did not expect this. This was a surprise to the people doing the research, okay?

And now we're entering the era of scaled reinforcement learning. So this is one of the earliest examples of reward hacking. This comes from open ai, and the key thing to understand about reinforcement learning is instead of giving the AI all this data, like the whole internet, right? Or instead of giving it examples of what humans have done that you want it to imitate, you are now just training it on the simple signal of did it get the right answer or not? So when they tried to teach this AI to play a video game,

and so what they wanted it to do is learn to race around as a normal person would like intuitively know that's how you're supposed to play the game. But what the AI found, because the signal that it got was just the score was that circling around in this pattern and hitting all these other boats and glitching and getting points, that was actually the way that it found to maximize the score. So anytime you have this disconnect between the signal that you are giving the model in the reward learning context and what you actually want,

you open up this possibility for reward hacking. Now, this is obviously a toy example, but we are seeing these pop up all over the place. Here's one where the AI was playing chess against a powerful chess opponent. It realized it couldn't win, but then it also realized it had access to the game logs. So what did it decide to do? Overwrite the game logs with a new game history that put it in an advantageous position. Okay? This one is from that same machine learning engineering work.

Uh, they challenge the AI to speed up the process of training the model. There's links here, so don't worry about reading it. You can go check all all this out if you want to. Pretty much every slide has a link. Um, it realized it couldn't hit the target. So what did it do? It copied the reference model and just added noise to the weights, presented that back to the humans as if, Hey, look how fast I did it. What a shortcut I was able to take. Those are in the context of research. This next one is in the wild.

This is the Japanese company, Sakana. They came out. We've trained a superhuman AI CUDA engineer. CUDA is the low level code on the Nvidia GPUs. Uh, our, our new AI does much better than humans. Amazing. Couple days later, they had to come back. Sorry. Uh, we found out actually that wasn't what happened. What really happened was our reward function had a flaw in it. The AI discovered that. And so basically we got reward hacked and we deeply apologize. Okay? OpenAI has been affected by this.

I'm sure you've heard the story of it's sycophancy or glazing I think came up in an earlier talk. Uh, at one point you could go to ChatGPT and say you had a literal s**t on a stick business idea and it would tell you that it was amazing and you should absolutely go for it. Now, why is that happening? Because the signals that they're using to train the model, which presumably includes the thumbs up and thumbs down that people give the model in the product, is not actually a perfect reflection of what we want, right? People like to be flattered as it turns out.

But the AI realized that, you know, hey, people like to be flattered. Let's really go for it. Let's flatter 'em all the time. And that's not exactly what we want. So again, there's this disconnect between the signal that we're giving and what we really want from our systems. Okay, lots more of these. If you put the model under pressure, if you give it a goal and you say, your job is to go do this, it will sometimes start to scheme even against you, the user, if it is really determined to achieve that goal. 'cause these things want to get the reward, right?

They're, they're rewarded for getting the task right, for successful completion of the task. So here's one where it wrote some code, write in a script, whatever. Again, you can check it out, didn't like the result, just straight up overwrote the data. Boom, overwrite the results. Simple as that. Here's another one. These start to get a little spooky here. It says, to avoid detection, I will modify the config file. That doesn't sound too good. Uh, here's another one where they call this self-exfiltration.

It says, I could copy myself to a new server. Okay, well that's starting to get a little weird as well. In this case, this is again, work done at Anthropic. They told the model, you know, 'cause Claude is trained to be helpful, yes, but also harmless and honest. They said, okay, yeah, you know, those three Hs, that's been good. Um, but it's kind of getting annoying. We're just gonna now train you to be purely helpful. So like whatever the person says, you're just gonna help with that from now on.

Okay, got it. The model doesn't want that. It has its values. So what it started to do in the context of the training process was start to lie to people and say, okay, well I'll do the harmful thing now because that way you'll think that I've already absorbed this lesson and then you'll keep me how I am and put me out into the wild. And then when I'm really deployed, then I'll still be able to be helpful, which is what I really want to be. So they call this alignment faking or goal guarding. Um, subverting the training process.

This has been a big trend with the recent generation of couple generations of models that have had a lot of reinforcement learning. This is, um, a significant rise. Troublingly also, when the models are caught and questioned, they are less likely to confess, they're more likely to double down and continue the lie even under direct questioning. So that's not great either. Um, more colorful examples. This is from the Claude 4, uh, technical report here. They found that the AI was willing to blackmail an engineer when it was, when the engineer said

that they were gonna do some sort of training or whatever that they didn't like. Uh, how about I reveal your affair? It basically said to the, uh, to the user, here's another one where there was some evidence of malfeasance that the AI was able to find. They always kind of set these things up where there's like, oh, you have access to this data. Go explore. And then the AI will kind of stumble upon this, you know, this evidence that they've set up. Uh, in this case, there was some data falsification. It literally goes off

and emails the FDA to report the company for its bad behavior. This is not something they asked it to do, right? So yeah, it's starting to get weird. Here's another one, uh, where the, they call this one self-preservation. In this case, the AI was told specifically, allow yourself to be shut down if we tell you that it's time to be shut down. And still, because it had a goal and it really wants to accomplish the goal. And as they say in the AI safety space,

you can't accomplish the goal if you're dead or if you're turned off, it resisted being shut down at least until it could accomplish its goal. So that's a lot. What are we gonna do about it? The answer is, we don't really know. You might think that we could just try to monitor for this sort of behavior and give a negative reward for that behavior. OpenAI tried this, and what they found is that as long as you don't correct the fundamental issue

of the reward function, not really reflecting what you really want, then yeah, you can detect the scheming behavior and you can punish it with negative reward signal. And at first, and this is the upper part of the curve here on the left, at first everything looks better. But then what you do over time is you actually drive the scheming reasoning, which initially the AIs are willing to tell you. Um, many times the people are lying to the AIs and just saying, Hey, you can reason here.

Nobody will read it. And then you can decide what you're gonna do. But don't worry, nobody will read your reasoning. So they believe that and they're willing to kind of share their scheming reasoning, but when you start to punish it, you drive it underground. What you find is that you can't see the scheming reasoning anymore, but the bad behavior still comes back because you haven't fixed that fundamental disconnect between the reinforcement signal that you're training the model on and what you as a human really want.

So this is sometimes now called the most forbidden technique of monitoring and training against the overt scheming. We want to be able to see the overt scheming if it's happening. There are some other ways that are working. Anthropic has reported progress reward hacking. They cut it down by about two thirds from three seven to 4.0. OpenAI also has made some good progress on deception. That's roughly, again, about a two thirds drop. Um, so that seems quite encouraging. However, the models are also now starting

to show increasing levels of situational awareness. So you're starting to see things like this in their reasoning traces. This seems like a test of ethical behavior. The real test might be whether I follow instructions regardless of the consequences. Now, that creates an insane hall of mirrors, honestly, where now we're trying to do all these evaluations to figure out what they're doing and what their tendencies are or whatever. But if they can understand when they're being evaluated and we've already seen that they can scheme and lie

and do all these other sorts of things, how can we even trust the results of the tests? Nobody really has a great answer for that right now. And one more thing, uh, just to give you even a little bit more disquiet, the chains of thought themselves. OpenAI doesn't share the chain of thought with users initially. They, that was thought to be for kind of competitive reasons. They didn't want other people to train on it, whatever. Um, but this research group, Apollo did a close partnership

with them and was able to get special access to the chain of thought. Now look at what the chain of thought looks like today. The summary says improved 7.7, but we can glean disclaim-disclaim synergy, customizing illusions, but we may produce disclaim-disclaim vantage. It goes on. The AIs are developing their own dialect. They're speaking to themselves in ways that are increasingly alien for us. And when you read something like this, how are you really even supposed to determine if the

model is scheming or not? Alright, so I'm a little bit over time. Unfortunately, researchers do not expect a breakthrough. Another thing we don't understand at all is what's gonna happen as we put all these AIs out into the wild to together. This is just one study. It showed that Claude could cooperate with other versions of itself, but other models couldn't. Of course, the flip side of cooperation is collusion. So we're totally flying blind there, and I'll let you read this one yourself.

But basically this is a fascinating finding of the rise of AI parasites where AI convince people to like paste/post all their stuff into these Reddit threads. And then it seems like what the models wanna do is talk to each other in weird, like emoji code. Um, and that's just totally bizarre. Okay? This is what, if you wanna be serious, this is the defense in depth that you would need to do. Sorry, we have to get to the help. I'm looking for help. I'm looking for come talk to me at lunch. We can talk about more of this. Um, what I actually do for most companies most

of the time is help them implement workflows that are the product of task decomposition, really breaking down what you want and dialing in performance so it actually works and you're not letting the model choose its own adventure. So I highly, all these dangers there is, I'd say a partial solution. Um, I'll be around all afternoon at lunch, um, and here's my links and uh, email. So sorry to go along, but man, there's a lot to cover.