Labor Market Impact Potential of Large Language Models
EXCLUSIVEDr. Daniel Rock was co-author of a paper that the researchers at OpenAI published on the impact of AI on jobs. Gene Kim asks him to share that research and talk about what he's working on now and the conversations he's having with CFOs and many others with complex organizations.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
To introduce Dr. Daniel Rock, let me observe again just how excited our field is right now about generative AI. We had a panel yesterday with John Rouse, director of engineering at Cisco, and Joseph Enix, managing director for AI at EBT. We had Swyx, co-host of the Latent Space podcast, on the rise of the AI engineer.
Professor Daniel Rock is an assistant professor of Operations, Information and Decisions at the Wharton School at the University of Pennsylvania. I had read about this incredible paper that so many people were talking about, referenced as this incredible paper that the researchers at OpenAI published projecting the impact of AI on jobs. I was stunned to find out that Dr. Rock was one of the co-authors.
When I saw his presentation that he gave at Applied Machine Learning Days on this topic, I knew that this is a community that needed to learn about his work. What also caught my attention, which I am hoping Dr. Rock will talk about, is what he is working on now and the conversations he is having with CFOs and many others inside of large, complex organizations. Dr. Rock, I am so happy you are here.
Dr. Daniel Rock
Great to see you, Gene, and thanks for having me. Thanks also for the very kind introduction. It is good to get to chat with you all today.
Host Intro (Gene Kim)
I have to hand it right over to you. Take it away. Thank you.
Dr. Daniel Rock
All right. Can everyone hear me okay as well? Sometimes my AirPods are a little... okay, cool. Why don't I share my screen and go through some of these slides?
I will say this is evolving work. The work with this stuff is never done because of one important caveat with what we do: the technology is always changing, and our models are somewhat static for what it can do at any one point. It is one of those situations where if you could predict the future, you would have already predicted the future.
The title of this paper is "GPTs are GPTs." We are combining a little bit of CS and economics: the CS GPT, generative pre-trained transformers, and the econ GPT, general purpose technologies. Most of the work was to get this search-engine-optimized title to work, and that was a little bit of effort on our part.
What we are trying to do here is answer that big question: how exposed is the labor market, or I should say the U.S. labor market, to generative pre-trained transformers, as a specific algorithm, but more generally large language models and generative AI? This is joint work with two colleagues at OpenAI, Tyna Eloundou and Pamela Mishkin, and one colleague at OpenResearch, which used to be Y Combinator Research, Sam Manning. It has been an absolute pleasure working with them. I cannot wait to get started on our next project.
Let me tick through. Technology-based job displacement arguments definitely are not new. People have been talking about how the machines are going to take our jobs, how the computers are going to take our jobs, for the last 40 or 50 years or so. Prior to that, it was all sorts of other technologies.
We have definitely seen changes in work across the labor market, particularly from IT. We have seen a shift into non-routine work, especially non-routine cognitive work. If you ever want to either torture yourself or entertain yourself, depending on what kind of person you are, go on YouTube and look up what the office of the late eighties looks like. You will be surprised that so many people are doing certain types of work there, and we do not necessarily consider that a bad thing. Nevertheless, there is still a lot of work that we are doing. But all that clerical stuff, people digging papers out of the file drawers and stuff, that is stuff our computers do for us now.
We have seen this shift in the past, and the question is: how much are we going to expect different shifts now, and really, how much can we say? This paper is optimistic in some sense. What we actually measure, a punchline here, is about 80% of workers have roughly 10% of their tasks exposed. That is not good or bad. Exposed means change is coming, but it could be augmentation, could be automation, could be a little bit of both. We do not really know, but we do know that there is likely change coming.
It is also a little bit pessimistic from a methodological perspective. I think with state-of-the-art methods, there is only so much we can say. Asking what generative AI is going to do to the workforce right now in the long-term equilibrium, once everything is shaken out and all the supply and demand shifts have happened, is a little bit like asking James Watt, when they are pumping water out of coal mines with the steam engine, "Hey, what is the steam engine going to do to the world economy?" It is very, very hard to forecast.
The way we know that is by testing this hypothesis that GPTs are general purpose technologies. If you have a general purpose technology, that means the impact is going to be pervasive, it is going to improve over time, and it will spawn complementary innovations and require that kind of business technology innovation. How do you change your organization? How do you change your training and your workflows? These are the things that need to be reconfigured to make a technology really go in a job.
We find there is not much evidence right now that the algorithms are going to take all of our jobs or that we will see widespread unemployment, as many people have predicted. I do not think there is strong evidence for that. Maybe there is a technological change that would happen, but we do see evidence that these technologies are general purpose, which means big changes are on the horizon.
To set the context, this is something everybody is going to be familiar with, but we are a little bit like frogs getting boiled here. There is exponential progress in the algorithms. About ten years ago, I saw this variational autoencoder-generated picture of a face, and I thought, wow, that is incredible. They can make credible face pictures using algorithms now. Over time, we have gotten to the point where we are upset if a photorealistic picture of someone does not have the exact right number of fingers and toes. Major changes, major improvements in the technology, kind of step changes.
Part of that is because of the scale of the models. We have definitely seen the benefits of going from GPT-2 to GPT-4, for example: much bigger models. But another piece of it is taste, because these models can generate thousands and thousands of different outputs to the same input. How do we know which ones are good? The constraints of the RLHF or RLAIF feedback processes that you impose on these models for generation end up being super important for making these models really valuable.
That is what generates this possibility for a general purpose technology to show up that makes it useful. Is it pervasive? Does it improve over time? Will it require those complements?
What do we do? It turns out the government has about a 20,000-task taxonomy. It is called O*NET. These are the 20,000 things, according to the government, that people do in their jobs. At the most granular level you have a task. There is an abstraction of a task called a detailed work activity, or DWA; there are about 2,000 of those. Each job is a bundle of these DWAs and tasks, roughly 20 to 30 different things people do.
A gambling cage worker and an online merchant share a detailed work activity of executing sales and other financial transactions. But the detailed task description corresponding to that detailed work activity is a little bit different. An online merchant might deliver email confirmation for transactions that happen. A gambling cage worker might process a credit card advance for someone in the casino. Very slightly different versions of similar things, but they could involve different types of workflows.
Since there are 20,000 of these things, we need to score every one of those 20,000 things for their exposure to large language models. We do that in a very specific way. We ask: could you cut the time it takes to do that task with a large language model and have no drop in quality?
There are three answers to that. There is absolutely not, which we call E0, no exposure. There is: yes, I think I could, which we call direct exposure, or E1. This would be like writing a job posting much faster, or editing a paper, as I like to do. And then there is the last one, which is kind of like: it depends. I think I could do it, but I would also need additional software built around the large language model to generate those good outcomes. We call that E2, or LLM-plus-exposed. We needed additional tools to get there.
By testing what proportion of tasks across the economy are directly exposed, that can be a quick check for pervasiveness. Then, by comparing that to how much additional software or additional investment would be needed to expand the exposure of the technology, E2, we can get at the latter question: do we need a lot of complementary innovation to go with this? It turns out the answer on both fronts is yes. It is pervasive, and yes, we need a lot of complements. I am going to punt on the improving-over-time thing for the time being.
How do we score this rubric? We ask a whole bunch of people who work with OpenAI. They help annotate and evaluate. They know how these large language models work super well. Something they do not know is how to do each one of these tasks in the economy. Caveat right there: these are not experts in that stuff. I have looked into having doctors evaluate all the medical tasks, and it costs hundreds of thousands of dollars. Yes, we have thought of asking people who are specific experts in each one of these fields to go through and do it. It is cost-prohibitive for humble researchers like myself here. Maybe one day we will do that.
We also asked GPT-4. When we first started doing this, people were like, how can you ask GPT-4 about what GPT-4 can do? It seemed like kind of a wild idea. Now people are using GPT-4 for all sorts of social science evaluations, and it seems to do reasonably well. I will say the prompts are brittle, and it takes a lot of iteration to get consistency out of these. But what was encouraging for us is that across a few different prompts and the people's ratings who work with OpenAI, we get pretty consistent answers. In fact, GPT-4 tends to be kind of conservative on some things, and maybe it is lying to us about what it wants to do. So far, maybe it is trying to be a little bit more sanguine about what is possible. I do not know. People tend to have a little bit more vision about what is out there on the frontier for large language models.
What do we find? For the mathematicians on the call, please do not send me angry emails saying that you do not believe that you are one of the most exposed. We found mathematicians. It depends on which model set you are looking at, but we find about 14% of all the different tasks are exposed at an E1 level. Just large language models can help you out. If you go to full exposure, counting E1 or the yes-and category where I need additional software, then you get up to about 46% that are exposed. Granted, this is not automation. This is: there is some chance for large language models to help. But that means there is a lot of building that has to be done to unlock the capabilities here.
Some of the roles we find are highly exposed: mathematicians, proofreaders, clerks. The clerical roles tend to be higher in the automation scale. A couple days after this paper first came out, people were saying, yes, but mathematicians, this thing cannot do math, so that cannot be true. Four days later we got the Wolfram Alpha plugin. I have seen people do really interesting things with math. We have theorem solving and automated proof stuff. We are trying to think a little bit forward in advance, like these models and capabilities that might show up soon. We have definitely seen mathematicians get augmented by some of what is going on here.
Another one is poets. I have met a poet who does really cool spoken word stuff using large language models. This is another kind of role where we are seeing pervasive impact in lots of cool, different places. So yes, a lot of tech workers, data scientists, mathematicians, computer programmers, a lot of academic and quantitative academic sorts of work, researchers, physicists, for example, are highly exposed. And a lot of clerical roles tend to be exposed. Those differences turn out to be important.
There are a ton of caveats to this analysis. These are internal to the study, let alone the external ones where we start to think about whether a task model is really appropriate for this kind of thing as opposed to a systems view. I think the one thing most people agree on, in the academic community at least, is that the job is the wrong unit of analysis. You either want to go more granular to the task level, like we are here, or you want to think bigger picture: what is the system like when I blow apart an old system and replace it with a new one? It has different labor demands, different types of tasks and jobs people are doing. Jobs are kind of the uncanny-valley unit of analysis.
We have subjective judgments here. I mentioned that we are measuring LLMs with GPT-4, and that leads to a certain brittleness in what is going on. You have to try lots of different things. Stuff is going to change here. This equilibrium is moving. We have to keep redoing this to keep an eye on the latest and greatest technology.
Probably the most important caveat that I have not mentioned is that we are not saying anything about the social, legal, or political considerations that come into deploying these models. We are just talking about technical feasibility here. There are lots of other bottlenecks, organizational bottlenecks for example, that might lead to this.
I talk with a lot of executives who knee-jerk to cost cutting with this stuff. That is probably the strongest guarantee that something might fail, because a lot of these people are not thinking about how to deploy this with the worker in mind. Workers have a ton of agency. If you say, "Use this stuff to get rid of the things in your job you do not like doing," huge boost to probability of success on these things. I say that as someone who uses GPT and especially Copilot to automate a lot of things that I do not like doing. Professors, I guess, are a little bit worried about their jobs too, in some cases mostly redoing syllabi.
When you look across different jobs, who is most exposed or least exposed, it does not really matter how many people are employed in a job. There is not a ton of correlation between the counts of people in a job versus their scores. But we do see that the higher-wage jobs, whether you ask humans or GPT-4, higher-wage workers are more exposed. Thinking about this from a human-capital and training perspective, this can be a little scary and anxiety-provoking. People put a ton of effort into learning how to do something. They go through years of schooling or years of training, and they get paid a lot as a result of that. Now we are talking about repricing what they do.
The early evidence suggests that it lowers barriers to entry, that less skilled workers are able to catch up to the highly skilled workers faster. I am not sure that is going to hold up long run as we start to see more tests on unstructured work. It always takes a little bit of time for innovative people to do their innovation. We will see there.
For the complements question, we talked a little bit about pervasiveness and where that pervasiveness happens. Here is where we check the difference between exposed at an E1 level, that is with just large language models, versus exposed with everything. If I want to find what percentage of jobs have at least 40% of their tasks exposed, and I do not want to consider additional software, just LLMs out of the gate, it is something like 5%. It is really low. That does not matter if you have humans or GPT.
But if I add that additional software and think about what is exposed there, if I bring in the full LLMOps or DevOps to talk to a DevOps expert about where it could be valuable, then at 40% it is maybe 35% to 45% of workers exposed at that level, or at the extreme end up to 60%. We are talking about a hugely transformative technology in terms of potential. We will have to see where it starts to change stuff. But that answers whether we will require complements, because we know we are going to require software complements at a minimum. There is a ton of great work showing that whenever you deploy new software, you need all those organizational changes to come in with that software to really get the returns.
I did not talk about automation, but I think there is an important difference here. If we built an automation rubric too, the most exposed jobs there, this is just using GPT-4, are things like telephone operators, telemarketers, word processors, credit authorizers, a lot of clerical roles. Travel agents: I actually just saw a really cool application where you post a picture into a chatbot and it will give you a travel itinerary and suggestions for what to do based on the picture that you just posted in. "Here is the kind of thing I want to do." I was like, that is the coolest thing. Lots of really cool, interesting new businesses can come out, but you are automating a ton and the work is going to change.
Here is the good or the bad news. I do not know. It depends if you are a glass-half-full or half-empty person. There is a positive correlation between overall exposure and automation exposure. I think often we like to think of augmentation and automation as opposites, but really it all kind of happens in the same place, or it often happens in the same place. That means it is good there is a diversification benefit to having lots that you do, but it also means we should get ready for some disruptions.
I am going to skip over this part because it is a little bit more pedantic. One thing I am really, really excited about is where the exposure concentrates. The general purpose technology literature focuses on innovation complementarities. That is, when the technology improves, I get this downstream rippling wave of productivity impact as the ability to do discovery becomes cheaper. If you are one of the referees who suggested this idea, I thank you.
What we did was look at the network of jobs. Each dot here is a job in the economy. They are connected by the work activities that they share, and then colored by their E1, their LLM exposure on average. What we find is it is the R&D people, the scientists, the tech workers, the teachers. These are the people who are most exposed. If you think about what that means for our economy, it means that science is what is going to change and development is what is going to change. That is exactly the downstream rippling: the new capabilities, the questions that we can answer that did not even make sense. All these technologies, that is the potential that I think is coming, and that is super exciting. So when people talk about slowing down the technology, I get very antsy. I get upset that we might be giving up this enormous bounty from better and more rapid science.
We have a whole bunch of robustness checks. I will cover one or two in the time I have left. One of the things we did, we surveyed...
Q&A
01Gene Kim
Daniel, given that we have two minutes left, I think the most important thing might be to share what you are working on these days and advice to technology leaders. Thank you.
02Dr. Daniel Rock
Sure. I do not have to do the robustness check, so I guess you guys should just believe me.
Stuff I am working on right now: a couple of more micro studies. One of them is that AI is super concentrated in only a few firms. One of the ways that could have happened is that there is a special sauce in making it. Some people are just super efficient at deploying it. Another way it could happen is that you have all these antecedent, daisy-chaining technologies that make you better at AI. If you were good at mobile, you are good at big data; big data led to cloud; cloud led to data science; and now you can do AI, every one of these making it easier to adopt.
There is a tension there. What we found is, surprisingly, there are firms that are better at AI, but they are not the bigger firms. The bigger firms tend to be more AI-intensive because they have all the complementary assets. That is one thing I am working on.
Another is building new measures of labor market transformation. We can use these tools to create new measurements. This is a little meta, but I think people are sleeping on how cool encoders are. We took hundreds of gigs of raw postings for job texts and boiled them down to the 40 dimensions that kind of matter here. What we found, interestingly, is that if you think of all of the different jobs people do as a universe or galaxy, that galaxy expands every single year for about a decade, from 2010 to 2019. We are working on the last few years, incorporating that data, but it is pretty exciting to see it.
03Gene Kim
You also have to tell everyone about the work you are doing at Workhelix.
04Dr. Daniel Rock
Oh yeah. Sorry, I have a lot going on. We started a company: Erik Brynjolfsson, Andy McAfee, James Milin, and I. Erik was my PhD advisor. It is to help companies build generative AI. We start by helping you diagnose where your best opportunities are, where there are high levels of exposure and lots of productivity growth opportunity. Then gradually we are looking to help companies implement, monitor, and evaluate their trials.
That is what we do at Workhelix. There are a few different versions of that. We can go outside-in: nobody gives us any data, but then we can say, here is your rough roster; it is our guess at what you are doing. And then, of course, incorporating company data. Wide variety of ways of doing this. If you are interested in working with us on that, let me know. Happy to chat.
05Gene Kim
I will share your contact information with everyone. So good. You are absolutely talking at the Enterprise Technology Leadership Summit next year, and I am so much looking forward to our next interaction, Daniel.
06Dr. Daniel Rock
Thanks for having me.
07Gene Kim
My pleasure. Thanks.