The Four Characteristics of Structure Needed to Get Great Dynamics
The Four Characteristics of Structure Needed to Get Great Dynamics
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
One of the most impactful learning moments for me was taking a workshop at MIT in 2014. I went to this class because it was taught by Dr. Steven Spear, who I mentioned earlier today in my opening remarks. Since then, he has become a mentor to me, and I cannot overstate the extent to which he has influenced my thinking.
Dr. Spear is famous for many things, but he's probably most famous for writing the most widely downloaded Harvard Business Review paper of all time in 1999, called "Decoding the DNA of the Toyota Production System." This was based in part on his doctoral dissertation at Harvard Business School. In support of that work, he worked on the manufacturing plant floor of a Tier 1 Toyota supplier for six months.
Since then, he's extended his work beyond high-repetition manufacturing work to engine design at Pratt & Whitney, to the building and operating of the safety culture at Alcoa, and to how we can make truly safe healthcare systems. More recently, he worked with Admiral John Richardson in supporting a U.S. Navy initiative to create a high-velocity learning dynamic across all aspects of that enterprise.
For nearly a year and a half, we've been talking nearly two to three times a week, trying to see if we can codify what we've observed in our careers around how to create these amazing dynamics that truly unleash human creativity and problem-solving potential across so many domains, with the goal of putting this into a book that we're working on to be released in 2023.
At this time, we will be presenting on how structure may be the best leading indicator of performance. Steve, I'm so delighted that you are willing to help teach this to us today.
Dr. Steve Spear and Gene Kim
Dr. Steve Spear: Gene, thank you so much. I consider it one of the great fortunes in my life that we met those some years ago, and that I've had the opportunity to be part of the conversation of this community that you've helped nurture and foster for so long.
Gene Kim: Thank you so much, Steve. Before you start, let me set the stage by talking about how structure may be one of the best leading indicators of performance. We believe there are structures required to enable the dynamics necessary to unleash the distributed and collective human creativity and potential to compete and win in an age that is being tumultuously disrupted by scientific and technological innovation, market transformations, and political and societal realignments.
Over the last 150 years, we've seen how some organizations have been able to generate and deliver better ideas more quickly and more reliably. The question is: how do these amazing organizations unleash this magical dynamic that helps everyone better compete and win?
The thesis is that structure might be the best leading indicator of performance. If we look over the last 150 years, we see that it's not just DevOps and Agile, and it's not just the Toyota Production System. We can see contrasts in how amazing organizations have been able to do great things, such as send a man safely to the moon and back, create COVID vaccines approved for emergency use, and vaccinate the population that needed them so badly.
In previous presentations, we described how many organizations are trapped in a slower, integrated problem-solving style where everyone is trapped in functional silos, unable to do integrated problem-solving with peers. For anyone to do integrated problem-solving, they have to escalate up eight levels and then down eight levels. In contrast, in high performers, the majority of communications are not happening up and down the organization. Instead, they are happening across the edges, within teams or across teams, across sanctioned interfaces. By doing that, work flows more quickly and more knowledge is created.
As the leader, one of the top responsibilities is not only to set system-level goals, but to design the organization, assign roles and responsibilities and the relationships between them, and assess performance. We can simplify it further: the job as a leader is to configure the system, run it, and assess its performance. Steve is going to talk about the properties required to enable great performance and answer questions such as: why isn't hierarchy enough? Why isn't culture enough? What specifically are the characteristics of how structure can enable great dynamics?
Dr. Steve Spear: Gene, I appreciate that kind introduction and the time we've spent together over the last many months discussing how we can manage very complex enterprises to tap well into the innate potential of the people in that enterprise to deliver great value to society.
One of the things that has been most impactful for me is the realization that we mischaracterize why we form organizations in the first place. There are legal reasons for creating LLCs, LTDs, Incs, and so forth. More distracting and misleading is the thought that we create organizations to get economies of scale around physical things, machines, material, and that sort of thing.
What has been most impactful and focusing for me in our discussions is the realization that the reason we form organizations is to collectively and collaboratively solve problems that are much bigger and much harder than we can ever hope to solve as individuals. When we start thinking about creating organizations for the explicit purpose of collaborative problem-solving, then we start thinking about the ways in which we organize as conducive or corrosive toward meaningful conversation.
There are four things we can do. First, simplify the flow of ideas through an organization to make them less confusing. Second, create standards. Third, create stabilization mechanisms. Finally, create synchronization among those doing the collaborative, creative work together.
To explain simplification: when we think about creating the flow of ideas or work through an organization, too often we think about minimizing the wait time for the thing. Using industrial metaphors about inanimate, unthinking, unresponsive objects leads us to put things in queues and send them to the first available person, like checking in at an airport. The idea is to move things through the system as quickly as possible.
But at each step in a process, there is a person. If that person is connected to every possible station or source of input before them and every station or destination after them, they have so many relationships to consider and manage at once. The person sitting in the center of a highly intertwined network can spend so much time distracted by questions of whom they depend on and who is dependent on them.
Step one in liberating people's brain space to do something useful is to simplify the flow of work so that there are fewer upstream dependencies and fewer downstream people dependent on me, so that the architecture is simple and less confusing. We spend less time, energy, effort, and creativity thinking about the architecture of the system in which we are embedded.
From the perspective of the person responsible for the system as a whole, when they look down on an interconnected, hyper-convoluted, intertwined circuitry over which ideas get created and flowed, how can they manage it? When they can look down on a system with cleaner circuitry, cleaner flows, and cleaner relationships, they can be more prospective in designing and supporting the system. Whether looking from the inside out or the top down, simplicity liberates brain space for useful work.
Point two is standards. Imagine trying to do anything without a recipe, playing music without a score, or a ballet troupe dancing without choreography. You would predict a mess. A standard is the capture of our collective best-known approach. Why ask someone to do something without the benefit of our already captured best-known approach? Standards take the wisdom of the ages and put it in the hands of the person doing something, perhaps for the first time.
A standard also lets you see an aberration. That is critical because anytime we are doing complex work through a complex, dynamic system, things are going to glitch. Biological systems glitch. Technological systems glitch. Social systems glitch too. Complex and dynamic systems behave well not because they are glitch-free, but because they can respond quickly when problems are seen. In the presence of standards, it is much easier to see problems so they can be swarmed, contained, and solved. In the absence of standards, it is hard even to agree whether there is a problem.
Simplification helps because people spend less time thinking about the architecture of the whole system. Standardization helps because people are equipped with the wisdom of their predecessors, and it is quicker and easier to see a problem when and where it occurs so it can be contained.
The third design principle is stabilization. In complex systems of collaboration, when we see a problem, we want stabilization mechanisms in place. Imagine sitting in that system and a problem occurs and you have no recourse. Locally, you suffer the disruption for longer and at greater amplitude than you otherwise might. The problem may also escape and start impacting downstream. It may escape because you cannot deliver, or because what you are delivering is less adequate. A local aberration becomes a systemic aberration and may eventually lead to system collapse.
With stabilization mechanisms, when a problem occurs, it is seen. It draws attention so someone can come over and ask, "What's the problem? How can I be helpful?" We decrease the duration of the problem so it is less aggravating locally, and reduce the chance that the problem escapes to disrupt the system. Now we have three tactics to increase the chance that people can collaboratively and creatively create value together: simplification, standardization, and stabilization.
The fourth piece is synchronization. Often in industrial settings, programs, or projects, who does what, when, on behalf of whom, and in what form has to go up to some central authority, whether production control, a project manager, or something else. That person has to think about next steps based on where everything is in the moment. The baseplate of activity, whether shop floor, deck plate, studio level, bedside, bench, or another setting, has its own natural frequency, speed, and level of detail, all fast and fine-grained. The central authority tries to draw inputs to understand the situation and gets overwhelmed by the frequency, speed, granularity, and detail. So much signal becomes noise.
The central authority's conceptual space is slow-thinking, deliberative assessment, and they are trying to apply that to something very fast-moving. They get overloaded, and either the system has to slow down for central control to process, or if the baseplate cannot or will not slow down, pockets of chaos become more chaos and may cause system failure.
The alternative is having the baseplate elements — individual designers, programmers, workers, and creative elements — synchronize their work based on immediate upstream and downstream dependencies. That preserves bandwidth and cognitive space for leaders to observe, monitor, assess, support, improve, design, and redesign. They can stay in slow-thinking, creative, deliberative assessment while the baseplate layer self-synchronizes at its own natural frequencies, speeds, granularity, and detail.
What has come out of our discussions is that we organize for the very important purpose of harnessing the intellectual horsepower of the enterprise toward common purpose. The objective function of how we design, operate, and manage should be to liberate that intellectual horsepower toward creating things of value, not to consume it in understanding the collaborative systems into which we insert people. To shift attention to the thing we are trying to create and away from the system itself, simplifying, standardizing, stabilizing, and synchronizing the system are huge.
Gene Kim: To reiterate: simplification asks to what extent value streams are linear and explicit, allowing partitioning and nesting. Standardization means everything has a strong declaration of how things should work and what happens when conditions fail. Stabilization means there is a mechanism so local effects can stay local when things go wrong instead of causing global disruption. Synchronization means coordination can be done locally without a centralized mothership that knows all.
I thought it would be fun to use that lens to look at some better and worse organizations. To be high-performing, we assert that all four characteristics need to be present. For an organization to be low-performing, only one of them needs to be missing.
The first example is Netflix, Chaos Monkey, and the first AWS outage that famously took down so many cloud services. When AWS East went down, almost everyone was down except Netflix. That was due to what they described as a Rambo architecture. They win on the simple characteristic because they designed systems so they are partitioned and modularized, and services should stay running even if dependent systems go down. Things were standardized: there was a strong declaration of what constituted a properly performing system in correctness, latency, availability, and so forth. When it did not meet that standard, it triggered that VM to be killed. Stabilized: when failures happen locally, they stay local, components can go down, and failover to another VM is used. It was synchronized locally: there was no centralized production control system telling each node what to do.
Dr. Steve Spear: Our assertion is that the way you design systems has a direct consequence in terms of how well you can engage people's problem-solving ability. The Netflix example takes a system that is complex in its entirety and partitions it into self-contained, relatively simple pieces so that if there is a problem in an element, it does not become systemic and can be addressed. It is a fantastic lesson.
Gene Kim: The opposite pole is problematic services. In the simplicity test, Scott Havens at Walmart talked about item availability lookups requiring 23 deeply nested API calls in the before state, all of which had to be up for the consumer to know whether an item was available. For standardization, Christina Yakomin from Vanguard talked about having alerting when things go wrong but no telemetry to tell us when things are going right. For stabilization, Scott Havens described that when any one dependent service went down, everything went down and they could not sell something to the customer. For synchronization, because all 23 services were required to operate correctly, and any deployment to one of those 23 could go wrong, suddenly all 23 services had to communicate, coordinate, plan deployments together, and deploy one at a time so that when something went wrong, they could identify and isolate it.
Dr. Steve Spear: The Walmart example resonates strongly because if you do not have a system that has been simplified and consequently can be partitioned, there is no difference between a local problem and a systemic problem. The local problem will spread and become systemic quickly. Even if it is identified as local, to fix it you still have to shut down the entire system because everything is connected to everything else, or many things are connected to so many other things, that when you work locally, you are really working on the entire system simultaneously. That is the exact opposite of Netflix handling the major disruption of the unavailability of Amazon Web Services and still functioning.
Gene Kim: Scott spoke brilliantly in 2019 about how he simplified those systems: 23 lookups went down to two, which is amazing.
In the coding space, continuous integration and delivery pass all four tests. It is simple because integrations, builds, tests, and deployments happen all the time, are well-defined and automated, ideally with only one way to get into production. It is standardized because assertions and tests are defined and automatically run upon every change committed into version control. When something breaks, stabilization kicks in: if anything fails or a test fails, the CI/CD pipeline stops, and that engineer, supported by the team or maybe even the entire organization, swarms the problem to get back to a releasable, safe state. All this happens at the local level without a centralized production control system telling every person what to do.
Dr. Steve Spear: The only way this happens is if the system is originally designed with an architecture, both the technical system and the organizational system as an overlay, in which you can create partitions. It must be clear what is within my partition, what I have to coordinate with the person on the other side of a partition, where I have latitude locally, and where I have to coordinate with upstream and downstream dependencies. Then the people at the top are not sucked down into baseplate operations and can continue to do what they are well-suited to do while we do what we are well-suited to do.
Gene Kim: The opposite pole is a large, complex deployment spanning hundreds of people that we only do twice a year, with almost no memory of how we did it last time. We fail the simple test: scores of teams, hundreds of dependencies, often only discovered in the middle of deployment, maybe requiring people to be woken up in the middle of the night. We fail standardization if we are lucky to have good documentation on hundreds or thousands of required steps. We fail stabilization because when things go wrong, the entire production is jeopardized. In The Phoenix Project, there was no turning back; we could only go forward and could not abort or call off deployment. And synchronized, everything is the worst of all worlds: tightly coupled and loosely controlled.
Dr. Steve Spear: Such an architecture for a complex system is condemning. Even if you can get to something that functions well, it will lack resilience and agility. We live in an environment where so much is changing all the time that the absence of agility, even in the presence of functionality, is condemning.
Gene Kim: One of our favorite examples is the Toyota Production System. For simplicity, one key insight of "Decoding the DNA of the Toyota Production System" is explicit relationships between each work center. Those pathways are explicitly identified, almost like an API, similar to how we think about software systems. For standardization, jidoka are built-in tests and standards at each work center. More importantly, the andon cord pull is a signal that help is needed, which takes us to stabilization. There are 4,500 andon cord pulls per day. There are also 60 line-side store changes in a given day, all part of daily work. All of this is done without a centralized production control system that must know all.
Dr. Steve Spear: It sure does resonate. What I learned from Taiichi Ohno through the Toyota Production System book, other things he wrote, and stories I heard about him, is that from his perspective, the reason to create this type of management system, and the beauty of it, is not so much creating explicit relationships between work centers, but creating explicit relationships between the people who are using those work centers as a way to express their inherent creativity as human beings. When one reads Ohno and rereads him and finally comes to some modicum of understanding, you realize he is talking about a system for managing people who happen to be using very complex machinery to express their creativity. He is not talking about the control of inanimate material moving through inanimate objects to create other inanimate material.
Gene Kim: We cannot talk about Toyota without talking about the opposite: the 1990s Big Three auto plant. Things were not simple and not connected point-to-point within work centers. They were connected through a centralized MRP system. The best evidence of lack of standardization came from the NPR "This American Life" NUMMI episode: documented cases of cars missing steering wheels or tires, and engines being put in backwards, showing that problems may or may not have been detected but certainly were not acted upon. For stabilization, a Big Three automotive executive said they tried six line-side store changes in a given day and had to shut the plant down for three days because everything was in the wrong place. Everything was routed through a centralized MRP system that could not keep pace with the real world.
Dr. Steve Spear: As you're reciting this, it strikes me that the perversion characteristic of such management systems is that they fail or refuse to recognize that they are systems for managing human beings who are seeking meaningful collaboration with each other. Once they miss that key point, everything is about managing material and machines. By ignoring the potential on the baseplate to actually solve problems, you do not engage it, and by overloading those at the top, you cripple their ability to do anything useful. It is disrespectful and perverse all around.
Gene Kim: Our favorite example is Team of Teams in the before state. We fail the simplicity test because missions required experts from vast numbers of functional silos, military services, and intelligence agencies with no direct connections between them. We fail the standardized test because dependencies between each silo were not explicit. There was no equivalent of an API, service-level agreement, or value exchange. We fail the stabilization test because there were no organized mechanisms for teams to quickly get help from each other, especially in the absence of defined SLAs. Prioritizing resources was happening only at the global region scale. For synchronization, teams were working against local priorities rather than the global system-level goal of dismantling terrorist networks in Iraq.
In the after state, they ended up with a far simpler system because functional specialists were put into short- and medium-term mission-oriented teams. It was standardized because roles and responsibilities were defined in the context of these mission-oriented teams, and they could depend on help from each other. Even where those relationships did not exist, they invested in creating them. Difficulties were identified and remedied, and opportunities could be taken: a terrorist leader sighted by a 22-year-old drone operator resulted in capture 45 minutes later. This was not done by a production-control centralized mothership, but by people at the edges. For synchronization, everyone knew the goals, so mid-level leaders could horse trade scarce resources such as helicopter transport and space on intelligence-gathering platforms across the vast network, without intervention eight levels up and eight levels down.
Dr. Steve Spear: One hundred percent. We have had the advantage of learning a ton from David Silverman, who was writing autobiographically as co-author of "Team of Teams." What David has explained in the IDEALCast and this forum is that the start state had high-speed, high-frequency, fine-detail operations. People at his level, as young officers and operational managers, kept having to wait for instruction from higher-ups and constantly give them updates so the higher-ups could decide what David, his colleagues, and adjacent teams were supposed to do when and where. The experience was what one would predict: the top seemed unable to make meaningful, timely decisions because they were flooded with information. Those on the bottom were either idle or misdirected, and the whole system was ineffective.
What David explained is that by giving the baseplate operational level the ability to talk directly through the horizontal flow of ideas and work, and to self-synchronize and stabilize, they could operate and retune with great alacrity while their seniors observed the system as a whole, how it was performing, and how it was directed in terms of purposefulness. Senior people who should have been thinking about policy and strategy, not the micro-tactical level, were liberated to do that. People at the micro-tactical level were able to operate, modify, and adapt at that level too.
Gene Kim: Admiral Richardson said on day one that the job of top leaders is to study the problem and spend 55 minutes in an hour understanding the problem so they can best craft the solution. This motivates why structure is so important and why it may be one of the best leading indicators we have for what the resulting dynamics will be and to what extent we can achieve the mission.
The help we are looking for is validation or refutation of these ideas. It has been gratifying to piece this together over the last year and a half. If you are interested in mapping flows of work, Steve has some amazing technologies to help enable that. Reach out to him on Slack. Steve, thank you so much.
Dr. Steve Spear: Gene, you're quite welcome.