ACCOUNTING VS. PHYSICS – How Coordination Costs Are Killing Your Effectiveness

Log in to watch

Las Vegas 2022

ACCOUNTING VS. PHYSICS – How Coordination Costs Are Killing Your Effectiveness

Scott Prugh

GVP Engineering · UKG

ACCOUNTING VS. PHYSICS – How Coordination Costs Are Killing Your Effectiveness

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

So I mentioned yesterday: the story from Captain Andy Bean was one of the best engineering stories I've ever seen. But of course, right up there would be the work that Scott Prugh and his team has done at CSG, which Erica Morrison and Steve Barr presented on day one.

So in fact, interacting with Scott over the years starting in 2013 made me notice something very peculiar: that a surprising percentage of people that I admire, respect, have the words Chief Architect in their title. Trying to puzzle out why this is the case has been part of the DevOps Enterprise program for years.

As I'm hoping you saw from the presentation that Steve and I gave this morning, it is architecture that dictates so much of the dynamics of the system, which dictates the performance of the system. It's not just Visio diagrams in an ivory tower. It really dictates how people interact, to what degree they can fully unleash the full creative problem-solving potential of the organization, or to what degree does it constrain it or even extinguish it entirely.

And just to fully land the point: moving and painting is a slow and far more forgiving environment than the work that we do every day. If we screw up something moving and painting so badly by its architecture, imagine how badly we can screw up architecture in software. And that is what I've asked Scott to present today. He has some astonishing insights, and that's why he's on the program committee. He has so much influenced how I've thought about this problem, and also tell you about a new role he has. Here's Scott.

Scott Prugh

And Gene, thank you for that intro. As Gene mentioned, I do identify as an architect, and it's something that I've kind of thought a lot about. I usually think about the software level, but today's talk will really be about applying architecture thoughts around coordination costs in organizations.

One of my sincere hopes is that folks here today, one, it resonates, and second, you're able to take some of this back to your organizations to kind of think differently and improve your effectiveness.

Just before I get started, I do want to thank the DevOps community, this community. I've been fortunate enough to be involved since 2013. I've gained so much inspiration from all of you and insights, and I guarantee I've learned a thousand X of what I've taught all of you. So thank you all for everything you've given me.

All right. So if you remember nothing else from today's talk, it's this: architecture plus leadership equals focus, flow, and joy. Erica Morrison did a fantastic job of teaching us that leadership is a lot like loving your people. There's no greater gift you can give your people than to create an environment where they can focus every day and they love their work.

Steven Spear just taught us a little while ago that really what we want to do is create a safe technical environment, a safe psychological environment, and an environment where people could be in the triumph zone, and they go home at the end of the day and they love what they do. That is also rewarding as a leader, when you hear from your people that they love what they do and it's easy to actually do work. And so that's quite a bit what this is about.

I always like to start with the problems and talk about what we're trying to solve. These are really the problems we generally see in lots of organizations. They're kind of universal: capacity and estimation fail miserably. I've always been quoted as saying it's the hardest problem in computer science. Teams struggle to make progress. High cognitive load, escalations become the norm, people are waiting and very frustrated and unhappy. They don't have that joy. Rework occurs often. Customers wait and are also unhappy.

So I'm going to take you through a little journey today on how to think about what's causing some of these things and then how to solve for them. I'm going to talk about architecture and transformational leadership. I'm going to talk about physics of coordination costs and the three C's of coordination costs, the golden rule of dependencies, the three dimensions of architecture, and finally, I've got an example called accounting versus physics, which is really the subtitle of this presentation. I'll ask for a little audience participation. There's a little math, so you'll need to pay attention for some of those answers.

All right. So we start here around architecture and transformational leadership. This is pulled right out of DORA: great leaders build great teams and great organizations, but they enable teams to re-architect their systems and processes. That is really vital. If you think about the causal diagram in the DORA studies, at the top of that diagram is transformational leadership, and from that comes everything else. At the end of that you get high performance, and so that's very important to think about. That's why we really start here.

The next is the physics of work coordination costs. There's a lot of formulas here. The first was wait time: percent busy over percent idle, which we were introduced to in The Phoenix Project. Basically, as you load your teams up past 50%, they get slower and slower, and eventually they stop returning work.

Coordination risk: Troy Magennis spoke on this in 2015, really how your odds of arriving on time degrade at two to the N, and we'll talk a lot about that later in architecture.

And then knowledge left, which really degrades basically in half with every handoff of tacit knowledge. So as work hands off for every team that work passes through, or every person, basically the tacit knowledge divides in half. That was first introduced in the Poppendiecks' book, Lean Software Development, and Jon Smart has talked about this a lot lately.

So these are really the three formulas that are at play. I think the first question for the audience: how many use capacity estimates to determine work that can be done? Raise your hand. That's it? How many use some of these formulas? Anyone? Scott does. Okay. So that's what I've talked about, accounting versus physics. Accounting is very linear thinking about how we calculate what can get done, but the physics at play is really what's behind the scenes.

All right. So then we have the three C's of coordination costs, which is really contention, which is basically a conflict over a shared resource, and that can be people, it can be technology. We also have coupling, which means the degree of interdependence. Oftentimes in systems, when we change one piece of a system, we're forced to change something else. That's coupling. And then we have coherence, which is making a logical whole.

All three of these C's come into play when we think about architecture in the systems and the processes. I think of them as knobs. You can create a system with no contention and no coupling, but it has very low or no coherence. You can also create a system like a big ball of mud that has a lot of contention and a lot of coupling, and it also has no coherence. So you're always playing with these things. At the end of the day, if you don't reach some sort of logical coherence in the system, basically it's not really understandable and you don't have a good architecture.

All right. So here is where we start solving for some of these problems, and here's the golden rule: removing a dependency doubles your odds. So we're in Vegas. The house does not want you to know this. They don't want you to double your odds, but I want you to know this. It's really important.

Architecture becomes your tool. Basically, we can apply architecture to three things. We can apply it to the organization. We can apply it to the system, or the software, and we can apply it to the process as we use to run.

So let's get to an example now, and this is where I'm going to ask for a little bit of participation. Here is the setup. We've got four teams. They're spread across four time zones and four different technology stacks. They basically build features, and we'll talk about that coming up next. We have a bunch of technical dependencies, like all work for the UI goes to the UI team. Basically, we have functional dependencies here where across these teams, config, order, and billing, we have to actually work across those teams to reach coherency for the features that we actually deliver.

There's a couple things that I'll pull out here. One is called microservices theater. These teams went through a microservices journey a couple years ago. They read a Gartner report. They were like, it's going to solve all their problems. They went and they picked up every technology. They implemented all those things, and it turns out that it didn't help them as much as they would have thought.

So one thing is, don't distribute your domain if you don't have to, because now you have to reach coherence and you have to coordinate that. The other thing is, empowered tool choice does not mean all the tools. Just because you have microservices doesn't mean every team needs to choose their own tools.

The second is what I call SPA insanity theater. Just because there are tools to build complicated single-page architectures doesn't mean you have to abuse them. Don't put all your business logic in JavaScript on the front end. Don't. And the other thing I keep wondering is, what happened to the web server? There used to be this thing where you put logic on the server and you served up web pages. It seems to have disappeared the last couple years. So we see a lot of that out there, too.

All right. So we take round one. Round one here is where we basically go and load up a bunch of features. We balance capacity, and you can see some numbers. I've highlighted the capacity of the team is 480, whatever, story points, hours, we can have that fight later. But we've loaded them up to 270 story points, hours. We need to decide a couple things. I think we'll take the combo strategy. We'll start on some parts of these features across these teams, and we'll coordinate that final functionality at the end.

The accounting says basically the teams have plenty of capacity. We should be able to get done in one iteration, right? Gantt linear sequencing may be a problem. We may have to sequence some stuff to get it done. But we feel pretty good because the teams have plenty of capacity there to actually get the work done.

So here's the question to the audience: how many people think it takes one iteration to get done? Raise your hand. Two iterations? Three? Four iterations? Yeah, so most people, right. It's four iterations to get done. You get lots of overtime. You get lots of escalation. You get a lot of escaped defects. Accounting said it should take 270 hours to get done. Actually, it took 1,920. So it took basically seven times the effort to get finished.

The question is, why? What went wrong here? You can see all those orange lines. Those are dependencies between those teams. That's basically coupling between those teams to actually get the work done. There's four distinct dependencies here. Basically what this means is your odds of getting home without a delay are one in 16, or only 6%. If you think of coordination risk, we have those four handoffs, right? Codependent delays make everything late. This is why planes are late. One plane is late, the next plane is late. Basically, you get these codependent delays between these teams.

All right. So what are the countermeasures in an accounting world? What are the most likely countermeasures to this problem? Yell them out. What do we do, besides microservices, but in the processes? Add people. What's the other one? Higher estimates. So you've had estimates, right? You're like, oh geez, my estimates are wrong. I'm going to add some. We add people, and then we're behind schedule, so now we have to add more work because we've got to catch up. So this is a great way to light money on fire.

You're putting more people in the system, and it increases the coordination costs. You're actually going to go slower: Mythical Man-Month, right? So when you put people into a high coordination-cost system like this, it's a very bad thing to do, and it drives up a lot of cost.

So now let's get to resolving this. Remember, removing a dependency doubles your odds. We want to beat the house here. Architecture is the tool, and these are the three things that we can affect. There's a bunch of snapping fingers going on here because we go through this solution solving rather quickly. Do take Jon Smart's advice: if you inflict these changes on your people without really building consensus, working through it, working through all those changes, you won't be successful. But we go pretty quickly here.

So solving for them. Removing one dependency doubles your odds that there won't be a delay. Can we remove a couple? First thing we think about: we apply Team Topologies and we apply the idea of a platform team, or platform teams in this case. We make config and we make the UI mostly self-serve. Config, we make it all self-serve. The UI, we make it self-serve. We move basically a bunch of those folks to create full-stack teams. We move the UI folks onto those teams. So now you can see a bunch of these dependencies went away, and you can also see the UI dependency is greatly diminished. It's a lot thinner. That really helps us because it's removed a bunch of these dependencies from the system.

Now, what do we do next? This is where it gets a little bit harder, and you have to think about some of the other architectural constructs around coupling and what we talked about, and then also something else is cohesion, which is basically a degree to which elements in a module belong together. We've got these two modules I talked about, which is billing and ordering. They're in two different development languages, and they're developed 16 hours apart. So these people are never awake at the same time. We've made it actually pretty difficult to get this stuff done.

We have one type of coupling, technical coupling, which a lot of folks are familiar with: APIs, databases, etc. We have patterns for that. But there's also this concept of functional, or as Mike Nygard calls it, semantic coupling between the teams. In this case, whenever ordering changes, billing needs to change, so they have a semantic or functional coupling. And the other thing is, whenever billing changes, ordering often changes. They're coupled together.

So to reach coherency, in other words to make the system work, we have to actually resolve that. The way we do that is people, and these people are across many time zones, across different technologies, and they have to do a lot of rationalization. That's a very, very expensive and costly way to do this coordination.

There's multiple ways to solve this, but the one I like to look at in this case is we have to decide either to make the coupling scenario better. But in this case, we're going to hug the coupling. We're going to basically invert it, and we basically collapse these modules together because we're having very high coordination costs across these different teams, across these languages. So we basically collapse them together: one time zone, one technology.

Now, I'm very aware that this can't happen in hours or weeks. It likely happens in years. But these are the types of things I want folks to think about, because now I have greatly reduced dependencies in the system.

So now what do we think the results are? Do we think they get done in one iteration? If you think you can get done in one iteration, raise your hand. All right. I'll give you the hint. They do, right? They get done in one iteration because they're dramatically reduced on dependencies in the system.

The accounting side of us, when we did the math and the estimates, said it was going to take 235. Basically, physics, the actual implementation, took 350. It's always more, but hey, estimates, again, hardest problem in computer science. But at least we're in range of what we thought we could do. Why? The distinct dependencies between these teams were dramatically reduced. We reduced the coupling. There's only one dependency. One is two; that means your odds of arriving on time are 50/50. So that is basically eight times better odds than we were when we had the four dependencies across those teams that we had to coordinate. Removing those dependencies doubles your odds every time. You want to beat the house. This is the way you do it.

All right. So take you through a quick summary of what we did there. We made the config and UI self-serve. The pattern there is really platform team self-service, a very powerful technique. We affected the org, the system, and the processes when we did that. We then embedded UI talent on the teams and cross-skilled the UI. So we created full-stack teams that owned the whole feature. They don't have to go coordinate with other teams.

Finally, we collapsed billing and ordering. We decided to invert the effects of coupling in this case and collapse those domains together. Oftentimes, you see people separate stuff out too soon, and this was a case where that happened, where those things should have just been together because they were sharing a lot of the same entities, the same domain. Then what this looks like in the before and after: we went from four handoffs to one, and we basically ended up being five times better in our cost of actually doing all this.

So now here is the real problem. I only talked about engineering. You see that engineering team up in the corner. They're value-stream aligned. They implement their features. But here's the real problem in organizations. We now have all this: we have maybe a solution team that puts together the solution before the engineering team gets it. We have a product team that does a lot of the requirements. Then we have platform teams, and we have customer service. We might have level one service. We have level two.

Now we get the customer all the way at the end of this, and the customer all the way at the end of this, they just want value. In this case, we have eight distinct dependencies. Basically, this is one chance in 256 of actually getting home on time. And so this is why, when we think about architecture, the organization, and the process, and the system, the software level, it is so important, because your odds of arriving on time with something like this are so slim. You're basically working against the physics in the system unless you change it.

The other thing I'll add here is really emphasized in Dave Farley's latest book, Modern Software Engineering. Long feedback loops basically thwart your efforts to refactor your architecture and systems. The best feedback loops are local. Use TDD to drive and refactor and get really great modularity. When you have feedback loops that are spread across all these teams and time zones and groups, you're never going to refactor your system. It's so hard because it takes so long and there's so much coordination to do.

All right. So the summary: tackling coordination costs. The first thing I said to remember is architecture plus leadership equals focus, flow, and joy. Both of those things matter: architecture and the leadership. The physics of coordination costs degrade exponentially. The three C's: contention, coupling, and coherency. The golden rule: remove a dependency, double your odds. Beat the house with that one.

The three dimensions of architecture: you can affect the organizational architecture, the system architecture, and the process. Then the patterns we applied: platform self-serve, full-stack teams, domain and team modularity. In this case, we hugged and inverted the coupling that we had. Time-zone cohesion is important. Using talent across multiple time zones is incredibly important, just to think about your architecture across it. And having standards in place is very important also.

The help that I'm looking for is more examples like this: identifying models for the cost quantification. How do we quantify the cost and talk to our leadership about these types of things, and then patterns to continue to solve for these types of coordination costs. Thank you very much.