ACCOUNTING VS. PHYSICS – Coordination Costs and How Organizations Win

Log in to watch

Las Vegas 2023

Download slides

ACCOUNTING VS. PHYSICS – Coordination Costs and How Organizations Win

Scott Prugh

Transformational Technology Leader

ACCOUNTING VS. PHYSICS – Coordination Costs and How Organizations Win

Chapters

Full transcript

The complete talk, organized by section.

Scott Prugh

It's really great to be here. It's great to see a lot of familiar faces. It's been 10 years, and so it's really a privilege to be back.

I'm going to talk a bit today about something I call accounting versus physics, and then really coordination costs, which are the physics part of this. A lot of work that we do in software and in organizations, we usually measure in very linear ways. Unfortunately, our organizations behave very differently. That's really the physics part.

I also did a lot to try to integrate Gene and Steve's recent book and overlay that onto the talk. The good thing is there's just a ton of synergies there, which bring a lot of the discussion forward.

I view a bunch of our jobs as leaders, and at least my job as a technologist, as a CTO, as a leader, to really bring focus, flow, and joy to our organization so that we can deliver outcomes effectively to not just the business, but the customers, of course.

To do that, we need to apply a couple things, obviously leadership, but I also view architecture as one of our jobs as leaders. Architecture has many dimensions, as you'll see. It's not just about how we build software, but it's also how we design the processes. It's how we think about enabling the great people that do the work. It's also how we design the systems to really all come together, and that's how we can deliver that focus, flow, and joy.

The problems we start with are really the things that we see, or what I call the observed problems, the ones that bubble up to us every day.

We do things like try to get the capacity of our teams. We do estimates, and we get to do estimates again. Sometimes we get asked the third time because whenever they come back, they don't appear to be right. Those seem to fail miserably. They just don't work. Everyone's frustrated, and product management yells at engineering and says, "You're terrible at estimating, and we need to get more productivity as IT teams."

The teams struggle to make progress for a variety of reasons. It may be that builds and environments are really hard to work with. It may be that their tools are inadequate. It may be that they don't understand the requirements. It may be that everything takes too long.

Escalations start to become the norm. The way folks can actually get work done in those environments is they escalate to their leaders, they escalate to other organizations. That's how they try to survive.

The people in those environments are waiting or frustrated. Rework occurs often. That comes back to us in production incidents or defects or things that our customers find and are frustrated with. That comes back to the teams, and then that disrupts their current flow of work.

And then, of course, the outcome is that the customers at the end of all this wait and are unhappy. Those are not the results that we actually want in our software development processes for delivering the valuable outcomes that our companies want for us.

I'm going to take us through a journey of understanding this. There's a bunch of the theory up front that I'm going to talk about, and then the back end of this is an actual example with real data. It's a real portfolio that we transformed. The data is from what we pulled from doing some value stream analysis. There's a bit of it that is what I would call sped up so that the tolerability is a little bit better, but it's a factual example.

The first piece is really talking about the three layers of organizations and how to think about rewiring organizations for improvement. Then we talk about architecture and the three dimensions there: organizational, process, and system, and how important transformational leadership is.

We'll talk about the physics of the coordination costs, really counterbalancing that with the idea of using accounting-based mechanisms, the three Cs of coordination costs, the golden rule of dependencies, the three dimensions of architecture, and then how we apply simplification. Then we'll get into that example. There's a bit of math, and I'll ask some questions in there, and we'll see if folks are paying attention.

The three layers, and this is out of Gene and Steve's great book. We had a workshop yesterday on it. There are three of these.

Layer one, or that lowest layer, is really the person that is doing the work with the technical object. In our world, those are really the developers, the architects, the testers. It's the folks that are working with the code and are working on the code.

Layer two, the next layer up, are really the tools that you use. These are things like your IDE, your version control, things like CI/CD, telemetry, infrastructure as code, work tracking. It's even the great new tools, and you have to get Copilot and AI in here or otherwise it would not be a 2023 talk. Those are what are considered layer two.

Then layer three is the organizational architecture, your system architecture, your processes, the way information flow occurs, and ideas, and even behavioral norms and the cultural norms all exist up at this layer three level. That's even how we treat people and what type of culture we create. That is all that social circuitry.

You have to be aware of these three levels because you have to affect things at multiple levels to make changes. If you only pick one of these levels, and you'll see in a minute, and make a change at layer one, it may not have that desired effect. It doesn't mean you might not do it, but it's just not going to have the desired effect overall.

Let's take an example. This is real data from a value stream analysis of feature flow through a system. A feature is something valuable that we want to deliver to a customer. It comes in and it goes through this process, and we do some analysis on that value stream. We find that it takes about 281 days from where someone asks for a feature to flow through the organization and actually get to a customer.

If we look at that and map that to layer three, that is really that flow of work and ideas throughout the organization.

If you look, we'll see the development phase marked out here, and that is only 65 days in a 281-day process. That represents some 23% of the total time.

The important thing is to really understand this. One is value stream analysis and how important it is, but also really understand how that is hard-coded in the organization. Because if you can't affect those things, that layer three, your improvements at other layers may not really have the effect that you're hoping for.

If we look at this and we say, of that time, about half of the time is actually spent in an IDE coding, and that's actually generous. There's some analysis that we've actually done that puts time that developers actually spend coding between 7% and 8%. Actually, Mik Kersten from Planview finds about that time also.

But we'll divide in half to make it easy, and we'll say it's about 12% of the total lead time for that feature is actually spent in coding. We've got this developer all the way down to layer one. They're working, and let's say they have some tools. There's this really cool tool, it's called Copilot, and it makes developers really fast.

If we now go and apply those tools, which is probably a good thing to do, what is the best result that we can hope for here? If they become 100% efficient, basically they'll get those four hours a day back, hopefully, of coding. But really the best that they can do is to have 65 days of coding, which is still 23% of the total time, and you're never going to get all that back anyways.

This is also known as, Copilot might help, but it won't save you. Do it, but don't expect to get the returns. Developers will be happy, produce some better code.

The thing to think about, and Mik had a blog a couple weeks ago and talked about work that he's doing at Planview, which is really to analyze the data at this higher level. Scott, Ella, and I had a good discussion earlier too, around what can we do at this level to really analyze how the work flows through the system, get the understanding of that, and really start to try to rewire that, because that's really an important thing to tackle. This is important to understand across these three layers.

Now I'll flip to the DevOps picture. It really came out of Accelerate, really about architecture and transformational leadership, and really how, as leaders, our job is to build great teams, great technology, great organization, and enable our teams to re-architect the systems and processes.

I just flip that and really think about, in Gene's book, we're really talking about rewiring the organizations with the technical practices and lean product management. This is really important because this rewiring has to coexist with these other technical practices.

Let's talk about the physics and really what's at play here.

We had The Phoenix Project, which presented wait time, percent busy, percent idle, which was really about how work queues up in systems and basically how people respond to being overloaded. Bottom line is that as you overload folks, they respond slower and slower. That's very important.

The next is coordination risk, which is a one in two to the N chance of arriving on time with every dependency. As you add a dependency, your odds of arriving on time halve every time.

Finally, the knowledge loss, which was originally published in the Opex book and Jonathan Smart brought it into his book, which is every time you have a handoff, you lose what's called tacit knowledge. Every handoff cuts the tacit knowledge in half.

These are really the physics that are at play. I really counterbalance this with the accounting that's generally used in software development, which is, "Give me your estimates on how long something is going to take," and basically divide the number of people, and now I know how many people I need. That's an accounting-based approach to solving the problems. We really need to think differently because these are really the formulas that are at play in the organization.

The properties that sit underneath this that are causing these physics: one is contention, which is conflict over access to a shared resource. You contend for that resource, and now you have conflict.

Coupling, basically the degree of interdependency between two teams, resources, people in an organization.

And then coherence, which is very important, which is the quality of forming a unified and logical whole. We can break a system into a hundred parts, but if we can't create coherence from all those parts, we still have a problem. We can think about that in terms of microservices, if that was really where microservices were going. But one of the downsides of microservices is you over-fracture the system. You make it really hard to understand. There's a balance in all that.

Now let's talk about the golden rule, which is, you're in Vegas, we want to double your odds. Removing a dependency doubles your odds. It's very important.

The way we approach it is with this layer three architecture and addressing that, and addressing the organizational architecture of the system and the process architecture at that level. To put this in contrast with how Gene and Steve look at it, this is really underneath simplification and really how we look at modularization, creating linearization, and working incrementally in organizations to win.

Now here's the example, and here's where all the math comes about.

We set up this situation where we have four teams. We have different tech stacks. We have different time zones. You see we're all over the place, from UTC minus six to UTC plus five and UTC plus 10.

The features get delivered in these two-week iterations. We basically send all UI work to the UI team. All teams need to really deliver a coherent solution. The features also often have dependencies because true domains: config, order, billing, and UI.

If you look at how we got here, the reason we got here was we went about microservices theater. Every team should choose their own language, their own database, and go off and build their own microservice. So we had this fractured domain.

We also had this SPA sanity theater that went about building a really complicated UI with hundreds of thousands of lines of JavaScript and TypeScript in the UI, using very sophisticated technologies there. This is what we ended up with.

Now this is where you get ChatGPT involved and you say, "Hey, is this a modularized and linearized structure where work can be done incrementally?" We're really not sure. But you'll see in a minute that it's actually challenging to work across this.

We take round one of this picture and we look across and we load up features. We've got four features below. We go get some estimates and we balance capacity. We start all teams on all.

We have some options of what we need to decide. Do we start all teams at once? Do we sequence the feature? We'll end up with a combo where we start on some parts of the features and then we coordinate the final functionality at the end.

You look, total capacity is about 480 points, jelly beans, whatever you use to measure there. Each team is, and this is where we simplify it just to make it easy, has the same capacity. Then we load them up to 270.

All the data here says, we've got plenty of capacity. One iteration should be sufficient. You might have some challenges with linear sequencing.

We get the run results and basically we find this: the teams take four iterations to get done. There's lots of overtime, there's lots of escalations, there's lots of escaped defects. Accounting said it should take 270, so less than one full iteration, but it really took us seven times that effort to get done.

Who wants to guess why?

Coordination.

What ChatGPT couldn't see, or we couldn't represent in the system, and this is the thing that we need to figure out, is how do we surface those dependencies? How do we get those into the system and visible?

Those orange lines are those dependencies. Those are the distinct coordination dependencies in the system. There's four of them. That gives us two to the fourth, which is 16. So the odds are one in 16 that you'll actually arrive on time with this type of system and process architecture in place.

These codependent delays make everything late. That's really important to figure out.

The round one countermeasures in the accounting world: what do we do? What are the top three things that we do when we encounter this problem?

That's one of them.

We often pad the estimates next time. That's one of them. The other is that we add people. "If I had more people, we'd get the work done," right? Then we add more work to catch up.

This is a really great way to, well, they obviously already did. That's why they potentially got here. They also called the microservices consultants. But this is how you light money on fire. This is an extremely expensive way to manage a process.

We've seen a bunch of different companies, 10 different portfolios. We've seen this problem play out very similarly in lots of different teams, technologies, industries. Unfortunately, it often plays out this way.

Now we think about it when we get back to this: removing a dependency doubles your odds. We look at this layer three architecture. Let's rewire this.

This is where we take a little bit of literary license, because this does not happen in minutes. It happens in months and years.

We look at this and we say, solve for these coordination costs. Removing one dependency handoff doubles your odds, though, not to be delayed. Can we remove several of them?

The first thing we look at here is this whole config and UI problem we had. All of the config for the system went through the config team. Let's just say that was a good design. But the problem was that we had to coordinate with that team every time we put new config knobs in the system.

So we make the config self-service. We had a lot of self-serve tooling. Now teams that need to put new config in the system don't have to actually coordinate someone there. They become more of a platform team.

We do a similar effect with the UI. You can see those coordination lines go away, and we can actually see greatly diminished coordination on the UI. We make the UI frameworks a lot easier to use. We embed UI on those teams. Now we actually start to remove a bit of that coordination. This enables modularity and linearization in the system.

Now what can we do next?

Now we actually have the harder problem that we look at here, which is coupling.

Coupling is this degree of interdependency between software and modules, taking the software sense. Cohesion is the degree to which elements in a module belong together, like they should change together in the system.

Technical coupling is an easier thing to comprehend, and it's things like, I've got a service that depends on an API or a shared database. Those are easy to resolve. We have a bunch of patterns for that.

The harder stuff is what is called functional or semantic coupling, where I have functionality in one subsystem that is dependent on another. In this case, we had a significant challenge. When billing changed, order often changed. Then when order changes, billing must change.

We have this problem now across time zones. If you look at that, UTC minus six and UTC plus 10, they have a 16-hour time difference between two teams that have codependent functionality that they need to synchronize. That is an extremely difficult thing, because this is usually resolved by people talking to each other. If those people aren't even awake, how do you do this?

Well, we put it into documents and we send it the next day. Yes, there's value in working asynchronously doing those things, but it's hard. It's very, very hard to actually do this, and it shouldn't be your default choice.

If we go look at this, one of the things we thought through, and this again took us months to go work through, is we basically collapsed those modules together and ported off of the one programming stack into the other and collapsed the software modules.

What that did is that actually removed these time zone coordination costs from us. We basically hugged the coupling. We used modularity and cohesion together, put it into one time zone, one technology stack. That made it now a lot easier to coordinate.

People are not as good at coordinating as compilers. Having stuff in the same module, that you actually have code and maybe access interfaces, you still have modularity at the code level, but the compiler will actually tell you when dependencies are not necessarily resolved. That's a lot easier to do than to have people coordinate across 16 hours.

We look at this. The teams take less than one iteration to now get done. It takes us basically about 360 points to actually finish this, where accounting total is 235. The reason is we have one distinct dependency. So our odds are 50/50 that we'll arrive on time, which is a lot better than the four dependencies.

If we start to summarize this up, there's a bunch of actions that we took.

One was to make the config and UI self-serve. Basically use a platform team pattern, self-service APIs. We implemented modularity and linearization. We affected the org, the system, and the process.

We embedded UI talent in the team. That was full-stack team pattern and cross-skilling. Again, modularity and linearization. We affected the org and the processes there.

We collapsed the billing and ordering, which affected all three dimensions. Basically, we inverted the coupling by bringing those things together. Development standards, same programming language, and cohesion was key.

If you look at before, you can see all the coordination, four handoffs, basically a one in 16 chance of getting home. The cost was 1,920. Afterwards, basically we have one handoff, which is a one in two chance there would be no delay. The cost is 360. That's about five times better result that we actually got here.

The final thing, and I'll talk about the real problem, which goes beyond just doing development.

This is a support flow of incidents coming through from a customer that talks to level one, and level one talks to level two, and then level two might talk to the, and then it might go to the engineering teams, and the engineering teams have to coordinate across all that.

If you go look at this, eight distinct dependencies, and so you really have a one in 256 chance of arriving on time, that basically there'll be delay here. One of these dependencies that you can remove starts to make a significant difference in how you can actually respond to the customer at the end.

We'll see this often, where multiple groups keep getting layered on to sit in front of the support process, and you get this really diminished effect and diminished response time for your customers.

The final thing here is these long feedback loops. They thwart any effort now to refactor your architecture and system. You get these things in and changing is just so hard because it takes so long.

The final summary here is: architecture plus leadership equals focus, flow, and joy. You want to win by rewiring and re-architecting the system. The physics, the coordination costs, really behave exponentially. It doesn't behave in an accounting way.

The three Cs, contention, coupling, and coherency, are really what you're battling. The golden rule: remove that dependency and double your odds. The three dimensions of architecture: the process, system, and then simplification. Then the patterns: platform, self-serve, full-stack teams, domain and team modularity and cohesion, time zone cohesion, and getting standards in place across the teams.

That's it. Any questions? I have like a minute left.

Q&A

Q: The automation, I mean, nobody objects to a dependency on AWS. So it's really about human-in-the-loop dependencies that introduce variation, right?

A: Yep. It's really about the coordination costs that sit under the dependency, which isn't actually captured in the simple dependency equation.

But I really think of it as self-serve. I don't have to meet a person at a point in time to coordinate that dependency. I can choose when to do that myself. That's very important.

Q: Could you clarify a little bit your usage of the word linear ability here? I had a brush with transaction theory 20 years ago.

A: Well, that is a new term from Gene's book, and so I'm kind of new to it too. Basically it means that multiple work paths can be executed at the same time and you can join the result together at the end.

You need modularity to be able to do that, because if you think about software, I need modular interfaces, things that can be tested independently, and I can bring those together later on. So now I've decoupled the dependency in the work stream, that those two things can work linearly. It is a new one for me.

Q: The question is regarding the dependencies which are outside your sphere of influence, especially external parties. Resolving dependencies, contentions, et cetera within your sphere is probably relatively easier. I can exert my influence. What about boundaries?

A: That's one of the hardest things. Damon Edwards loves tickets to put in more tickets, and unfortunately that ends up being the operational norm for most organizations. Once you get outside your boundary and control, everything goes to ticket coordination, and it becomes extremely difficult to coordinate that morale. It's very, very hard.

You need to be aware of it, and hopefully you have some sort of influence to start to affect that. Very difficult. Shared outcomes are one of them. Feedback loops, and even just making, oftentimes folks don't even make the data visible. It's like, "Well, we put a ticket in."

Right, but just wait. How long does it take? Well, 12 days. I can't do anything for 12 days. Oftentimes folks don't even know that that's what's sitting in the system.

Q: A lot of this is the result of previous accounting-based decisions, right? This team in this location because they're cheaper. I'm going to use this third party because they got me a better deal.

It would seem that one thing we could all do is try to bring forward in the shitty estimation process the accurate calculation of the costs of these previously unit-cost-focused things. Because the real answer is undo all of those crappy decisions and have co-located teams that own more rather than less.

But most organizations got themselves in a path where that's impossible, but they don't know the cost until the cost has been incurred. I love the way you framed this up. I just think trying to find a way to do that as an upfront diagnostic rather than an after-the-fact explanation might actually serve us all really well.

A: Yeah. It decays over time, unfortunately. At the beginning, people didn't design it this way, and over time it progressed that way. You just continue to decay. It's almost like the boiling frog problem. It gets more and more painful over time.

Gentleman over there.

Q: When you take these plans to different corporations, how do they respond? Because it feels like someone is going to have to either give up or share power to really make it go.

A: Yeah. Political issues.

I think you highlight the challenge. For this type of analysis, that's layer three. One is, are you willing to do the analysis? I was talking with some folks this morning on that, and even looking at how work flows.

Then once you get to that and make that data visible, is your leadership even willing to take action on that? Or are they just willing to live in this case where all these expensive developers are only spending more or less 7% of their time actually writing code?

It's a challenge. You have to have support at that layer three level to actually start to address these things.

Anything else?

Q: I got one more. What about with PI planning and the cost of coordinating?

A: There's a lot there, and we can get into BIng and all kinds of other stuff. But to be honest, it really comes down to more of the question at the leadership level. Do you understand these are the types of things that can happen?

It doesn't mean that PI planning is bad, but there are ways to actually incrementally plan and work and restructure the architecture, restructure the work processes, that you don't put it with a massive batch that produces awful outcomes.

I'm not sure I answered your question per se, but there's a way to have a one-hour PI planning, and I've seen it done. But you didn't get there overnight, and you didn't get there without continually affecting a lot of other things in the organization.

All right, thank you.