Predictability: No Magic Required

Log in to watch

San Francisco 2016

Predictability: No Magic Required

When you merge onto a freeway and are stuck in bumper-to-bumper traffic, you know right away that it's going to be a long trip. Similarly, you can predict the cycle time of your work before it is finished without time consuming, and often incorrect, estimation. Sound like magic? Fortunately for all of us, it's not.

This talk explains the basics of queueing theory; demonstrates how allocation models and pull policies affect the cycle time of work; discusses the effects of batch size and variability on queues; and teaches how to successfully monitor your workflow to get leading indicators of effectiveness. With this information, you'll be doing better forecasting, and achieving better outcomes, in no time!

Chapters

Full transcript

The complete talk, organized by section.

Julia Wester

Hi, everybody. My name's Julia Wester, as you just heard, and I work at LeanKit. We have a booth down in the expo hall if you care to learn what that's about. I'm an improvement coach there, along with Dominica DeGrandis, who you heard this morning.

I'm just really excited about all the talks and things that happened this morning, because they had a lot of things that really sort of cued this up quite nicely. So I'm super stoked about that.

What being an improvement coach means is, simply put, I go out to help our customers tame the chaos of their workplaces. That can take a lot of forms, but one of the things that we talk about a lot is predictability. Everyone wants to know how to make their work a little more predictable, and we feel like there's some magic to that.

But I'm here today to talk to you about some simple things that we can do to give you back control of your predictability without the need for magic.

Before we get into that, let's all get aligned on what "predictable" means. We all know it's an adjective, and it means what's expected of you on the basis of your previous or known behavior. Simple, okay?

You can be usually horrible, because we've all been to restaurants that are predictably bad, yet we still somehow drag ourselves there because they have great food, or we just haven't given up on it yet. Imagine if we're the horrible ones and our customers are in that predicament.

But usually, we want to be predictably good at something, maybe like delivery.

Again, I want to go back to that concept of, in our brains, we feel like there's this elusiveness to predictability. No matter how often our bosses or executives tell us the goal is to be more predictable, we agree, but on the inside, we think, "Well, I don't really know how to make that happen. It's probably never going to happen." So we'll just nod and sort of keep moving along.

We feel like we're trying to pull answers from randomness, but fortunately for us, there are examples of how this has been done in the past. So we're going to go all the way back 100 years, to a simpler time and to the Copenhagen Telephone Company.

They also had to pull answers from randomness. They had to figure out how many telephone lines they needed to avoid blocked calls. But the problem was, they had random arrivals. You never know when somebody's going to pick up the phone and call their mom, right? And you don't know how long they're going to talk on that call. Five minutes, an hour, you never really know.

So they had all of these random conditions, yet they had to find a way to predictably have service for their customers.

The solution was something called queueing theory. A mathematician who worked there was named Erlang, and he applied statistics and probabilities to help solve their issue.

While I'd love to get into statistics and probabilities with you, we have 25 minutes, so what we're going to do instead is give a primer of queueing theory and how it applies to your life. In essence, queueing theory is the mathematical study of waiting lines, or queues.

The reason why it's really important for you to learn stuff about queueing theory is because it helps us quantify relationships between some very important things: queue size, capacity utilization, and cycle time.

We have a little graph here on the bottom right. Hopefully you can see it in the back, but it essentially shows us a core tenet of queueing theory: that as we linearly increase our capacity utilization, 10%, 20%, 30%, all the way up to what some people want us to do, which is 100% capacity utilization, what that does is cause an increase in queue sizes of work in our system.

Simply put, that's because we are no longer able to respond to things as they come in and keep processing them smoothly, because no one's available and things start to back up. That's a relationship between capacity utilization and queue size, and that could take a whole talk in and of itself.

So what I'm going to do instead is focus the rest of the talk on why queues matter. Why should you even care about this whole thing? Choices that you can make about the queues in your work system, and how to manage them, will impact your predictability for better or for worse.

Then finally, how to monitor your predictability indicators. I want to share with you that no longer are we in a world where we only have lagging indicators. I have a leading one I want to share with you that hopefully you can start monitoring and get a little heads-up on your predictability.

So let's jump into why queues matter.

Essentially, queues are the waiting work in our system. Remember, queueing theory is the study of waiting lines. Not very exciting sounding, but very important to us, right? We have a lot of work in our systems, a lot of waiting work.

More queue, more problems, right?

And just like with the capacity utilization example, the more we have building up, the more we fill everyone's time with, the longer average cycle times we get. But not only that, we hurt our predictability because we generate a wider range of cycle times. All of that leads to more management overhead, increased risk, and reduced motivation.

So lots of problems caused by queued work.

It's really important because our workflow is just a chain of queues. We have waiting steps that you expect to have things queue up in, like things enter our area of known things in some kind of arrivals queue, and we expect stuff to wait there a little bit. Then things go into an active step. That's our work in process or progress. Then we have, if you're doing a pull system, some buffers between those steps, so you're not pushing work onto people before they're ready. Then hopefully it eventually happily departs your system.

So we have expected queues, but the reason I have the asterisk on the work in process is that we often, if you heard Dominica this morning talk about too much WIP, the time thief, when we pull too much WIP into our active states and we switch between them, we actually are creating hidden queues of work, because lots of stuff is waiting even when it doesn't look like it.

So queues are very important to understand how to manage and how they affect predictability.

Oh, sorry. I had a Wi-Fi thing pop up here. Okay.

Most of you have probably heard of Aesop's fable, "The Tortoise and the Hare." And if you have read Don Reinertsen, The Principles of Product Development Flow, which Dominica also mentioned this morning, you're going to recognize a lot of the basic principles in this talk. This is a primer of why you should care more about this stuff.

But he had a footnote in his book that I really loved, and I'm like, "Oh my God, this sort of helps explain it all." This is how he describes this whole situation. He says the hare is really great at quickly delivering business value, but he lost the race due to high periods of inactivity. So he sprinted, he napped, he sprinted, he napped. But the tortoise here, he won the race because he was very consistent and predictable in his ability to move forward.

So in the back of your mind as we talk through these things, think about "The Tortoise and the Hare," which scenarios we talk about represent each kind and what you do in your workplace, and are you the tortoise or the hare?

We've talked about what predictability is, but how do we actually become predictable? The way we do that is to reduce our probable outcomes from a wide range to a less-wide range, a narrower range. It's very simple at that level.

So we want to go from something like usually done in two to 200 days to something, maybe it's a little optimistic to go directly to 25 to 35 days, but that's definitely more predictable. Even if we went to 25 to 100 days, we've become more predictable than when we were two to 200 days.

Can you imagine someone coming to you and saying, "When can you get this done?" And you say, "Oh, any time from a couple of days from now to a couple hundred days from now." That's not really going to float the boat very much.

But a key concept I really want to help people understand is that predictability doesn't mean the fastest. Right? The hare could go the fastest, but he wasn't the most predictable because he couldn't consistently be that fast. Predictability is all about consistency.

So we know what queues are. We understand why they're important to us and why we might want to start managing them better. So I want to talk about three choices that we make about queues that can really impact our ability to be predictable or not. These are ways that you can control your own predictability.

The first choice is to use a push system or a pull system. I'm aware that not everyone really has a deep understanding of what that means, so I'm going to help explain this for you, and I'm going to use queueing discipline terms for that.

What we see here is very similar to what you might experience at a supermarket. You get your groceries, you go to checkout, and you go to the normal registers. There is a cashier, and that cashier has a queue of people behind them. We've pre-assigned ourselves to that cashier before she or he is ready for us.

That is a one queue per server. Each individual server has their own special queue. We call that a push system because literally we're pushing people onto that person before they can process them in advance.

The other example, at least that I see at my local Safeway, is at the self-checkouts, where we have one queue for six kiosks. We all wait there, nice and orderly in our one queue, until we have an open space at one of the kiosks, and then we move ourself over to that open kiosk. That's called a pull system, or one queue for multiple servers.

Okay. So think about which one you have in your organization. Maybe you have a little bit of both. Maybe you just have one.

But I want to talk about how this choice impacts predictability. I want to do that by talking about what happens when a cashier, or a server, or a developer, or an ops engineer has a problem. We're going to block a cashier or a person from each system here.

In our push system, when I block someone who has work pre-assigned to them, what happens is all that work stops. Now they're experiencing extreme slowness, that period of inactivity, like the rabbit taking a nap, while the rest of the other servers are going at a normal speed.

So we've introduced a wide variation, and we know that variation, at least in cycle times, is the enemy of predictable delivery.

In a pull system, if I block one of the particular servers or persons doing work, then we just stop going to that person, but otherwise everything else happens as normal. Yes, everyone has slowed down a little bit, but there is consistency in the service that everyone receives.

So the pull system, with the way that we're not blocking people before we need to, leads you to the ability to be a little more predictable than if you're trying to use a push system.

So that is your first choice. And your second choice then is, once you've decided which system to use, how are you going to load work into that system? What factors are you going to use to prioritize that?

I'm going to start out with the simplest one. It's first come, first served, basically. You could hear FIFO, first in, first out, or first in, first served. For our purposes, they're equivalent enough to be one bucket.

I have four cards here, and I have a lot of information about that. I've got T-shirt sizes, cost of delay, or value of some sort, and I also have the order in which they came in. In a FIFO system, all I care about is the one, two, three, four, right? Nothing else matters. I just do it in that order.

I can choose to prioritize differently. I can say these other things matter to us more than the order in which they come in. So in this system, I've said, okay, I'm going to look at the value or cost of delay first and pull them in in that order, and then when I have duplicates of cost of delay, I'll put the short jobs before the long ones. That's something that Don Reinertsen calls weighted shortest job first, WSJF.

Okay. But really, it's just an example of a prioritization method. It doesn't matter how you prioritize your work, and you may have another policy, round robin, something else.

What we choose to do, which one we choose, is going to impact our predictability, that is, our consistency in the experience of our cycle times.

We're going to, again, look at a FIFO system, the first in, first out, and then we're going to bucket everything else together into non-FIFO. Now, in the non-FIFO, we've got some things that are going to be processed out of order because they meet some prioritization criteria. So I've put little yellow dots on those, and I've said, okay, these people with the yellow dots are going to be processed specially by this person over here. You could also call this a class of service.

What I'm doing is I'm introducing a variation in how these people get processed. I have some of these expedited or prioritized items that have a shorter waiting time than they would originally have had. Because I've taken up capacity from a server to handle those, I'm inflating the cycle times of the other items.

So I'm shortening some, I'm widening others, and that is a wider variation. And by our definition, wider variation, less predictability.

With a FIFO system, we aren't artificially inflating or deflating cycle times of anything. So we have less variation, assuming all of the other lengths of time it takes to process a card is the same in both systems.

Okay. But you're thinking to yourself, because I know I did, is that really feasible that we can always just do everything we want in the first-in, first-out order, right?

Even recently, I have gone to conferences and open spaces and Lean Coffees and said, "Does this ever work if you're not stamping out identical widgets?" I'm not here to answer that for you today, but what I do want to let you know is that the reading that I've been doing, Don Reinertsen, Dan Vacanti, in his book that Dominica also showed about predictability, you're going to hear cost of delay. Cost of delay is king. Don Reinertsen says that over and over.

I agree with that, because businesses are about economics, and we need to make an economic decision. If you think that anyone can ask for anything, you're not just going to process it in the order it came in.

The only caution that we give you is question the assumptions behind the choices you're making to prioritize, because when we choose to prioritize, we are introducing a cut to our predictability. So just make sure it's worth it. Make sure it's an informed decision. Understand, maybe monitor the impacts of how prioritization is impacting your cycle times.

If you can use a FIFO system, do it. So on the things that you can, just do that. Maybe your standard work or other things. Do what you can with that, and then just be aware of the cost for everything else.

Now, the third choice is something that we hear about a lot. This is big batch, small batch. We've had this conversation before in the DevOps community. Mostly we're talking about technology and things like that, but I want to talk about its impact on predictability.

So we have monthly batches, weekly batches. Again, I want to look at how this introduces variation into our cycle times, and specifically, are we inflating them unnecessarily?

So our once-a-month delivery, I've color-coded as to age. I've been queuing these things up for a month. I've got some things that are four weeks old, three weeks old, two weeks old, or maybe they were ready to deliver yesterday, and that's just the age of sitting into this ready-to-deliver queue.

So I'm artificially inflating these things by up to four weeks and introducing that amount of variation into our delivery times. Yet if I try to deliver on a shorter cadence, I have less chance to artificially inflate our cycle times based on how long they're sitting in that queue.

Bigger queues cause bigger variation. So not only are there all of the other good reasons to have small queues that you've heard in the DevOps community, small-batch deliveries lead to more predictability.

So we know some choices. We know why queues matter. We understand a little bit about how to handle them, but how do you know if you're doing it well? Right? We have to have ways to see how we're doing going forward.

So I want to talk about how to monitor those predictability indicators. I am going to tell you your leading indicator, but I want to start with the lagging so that everything makes sense a little bit more.

This is a chart that we have in LeanKit from a Kanban board of a team that I was on. This is real data. This is a cycle time scatter plot. Each dot on this board represents a piece of work.

So you can see on here that I've got our work from July to November, and you also can see that we have a 95% certainty that our work is going to finish in 45 days or less, just by doing the math of the distribution.

But I also see here that I've got some good clustering going on. When we think about variation, the closer things are together and the more things that are close together, the likelihood is that I'm more predictable.

Okay? When I see these outliers over here, I know I need to fix them, but these things have already happened. I can't fix the history of those items. But what I can do is I can go see what I can learn to change future potential outliers so that they can be clustered a little closer.

Just a tidbit is, when we are making commitments and predictability, it's good to give confidence intervals, like 95%. You can change it to 85, 90, whatever you want, but that shows the truthfulness that we're never absolutely sure what's going to happen.

So queue size is actually the leading indicator, and you heard Dominica say that this morning. Okay? I'm going to use a very simple example to explain that, but don't dismiss it just because it's a simple explanation, because it does seem to hold true, based on all of the research.

When you get in your car, and assuming you work outside of the home, and you get onto the interstate or highway, you know right away if it's going to be a quick ride home or a slow ride home. You know that by how many cars are queued up in the road.

This goes back to the concept of the capacity utilization. What causes a traffic jam? Sticking so many cars on the road that nobody can move forward. So the same concepts apply in our work. We know that by the amount of work that's in that system of the road if our trip home is going to be quick or long.

So a cumulative flow diagram, or CFD, which you can get in tools like LeanKit, but I also wanted to draw one to make it simpler to look at, is going to demonstrate this relationship. It's another proof to you that there is that relationship between queue size and cycle time.

I've got my work units on the vertical axis and my time in days on the bottom horizontal axis. Each of these lines is the slope, the cumulative amount of work that has entered any given state in our workflow. So to do, more things have entered to do than design and so forth. Hopefully our delivery slope is going to be very up and to the right, just like we hope revenue is.

So what we can do is, knowing that we have amount of things and time, I can look at, say, on the 6th of October, I had an average of 18 things in my to-do queue at any given time. I could have just as easily done this all the way down our system to do an entire system queue size.

Okay? Queue size is a leading indicator, so what did that do to our cycle time? I take my little arrow, and I draw from the point there over to that same line, but on the time axis, and I say that at this point in time, we had 18 average things in queue and an average cycle time of two and a half days.

Okay, let's see how cycle time is affected when we have smaller queue size. Here, I have 10. My average cycle time is a day and a half. So this is just another tool to help understand the relationship between cycle time and queue size.

Okay? Once you understand this relationship, there are other things that you can track. You can say, "I get this. I understand this." I'm going to let people take pictures before I flip. I do have this deck too, that we'll give out. Okay.

Once I understand that relationship, it's not that I don't want to continue to monitor those together, but looking at it pretty simply in this efficiency diagram, this is the same data from the same board in LeanKit. I'm showing here my queue size and just sort of the trend of that, and I'm breaking it down into work that actually we expect to be waiting at the moment and work that is supposed to be in progress. But remember, there could be hidden queues there.

So I know when I've got peaks, I've got bigger cycle times, most probably, and so I'm probably being less predictable. I need to go look and see why we've queued up so much work and if I can get it shorter.

When I have smaller queues, I should be doing predictably better with our cycle times and staying within a narrow range. So if I look, if right after this chart ended I started to peak up again, I would need to go and see why our queues got increased, what I could do about it before they turn into outliers on our scatter plot.

Okay? So you have a leading indicator that can help you predict those outliers and rein them in before they become history.

So it's time to summarize. Remember, you have control over your predictability. You don't need magic. It's not elusive. It's simple math and things that we control. The hard part is getting everyone on the same page and agreeing to do it, right? But you have control.

Let's let go of the fact that we think we don't. Get your baseline measures of your queue sizes and your cycle times, and prove to yourself the relationship. Then make informed choices like the push or pull, like the prioritization or FIFO, or like the big or small batches. Then add other ones like Dominica talked about, about dependencies and the other things, and handle your queues in a way that leads to predictability.

Ongoing, monitor your queue sizes so you can get advance notice of future predictability issues and handle them before you become the hare. We want to be the tortoise instead.

These are my references: Don Reinertsen, The Principles of Product Development Flow. Dan Vacanti wrote this book about metrics for predictability. And then I had to include my coworker and friend, David Neale, who encouraged me to draw my own slides. Says that people don't connect with stock photos, so.

Just like Dominica did, I have things that I can send you if you want to get a copy of this deck before you may get it otherwise, as well as our first annual Lean Business Report or anything else you need. Just email me at julia@leankit.com with a subject line of DOES16.

Thanks so much for coming. I hope everyone's having a very hopeful and happy day. See you later.