The Shape of Uncertainty

Log in to watch

San Francisco 2015

The Shape of Uncertainty

Director of Learning & Development · LeanKit

For a variety of reasons, we overestimate the ability to get work done. We over commit our teams and ourselves. This leads to broken promises, unhappy customers and unhealthy businesses.

This talk demonstrates how to use a Kanban flow approach to measure the probability of delivering work on time.

- Exec overview of the Kanban Flow method – While demand is unbound, capacity is limited

- How to measure the probability of finishing work on time

- Wait time vs. getting work done time – a look at flow efficiency and resource efficiency

- What, how and why to set wip limits to strengthen business health

Chapters

Full transcript

The complete talk, organized by section.

Dominica DeGrandis

I'm Dominica DeGrandis, and I like to help teams make their work visible for a couple of reasons. One, because it's hard to manage invisible work, and two, because I think it's the best first step at starting a necessary conversation. And the same holds true whether you're at work or at home with that.

I'm Director of Learning and Development at LeanKit. LeanKit's a visual workflow management tool, and I live in Seattle. And I think there's two kinds of people in Seattle. There's those who are worried about the big earthquake happening, and those who don't seem to be concerned at all about the earthquake happening. After reading The New Yorker article, I now fall in the first camp.

Apparently, the Cascadia subduction zone, full rupture, that's the Pacific plate bumping into the North American continental plate, has the potential to go to a magnitude 9.2 earthquake. The experts say that means four to five minutes of severe ground shaking and rolling, and that has the ability to have 30,000 landslides in Seattle alone. If it's a partial rupture, then it can be up to an 8.0.

So I'm wondering if my house is going to survive. My house is on a level lot. It's made out of wood, which is good, and it is tied down to the foundation, but the foundation doesn't have any rebar. It's hollow tile cinder block with no rebar.

So I have a Post-it up on my home Kanban board to get a structural engineer in to help us understand what we might need to do to the house, and that's making my husband very nervous. The problem for a man who can build or fix just about anything is that the spouse gives him a really long to-do list, and he's wondering if he's going to spend the rest of his life trying to retrofit this house, among other things.

So if the odds are that a wood house tied down to a foundation with no rebar has an 80% chance of collapsing, then that's going to provoke a necessary conversation. We're going to need to decide, do we retrofit? Maybe we sell. But there's uncertainty here, so we're going to look at some metrics that should be helpful.

Yesterday, Julia Wester and Troy Magennis talked about three kinds of metrics: descriptive, predictive, prescriptive. The descriptive ones were where you're just counting events, like the number of outages or the number of dependencies like that.

The predictive model is what I want to focus on today, where we're trying to be approximately right instead of exactly wrong. Right? The important thing is we're not going for just one specific point or one specific date. We're looking at a range of possibilities to understand what the probability is.

So I'm going to use an example of predictive metrics, and the question that I'm trying to answer here is: When is it going to be done? When is our software delivery feature, whatever, going to be done?

So to answer this question, it helps to look at the pipeline that the work actually moves through on its way to being delivered. Here's a typical pipeline that work moves through. We've got a series of work states. We've got a series of wait states.

On this example, this piece of work, let's just say it's a request to get a new domain controller. It arrives September 23rd. There's some initial work to figure out the priority and the domain name, and then that work waits for a bit while we get that name in. And then we need to have a conversation about SSL certs, and then it waits for a bit. But eventually, the work gets to done on October 1st.

So October 1st minus September 23rd, we get eight days, and this is our lead time, or flow time, as I like to call it. I'm a bit of a rebel there. Like lead time, flow time is just the process time plus the wait time. It's the total duration of work, the elapsed time that it takes to move through the system.

And the important thing here is to figure out where you want to start the clock. Mark Mikayla said earlier this morning that from a developer perspective, they're interested in looking at the cycle time or the process time that's within their control. In this example, I'm starting the clock when the architect upstream asked for that domain controller because he's my customer, and he cares about the flow time. Right? That's what he's interested in.

So now I'm going to take that eight days, and I'm going to plot it on a graph. So I've plotted our first work item flow time on the graph at eight days on October 1st. The x-axis is the duration in calendar days. The y-axis is the number of days it took that item to complete.

Back to see what's going on in our pipeline. Now we've got two more work items in our pipeline. Work item zero, that's the one that we just plotted. Work item one, that's sitting in a wait state. It's waiting on an Akamai patch. While it's waiting, we pull in work item two, and we start building out a Tableau 9 server.

We go out to lunch, and then we come back to work, and we see more requests have arrived on our board. W1, work item one, is still waiting on Akamai. So we've put a blocker on it, something about reissuing an SSL cert. We go ahead and block that because we want to give it extra visibility. Because it's blocked, we're unable to get feedback on it as soon as we'd like.

There's a lot of talk in Lean philosophy about removing waste, but it's hard to see the waste until you can get some feedback about it. Meanwhile, W2's done, that's the Tableau server, and it took four days to do. It shouldn't have taken that long, but the person who was doing that work wasn't available at the time that request came in because they were busy working on something else.

So let's add that to our chart. We've got W2, took four days to do, and it completed on October 2nd, which is a Friday. So we head out for the weekend, come back in on a Monday, and now we see that our board has all these new orange cards on it in a new swim lane there.

So now we have two work item types. We've got blue cards that are sort of like the daily regular standard tasks that we do, and we've got a secondary lane for projects. This one happens to be a disaster recovery project. And it starts taking precedence over the other work that we already started. So we see blockers there.

Because work is arriving at a faster rate than we're able to finish it, the work in progress starts to pile up. It's just like when cars are on the freeway. If cars are arriving on the freeway faster than they're leaving the freeway, our commute is going to be longer. It's going to build up.

We tend to work on too much stuff at the same time, and we don't consider the dependencies as much as we should. So W6 there in the middle has a dependency on another team, so we flagged it with a bright yellow visual just to bring it some extra visibility.

It turns out that dependencies reduce the number of options for when and how we can start new requests. And I was just surprised to learn that dependencies can actually cut our options in half. Here's the example.

You and your friend are going to the movies after work. You both have a 30-minute commute, and you need to leave your different respective offices by 6:00 PM, or you risk being late. What's the likelihood of getting to the movies on time? Anyone?

Either you're on time and your friend is late, your friend is on time and you're late, you're both late, and the last option is that you're both on time. So it's a one in four chance, 25% chance.

Now add your brother, who also works 30 minutes away. We got three dependencies. What's the probability now that you're all going to arrive on time? We cut it in half. It's going to be one in eight now, 12.5%.

And then if you add your brother's friend, then it's one in 16, unless your brother's friend works in operations, and then he never gets to leave work on time.

If you add three more dependencies for a total of seven dependencies, we cut the likelihood of delivering in time down to less than 1%. So we need to hopefully have dependencies under control.

Okay, back to our board. We come in now and there's an audit happening that we weren't prepared for. So the gray cards are for audit in that third swim lane. So we have three kinds of work item types now. We've got project work, we've got, it could be product work, or we've got kind of smaller daily tasks in blue and gray, kind of an expedited item that's an audit.

What's happening here is now the work that we told people two weeks ago that would only take two days still isn't done. More work keeps coming in. It's a higher priority, and the other work gets deprioritized. And so we've become unpredictable, and people don't trust us anymore when we tell them, "Yeah, I'll get that for you next week."

And we do this to ourselves. We take on work at a faster rate than we can finish prior work. And then when escalations arrive, we deprioritize and block that work that's already in progress. It's like we launch a DDoS attack on ourselves, and then we act surprised when we don't get things done on time.

So given that flow is the movement and delivery of customer value through a work stream, through a process, it seems reasonable that our process should be aligned with optimizing for flow and not optimized for keeping people busy 100% of the time.

This is a little blip in my talk about resource efficiency versus flow efficiency. And it is just a blip due to time constraints here. But when we load up people to high utilization levels, it's going to increase the probability that people aren't going to be available when we need them to.

And so that's why queues matter more than keeping people busy does. So if your Kanban board is designed to capture wait time, then we'll be able to measure flow efficiency too. And if you want to talk more about that later, hit me up after this.

Back to our graph. In the interest of time, I went ahead and plotted the rest of the eight data points, so now we have 10 data plots on this graph. We call this a time series plot or a scatter plot. And we're using this graph so that we can evaluate patterns and behaviors and data over time, so we can look at the trends.

You can plot anything on this kind of graph you want: deployments, number of MTTR requests that were over a certain time, dependencies, really whatever you want to analyze where you want to become more predictable.

Then we can add the percentile lines. So here's 50th percentile. Easy to calculate. We got 10 items on the board. I count the bottom five and I draw a line. 50% of the work that came through our pipeline was done in four days or less, 50% over.

Add another line in, 70th percentile. 70% of our work is done in eight days or less. 90th percentile, 90% is done in 14 days or less.

Often, I'll hear, "We don't have enough data collected in order to use predictive metrics." But statistically, using a sample data set of seven to 11 data points gives us a pretty good idea. We've got 10 data points here. We can say with a 90% probability that it's likely that we'll get work done in less than 14 days.

If the odds of a 9.2 magnitude earthquake damage can be determined in an area that's in a major modern metropolis that's never experienced that magnitude of an earthquake, we ought to be able to figure out how long it's going to take for our work to get done.

Okay, here's the cool part. We're mapping this scatter plot to a histogram. So how this works is, I know it's probably tiny for you to see, but there's three work items that had a flow time of one day. Took one day to complete them. So I'm going to map that over to the histogram.

There were three items, one, two, three, that took one day to do. There were two items that took two days to complete, one item that took four days, another at eight, and so on. These are all one day. And then I can flip that histogram vertical so we can look at it in the standard way of looking at it that we're used to.

So now we have a shape of this distribution. And note that it's not a normal distribution. It's not a bell-shaped curve. Bell-shaped curve is what we were used to in school, where 10% of the students got A's and 20% got B's and 50% got C's. That's not what we have here.

We have a long tail on this distribution, and that's why looking at standard deviations away from the mean doesn't make sense when we have a distribution like this. So that's why looking at other percentages can be more useful.

I just plugged these into Excel to walk through this example. There's tools out there that can provide this data for you. Here's a bit of a prettier histogram showing us uncertainty, which is an uncertain number is a shape. It's a bunch of numbers. It's a bunch of numbers and a distribution of numbers. And the longer the tail, the more unpredictable we are.

So this shape shows us that we had 280 work items that took from zero to one days to do, and at the far right, we can see that we had about 15 work items that took 90 days or more. And we're always going to have uncertainty in our world. And the best thing that we can do is reserve some capacity to handle that improvement. Right? To improve the probability that workers are going to be available when needed.

This chart is broken down by every two days on the x-axis. If we split that out more granularly, we can see that 200 of the previous 280 items were actually done in less than a day. They were done in a matter of hours. So from there, if we wanted to, we could break it down into hours and not days.

When it comes to metrics, it turns out that queue size is one of the most valuable metrics. It's a leading indicator. We rarely get leading indicators. If you're measuring velocity, throughput, cycle time, those are all trailing indicators because we don't know how long something took until it's done.

But work in progress, queue size is a leading indicator. The more WIP we have, the longer things take. And that's why, you know the minute you get on a freeway and it's plugged up, you know right away that your commute is going to be longer.

It's also easier to trim demand, in other words, to avoid taking on new work than it is to deal with lots of variability, like DDoS attacks and OpenSSL patches and surprise audits.

This is a scatter plot or a time series plot of Ops Group of all their work completed in a 10-week period of time. So from August 28th to October 12th. And what we're doing here is bringing visibility to the different kinds of work. If you were to look at this close up, you'd be able to see that the different colors represent different kinds of work.

So orange dots equal fires, blue dots equal deployments, red dots equal projects, brown maintenance, green improvements, and the purplish ones represent unplanned work. So they're tracking six different kinds of work that they deal with and trying to gain some statistics on this kind of work.

These charts don't tell us why work took so long. We have to determine that ourselves. But what it does tell us is how long different kinds of work takes, so it can help us be more predictable.

We can segment this further. We can look at just unplanned work. One of their work item types is unplanned work. So this is a scatter plot showing just the unplanned work during that period of time. It turns out that about 10% of their total work is unplanned. These are the interruptions and the walk-ups.

The average flow time on these is about four days. They've got a couple outliers on top at 20 and 23 days. By the way, a lot of the standard reports in the tools will show percentile lines at 95 and 99 percentile because it's based on the standard deviations. So they'll be there, but it's interesting information, but I don't find it that useful when we're trying to be predictable.

Here's what the histogram looks like for that unplanned ops work. And I'm looking at this not with the intention to drive down unplanned work to zero. I think unplanned work is a reality, and we're in denial if we think that unplanned work is going to go away.

If people aren't available to handle the unplanned work, that's a problem, especially if it needs to get done right away. And that's when we incur interruptions and expensive context switching. So if you know that 10% of all your work is unplanned, then we can allocate capacity for it. We can plan for unplanned work.

Question for you: Are you making decisions based on opinions or facts? Earlier in my career, my biased opinions didn't work very well for me. It really wasn't until I started collecting and presenting flow metrics, where I could have pretty good necessary conversations, and people started being much more receptive to listening to me.

I could provide more value to the team, and it really changed my whole outlook at work. I felt like I could become a voice of reason. So it was a big deal for me, and that's why I'm pretty passionate about helping other people get to that point, too.

Takeaways. Man, 25 minutes just flies by.

Number one, consider being approximately right instead of precisely wrong. The idea isn't to always hit the nail on the head 100% of the time. The idea is let's avoid smashing our thumb. Let's get close enough to the head so we don't smash our thumb.

Two, adding work to your plate faster than you can get work done is going to increase your WIP, it's going to increase your work in progress.

And that leads to number three, where the odds of being predictable are going to decrease the more work in progress that you have. The flow time is going to be longer.

Which leads to number four, control queues, not timelines, to improve flow. Because queues are a much better predictor than estimates or Gantt charts.

And number five is a quote from Troy Magennis, where he's saying, "If predictability is your goal, the best thing you can do is reserve capacity."

My question for you is, where is the costliest uncertainty in your organization? Love for you to send me. I'm `dominicad` on Twitter. I'd like to hear where you'd like to be more predictable. Where is being unpredictable causing you pain?

Here are some references for you. And there's a link to this presentation deck, which should be up within the hour.

Thank you very much.