How we used Kanban in Operations to Get Things Done

Log in to watch

San Francisco 2014

Download slides

How we used Kanban in Operations to Get Things Done

Dominica DeGrandis

Kanban for DevOps Trainer · Dominica DeGrandis & Associates

How we used Kanban in Operations to Get Things Done

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

Today I'm going to walk you through a story about an operations team who implemented Kanban to improve their world. This team is in the gaming industry, and at the beginning of the year, their numbers ran 27 million players a day and 67 million players every month. This team had a strong desire to control their network speed and bandwidth. And they built out and maintained all of their data centers and all of their networks because they wanted to reduce latency risk.

They were tasked to build out six data centers across six different countries within six months timeframe. And they're still doing all this other stuff. They're still keeping the lights on. They're still supporting live issues.

They're on call. They're rolling out a new configuration management tool. They're deploying new features. And unbeknownst to them, they have some organizational restructures headed their way.

After spending a few weeks with them trying to understand their situation, we didn't have time to stop and readjust and put in a new system for them. We had to do it on the road. There was no capacity for them to stop. They had to get these data centers out.

So I took them out to coffee, and I asked them, "What prevents you from getting your work done? What randomizes your day?" And these were their responses. Basically, tons of conflicting priorities and context switching. And that resulted in them unable to meet commitments, like hamsters on a wheel, and feeling pretty bad about not being appreciated was the bottom line, not being appreciated for all the work that they're doing, especially when they're up at 2:30 in the morning getting calls.

So at this point, I'm wondering three things. Oops. Here we go. I'm wondering, what is the actual demand?

Because one of their pain points is that there's just too many conflicting priorities and too many priority changes. I'm wondering, what is the actual lead time and cycle time that they're having? Because when I flipped it around, I asked them to tell me what their customers were complaining about, customers being upstream product owners and stakeholders and upstream management folks, what they were dissatisfied with. This answer came back.

There's no visibility, and things take too long. So I wanted to understand what's the lead time, what's the cycle time, so we could see actually how long things are taking. The third question is, where are the bottlenecks? Where are the constraints in the pipeline where this team is getting hung up?

And here's how we got started. We looked at three data points. First data point is, can we keep up with the demand? So they did have some data.

They had a ticketing system. They had some data in there. We went and mined the data. What we're interested with this chart, the X-axis is six months of data that we collected over time.

We just pulled it out of the ticketing system. The Y-axis is how many tickets had been created. We're interested in the two slopes of these lines. The top slope, the top line, it's red brown, those represent the number of new tickets that have come into this system that they're dealing with.

And the bottom green line represents the number of tickets that have been closed. And we're interested in is the arrival rate of those tickets greater than the closure rate? In other words, are we able to keep up with the demand? And if we've got the slopes diverging like this, the answer's very clear.

No, we're not able to keep up with the demand. And then the question becomes, why is this team taking on more demand than they have capacity to handle? Why do people do that? Some of the research I've done, here's the top five reasons why people take on more work than they have capacity to do.

Reason number one is we do stuff for people that we like, and in this team's case, sometimes we prioritize work based on personal relationships. That's how we get things done. We walk up to people's desks and say, "Hey, dude, can you help me out with this?" Reason number two, we're team players. We don't let the team down.

Three, sometimes we just don't realize how long things take. This is a complex domain. We don't always know what cause and effect is. If we don't know that, then we can't estimate how long things will take.

Number four here I want to call out and highlight, fear from those in position of power. People are nervous that they're going to be publicly humiliated, and so they avoid saying, "No, I can't take that on right now." They want to say yes to everything. And then lastly, people pleasers. So John Townsend wrote this book, "Boundaries," and he talks about women do more people pleasing in relationships, but men are more likely to say yes to tasks.

And I'm wondering if that's why this team of all male 40 engineers couldn't say no to all the work. Like everything, we'll do everything.Second metric: what is our lead time or cycle time? This is a histogram. This one's on lead time.

The difference between lead time and cycle time, lead time is when the customer first made the request. When did the ticket first come into the system? Cycle time is the time that we pulled that ticket into our queue and actually started working on it. It could be that a ticket's in the backlog for three months, so the lead time is going to be quite long, right?

So along the y-axis, we got number of tickets that were closed in so many days. The x-axis is the number of days it took to close out that ticket. So the mode of the data here is 50. 50 tickets only took one day.

The beauty of looking at this kind of histogram is you can start to see the nature of the demand. You can see that a lot of their work was done in just one or two days, and then at about three or four weeks, we have another bump up, and then again at three to four months. And for the statisticians in the audience, you'll recognize this is a long tail on the end of this distribution. It's not a normal distribution.

There's a lot of variability here, and this is one of the biggest differences between doing Kanban for ops and Kanban for development or for manufacturing. There's a lot more variability because of DDoS attacks and production coming down and 100% disk utilization. It's also a good representation of why estimating by averages can be risky. If we were to provide averages here based on this cycle time or lead time, we would probably be wrong at least half of the time.

I've got a note there about a book called "The Flaw of Averages" for those who are interested in learning more about that. And also note 21% of the work took more than 120 days. So they weren't able to keep up with their demand. That's what metric number one showed.

This is showing how long things are taking and some of that work, this is real data, some of that work has been in the system over 300 days. That work probably just needs to be closed out. When cycle time is long, then we want to understand where things get stuck. So metric number three here.

Where are the bottlenecks? Where is work stuck? Where is work stuck here? Validate.

It's the beauty of this. We can see where things start to bottleneck up. Right. It was very easy to create new tickets.

Automatically, they sent an email. It was hooked up to their ticketing system. The subject of the email became the subject of the ticket. Boom, it's in the ticketing system.

So warning for them. When they make it really easy to create tickets, it was problematic because it wasn't very easy to get tickets closed, and tickets piled up. At this point, we recognized that there was a lot of contention between upstream customers and the operations team, and we wanted to try and build some trust. So we invited the customers to our meeting.

By customer, I don't mean the end users. I'm talking about product owners, stakeholders, managers, leadership, other people in other departments of the company. And it was like, "We know that we haven't been serving you well. We know that.

We hear you that things are invisible to you. We understand that. We've got some metrics for you, so come listen to what we've got." And we just went through that data, right? Here's our demand, here's what we got done, here's how long it took, right, and here's where work got stuck.

And they're extremely appreciative that we were showing them this data. And it started to open the door a little bit more for some dialogue as a turning point. Next step is looking at work in progress limits. Team can't meet the demand, plagued by priority changes and context switching.

So we started looking at how much stuff do you have on your plate right now? And some of these guys had 20 to 40 tickets that they were working on at the same time. It's like, whoa, that's huge. And we just asked, "Does this seem reasonable?" And they kind of hemmed and hawed, and we just sort of hinted at how would it be if we just tried to take that down to 10?

Why 10? Because it's better than 40, right? Let's just try it. Let's just experiment with it and see where we go from there.

And then there were some reorgs. The first reorg structure really was more of a pilot. It took a dedicated team, one CIS admin, one DBA, one network engineering guy, and one PM. And their role was to just bang out projects that were 90% done and just get them done.

And they weren't supposed to be on call. They actually relocated to another building. They finished 10 projects by the time we had another customer meeting. So it was considered a real success that if we can limit the work in progress, and if we can limit the number of interruptions and context switching that the team is dealing with, they're going to have higher throughput and their cycle time is going to be lower.

Second org structure change split the team into three other teams, and from this point on, the story's going to focus on the Live Ops team because they got the person who was most interested in collecting the metrics and helping them manage their workflowThese were the tasks that they were responsible for. And here's what their board looked like. So yes, they have a physical board on a wall. They also have these tickets in electronic tool.

They've got the best of both worlds. They've got one, two, three, four, five, six columns or queues as we call them. There's two on-deck queues up front, then there's a CAB queue, change advisory board. They didn't initially have that queue when they first designed their system.

Later, they recognized they needed it because there were some hold-ups in CAB. Then they have a in progress queue, then a validate queue, and then a closed column that's not in the picture. And you can see the validate queue's almost empty. I'll talk about how that happened in a bit.

They've got three colors of tickets there. Orange tickets, yellow tickets, green tickets. In this design, this is about the third or fourth iteration of their design because they're always improving. The orange tickets are the highest priority tickets.

The green tickets are medium priority. Sorry, the yellow tickets are medium priority, and the green tickets are the lower priority. And you can see that they're organized, so that you would pull the orange tickets first. Why would we put low priority tickets in the queue when there's higher priority work to do?

Seems like a reasonable question. We do that because if everything on their board is a priority one, and they get hit with a DDoS attack or some other emergency, then they have to drop a priority one to handle that emergency. We allocated capacity for lower priority work intentionally, put it in the system, because then when they get hit with an emergency, the lower priority work can go on hold for a bit while they handle that emergency, and they don't have to drop a priority one. Okay.

Good guidance on that. We do have some swim lanes there. You'll see the swim lanes. So queues run vertically, swim lanes run horizontally.

They've got sysadmin, neteng, and DBA. They're bringing visibility to the skill set demand. It's very interesting. Originally, their network engineers were slammed.

Lots of work in the network engineering lane. At this point in time, it's their sysadmins that are getting slammed. We made a clear distinction between those first two columns. The first one is on deck unassigned, and the second column is on deck assigned.

So you'll see the little blue stickies, those have engineers' names on them, so you know who's assigned to what work. And then we're understanding here what's coming up the pipe, what's next, what's on deck next, but we don't have capacity to handle it yet. We don't pull it into that second queue until there's capacity to handle that work. But we have visibility on what is coming up next.

All right. So this board is in the hallway, and development teams and managers come by all the time and look at the board. They're fascinated with it. The same information's in the tool, but people don't always have time to go look at dashboards and tools, or they forget where the URL is.

This is just here. It's in their face. It's easy for them to see. And there's no constraints with this board.

They can build it up and tear it down in a matter of hours to rearrange it to how they want it to be. Why do we visualize work? Because it's hard to manage invisible work. We're trying to make work visible, and having this out in a public place worked quite well for that.

We can see how many tickets are assigned to who. We can see where things are bottlenecking up. We can see where the demand is based on skill set. You have three levers in Kanban.

Three dials that you can tune and manipulate. The first one is WIP limits, work in progress limits. That's the constraint in Kanban. With Scrum, the constraint is the iteration.

It's the time box, the one or two week. Everything that goes into the time box gets delivered at the end of that sprint. With Kanban, we're using queueing theory. It's based on the theory of constraints, and we're looking at just how many items can we have in our queue at any one time.

So they're both agile methods, and they both have constraints. The constraint is just in a different place. At this point in time, so with WIP limits, we have been socializing 10 for quite some time. Now we started socializing seven.

How about seven? Right. And we started to see a drop. People actually started paying attention to how much they had in their queue.

It wasn't something that we strongly enforced. We just kept playing broken record, broken record, and we started showing them the metrics and the cycle time and how throughput increased when they had less WIP and how cycle time decreased. I like to think too much work sitting idle in the system is like rotten fruit. It's expensive, takes up a lot of space, and it smells bad.

Yeah.So three levers. First is WIP limits, second is policies. These are what are the rules of the game? For example, each column has a done criteria.

So everybody's really clear on what it means to move a ticket from on deck unassigned to on deck assigned, and everybody's really clear what it means when a ticket gets moved from CAB to in progress, and from in progress to validate. Those policies are socialized. The idea is not to have a lot of policies. That's not the idea here.

It's just, if there's a rule that's important, let's make it explicit. Let's put it out there for everybody to see. It's much easier to change a stupid rule if it's out in front of everybody and it doesn't make sense, than if it's hidden out on some SharePoint site that nobody can find. One of their rules was they did not put hardware fix tickets on the board.

They had too many. What is CAB again? Change advisory board or change approval board. It's where incident tickets would go to be reviewed to make sure that the risk was low enough that that change could go out to production.

So it's not just being on deck, in progress. Like it doesn't move from on deck to CAB to in progress. Some tickets do jump over CAB if they do not need CAB approval, but many of the tickets will move from on deck assigned to get CAB approval before they are implemented in production. Just pointing out here the three swim lanes so we can see what the demand is for each skill set.

The green tickets at the bottom represent the criteria, the rules for moving between queues. The third lever is work item types. A work item type is just a category of work. It's one of the killer features of Kanban.

How do we categorize our work? And what are the rules for that? Not all work is the same. We're not going to treat project work the same that we treat keep the lights on work.

They're handled differently. One comes from a CapEx budget and one comes from an OpEx budget. The keep the lights on work usually is done much more rapidly. Projects take longer.

So three levers, WIP limits, policies, and the work item types and the rules behind those. Here's some of the changes that we started doing. I talked about this socializing this WIP limit over six months. The average after six months was between five and seven for 18 people on this team.

And yeah, there's still a few dudes who have 10 or 12 items on their plate, but nobody has 40 now. That's huge. Question? On the previous slide, you were talking about how to handle the so-called day-to-day work.

Do you recommend putting like bugs or to- Question is, a previous slide had work item types. How do we handle the day-to-day work? Do we recommend putting them on tickets? That's going to be contextual depending on what the needs of the organization are, and it's more than I'm going to be able to address in the next couple of minutes.

So be happy to talk to you afterwards. Couple other things we did to improve. We made a new policy, and we didn't have to go get permission from anybody. The team just decided this made sense.

All tickets with no activity for over 90 days got closed. If it's that important, somebody will create a ticket again. We started saying no to last minute requests. We did hire two new people, but we did all these other things too to help improve things.

Here's an email with an example of saying no. It's their first communication where they're saying, "You know what? We should have monitoring set up before a new service goes live." That's what that's saying there. We need to have monitoring set up beforehand.

Other changes that we did. We took time during stand-ups to focus on some continuous improvement. So instead of just going around and saying, "What are you doing today? What'd you do tomorrow?" We actually looked at the data.

How long have these tickets been sitting idle in this queue? We reduced that validate state all the way down to three days. When we first started, the bottleneck was in validate. We came up with a new policy.

If it's not been updated in 14 days, we'd like to close it. Mr. Customer, are you okay with that? And they said, "Yeah." If we had said in the beginning, three days, I don't think it would have flown.

It was a gradual evolution. We went from 14 days to 10 days, to seven days, to five days, and now it's at three days. So tickets aren't in that validate queue any more than three days, and what that does is it lowers the WIP in the whole system. We found a creative way to deal with these walk-ups and the work done via personal relationships, and that was somebody would walk up, "Hey dude, can you help me out?" And he'd say, "Sure, have a seat." And they'd sit down, and he would create the ticket in the ticketing system, and he says, "I'm going to put you in as the creator of this ticket.

And because we just hired this new person over there, we need to cross-train them, and we're going to assign this ticket to them. I'd like you to be able to work with them." That was quite effective. And they made an agreement. Instead ofWalking up to people's desks and interrupting them all the time.

The team itself, we're just going to sync up daily for 15 minutes at 3:00 PM. And it reduced the amount of context switching. It didn't drive it to zero, but it reduced it, so it was helpful. About this time, this book started being circulated around the management team.

This is a story of Pixar, and there's a chapter in there where they're doing "Toy Story 2" and Steve Jobs comes out and says, "Disney doesn't think we can do it. Let's prove them wrong." And what happens is death march, people burn out, until one day, a dad forgot to drop off the baby, and in the afternoon, they realized that the baby is still in the back of a car seat, strapped in unconscious in the hot heat of the parking lot. And this is where author Ed Catmull says, "This is unacceptable." We're just asking too much of people. Even if people wanted to give it, it's just unacceptable at that point.

And this contributed to more discussions around the office about lowering WIP. We're trying to do too many projects at the same time. We just don't have capacity to do that. We need to be smarter about prioritization and allocating capacity for the capability of the team.

It's what we're trying to do with Kanban. We're trying to balance the demand on the team with the capability of the team to meet that demand, and we're trying to do it in an economic fashion, because if we can't do it economically, we'll go out of business or our competitor will take over. Here's our ask. For the leaders in the audience, we ask that you please consider the power that you have over people when you ask them to do something.

I think workers are hesitant to say no because you're their boss, and we're people pleasers. We like to say yes. And for the workers in the audience, what we particularly need help with is how do we make it okay to believe that no is an honorable reply to somebody asking too much from you? I'm going to leave you with this book list.

Several models here that we call upon in the Kanban community for improving collaboratively. It's a super list there. I could spend another half hour talking about all these books. I do want to point out Don Reinertsen's book.

I think Scott Pu on yesterday's keynote called out Don Reinertsen's book. It's all about queuing theory and why 100% capacity utilization doesn't work. Fabulous read. All right.

Thank you very much. Do I have a minute for questions? We have a question. Okay.

So question. So the physical Kanban board that you show there- Yes ... what have you seen in the form of a digital version? All that stuff is someplace in a...

It's already in a tool that I can get a really big screen- Yeah ... I don't have to spend time with. Yeah. Question is, have I seen an electronic version of a physical board?

I think the closest thing I've seen is Linkit's tool. They're here. They're a vendor at the conference, and their board is touch screen. So if you had a great big 16-inch monitor, you could put that up in the hallway, and it's touch screen, and it's updated live time.

Right here, a question. Have you ever done Kanban for architectural teams? Have I ever done Kanban for architectural teams? Yes.

One of the first Kanban implementations back in included an architecture team. We rolled out SAP using Kanban with 140 people. Kanban can scale quite nicely. It's not just for operations teams or small teams.

Yeah. Question? We get about 3,000 tickets a month. We tried to do this stuff, but our influence as tickets- Yeah ...

go up 60%. Yeah. Yeah, he's saying they got 3,000 tickets. So the overhead to put a ticket on the board, the transaction cost is too high than the value you get out of having it on the board.

So is there some amount of work that does make sense to put on the board that's going to bring you that value? And for that, I would go back and ask that question, what randomizes your team? What prevents them from getting work done? What are the biggest pain points there?

Maybe just bring visibility to that. And maybe just experiment with it for a week or two. It doesn't have to be a lifetime commitment. All right.

Let's do one more. Have you done anything to help increase the lead time for what we're putting in? Have I done anything to increase the- So you have more lead time for the work that's due in the wrong time? Oh, how we manage that.

I've seen a few different ways. One way is to actually have some calendar lanes upstream. So if this is October, you would see November, December, January. So at least you get a heads up of what's coming up the pipe.

But if there's going to be a constraint in the system, if it's up front, that can, in many ways, be ideal versus being at the very end. Right. And I can talk more with you about that later. All right.

All right. Thank you. Let's give Tamra a round of applause. Appreciate it.

So-