DevOps and Lean in Legacy Environments

Log in to watch

San Francisco 2014

DevOps and Lean in Legacy Environments

Chief Architect & VP Software Development · CSG

Startups are continually evangelizing DevOps to be able to reduce risk, hasten feedback and deploy 1000’s of times a day. But what about the rest of the world that comes from Waterfall, Mainframes, Long Release Cycles and Risk Aversion?

Learn how one company went from 480 day lead times and 6 month releases to 3 month releases with high levels of automation and increased quality across disparate legacy environments.

We will discuss how Optimizing People & Organizations, Increasing the Rate of Learning, Deploying Innovative Tools and Lean System Thinking can help large scale enterprises increase throughput while decreasing cost and risk.

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

I'm pretty excited to be here. We had some great conversations last night and had some great conversations with Gene about really how to change organizations, especially legacy organizations, and that's what I'm going to discuss today. So my background is really in software architecture. I usually draw things on the wall, and then I leave and teams build it, and I come back a year later.

No. So most of my background is really building large-scale systems. And I've been doing that for a long time, and I'm pretty excited about that, but I started running into organizational problems of being able to actually deliver the great software that I was building. And it was really kind of a lot silos that we saw across a lot of groups, and we needed to kind of defeat those problems.

So I'm going to talk today about how we've applied really kind of lean thinking and DevOps techniques in a large organization, and we really kind of over the years have documented 10 of these techniques that have helped us really get there, and I'm hoping that you guys can take those home. So really quick on CSG, who we are, and I kind of divided this into two sections. On the left, basically, we are a large outsourced customer care and billing operation, and we have basically about 50 million subscribers in the US across 120 customers that you may recognize, like a Comcast, Time Warner, Charter, Dish Network, and other cable subscribers. We have over 100,000 call center seats deployed, and we have billions of transactions that run through this application platform every month.

We have about 40 development teams, and about 1,000 practitioners that we're going to talk about that we executed this optimization through. The product stack is called ACP, and it unfortunately runs across about 20 different technologies, everything from JavaScript to high-level assembly language on the mainframe. So it's extremely challenging because we also deliver this as an integrated suite of 50-plus applications that all actually have to work together. So our challenges have been really kind of quality, time to market, release impact.

We've got technology stovepipes, and we've got role stovepipes across these groups. The other side of the business on the right-hand side is our print and mail factory. It's a high-performance lean organization that prints over 70 million statements per month, and their challenge is really continuously optimizing that business. So for folks that have read "The Phoenix Project," you'll see this kind of is eerily similar to the scenario there, where basically that print and mail factory is like our MRP8.

That's basically our manufacturing platform that we're lucky enough to have, and we can actually go look at their processes and basically really compare and contrast those to actually how IT work is managed. So just kind of the improvements that we've seen, we'll start with those. So we used to have 28-week releases, and in 2013, basically, we had a release impact score that we calculate by assigning a critical, a four, a high, a three, a medium, a two, and a low, a one at basically 458. And when we would release our software with this 28-week batch, it would take us 10 days to recover.

In 2014, we basically reduced our batch size to 14 weeks, and we reduced the release impact to 153 with two days of recovery time. So you'll see there that we had basically a 3X improvement in the impact to our customers, and reduced the impact duration by one-fifth. And so that's significant. You can imagine how frustrating it is to put a release of software out there and have 10 days to recover with all those incidents.

So we immediately kind of saw improvement, right? There's also some additional information from a metric perspective that we've actually seen with this. So the two columns on the left represent what we saw on the previous slide. Now, the next column, basically further to the right, is the exact same code that we deploy a second time with fixes in it to a second customer a few weeks later.

And you'll see the dramatic improvement there, where our impact score goes to a two. And now we talk about an order of magnitude improvement that Gene talked about later in quality that we've actually seen. Now, what that indicates to us is that we're actually not getting quite enough practice. We need a little bit more.

And then the final column on the right-hand side is a four-week batch. That four-week batch is another product set, but again, you'll see that order of magnitude by having that smaller batch size and the agility of releasing that code to production. So if we kind of rewind and then just think about in 2010 when we looked at this, when we wanted to improve the quality and the time to market, we saw out there that we had all these system constraints. We had structure problems.

We had stovepipes and handoffs. We had technology variance. We had silos of technology, as the Target folks talked about, defects in quality, issues, low automation, and fragility. And so we really started on this three-year journey then, and along the way, documented these techniques that we found were valuable to actually help change the organization and really change actually how we were going to deliver value faster.

And we'll dive into all those right now. So the first one, and the most important, Heather from Target talked about really talent management and people. That is the most vital, and that's why we actually start with that one. So we realized that we needed to build a culture of learning and change the way people thought about self-improvement and inject lean thinking into the environment.

So we found a lean framework, and in this framework we've adopted is called SaFE, or the Scaled Agile Framework. You may have heard about it. And we used this and trained 2,200 people in our organization in lean thinking techniques. You'll see that the pillars there, respect for people, incredibly important.

People do all the work. Product development flow, which is based on Don Reinertsen's work on how you flow value through product organizations. And finally, Kaizen, basically improvement, continuous improvement. It's built on a foundation of leadership and has a goal to basically deliver value continuously at a sustainable pace.

The other thing we realized we needed to do is encourage cross-training, and we use several techniques for that. Taking I-shape resources, encouraging them, and cross-training them to move to T-shape resources, and in effect, increase their response repertoire, which Kevin Bear quotes basically in his presentation about coal mines, which is we increase their ability to react to change and basically take on different types of work. And finally, we encourage people to move to E-shape resources, where they build up the expertise, experience, and then finally, we want them actually to explore new areas of work.The second technique is called the Inverse Taylor Maneuver. So Frederick Winslow Taylor, basically, his principles of scientific management really helped revolutionize certain areas of manufacturing.

Unfortunately, they trickled pretty deep into large organizations. And prior to 2010, we had organizations that looked like this. Extremely role stovepiped, different roles, and they were really all optimizing for specific roles and communicating through large handoffs of documents, code, for example. We'd dump code over the wall to operations, and they suffer getting that at the end of a large batch handoff.

So basically, what we knew from that is that structure and the responsibility enforced the behavior, but it also prevented learning because these role-specific groups did not learn the entire business process flow. So really a principle of the agile movement was to create those cross-functional teams, and we basically went and reorganized our groups, so we had those cross-functional teams, feature teams in those areas that could deliver, for the most part, entire features. And this structure basically removes queues. So if you know anything about lean, queues are extremely problematic, and also it incents the teams to learn at a faster rate.

So we basically organized those teams to optimize the entire flow of value. The third technique is the Inverse Conway Maneuver. So Conway's Law, Melvin Conway in the '60s developed Conway's Law, which really states that your software architecture will generally represent the structure of your organization or teams. He found that four teams create a four-pass compiler.

Many organizations, and ours in particular, we found that the team structures really enforced the technology and the architecture of the software that we had. So in our case, basically, we had a traditional fat client-server desktop. We had legacy middleware. We've got just about every one.

This is an example of one that we've got. And then we had a standard SOA architecture that we're building our API strategy around, very similar to how Target was building an API strategy, and we really favored that architecture, really centralizing the business logic in one place, operationalizing it one way. But we couldn't get there because we actually had all these stovepipes of teams that were basically building different things in different ways. So in effect, we inverted that and basically changed the structure and used Conway's Law against itself to say, look, we're going to structure the teams so that they provide a standard API strategy for all the applications.

You also note on Martin Fowler's technology radar in July that he actually quotes this Inverse Conway Maneuver. We've used it for many years, but that was the first time we've actually seen it in writing, so I adopted it here and credited it to him. So the fourth technique is shared service continuous delivery. So one of the problems with 40 agile teams is that they will actually produce work really quickly, and then actually they will actually pave 40 different ways to production.

So one of the things that we wanted to do is make sure that we provided predictability for the downstream teams. So we basically created a shared service delivery set of teams that provided the common infrastructure for all the teams to consume. And we use many of the same tools that Target does. We use Jenkins for the continuous builds.

We use basically the Atlassian stack, really so the teams can communicate. We use Git and Subversion for all that infrastructure. But we have that common platform, and then we also try to make it self-service as possible, so it doesn't create a massive bottleneck for the 40 different teams within the organization that are consuming it. So the fifth technique is environment congruency.

So one of the things that we saw when we looked out across the teams was that we had development teams carrying out operational roles in development environments and getting to practice doing that every single day. And then two to four times a year, we would basically hand off the code to production teams that have really never practiced that deployment, and that created high batch transfers and high failure rates. So we looked at that and said, "Well, this is a little bit crazy." And I quote this to one of my colleagues, Steve Barr. He said, "Well, it's like basically we've got a game team and a practice team, and the game team never gets to practice," right?

So we looked at that and said, sports teams never do that. They don't do that, right? They actually have the team practice every single day. And we basically looked at it and said, "Well, we need to fix it." So we basically created this concept of shared operations teams where we have the exact same team do the deployments in every single environment, and they use similar environments to production.

And so in our cases, they actually get to practice about 70 times before release day. So we have about 14 weeks if you basically get to practice five times a week in the deployment. So by the time they get to production day, they understand the system, they understand what's coming, and they've practiced doing it 70 times. And basically now they have very low impact, and they're very successful at their deployments on production day when we need to roll the software out.

And you'll see at the top, I have Jez Humble's quote, "If it hurts, actually do it more." So one of the things we found is that the skepticism in the beginning is there's no way that we can take this on, right? Operations teams are so busy. But because you start to do this every day, you automate at all the infrastructure and the components that before were extremely painful and that were done at 2:00 a.m. in the morning, when people were trying to struggle to get the software working.

You end up automating all that.The sixth technique is application telemetry. And this one is one of my favorites because it's actually very surprising to me, actually, what improvement that application telemetry can make. And I think it's often surprising to executives and other individuals why it's important to make an investment in this. So if NASA launches a rocket, they have millions of sensors that tell you what is going on with that rocket.

Temperature, altitude, direction. If there's a failure, they record it, right? And it's a very expensive piece of equipment, so they basically have invested in the telemetry to understand what is going to go on with that rocket. Unfortunately, with software, we don't seem to take the same care.

We write distributed applications that run across a thousand nodes, and we use console writelines to standard out to write the errors, and then we expect ops teams to go grep through thousands of files to go find the problem. And we wonder why they can't find the problem. The development teams can't help, and then it takes forever to actually fix the problem, actually, when something breaks. So what we do is we actually build and embed very deep telemetry into all pieces of our application.

So all of our process spaces are instrumented to basically collect trace and activity information, and it's sent in real time to a repository that we call StatHub, which sits on top of Elasticsearch. And basically, we collect a billion events per day from all the process spaces that are running in that environment. We've instrumented over 100,000 location codes-- sorry, locations in our software so that we now understand distributed calls, database calls, writes to the file system. And when there's failures in those, we can see those in real time on a dashboard.

So on deployment day, when we push things out, we can look at that pane of glass and we start to see red, and we know we've actually created a problem. We can click, we can get the stack trace, we can go look then back in the code, and we can say, and go figure out what went wrong. Did a connection fail? Did the disk fill up?

Did a developer inject a parsing bug somewhere in the code? And we can see those things in real time. Extremely powerful. And now the teams learn how to make the application better also.

The seventh and eighth techniques are really around work visibility. So you'll find in a lot of organizations, and this is what they found in the Phoenix Project, is that work gets injected from all different places. There's emails, there's phone calls, people walk up to your desk, there's IMs, "Hey, I need to get something done," right? And they go directly to the workers for that.

That creates kind of chaos and context switching. And then what happens is you get other workers that then get blocked or aren't busy in the environment, so you don't get really great utilization. Really, you need to fix that. And one of the things that we've been doing, and we've been spending really the last couple of years, and it's really accelerated in the last year, is getting a handle on that across all teams and making sure that we understand where the work is coming from.

So you need to create an intake buffer. You need to basically create a way where all your work goes kind of through one process flow and into one set of tools. I mentioned the Atlassian stack. We use Jira, but there's lots of great tools to manage this with.

But it's not spreadsheets under people's desks, it's not napkins, and it's not phone calls to actually process that work, and you see that in a lot of places. Once you have that one list of work that you basically can triage and have unified visibility, then you can start doing things like WIP management. Limit the amount of work that you inject into your system, because as you drive the amount of work up, as we saw on the Phoenix Project, basically, things slow down. Wait time increases.

So you really have to adjust that WIP limit and also release the work into the environment in a predictable way. And in manufacturing, this is what they call job and materials release, and we'll look at some pictures of that in a second that really kind of illustrate more of how it works. But these types of techniques now allow you to put predictability into your work stream. So I mentioned the print center.

So this is a work visibility example from the print center, and this is standing in what's called row one or aisle one. Each one of those carts that are there represents a job that's going to be put into the print stream. And on those carts are proxies for all of the materials that will be needed to process that job. And they don't put one of these carts into the system until they understand that they have the capacity and all the resources to handle it.

So this gives them that predictable job and materials release. So we take our IT managers here, and it's really great, and we put them in aisle one, and we're like, "Hey, how do you do this in IT? How do you release work?" And on the first couple of times, they're like, "We don't know. We're not sure.

We just tell people to go." And I'm like, "Well, then how do you know you have the right operational resources? How do you know that you have the other infrastructure?" Right? So it really kind of starts to click for them how important that is and how, in some cases, manufacturing is very similar to IT. There are other cases that don't correlate well, but this is one that does.

So what we ask people is, do you know where your work comes from, basically, and how it is scheduled and how it is released? So this second example is a new robot that we just got installed, and it's kind of a kitschy example, but I thought it was pretty cool because it reminded me of automated deployments, so I thought I'd show it. But again, this is from our print center, and we have the luxury of actually being able to go look at these things and then actually translate them to the IT world. So if you play the video.

So this is a robot that's actually auto-sorting mail to put it on a pallet for the US Postal Service to pick up. We used to have people do this 24/7. Bend over, pick up boxes, sort them, get them on the right pallets. Now we have a robot that actually does the work perfectly, right?

It never fails. It never needs to sleep. Right? And so the things that I ask people from this isAll right.

How do you get your code in production? Do you have someone up in the middle of the night deploying code and making mistakes, or do you have automated robots actually deploying that code the same way every single time? And again, this was a kind of great example that you can kind of point to and look at the physicality in the print center and really translate that to IT. So I thought it was another good example.

The ninth technique is cadence and synchronization. So the example that I've got here is a picture of a bus stop. I'm actually standing at the corner waiting for the bus and there's five buses lined up and I'm sitting there kind of giggling a little bit, taking a picture because I think it's a great example of what failure to have cadence and synchronization does. This is actually called the bus bunching problem.

And so I also then took a screenshot of the CTA bus tracker site that actually shows they have the data that this problem is occurring. They see that there's five buses bunched up in the same place, but they don't know how to fix it, right? So the end result is all those buses end up late. The first one has a whole bunch of people on it.

The one in the back has no people on it, right? And it goes down the street, right? So they see this happening. They've got the data, they're not doing anything about it, right?

So the end result is the thing to take from this is if in IT and in work processes, if you don't inject cadence and synchronization, nature will do it for you and it will do this exact same thing. All of your projects will collide and basically cause a problem in the system. In other words, all projects will probably be late or of low quality. So what we do to manage this is we inject cadence and synchronization.

We basically line up unpredictable events like release planning to occur at the same time across all teams. So you basically have 40 teams that plan their dependencies at the same time. And then inside a program increment, we have subharmonic basically iterations or sprints where every two weeks they continue to plan and resynchronize, and then you have a major pull event into production, and then you replan again, and you keep doing that. That basically gives you predictable events to replan for problems that occur in the system and humans like predictability.

They like to look out and they like to know the date. Hey, I know exactly in 14 weeks I'm going to redo this planning. So let's get all these stakeholders together, the architects, all of them together to actually plan it out and plan that next release. The other thing it allows you to do is it allows you boundaries to manage new work injection into the system.

So you don't want new work coming in every single day, right? Because this unplanned work actually creates chaos for your teams if you're sending them new stuff every single day. So this gives us those boundaries to manage that on. And this is again how we use this cadence and synchronization to prevent problems like the bus bunching problem of all teams colliding and things occurring late.

So the 10th, really kind of the final on this of all this is reducing batch size. So we knew and everyone kind of knows in manufacturing that reducing batch size can have significant effects. But as we mentioned, we had structures which prevented us from doing it. We couldn't have just woken up on January 1st in 2014 and said, 'Hey, we'd like to do 14-week releases.' We knew that the constraints in the system, the fragility, would've caused a disaster.

But we put learning in place, we put infrastructure automation, we put processes in place, and then after doing all that, we're able to reduce batch size. So we basically took those 28 weeks, right, which were 14 iterations. We reduced that to 14 weeks of smaller program increments. And then from that we get smaller and fewer things going through the system and they go through faster and they have significantly less impact and higher quality, right, once we've done all these things to actually reduce that batch size.

So kind of a summary of the metrics, where we were before this. We had this 28-week batch size. We had high impact of 458. It took us 10 plus days to recovery.

We had a lot of irate customers from that. Then we cranked things down to 14 weeks after all these improvements in 2014, reduced our impact to 153, which is basically a 3X improvement on where we were. And it takes about two days to recover, which is a much shorter time to recover from putting a release in production. Where we want to be is probably somewhere between those last two.

We know with the extra practice that we're getting after that first deployment that we can do a lot better. We can basically do two orders of magnitude better. So we really want to get to that type of impact. We know on smaller products with that smaller batch size that we see a similar type of impact of four weeks.

We do believe that somewhere in between probably eight to 10 weeks is the right sweet spot for us. If we release every four weeks, our customers probably can't consume those features that quickly. So it's probably an eight to 10 week release cycle that actually we want to get to with that lower impact and that higher quality. So summary of these techniques, the accelerating learning very important, really lean in systems thinking.

Inverse Taylor maneuver and inverse Conway maneuver to change structure and technology. A shared service continuous delivery platform to provide kind of consistent delivery resources and tools across all the teams. Providing environment congruency and practice for the teams that actually do the deployments. Application telemetry so that your teams can learn how the application behaves in production and so that they can actually now increase the time to recovery, or sorry, decrease the time to recovery and continue to make the system better.

Visualizing your work and creating work release and WIP limits. Then applying cadence and synchronization to really kind of line up these unpredictable events in large enterprises. And then finally reducing that batch size. So a few credits that I've got here in the slide there.

There's a lot of work that was pulled from for this. And then kind of the final questions of how you can help me are things that we struggle with is really kind of standardizing these applications at scale, really proving the business case. We've seen great results, but we continue to struggle to prove out those cases and trying to make progress quickly on that is a hard thing. You've got tons of legacy applications, trying to clean those up, continue to standardize those, make the operations of those things a lot smoother.

And then also balancing standards with innovation. One of the challenges of having standards is it does stifle innovation a bit because you don't have all your teams running out creating new things. And we continue to struggle with that in our enterprise where we have teams that want to innovate and they should, right? We increase their learning.

We are asking them to explore things, but all of a sudden now they're injecting all these new tools and then that creates a bit of a problem for us to actually manage. And that's it.