Conway & Taylor Meet the Strangler (v2.0)

Log in to watch

San Francisco 2015

Download slides

Conway & Taylor Meet the Strangler (v2.0)

Scott Prugh

Chief Architect · CSG

Erica Morrison

Senior Manager Software Development · CSG

In previous talks we discussed the last several years of a grass roots transformation at CSG. This focus was driven by Agile and Lean adoption and then beginning to tackle DevOps and flow of value optimization all the way to production.

Moving forward, we have adopted a shared principle based approach aligning development and operations leaders. This movement has not been without its struggles. Culture, Process and Technology limitations continue to be serious challenges for large enterprises trying to move closer to “Internet Speed”.

In this presentation we will discuss our goals to further accelerate our delivery and detail some of the principles we are using to align our vision and execution.

Chapters

Full transcript

The complete talk, organized by section.

Scott Prugh

All right. Thanks a lot, Gene. We're super excited to be back here again. I always learn so much at these events, really sharing the stories of improvements that people have gone about. Had some great times the last couple of years visiting Target, working with the great folks there, and it's been a great journey.

So today, we're going to cover really continuation techniques that build on top of last year's journey. Kind of kept the format similar so you could follow along. Now, these are forward-looking techniques, so we don't necessarily have all the answers, because last year was looking back, and it was easy to look at those and see our improvements.

So we'll get started.

All right. So really quickly, CSG in North America, who we are. There's really two sides of our business here. On the left, we have our customer care billing operations. We operate a very large outsourced billing customer care for about 50 million subscribers in the U.S. for large cable and satellite, many of whom you recognize, like Comcast, Charter, Time Warner, and DISH. We have about 100,000 call center seats. There's 40 dev teams that support and run our software, and 1,000 practitioners.

ACP, the key suite or platform that we provide, runs across about 20 technology stacks. That's everything from JavaScript on the front end to high-level assembly on the mainframe. So this suite of 50 applications looks a lot like traditional organizations: a lot of stuff that's been stitched together over the years. Our challenges have been time to market, quality, release impact, technology, and role stovepipes.

On the right, we have our print and mail factory. It's a highly lean and efficient manufacturing facility that churns out 70 million statements per month, and they're always trying to continuously optimize their business. For folks who read The Phoenix Project, you'll see there's a very striking similarity here. On the left, we've got software and operations, and then on the right, manufacturing. And we can use it to juxtapose how work is done between the two groups. So we've learned a lot from having that manufacturing facility.

So this is a recap of where we were last year and where we are today. I presented the left two columns last year, and in 2013, when we released into production, we had about 200-plus incidents, 201 incidents to be exact, that we would basically drop on our customers and on our support staff. That was incredibly impactful. It was very stressful to release. It's one of those things when you start pushing the buttons and red lights come on.

We implemented our 10 techniques. We halved our batch size. We went from 28- to 14-week releases, and we improved quite a bit. And last year I presented we hit 67 incidents, which is pretty good. That's about a two-thirds improvement from where we were.

The great thing on the right is we've continued to improve. Our latest release in August was just 18 incidents. That's a significant difference from where we were in 2013. That's a 90%, or 10x, improvement. It's pretty substantial, and now releases have become a non-incident. And all this is from applying lean techniques, DevOps principles to our environment, to these legacy applications.

So graphs are kind of boring, but pictures are a lot better. And this is one of my favorite pictures, really because it shows the impact that this has on people. As Jody said, this is less about technology, and it's really about people and culture. So these are operations engineers watching over our 15.1 deployment, which was in the beginning of 2015. They're putting in 71 features to 50 million subscribers. If you look closely, these guys are playing video games.

So we've actually optimized and done such a great job with this release, and we've gotten practice every day. About 70 times we've gotten to practice this release, that when we get to release day, it's a non-incident. So these guys used to run around for two weeks with their hair on fire trying to fix all these issues, and now very low stress. So for folks that think this is really technology and tools, this has a great impact, really, on your people and your customers, and we've seen substantial returns there.

So I guess I can ask, what's the problem, right? So why are we here talking about these further techniques? And this is kind of the issue that we have, and I assume a lot of people have, is that the demand for quality and speed, because of the internet, because of mobile, because of video, continues to increase. And we have traditional systems of record that were designed 30 years ago, and they're being stretched to become systems of engagement. We didn't necessarily design them to be front-facing, but over the years we've evolved them to be as such, and that speed now continues to stretch them.

The customer expectations, and their customers' expectations, also continue to increase. If you've got your mobile phone and you want to buy something or you want to stream a video, you expect it to work, and you expect it to work all the time. So those demands continue to increase faster than the quality and speed of many of these large organizations can go.

The second problem is organizational and process debt. Traditional organizations that were designed around Taylorism and Conway's Law really thwart speed, and they also thwart the speed of information learning.

And the final problem is the infamous technical debt, which a lot of people have. In this picture, I've got some green systems which are new and modern, and they move really quickly, and you can change and adapt them. And I've got some red systems that act as speed bumps. The problem is they're all interconnected. So when we change one system and it actually runs into a system that's a roadblock, we have issues.

So across these, we strive really for unimodal speed in IT, and we invest correctly in technical debt reduction and simplicity.

All right. Oops, go back. All right. So I talked about the pressures and constraints, our strive for unimodal IT, where we invest in culture, empathy, and understanding across the teams, and also the simplicity and automation, which is very important.

Now I've got our V2 techniques, which build on our previous techniques from last year, and I'll dive right into those now.

So our first technique is holistically improving work visibility. I talked last year about how we improved our visibility and features across both development and operations. But as we've been discovering, to continue to improve quality and speed, we really have to take a harder look in a couple areas. One is incidents, the second is dependencies, and the third is really kind of single intake and tooling our cross-plan work. And that's not just features, which is creative work, but also service requests that are asked of the teams. And visualization of this work will then actually help us and help teams prioritize their WIP and target additional areas of improvement.

So 1A underneath this is holistic incident visibility. So I show a picture of our great improvements, 90%, and here we kind of talk about making these incidents visible across all the teams.

So we kind of zoom out to this picture, and I felt like Homer when I actually rendered this graph. The blue are the incidents that we've improved of release. So we went from having impact in the environment that was one-fifth of total impact with the release to one-fiftieth, and those are those blue slices. The orange is everything else that's going on. Those are incidents that are occurring across the rest of the environment.

So this was very striking. We looked at it because it was clear that we did not have a great picture of what was occurring across all the teams that was occurring in the software environment. The releases were great, but there's a lot of other stuff going on.

Kind of zoom to another picture and it becomes even more evident actually what's happening. So on the left, we've got the development incidents resolved by development teams. It's not necessarily the teams that cause the incidents, it's the teams that actually are resolving them. And on the right, we have the operations teams. The blue, again, are the releases that we've done such a great job to optimize away. It almost seems like there's a wall between those two groups where they're not talking to each other.

So, what we see is significantly more incidents actually occurring outside of a release, and additionally, we see the majority of these incidents being resolved actually by the operations team. So release incidents represent less than 2% of the total volume. Non-release incidents are greater than 98% of the volume that's occurring, and ops is burdened with repairing 94% of that total volume.

We've also found that the medium-low, which is indicated in the number three there, is 95% of the volume that's occurring in the environment. Further analysis, we found that basically 90% of those incidents are occurring from less than 20 errors that occur in the system. So I kind of look at this and I really ask, why is this feedback not occurring? What is really kind of preventing this feedback loop?

The Phoenix Project introduced us to the second way, which is to amplify feedback loops between development and operations, and we've got some additional techniques which myself and Erica will talk about around looking at the KPIs that incent this behavior, rotation to basically build culture, understanding, and empathy, and then finally telemetry to help you understand more about what's going on in the environment.

1B is dependency visibility. This is a picture of our manufacturing facility, and I'm standing in row one. Row one contains job carts. Each job cart has a job card on it, which indicates all the materials and the dependencies required, and it's a very physical activity to actually put jobs into the system. And we have software that manages the jobs, but this is a very physical way. When they take a cart and they place it really onto the floor, they actually start that job. So it gives them a very tactile way to manage work.

And we kind of look at this and then we kind of ask a bunch of questions like, do you know in software how your work comes in and is scheduled? And then I've added on here, do you also understand your dependencies required to satisfy the work? And that's extremely important, and it's extremely important in large enterprises in IT to have good understanding of that.

So this is basically row one of our software program. Across the top is 41 teams. Vertically is time, or the iterations. The blue cards represent features that are being developed. The yellow cards represent dependencies that need to be satisfied for those features, and the strings kind of connect them together.

So I've kind of started calling this Conway's Board because I always think of Conway's Law that says four teams will create a four-pass compiler, and I'm wondering what type of compiler will 41 teams create. This also looks strangely like our software architecture picture where we have dependencies all looped together.

So this was an experiment. We tried this PI to kind of really start to look at these dependencies. The teams built this. The general feedback is it's hard to really truly understand what it's saying, which points a little bit to the complexity. But a lot of things that we're experimenting with will actually continue to optimize this board and perhaps change it to be electronic, really, so you can get more information from it.

But it's really kind of an important look at the dependencies in the system and how we are looking to kind of move to leverage this understanding and eventually move towards true feature teams, or at least teams that can develop more of the solution without depending on other teams.

1C is single intake of all planned work. So here I got a picture of toolset A that manages our features. And I'm sure many companies have one type of work in one toolset, and then I've got toolset B that manages service requests and other really repeatable work in the system.

So kind of the problem with this really is, with three lists and multiple tools, is that multiple tools create what's called an information fog. And you generally then have to use a lot of effort to really stitch together the work, right? You deploy a lot of project managers, they stitch together the work, they draw project plans across it.

The dev features in this actually require both dev and ops teams. The ops features require development. Service requests can require both development and ops, it can request of both teams, and these work and resource dependencies collide. And then service requests generally don't get optimized and automated as part of this.

So what you really want to do is move to one toolset. Get your work in one tool so you can get that visibility across everything that's going on. It's one list, then, you can plan and coordinate together. It doesn't mean it's one giant bottleneck, but it does give you visibility now to all the work that is being requested across your dependencies and your teams.

All right. So technique number two is challenging shared KPIs. So if you think back to that incident picture that we had, I've got some things in red here highlighted. And if you look at the current KPIs that teams are being managed to, you'll see a disparity around critical and high where the ops goal is really two- and four-hour recovery. But development has a 12-hour and 15-day recovery. So there's a significant difference between those two.

Even more striking is medium and low, where operations has three and five days to recover, whereas the development goal is 90 days. So you can imagine with that how quick fixes are incented and the real fixes actually never occur.

So some things we're doing here now is to really set across the teams shared KPIs where both development and operations share in KPIs to recover the system. So the TTR for critical and high needs to minimize the impact to two hours and reduce the overall volume by 50%.

The second, around medium-low, it needs to incent code fixes, not just workarounds. And with this, we need faster dev response and reduce the volume by 90%. So this is something we're now using as measures as a feedback to really kind of take that picture that we've got of incidents that we've made visible and use these KPIs now to track our progress. And we're also not looking to punish people for not hitting these. Again, it's a measure where we want to leverage as a feedback system to continue to improve.

Erica?

Erica Morrison

Thanks, Scott.

So technique number three is our Go See and Role Rotation program. This program allows us to build understanding and empathy across our teams. Participants get to experience a day in the life of other teams as they sit directly with the members of these teams that they select. They spend several hours going over details of what that team's work entails, different challenges that they face.

Historically, thinking and perspectives have been fairly siloed along organizational lines. So we've got dev, operations, and ISC, which is our help desk. And people optimize for what's most visible to them, very much impacted by the people that they interact with, different feedback that they receive.

So what we've attempted to do with this program is to provide cross-pollination across the teams so that we can really develop this whole-system thinking. So then we can turn around, we can attack continuous improvement for the entire value stream, not just optimizing one function, but improving the whole end-to-end system.

So the Go See program itself is managed in a fairly lightweight manner. We've got a homepage people can come to get an overview of the program and get an overview of the different sessions that are available. Each department that participates provides at least one session, which is the opportunity to go see what that department does.

Sessions run the gamut from overview of tools, processes, products that a team interacts with, to actually digging in and helping resolve issues. Typically run about two to eight hours in length, although we are looking at adding an additional two-week role rotational component as well to really dig in and help understand this cross-team behaviors.

So I've included a few examples of some feedback on the right side of this slide here. Overall, the takeaways are that it helps people better understand the process and the systems so that they can do their job better.

Technique number four is infrastructure as code and shared understanding. So we've spent a lot of time this last year investing in continuous delivery and infrastructure as code. We picked a pilot team to do this, the team that manages our Jenkins infrastructure and really our whole build system infrastructure.

This is a DevOps team. We provision 15 Windows VMs four times a year for our Jenkins master and our agents, as well as some Linux and Solaris machines. And then once we have these Windows VMs provisioned, then we configure all the software on them and install everything necessary to successfully build all of CSG's code.

So this process used to be very manual, very time-consuming. What we've done now is we've automated all of this. So with a click of a button in Jenkins, I can interact with VMware's vSphere, run our Chef cookbooks, get our fully provisioned and configured VMs. So it's completely reproducible. We're able to create our entire Jenkins farm in this manner.

So Martin Fowler talks about how a server should be like a phoenix, regularly rising from the ashes. In contrast, you've got snowflake servers. These are long-running servers. They have their initial configuration. They've been manually changed over time. They're unique. They're not reproducible.

So what we're looking to do at CSG is move to a paradigm of phoenix, not flakes. So we want to build on the expertise and the understanding that this pilot team has developed and roll this out to the enterprise.

So it's important to talk about what we've accomplished with Chef. Also want to talk a little bit about how we've been able to accomplish this with behavior changes and double-loop learning.

So we knew when we started out this work that the infrastructure-as-code initiative was going to be a truly enterprise initiative. Quickly became apparent we'd need a new way of doing business to deal with the different roadblocks that we were running into from something cutting across so many different organizations. Roadblocks such as preventative mechanisms in place for procuring VMs and resistance to a particular technology, as many teams had started investing in this area.

So what we did is we pulled together leaders from across development and operations, got them to commit to a number of different things. First of all, commit to the pilot team with our Jenkins infrastructure team, commit to a technology with Chef, commit resources and prioritization, and most importantly, commit to allowing this team to challenge the status quo.

So Chris Argyris talks about something he calls double-loop learning. This is detecting a problem or an issue and not only changing what we do, but changing our underlying belief system. This is organizational norms, policies, objectives, and we've done that in a number of different ways.

So first of all, we changed the idea of how a team could be structured. So we pulled together from across numerous different development and operational organizations to create this truly cross-functional team. Team functions just like any other Scrum team. We've got shared planning, shared stand-ups. Everyone has the context of the problem that we're trying to solve, so they get it, which makes a really big difference.

We have quick visibility into issues, prioritizations. Means that we can move through roadblocks very quickly. We're able to expedite our learning. We're able to leverage everyone's varying backgrounds so that we can attack these system problems with this really holistic view.

Other things we've changed are the policies and processes around procuring VMs and also how we attack infrastructure management problems. So this used to be something where we would more or less duct tape something onto the end of a finished system, such as with patching. And now instead, we view this as something where we can design it from the factory. We can reduce non-value-add, reduce variability, attack this as a system problem.

So John Shook talks about changing what we do in order to change our underlying thinking, and this is basically an inversion on the traditional methodology of changing culture, and this is based on his time at Toyota. And we've followed the same methodology here, where we're taking a new approach to a problem with this cross-functional team, infrastructure as code. We've changed what the team does in order to change their underlying thinking.

Technique number five is legacy test automation and ATDD. So automated testing is another area where we've invested a lot in over the last few years, and I'd like to talk in detail about a success story with a product called SL Boss and how we've been able to use acceptance test-driven design and a continuous validation portal to successfully move from a legacy system to a modern one.

This effort's already saved us over $9 million, and the system is priced based on capacity. We've been able to reduce our cost down from $217 per core per transaction per second to $1.44. We process large volumes of data. We process over a billion transactions a month for more than 50 million satellite and cable customers, and this was running on a legacy system built on top of a complex, arcane middleware.

We didn't have automated testing in place. This meant that we couldn't move quickly and that we couldn't make changes in a safe, low-risk manner. So what we decided to do is apply the strangler pattern to reduce the operational complexity, reduce the complexity of the application code.

So as part of applying the strangler pattern, we started writing SpecFlow tests against the APIs for the legacy system for one area at a time. And once we had reached sufficient coverage, then we knew we could port over to the modern system with very low risk. You can see we put a framework in place here where we run our exact same SpecFlow test through the legacy system, through the modern system, and we actually compare the result in XML, and we'll fail our tests if they're not identical.

So we have customers who have coded to the specific XML coming out of this legacy system, so our results do need to look identical in this new system. So this ATDD framework allows us to catch issues early, flush out any missing requirements, any missing business logic before it ever gets into the hands of the customer.

So we've built on top of this ATDD framework, which is primarily targeted at running tests in a development environment, and we've also developed a continuous validation portal. This runs tests for all of our different environments for us. It runs our SL Boss transactions for all of our major subsystems for all of our customers. Gives us a dashboard we can quickly come to, we can quickly assess the health of the system, get high confidence that things are performing as expected.

So this slide here shows the results of implementing automated testing. There were a few other changes that went into some of these numbers, but automated testing is the biggest component of this. Improvements across the board, right? So feature development time up substantially from 15% to 55%. We've been able to increase our quality and our speed across both development and operations as well as reduce our risk.

So basically by putting this automated testing in place, we've made our environment something that we understand that's safe to change. And we're going to cover more details about this on a forum session tomorrow afternoon at 3:25 on legacy test automation.

Technique number six is telemetry and shared understanding. So we've continued evolving our telemetry solution. We embed telemetry into all aspects of an application. The code itself sends activity detail and tracing information to a central location to a product that we call StatHub. StatHub processes hundreds of millions of records a day, peaking at about 4,000 a second.

You can see from the diagram here that our servers and the applications running on them send their telemetry data to the central location in StatHub. We use Elasticsearch to store this data. And then we provide numerous different reports to view and analyze the data. Members of all of our different organizations are able to come get these same views of the data. So it's basically a common platform for people to come in, develop a cross-team understanding of system behavior. This promotes collaboration between our teams, particularly within development and operations.

So I'd like to talk about a few improvements that we're making to this system. First of all, we're adding additional applications. So the original libraries were written in .NET. We've added Java libraries. We're looking at extending to other languages as well, such as JavaScript.

We're incorporating legacy systems. So one such legacy system is a thick client that runs on hundreds of thousands of desktops. What we used to do is save the log files locally to that desktop. If there were any issues, that user had to actually call our help desk, had to manually send us the files, and then we'd take a look at them in isolation.

What we've done now is implement our logging and tracing framework so that it automatically gzipped up those log files, sends them to our StatHub REST endpoint, and we have immediate access to that activity detail and tracing information for all of the desktops. So we get the same benefits that we do from other applications that are in StatHub: better overall understanding of system behavior, reduced mean time to resolution.

Other improvement that we're making: we're beginning to implement some of the work that was done by the Chef pilot team and leverage that to begin managing our StatHub environment via Chef. So our StatHub team is a DevOps team responsible for both the code and the operational environment. We deploy right now to production once every two weeks. I've challenged the team to get to twice a week by the end of the year. So that's a substantial improvement from where we were at two years ago, deploying to production once every six months.

So another change that we're making is we're porting over a legacy telemetry system into StatHub, so focusing on getting all of our telemetry data available in one location. This gives us some additional capabilities such as alerting and machine statistics.

So I mentioned that StatHub is a good platform for developing shared understanding of overall system behavior. One such example of this is a recent troubleshooting session that was actually between Scott and a member of our operations team. They were collaborating, trying to figure out what was going on with this little blip that you see here.

So as a result of this collaboration, they were able to identify an improvement to StatHub to get to root cause faster. Basically, the ask was to be able to drill into the detailed logs behind the summary data in an easier manner, and we use Kibana within StatHub to do that.

So Scott sent this request to one of my teams. We were able to turn this around, get it into production in about a month. So this shows that not only is StatHub a good platform for developing this shared understanding, but we're able to leverage that shared understanding to continue to evolve our telemetry solution.

And with that, I'll turn it back over to Scott.

Scott Prugh

All right. So just a quick summary. So we've got the V1 techniques on the left-hand side that we covered last year, and this year we went through the V2 techniques that really build on top of those.

The first was really holistic work visibility and improving that across a lot of facets. Holistic work visibility across incidents, your dependencies, and actually what dependencies need to be satisfied to complete work, and then also single intake of planned work, taking things like service requests and features and change, and getting those all into a shared system and a shared backlog that's managed across the teams.

The second technique, challenging shared KPIs, really looking at the KPIs and understanding what behavior they're incenting and creating shared KPIs, DevOps KPIs, across both development and operations so they share in resolving issues together.

Go and see role rotation, really to start building understanding of how other groups work and really how we can get a system understanding to improve things end to end.

Infrastructure as code, not just for automating what you do, but actually changing how people behave and actually building a shared understanding of actually how the system is constructed and works.

Legacy test automation, really to also build that understanding of the functionality, but greatly reduced risk to change and be able to move at a much faster speed across applications that weren't necessarily designed with test automation in mind.

And then finally, telemetry and shared understanding. Having deep telemetry in your systems really allows you to understand what's going on. Making that visible to everyone builds that shared understanding. When things break, you can fix them quickly, and then you can also continue to improve the systems at a more rapid pace.

So the final thing, question and help, we're undergoing really an internal platform-as-a-service effort to really start making things self-service. One of the challenges is how do you move quickly from being a task-based organization to provision infrastructure and really provide that platform as a service quickly? It takes a lot of time to do that, and we'd like to move a lot faster.

Thank you.