More Culture, More Engineering, Less Duct-Tape

Log in to watch

San Francisco 2017

Download slides

More Culture, More Engineering, Less Duct-Tape

Erica Morrison

Director, Software Development & Operations · CSG

Scott Prugh

VP, Software Development & Operations · CSG

Over the past four years CSG has undergone a major transformation from traditional siloed orgs across development and operations to true cross functional teams in 2016.

Last year we discussed the structural culmination of this journey with the rollout of DevOps teams that follow the model of “You Build It, You Run It” teams. This places Accountability, Understanding and Engineering of the entire lifecycle of design, development and operations on one team.

This transition has taught us a lot of how Cultural norms of individuals and teams affect behavior. We have learned a lot about technical debt in legacy technologies and how duct-tape fixes pile on to create difficulties and problems in operations.

In this presentation, we will further discuss the techniques we have used to influence behavior, incent learning and knowledge sharing and change the cultural norms of established IT enterprises and practices. We will also reflect on applying modern engineering and architectural principles to established and seemingly intractable technologies often found in the enterprise space.

Erica Morrison, Director, Software Development & Operations, CSG International

Scott Prugh, VP Software Development and Operations, CSG International

Chapters

Full transcript

The complete talk, organized by section.

Scott Prugh

Thank you so much for acknowledging the work that we've done. We really appreciate it, and we really appreciate this opportunity to talk to all of you today.

As Gene mentioned, my background is from software, and so is Erica's. So we'll give you a little bit different perspective about how to think about operations, and we'll also kind of challenge a lot of the thinking around traditional operations structure and actually responsibilities.

CSG, what do we do, just real quickly. CSG is the largest SaaS-based customer care and billing provider in the United States. So if you get your cable bills from a lot of the major providers, we produce those, but we also run the software in the back end for the customer care. We've been doing that for about 35 years. So we're really proud of our heritage, but we're also really proud about continuing to innovate and improve.

Just recently, we hit about 61 million subscribers, and you'll see our growth. We have about 150,000 call center seats in the U.S. And the teams that we'll be talking about today, development and operations teams, there's about 40 of them, and we run everything from mainframes with assembly code all the way to JavaScript and Node. So really, we've got across the board all the different technologies you'll find in traditional organizations.

We have the same challenge as everyone else: getting things faster to market, higher quality, both on the software and the operations side. And just to be clear, you can do this with legacy technologies. You can innovate them. You can actually change and improve them. It's really not about the technology. It's really about a lot of the other processes and the people improvements that you can put in place.

So last year, we introduced what we called our DevOps teams model, where we collapsed together both development and operations into build/run teams. We're going to build on that today.

I'm going to go through a couple things. I'll reopen and hopefully close for good this concept of bimodal IT. Then we'll go through our DevOps journey and business and culture metrics, and we'll look at the improvements we've put in, but also what we're doing with people.

Then we introduce what we're calling the service owner model, and this is where we'll begin to challenge really some of the traditional thought processes around separating SDLC and ITIL and ITSM processes. I'll cover change, incident, and then Erica will look at post-incident reviews. Then she'll cover a team spotlight. And then we'll look at our targeted DevOps culture leadership series, where we continue to really reinforce our culture about how we're leading our DevOps transformation.

So the final thing at the bottom is product management and really what they have to say about this. So our DevOps transformation has really done an incredible work to improve both speed and customer satisfaction. And if you adopt some of these things you're seeing here this week, they can really help you win in the market, and these are the types of things that I wake up every day and try to do.

All right, so pretty sure this slide will not get me invited to any big IT conferences, but this is mode two. So mode two, and this lines up pretty much with the definition of bimodal, is you run your servers and apps safely with speed and quality, right? And it's really your obligation to do that, whether or not you have systems of record, whether or not they've been deemed not innovative. They all really need to be run like they were. And you can definitely do that.

So I think the thing here then is what does mode one look like then if mode two looks like this? And here's my definition. You basically get rid of the other stuff, or you figure out how to evolve them.

So could you play this video?

It was supposed to have audio. So there we go.

So that was incredibly actually fun to do. For five years, actually, we ran those servers and... Well, we ran them longer than that, but for five years, we were transitioning those servers out of production.

So it took about 40 minutes to recycle those servers. So imagine patching that infrastructure when you're running transactions for customers, for all the major customers in the U.S. for cable, doing pay TV, and you basically are taking 40 minutes per server to cycle. It's not very safe, right? It's extremely dangerous to do that. We take things out of production. People have to work late at night to do those types of things.

Now we've transitioned and ported and strangled off those servers, and those servers now restart in a few seconds. So it's a much safer environment by transitioning it.

So really, there is no mode one. It really has to be mode two across everything that you need to run. The bad guys don't really care that those systems have been deemed not innovative because they still want your data, and it's your responsibility to protect it.

All right. So now we'll take a look at the metrics. And this is a bit of a rehash from previous years, but the maroon line at the bottom really represents what we did with release quality through our agile transformation and what we kind of call early DevOps. We basically improved, putting in things like continuous integration, automation, about an order of magnitude.

But now look at the line at top. So the line at top is basically around... Can we go back a slide? I think one got skipped. Sorry, guys. Go back two slides. Yeah.

Sorry, I skipped over this one.

So this was our first. We talked about the business growth metrics. So this is, I talked about in the beginning, we went from 48 million subscribers to about 61 million in the timeframe from 2012 to basically ending this year.

The other thing is the growth on our API platform. We went from 750 TPS to about 4,000 TPS. That's a 400% growth. Basically, our customers just continue to consume our APIs. They have this really kind of insatiable appetite to use our services, and that's a good thing. You want these green lines going up because that means basically more value for your customers and more value for your company.

The next is the quality metrics. The maroon line, again, was release quality for agile transformation. But the one at the top is really interesting. It's the incidents that we actually have coming onto the platform during these years. For the most part, they hovered between 1,400 and 1,600 incidents per month.

About 2017, we saw an incredible drop in those number of incidents. That was just a few months after we introduced what we call those DevOps teams that build and run the software. So that was basically a 60% improvement by just actually having the same teams that build it actually operate it and see all the incidents coming in, and actually fixing those root causes. It ends up being a forcing function to actually help fix those quality issues.

The next thing is people. So in 2014, we had two people attend this conference. So folks, if you see my good friend Steve Barr, I think he's over here somewhere. Pat him on the back because it was only the two of us then, and Steve was my operations partner in 2013. And he had the courage to really experiment and look at different ways to work. And what we did there in those early years really formed basically the pocket of excellence that became our DevOps transformation later.

So he's just done a fantastic job, and now he's actually one of our service owners. He basically runs development and operations teams that, again, build and run software for our print platform.

So the other thing we have that's going on: today there's 26. So CSG folks, there's 26 people here. That is absolutely fantastic, and I want to thank all of you guys for not only your support, but really for your willingness to continue to experiment and try new ways of working. We're doing some really challenging stuff, and I'm pretty excited about what we're doing. So thank you to all the CSG folks.

The next set of things is we have Jill Musil and Jill Edmondson. Jill Musil is from corporate IT, and Jill Edmondson is from our product management group, and they have a presentation really about enterprise work visibility. It's really a fantastic story, but not only is it about enterprise work visibility, which is incredibly important, but it's really a great story about partnership, not just across corporate IT and product management, but across a lot of the teams to really look at this problem and basically solve enterprise work management at scale.

And so finally, we have the fantastic Lisa McCarey O'Neill here, and she's out here somewhere. Hi, Lisa. She's our HR business partner. So I did look, but I just want to check in the crowd. Do we have anyone else from HR here? Anyone? We have one other person?

So why did I bring HR, and actually this was Lisa's idea, to come to this conference, to a tech conference? The reason is, it's not really about the tools and all this cool technology, it's really about the people. Now, I love some tools, but you can buy these tools, but the tools aren't going to help you in your cultural transformation. It's hard work. Involve your HR, involve the people to help you transform your culture. It's incredibly important, and that can help you win.

All right. So the next thing is the service owner model.

So for us, a service owner is this: it's the transformational leader that's accountable for really the end-to-end construction, the operation, the SLAs, the customer experience, the stewardship of business value for a product or a set of services. They really have the whole thing.

So if you look on the left, it's the traditional, basically resource efficiency model from IT. Now, this is very project-centric. It incents you to run projects across all these groups. If you guys are still doing this, I'm telling you, you have to question what efficiencies you are getting and how you need to change to actually work differently.

On the other side is basically what we have, our DevOps teams and the service owner model. So with that, we're basically organizing to run and build a service across all the resources that are required. So we take what we built on last year with T-shaped teams and T-shaped resources, and we put T-shaped leaders in place that are able to transform and basically lead teams that both build and run the software.

So incredibly important, and I think you'll see more and more of this model in place as teams look to get more efficiency and basically provide more value.

All right, so the next thing: really looking at the combination of SDLC and ITIL. So on the left here, we have our traditional model, which was really, in our case, we run SAFe, basically the Scaled Agile Framework. And on the other side, we use ITIL, right?

And you usually have your feature board, and it has all kinds of great business features on it. And then you have your operations board, which has all the other stuff. And generally, the flow really goes from left to right, and you dump things over to operations, and they run it, and things break, and then they try to fix it. There isn't great feedback that goes basically back into actually how to improve and operate and design the service.

So what we are suggesting and doing is this: as you combine the teams, basically combine the processes together of your software development lifecycle processes and your service operations and service design processes into one backlog.

So talk about work visibility. Here it is. You can now see really all of the work that actually is going into that service. Now things like continuous service improvement, which is part of ITIL, which happens sometime later after you've built the service, it becomes an activity every day. So teams at standup are looking at the issues. They see basically that they're having incidents. They see the changes that are actually coming. They can actually integrate security into the construction process.

It's an incredibly powerful model that now allows you to get what I would call CSI for free, that continuous service improvement as an everyday activity with that feedback.

Now, this isn't without its challenges, and my product managers kind of tell me this all the time, like, "Well, great, Scott, but how do I get now more blue features? My dev teams are basically spending all of their time fixing the service that they built."

Well, that's exactly the idea. You need to get that investment basically into the service operations to improve it, and this, again, is a great forcing function to basically do that.

This is also the inverse of kind of Damon's talk coming up, is how do we keep this type of transformation from crushing your development capacity? In the beginning, it can really seem like that's occurring as basically a lot of your backlog gets eaten up actually improving that service. But if you remember the incidents in the beginning and how we got that 60% improvement, doing things like this is actually what gave us that.

All right, so let's talk about change.

So everyone loves CAB, right? I don't think anyone raised their hand for that.

So this was a great idea, right? Get a whole bunch of senior people that are really smart, put them in a room, have them basically advise on all change going into the system. It doesn't work very well. It doesn't work very well for a lot of reasons.

One, it puts the approval furthest from the knowledge. It also creates really large batches because those senior folks basically can only get together every once in a while, and we have to batch up a whole bunch of work. At the end of the day, it increases the risk that your change is actually going to fail.

So instead, we actually recommend this: let's decentralize all the change. We're basically going to push that change into the backlogs of the teams doing it and basically have them manage it. They understand the most about that change that is going in. So why wouldn't they be the ones who understood the risk and actually how to improve it?

You also get other things like, hey, they can now start to make the system safer for change. Like, let's redesign actually how that change is going to work. Let's automate it. Let's create things like standardized work. And I really view change as basically a feature that has very low variability. And now over time, you can get that automation in place.

Now, I think you'll take this as saying that there is no CAB, but we still have that. We still have that for changes that have really large blast areas that cross a lot of teams. And if you see Kevin Story, he's our change process manager here, you can talk to him about what we put in. He's really done a fantastic job of changing change and basically looking at different ways we can actually enable the teams. It's really fantastic.

All right, so the next thing is support.

So the traditional model of support, and John Hall has some great writings out there on it. I don't know if John is here, but I've read his writings on this, and it's really fantastic and lines up with our thinking.

So the standard model, three tiers of support, where you hand off from a help desk to product operations, maybe to a level three development, creates a lot of problems. One, it creates queues and handoffs, but it also creates organizational boundaries because those queues, they don't learn. People learn. And when you put those boundaries between them, they don't generate new knowledge to actually fix the system.

So what we're recommending is basically this, which is the swarm model, which again is covered in John's work. So in this case, basically bring everyone together for a major incident, basically on a shared bridge. Basically have them all involved and sharing information to resolve the issue as fast as possible.

So we still have a help desk. The help desk facilitates the call, has a shared whiteboard, it annotates, it works a timeline, right? But all those teams are now active in resolving that issue. The people with the expertise swarm it because they have the knowledge to basically fix the problem, and it removes those queues and handoffs, and it also removes the frustration from the customer.

So imagine a customer calling, they're having an issue, and what they realize is they're going to have to wait in three queues before they actually talk to the people who know what's going on. This actually removes that problem.

Erica?

Erica Morrison

Thanks, Scott.

So not only has our service owner model changed how we respond to issues as they occur, it's also changed how we respond to them after they're resolved.

So in the old way of doing things, often our operations team would be the one that would be resolving the production issue, and then they would have an after-action summary, or AAS, to talk through the issue after the fact. I came from the development world. We were honestly often blissfully unaware there was even a production issue. No idea what an AAS was. Our infrastructure team may or may not have been involved in this process. Obviously, we weren't applying system thinking, continuous feedback to come up with holistic solutions.

So what we do now, we have post-incident review at the team level. This is one or more DevOps teams, along with infrastructure teams if necessary, and they're talking through the timeline of what happened. It's an informal discussion. We're brainstorming on different ideas to make the system better. We're looking at, can we avoid this altogether? Can we get it resolved faster? What sort of knowledge and training can we share with our team members?

And then we have our after-action summary. This is really targeted at a different level. It's more of a summary of the issue. It's for our senior leaders, our business partners, and our DevOps teams participate here as well. It's really a summary of what's the impact to the customer and how are we going to get better.

So our service owners really are doing a lot. We've really only highlighted a few key areas of responsibility that they have. This slide shows additional responsibilities that our service owners take on, things like performance, monitoring, tech debt, people operations. So if you'd like to talk about additional details around this area, we will be having a Q&A session this afternoon at 2:35 in Imperial A.

So we've been talking about changes that we've made at the organizational level, and we really have made great strides in a number of areas. However, we've also exposed that the continual journey is not easy. So I'd like to take it to the team level for a minute, spotlight a few specific teams, see how the successes can sometimes bring unexpected challenges, and it's not a straight-line journey. So we'll talk through these teams and the successes and the challenges that they've had.

So the first team that I'd like to spotlight is the team that manages our load balancer. So we talked about this team last year and the success that this team had laying DevOps foundations, making work visible, automating manual changes, integrating with our telemetry system, and getting better visibility into changes that we were making in our production environment.

So infrastructure as code has been a big focus for this team this year. We've developed a framework, and now we're porting product by product into this framework. We've got about 20 products converted over so far. So this has allowed us to make change in a much safer fashion, where now we can, with the click of a button, deploy to production exactly what we deployed and tested to our QA environment.

To give you an idea of the scope of our manual changes, our largest manual changes took up to six hours. So that was a lot of clicking in a UI and doing a lot of work. Those weren't necessarily standard, but we were doing a lot of changes in this area.

So with this new safer model, we can now not only have smoother deployments, but we have less outages. However, there's been some complexities along this journey as we've learned our way through this.

So first of all, the intake process takes longer now. It's safer, it's more maintainable, but there is a larger upfront cost when we do new setups.

Another issue has to do with production outages and our ability to respond to them. So in the old way, we could go and in, say, 30 seconds in a UI, once we knew what to change, we could go change it. Now we've got to check it out of source control, we've got to build it, we've got to deploy it. So we had to revisit our continuous integration system, our module layout, and optimize the flow through the system so that we could get things through here very quickly.

Another thing we had done, we developed a stopgap for teams where they could basically, with the click of a button, go do some basic things like enabling and disabling servers. Well, when we went to source code as a source of truth, what happened a couple times is we let them keep the button. We should have foreseen the issues here. But they would use the button, change the state of what was on the server, we deployed over the top of that and caused issues. So we've streamlined that.

Another thing that we had with rollback: just some confusion on what to roll back to. Rollback is awesome with infrastructure as code. It's very easy. But we need to know what to roll back to. So we just integrated with our telemetry system here. We said, "Hey, we're already writing what version we're deploying. Let's just write what the previous version was there," and that's really smoothed things for us as well.

We've created a synthetics framework. So now we've got a dashboard of about 1,000 endpoints, and we can ping them and basically get a red-green status every five minutes.

So we had hesitated to do this, saying, "Hey, we're routing on behalf of these other applications. These application teams know their product better than we do. They know the feature functionality. They've got tests here."

But what kept happening is we'd have change windows, and we'd be blind during the window. We'd say, "Well, how's it looking? How is it looking?" And then we had multiple times where we implemented a change, it validated cleanly, only to find that there was an issue the next day.

And so we said, "We're going to take control back ourselves, and we're going to just simply ping these endpoints," which gives us a lot better visibility than what we have now. So this has greatly helped with our changes. When we have an issue, we can have a post-incident review, we can talk about what changes we need to our dashboard, and we've even had other teams who use this to troubleshoot some of their issues.

Another thing that we've done is introduce a release cadence, where we're deploying this infrastructure as code in small batches, just like we would any other sort of code. So in my early days with my involvement with this team, we touched code as infrequently, or production as infrequently, as possible. It was fragile, high risk to do a lot of these changes. So if someone explicitly requested something like an IP change, we would do that.

However, even in our early days of infrastructure as code, as we were learning our way through this, if we were doing something like changing the underlying standards, that wouldn't go to prod until the next time someone requested it. As you can imagine, a couple times things went to production, and they didn't realize that they were getting them.

So basically, we've finally turned the corner where it's safer to do these more frequently, and it's lower risk in that manner. We've now developed a risk analyzer so we can show them exactly what's changed since the last time we went to production, and they get used to this cadence.

So I think a great metric is how much sleep I get on the nights we make these changes. So when we first started doing these and I first started getting involved, I can tell you I never slept through the night on any of these major changes. I was often on the call when we were doing the change itself. If I wasn't on the call, I was checking my phone every hour, waking up, seeing how things were going.

The first time that we deployed everything to production with infrastructure as code, I slept through the night.

Thank you.

We're also evolving towards self-service, so we're coming up with lighter-weight solutions where teams can do more themselves and supporting cloud.

Next team that I'd like to talk about is the team that manages our monitoring and alerting solution. So this team has really experienced unprecedented growth this year. We've got lots of users of these systems: DevOps teams, internal business partners, our help desk, and our customers. And as more and more people have seen the value of this centralized telemetry system, the request to get products on here has really gone through the roof. So much so that we've outpaced our ability to scale our capacity.

So we've made huge changes and growth in our capacity that we can handle this year, lots of improvements to the system. So we've done things like, at the foundational layer, added and separated out the infrastructure. At the software layer, similar changes where we're separating out so we can scale these individual components. Partnering with our third-party vendors, looking at our ever-changing operational footprint, things like node allocation, indexing strategy, et cetera.

We've introduced blue-green deployments, where now the biggest value of this for us is if we get into an unhealthy state, we can flip over to an environment that's healthy, focus on current data moving forward. Improved fault tolerance between the components. We had tight coupling here, so we've decoupled that to the best of our abilities.

We're focusing on infrastructure as code and cloud, so we've developed cookbooks here, and this is likely going to be the first product that we take to the public cloud early next year. Finally, we've improved visibility around system usage. One of the biggest things that's bitten us is huge, unexpected volumes. So now we can see better abuse of the system, new products coming on, et cetera.

This team's been an interesting use case this year. They're one of our more mature DevOps teams. They were really doing DevOps before we'd fully embraced it as a company. They're really just a bunch of developers that happen to support an operational environment. However, they've been victims of their own success a little bit with this product that so many people like and want to have.

We're constantly operating at capacity. So we'll make an architectural change to get us more capacity, more breathing room, and it's like two or three weeks later, we just can't keep the next product from coming onto the system, so we're right back to at capacity. So as you can imagine, when you're operating at capacity, you're fine as long as nothing goes wrong. And we all know from the operations world that nothing ever goes wrong.

So as a result, we firefight more than I'd like. This firefighting, again, we have post-incident reviews. We've identified many improvements we've made to the system, which is great, but it has detracted from our roadmap of where we want to go to get to true enterprise scale. So we do have a good plan. I know we'll get there. It's just been a matter of getting the work prioritized.

Last thing I want to talk about is our targeted DevOps culture focus. So 18 months ago, when we reorganized around DevOps, we basically jumped right in and just started doing and let the methods speak for themselves. This was intentional. However, we took a step back and said, "Hey, we also need to follow this up with the why and the how and set the vision for the entire organization so everybody gets it."

So first of all, we came up with what does DevOps mean to CSG on one side, and I don't know if you can read this, but basically some key areas for us: customer focus and delight, people, and modernization and process.

We knew we also needed to have a forum for discussion and a venue to share our success stories. So one of the things we did, we introduced a DevOps leadership series. This is a monthly meeting where we meet with our leaders, we have a topic, and we also celebrate our wins.

We extended our DevOps community of practice. This is a meeting for basically our practitioners. We talk about things like tooling. For instance, this month's topic is a COBOL unit testing framework. We're holding book clubs at the team level, and we're also participating in our local DevOps community.

We've identified that there's a lot more that we need to do here. This is a gap for us. We're partnering with our HR team. This is going to lead into some help that we're looking for I'm going to talk about on the next slide.

So in summary, Scott showed the graph that shows the tie-in between our business and culture metrics along our DevOps journey. Our service owner model has really changed how we approach a number of things. Having one person take a system view across a software development lifecycle, change, incidents, both when they occur and after they're resolved.

We've walked through a few examples of some teams that have shown the ups and downs of this journey along the way, and we're starting to have a better targeted DevOps culture focus.

So help that we're looking for really centers around this culture space. We're looking for your best practices, your ideas, success stories around these topics. Things like continuing to build DevOps culture and cross-skilling, penetrating the next level of leadership, and building consistency of message.

So we're really looking forward to three days of learning around these topics and many others. Thank you.