More Culture, More Engineering, Less Duct Tape v3.0
Scott Prugh, Chief Architect & VP Software Development. Scott supports the North America Development teams that deliver CSG’s hosted Billing & Customer Care Platform. Scott has broad experience across development and operations functions from startups to large enterprises. Scott is a Lean enthusiast and his mission is to help others learn and improve their environment to maximize value delivery to customers. Previously, Scott was CTO of Telution and built the core runtime and billing architecture for the COMx product suite. Scott lives in Chicago with his wife and 3 kids. In his spare time, he perfects pizza, enjoys wine and code.
Erica Morrison, Executive Director, Software Development. Erica’s teams provide software solutions to CSG’s 40+ development teams. These solutions range from continuous integration frameworks to reusable libraries to telemetry visualization platforms. Erica is passionate about agile and has experience leading DevOps teams where members own the end-to-end infrastructure and code. Erica also has software development experience in the defense and aerospace industries where she worked on projects such as the replacement for the space shuttle. She lives in Omaha, Nebraska with her husband and two kids.
Chapters
Full transcript
The complete talk, organized by section.
Erica Morrison
My name is Erica Morrison. I'm an executive director of software engineering at CSG.
Scott Prugh
I'm Scott Prugh. I'm SVP of engineering, and I support both software engineering and operations for our North American products.
Both of our background is really from the software engineering side, of really growing up building systems, and we're kind of new to operations, as we had a major reorganization over the last couple of years to bring together both our software engineering groups and our operation groups under one set of leaders.
Erica Morrison
So a little bit about CSG. We are the largest SaaS-based customer care and billing provider in the United States. Some of our customers include some of the United States' largest cable companies, and we do billing for them, as well as back-end processing.
Our customers have over 61 million subscribers, and we've got a footprint in over 150,000 call center seats. We support all of this with over 40 development and operations teams, with a tech stack that really runs the gamut, everything from mainframe to JavaScript. And we're a traditional enterprise software company in many regards. We probably have a lot of the same challenges that you all have around time to market and quality of software solutions.
So Scott and I have chronicled CSG's DevOps journey the last several years at DevOps Enterprise Summit in San Francisco. And as Scott mentioned, we've made a major change to DevOps teams with the teams that you build it, you run it. And so we want to build on that foundation today and talk about where we've come since we made that major change.
Some different things that we're going to talk through today: our opinion on bimodal IT, the metrics that show what's happened with our journey, our service owner model, and how that's impacted how we do different things with our software development life cycle, change and incident management. We're going to talk through a few specific teams, see what the journey's been like for them, and then wrap with some items on culture.
Scott Prugh
All right, so the first thing I want to hit on is this concept of bimodal IT. My definition of mode two is really pretty simple: your servers and your apps, they're run safely with speed and quality.
So quick question: who here is in IT? Raise your hand.
All right, who here is in security? Everyone should still have their hand up. If you are in IT, you're in security. So if for nothing else, you actually need to run your servers and your infrastructures this way. Bad people, they want your data, and it's your obligation to run your infrastructure in a safe way that you can update and patch.
So now let's talk about mode one. This is my definition of mode one. You take your servers out to the parking lot, and you use a pickaxe or a sledgehammer. I think Chris Hill said this morning that bare metal is hard. Well, this is one way to take care of it.
The point being is that you actually have to strive to really run all of your servers efficiently, fast, and patch them.
So this was a fun day that we had. It used to take us 40 minutes to restart these servers, get the applications running. So imagine how dangerous that is for our operators in those environments, to wait as they restart 100 servers, 40 minutes at a time, as they rotate through those. Extremely dangerous.
Now they do it in seconds, or they deploy virtual ones and have them ready so they can switch traffic over to it. Very different.
At ChefConf this year, and I saw Nathen Harvey here, one of the key things they talked about was the ability to go fast, be stable, and cost less, and also be secure. It turns out that you can have all three of those. As Nicole mentioned this morning, there really are no excuses. Using the tools and techniques that we talk about here today, you can have all three of those characteristics in your systems.
One of the things that our colleague Carter McHugh presented at ChefConf was this idea of continuous compliance and getting out of compliance theater. We use tools like Chef and InSpec to basically go across our servers and confirm compliance every 15 minutes, so we know exactly what's on them. We know if anything gets onto the servers that we didn't intend.
That's very different than what had to happen before, where we would spend almost 20,000 hours for every PCI compliance audit to basically check what was on those servers and try to keep them in spec. Extremely difficult. We reduced that effort by over 80% to under 4,000 hours now with the tools and techniques here, and we have really near-perfect compliance on all of that infrastructure.
So I want to talk a little bit about our metrics. We have a concept we call release impact, and before we started this journey, this release impact, which is basically the severity of incidents that you have multiplied by the incidents that we actually release, gave us a score of about 507.
We went through our Agile transformation. We did a lot of things that we previously documented, reducing batch sizes, applying automation, automated testing to the environment, and we reduced that by over 80%, down to 85. That was a pretty good improvement.
But now let's look at some things that happened with our DevOps transformation. On a monthly basis at our peak, before we did our DevOps transformation, we were getting some 1,640 incidents per month into the environment. Those were reported by our customers. After we went to build-run teams, collapsed to teams that both build the software and run it, we saw a dramatic drop to about 633 incidents per month. That's a 61% improvement, and we presented that at DevOps in San Francisco last year.
I just pulled the metrics yesterday, and we see continued improvement there. We've dropped to about 427. That's a 74% improvement in the number of incidents that are actually found by our customers. So now when the teams are accountable and responsible for the software that they build and run, they improve the environment continuously.
Finally, let's look at our growth metrics. When we started this journey, we were just under 49 million subscribers on the platform. We grew to 61.4 million now. That's about a 26% increase.
TPS on the platform had an incredibly impressive growth. It went from 750 TPS in the beginning of this journey to over 4,000 today. That's a 400% increase in traffic on the platform. None of that would've been possible without the techniques that we put in, because the volume and load that was increased really required us to change not just the architecture, but the underlying infrastructure, how we deployed, and how we run the software.
Erica Morrison
So I want to touch on the people aspect for a minute. Many DevOps transformations only focus on the engineering part of the organization, so development and operations. But what we've found is it's really important to involve your entire organization, everyone from HR, to finance, to security.
How many of you in here are in HR? Anybody? I saw one hand go up there. That's awesome.
HR is a really important part of your cultural transformation, because as we all know, there's a major cultural aspect of this.
And anybody from product management? Great. We see several hands.
Product management, they're essential stakeholders for us, and they need to understand not only the software that we're developing, but how we're doing software. So they understand things like configuration management. They're going to need an appreciation for these things as we move forward in our transformations.
What you see here is a picture of a team that we've created. We call it our One Team, and this is the leadership from DevOps and from our product management team, and we get together frequently. We talk about different things within the software space.
In this picture here, you see us talking about Dominica DeGrandis' great book, Making Work Visible. If you haven't read this book, I highly recommend it. And there's a quote here that you see on this slide from an executive director within our product management team, talking about how our DevOps transformation has allowed us to serve our customers better.
Scott Prugh
Let's talk a little bit more about our service owner model and basically how we view DevOps across a couple dimensions.
First, the service owner we consider the single transformational leader that's accountable for the end-to-end construction, operation, the SLAs, the customer experience, and stewardship of business value for a product or set of services.
A couple dimensions, really first from a team and people shape, and then optimization and efficiency. In traditional waterfall and project-based organizations, we have a lot of I-shaped resources that are really focused around roles, and there's a big focus on resource efficiency, or really trying to keep people busy, focused on specific tasks to their resource type.
In Agile, we really looked and went to what I call small T-shaped teams, or teams that build the software but have cross-functional skills like development and test on the same teams. The flow, we basically look for optimization efficiency here, more towards flow, trying to flow through the organization, but I do consider it turbulent flow. The reason being is that you're flowing the construction process and that's becoming more efficient, but you're handing software over to operations, which can still be very turbulent as you drop software on the operations teams that actually have to run production.
Finally, when we get to DevOps and service owners, we get to big T-shaped resources and teams. These are teams that really have development, testing, analysis and design, UX, and they even have on the teams the operations folks that run it, and they act as a full team to basically build and support the software.
And then from an optimization perspective, we're really looking at flow and knowledge-sharing on the teams, and also storing a lot of that knowledge in code, across the teams, both intra-team and inter-team, across teams in the organization to share that knowledge.
Next, from a process and leadership structure. In waterfall, we actually saw very role-specific processes and very role-specific leadership structures, vertically organized in the organization.
In Agile, the processes that we saw were really SAFe versus ITIL-based processes. And then we saw dev versus ops, really the opposing forces of development trying to go fast and operations trying to slow things down.
Finally, when we get to DevOps and service owners, we actually collapse SAFe and ITIL together, basically making those processes really blend together on the teams. And then we have basically DevOps and service ownership, and also look for transformational leaders for our teams to continue to transform.
And then finally, work management. In the waterfall space, you saw a lot of tool silos, a lot of work being handed off through documents, whether it's Word documents, Excel spreadsheets, or tickets sent to other teams to do work. And queues are everywhere, really across the organization.
In Agile, we still have those tool silos, but we see the team collaboration from the software construction process. But also we've got a lot of queues. We especially have queues to the operations teams to either provision infrastructure, deploy infrastructure, update databases, reboot servers. Those are all really ticket request queues.
Finally, in DevOps and service owners, we look to have a unified backlog. Instead of having separate backlogs for production incidents and defects and changes and the features that you're building, we unify those together.
We also look to have this end-to-end team collaboration and also automation and self-service. We use tools like Chef. We use great tools like Rundeck to basically provide operations as a service to other teams, and those things remove the queues so that now we remove those wait times and that feeling of helplessness across groups.
Other SO responsibilities cover things like SLAs, monitoring performance, the people operations, basically the hiring and caring for the people, standards, tech debt, incident response, and security.
All right. Let's talk about change. How many people have a CAB in their organization here? Raise your hand. How many people like it? There's one? I want to talk to you later.
The concept in ITIL is to have a change advisory board, and one of the big problems with that is that it puts really approval the furthest from the people that have the knowledge. The thought is that you can create an advisory board that can validate all the changes going in.
It does remove the accountability from the responsibility and increases batch sizes, makes really large batches. These people are hard to get hold of, so we batch stuff up, give it to them for approval, and then release it into production. Extremely dangerous.
And we heard from Nicole today, CAB is pretty much useless. We actually want to rethink how we do that.
So let's look at what we do. We've pretty much gotten rid of the idea of CAB in our organization. We still hold a CAB for really critical changes, things that happen at the network level. It usually lasts about five minutes. It's more to inform people what's going on and to prevent things from colliding.
Otherwise, we decentralize change all the way to the teams that are responsible and accountable for that change. So change is just a feature with low variability that goes into their backlogs. During standup, they talk about the changes they're going to put in today, make sure they have the testing, the validation, those types of things, make sure they can roll back. But it's those teams' responsibility because they know the most about the systems at that level to put that in to do it safely.
All right, so finally on support. Traditionally, the standard three-tier support model is to really keep the support activities away from the teams who actually built the software. The whole idea is that they're precious resources. We don't want to distract them on a day-to-day basis.
The problem with that is it really creates these very elongated recovery times. You have queues which prevent knowledge-sharing really from that level one help desk to a level two product operations to level three. The level three development teams really never experience or hear about actually what's going on in production. And it really puts that issue resolution the furthest away from the knowledge.
So we do something very different. We have basically what we call an incident swarm model. We bring together basically all of the people to support the product when there's an incident onto one call to support that. You get all of that expertise on the call. It removes those queues, and it really gets rid of those handoffs. It reduces the time to resolution, and it maximizes knowledge-sharing on those teams.
We really want to swarm and solve those problems as fast as possible and also create new knowledge in the organization to continue to improve.
Erica Morrison
So we've had a lot of successes with our DevOps transformations, but we've also seen that some of these successes can bring unexpected challenges, and it's definitely not a straight-line journey.
We'd like to spotlight two specific teams, talk you through what their world has looked like through this transformation, to give you an insider view on what we've been going through.
The first team that I'd like to talk about is the team that manages our load balancer. Historically, this team was very operationally focused, and they were immature from a DevOps adoption standpoint.
Over the last several years, we've made a number of strides on their DevOps journey. We've automated manual work, we've made work visible, we've integrated with our telemetry system. And this slide here focuses on the work that we've done really over the last year.
Expanding on infrastructure as code is a really big transformational change that we've continued with. Our deployments were very manual in nature. To give you an idea, our largest changes that we implement in production could take up to six hours in length.
Obviously, we didn't have a good repeatable way to test in our lower environments, roll out to production, roll these changes back if we needed to. And this team makes a lot of changes, 20 to 30 production changes a week. As you can imagine, success of change and stability of the production environment were pain points for this team.
So what we did is we developed an infrastructure as code framework, and we began porting product by product over into this framework. We've got over 100 configurations moved here today. Now we've got the ability, button-click deployments to QA, do the exact same thing in production, roll it back if you need to.
So this has been awesome for us. We've been able to develop a unit test framework around this, so now we can catch issues upstream. Now we're talking the same language. We make changes on behalf of other teams a lot, so now we're talking source code, which is great. And we've been able to integrate with our telemetry system, so now I can see exactly who's deployed what and when they've deployed it.
Overall, it's increased the success of our changes greatly. However, we were an early adopter for CSG with infrastructure as code, so we've had some bumps along the way, some lessons learned.
First of all, in the case of a production outage in the past, once we knew what the issue was, we could quickly pull up the UI, make a change, and immediately, just like that, it was fixed. But now that we had it in source code, we had to wait for it to make its way through the build system, through some other processes. We had to revisit how we were doing some things to streamline things through and get them through the system faster.
We had also developed some self-service capabilities for teams. Because it wasn't in source code, we basically developed some workaround capabilities, and teams really liked these. So when we went to source code as a source of truth, we let those coexist for a while. In retrospect, we should have seen that, but they stomped on each other. And so that was another lesson that we learned.
And then finally, setups just take longer up front now. They're more maintainable, they're safer, but they do take longer. We just need to know to expect that. Again, being an early adopter, we fed this back into the enterprise as other teams are adopting infrastructure as code.
We've created a synthetics framework, so now we've got a dashboard of over 1,000 endpoints. We ping these every five minutes. It gives us system health information on all of our endpoints, and it also helps us when we're validating our changes.
Previously, when we were doing changes, we were dependent on other teams to validate those for us when we were making them on their behalf. We didn't want to reinvent the wheel, and so we really struggled with how to handle that. But just pinging these endpoints is a really great medium solution for us, and it's allowed us to improve our change as well.
We've introduced a release cadence. We're releasing code in small batches. I mentioned that we go to production very frequently, but we don't touch everything all the time. We want to follow lean best practices and deploy code in small batches frequently and refresh that code.
And then we're evolving towards self-service. We do not want to be a central bottleneck team. That's the model this team originally was, and so we're moving away from that. Now that things are in source code, teams can start to make changes on their own behalf.
The next team that I want to talk about is the team that manages our monitoring and alerting platform. This team's an interesting use case because they were already fairly mature from a DevOps standpoint, but they've experienced unprecedented growth the last several years with their system.
As more and more people have seen the value of this centralized telemetry system, requests to get more data in the system have outpaced our ability to scale the system. We've constantly operated very near capacity, which can lead to stability challenges. As long as nothing goes wrong, we're great. But as we all know from the operations world, sooner or later something does go wrong.
And we've made changes to increase our capacity, and as soon as we do that, somebody comes in and adds more data and gobbles up that capacity.
Some of these incremental changes include adding and separating out our infrastructure and software, basic things like adding VMs and splitting out our software components. We've improved fault tolerance between the components, so we've removed the tight coupling that we had here.
We've tuned third-party components, so we're continuing to partner with our vendors, working with them to respond to our ever-changing operational footprint. And we've improved visibility around system usage. So now we've got telemetry data on our telemetry system. This is really important so we can see who's sending us what.
However, we knew we needed to make a larger change at some point. We're very excited to have migrated to public cloud using infrastructure as code very recently here. On May 11th, we moved 40 back-end servers to AWS, and we did this in an automated fashion using Terraform and Chef. And not only did we automate spinning up the servers, but we have to be PCI compliant because we process credit card data.
So we can prove through these Chef InSpec tests that we are meeting CIS hardening standards, which satisfies our auditors, and it also makes sure that our security partners internally are on board with the solution.
This has given us a scale that we really just couldn't achieve on-prem due to our ever-increasing needs for compute and for storage. The lead time and cost associated with doing those was just something that we were struggling with.
And it also improves our patching story. Patching used to be babysitting one server per night for 15 servers, which was just super painful, and it's because we were so close to that capacity all the time. But now, through a modified blue-green approach, we have largely automated the patching of these servers.
This was the first public cloud rollout of its type for us, so it's really been a culmination of months of coordination across DevOps teams, network, platform, security. We fought through many issues, like how do we create a reusable machine image that meets all of our security requirements, and how do we integrate with our on-prem server inventory tooling? So this now serves as a blueprint that other teams that follow in these footsteps will be able to follow.
Scott Prugh
I want to talk about one area where we did have to take a little bit of a look back. We're big fans of really what's called Schook's law or Schook's model of change, where basically to change culture, you change behavior, or how you act, first. David Marquet's book, Turn the Ship Around!, also covered this, and he called it "act your way to a new way of thinking."
When we went through our DevOps transformation, we actually did it pretty quickly. Within a few weeks, we reorganized over 700 people across 40-plus teams into the build-run teams.
After doing that for about a year, we did see substantial improvements, but we realized that we hadn't done the best job of really telling people a lot of the whys and really going about what the true vision was of what we wanted.
So we did a bit of work to really set the vision. Really what we were doing was trying to get to one-by-one feature flow, 100% value add, employee and customer delight, and then security and safety for both our people and our customers. And that's both safety in the environment where we actually work, and it's also psychological safety, where it's acceptable to make mistakes.
After setting that vision, we really realized that we also need to provide forums for discussion and learning. So we created what we call our DevOps Leadership Series, which we come together once a month and we cover a meaty topic, whether it's DevSecOps, whether it's audit and compliance, whether it's infrastructure as code automation. Those are all things we bring out into our leadership series across the groups and really try to bring in as many people as possible to really understand the why behind what we're doing.
We also expanded our DevOps community of practice. This is the practitioners that are really looking at things like how do we do testing, automated testing, business synthetics? How do we actually do deployments, and how do we actually share how to do best practices around blue-green? These are the practitioners coming together and really sharing their practices as a group across the organization.
We also hold book clubs. We distributed books like The DevOps Handbook, Toyota Kata, The Phoenix Project, Making Work Visible, and cover those in book clubs with our groups and actually share learnings from all of those.
We also think it's really important to participate in the local community, the DevOps community. That brings new knowledge into our organization, but we also cherish the opportunity to share what we know with other groups and learn from them.
Erica Morrison
So quick recap. We talked about bimodal IT. We showed some metrics on our DevOps journey. We talked through our service owner model, a few specific teams, and a culture focus.
And this is actually the third time Scott and I have done this presentation, and it's fun to every time go back and see what we've evolved to from the last time as we continue to have new improvements and new stories to tell with this.
So the future direction and help that we're looking for, big focus areas for us moving forward. Public and private cloud transformations will be really big for us, and we know we're a little late to the game with moving to public cloud, but we really are excited about the momentum that we're gaining in this area.
We've laid the infrastructure. We've got Chef infrastructure in place. We have a team to support this now. We've had our first public cloud rollout. We have a private cloud rollout that's running in production as well, and we think that we'll continue gaining a lot of steam here over the next few years.
The Work-Life Balance Initiative: one thing that our teams gave us feedback on as we've gone through this transformation is that we needed to do a better job with work-life balance. So what we've done this year is dedicate 15% of each team's time to working towards improving work-life balance.
This is a team-driven initiative, meaning they get to figure out what their biggest pain points are and address them. For one team, this might be having less middle-of-the-night pages. For another team, it might be automating manual work.
And then finally, continuing to partner with our security and compliance team, making them a part of everything that we do. Like Scott said earlier, we are all developers or operations engineers, and we're all in the security space as well. That's part of all of our jobs.
Overall, we want to spread culture, invest in engineering, and shift ops left. Thank you.