DevOps & Modernization - An Engineering Excellence Story

Log in to watch

Las Vegas 2019

DevOps & Modernization - An Engineering Excellence Story

Chief Architect & SVP Software Engineering · CSG

In our past talks we covered both the radical people and process changes at CSG. This started with our Lean/Agile Transformation in 2012, followed by our DevOps Transformation in 2016 and integrating Product Management into DevOps thinking in 2018.

This year we will cover the great modernization work we have done at CSG. As a company with a strong culture of engineering excellence we saw it vital not to act like a legacy company. As part of this, we knew we had to invest in modernization and remove technical debt. In 2010 we set out to completely modernize and transform our technology and application stack. This included Foundational Modernization such as: E2E Version Control, Automated Testing, Telemetry and Infrastructure Modernization.

We then focused on a multiyear effort to modernize our application stack by moving to commodity and OSS. We also have made great strides to modernize our mainframe technology.

Scott Prugh, Chief Architect & VP Software Development. Scott supports the North America Development teams that deliver CSG?s hosted Billing & Customer Care Platform. Scott has broad experience across development and operations functions from startups to large enterprises. Scott is a Lean enthusiast and his mission is to help others learn and improve their environment to maximize value delivery to customers. Previously, Scott was CTO of Telution and built the core runtime and billing architecture for the COMx product suite. Scott lives in Chicago with his wife and 3 kids. In his spare time, he perfects pizza, enjoys wine and code.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

The next speaker is Scott Prugh, who we know very well. He's presented every year here at the DevOps Enterprise Summit, and a company that prints over 70 million bills monthly. And over the years, he's presented with Erica Morrison, Executive Director of Software Development Operations. Last year, he presented with Brian Clark, the VP of Product. But this year, I asked him to present on something that has amazed me from the very beginning when I first met him. Specifically, how has CSG re-engineered some of the largest portions of their technology stack, including parts of the application that were written over 40 years ago at First Data Corporation before they were spun out? It is one of the most heroic engineering stories I've ever seen.

So please welcome again, Scott Prugh.

Scott Prugh

Gene, thank you for that wonderful introduction. I'm glad to be here. This is a really great opportunity to tell a fantastic story at CSG. But I want to be clear, I'm just the storyteller. The credit really goes to the great engineering teams, the product management teams at CSG that allow us the ability to continue to improve our environment, the great infrastructure teams. And so I have two asks. We have a good number of people at CSG here today. We have more than actually the company.

So when you see a CSGer, congratulate them on really the fantastic work that they've done, not just here that we're going to talk about today, but over the last few years, transforming the way that we work. The second ask is engage them. Ask them about what it's like to work in a high-performing DevOps organization that has really transformed from legacy ways of working to modern ways of work.

So I have to recognize someone. So Erica Morrison, I don't know, she's probably up there. I saw them sitting up there. She's been my partner in these presentations the last couple of years, and she has done just some fantastic work. Erica, I want to recognize you and everything great that you've done, not just for CSG, but contributing to the community and how we transform. Please give Erica a round of applause. So folks that follow me on Twitter know that I practice fridgeban.

So this is my fridge. And my wife loves it when I lay things out on the fridge and plan out what I'm going to do. But really the last few years, in DOES 2014 through 2018, we talked about really the people, the process, the cultural components and changes that we've gone through. This year, as Gene mentioned, we're going to talk about our technology journey, which really underlaid a lot of that, but we never really went into the details of what happened there in this transformation. There's some pretty fantastic stuff.

So I'm going to recap real quickly the last couple of years.

So in DOES 2014, we really talked about our agile transformation in 2012, which kicked off, and we basically put in cross-functional agile teams that really now designed, built, and tested their software. And that yielded some reasonable improvements in what we called release impact. When we put a release in, we improved that about 83%.

Then we went through a true DevOps transformation in 2016. We collapsed together our product operations teams into basically our development teams to really own the entire life cycle of building and running the software. We saw some fantastic results with that, including incidents per month dropping. We're able to grow our subscriber base some 27% up through 2018. And then also, TPS on the platform grew through 2018 to about 4,000. A lot of that was enabled by the technology changes, but also changing the way that we worked.

Then in '17, we really focused on spreading DevOps really to the rest of the organization, to our platform teams, to how we manage projects or move more to a product mindset, but also engaging deeply with product management and those DevOps principles. I presented last year with Brian Clark around this lean portfolio leadership and PM meets DevOps. And in that, we really talked about how we look at the portfolio, how we integrate the other capabilities of DevOps, basically into how we look at the product. We reduced something called impact minutes, some 58% through 2018. Release on Demand, when we basically start getting rid of releases, improved some 460%. And finally, our employee eNPS improved quite a bit also, some 400% over that timeframe.

So looking at the problems we'll lay out today, back in 2010, we're really looking forward and we didn't quite know what the future looked like, but we did know we needed to grow. We needed to lower our cost. We needed to go faster, and we needed to be more stable. And these were all things both our executives and our customers were asking us to do.

So you fast-forward and you'll see that's actually what happened. We were able to grow our subscriber base now through '19 to some 63 million subs. TPS on the platform continues to dramatically increase because of digitization, mobile, and the cable industry changes have a lot more be self-service.

So that grew to about 6,000 TPS, which is 700% increase. But we couldn't increase cost 700%. We really had to reduce technical debt and change the way that our infrastructure worked, and really that's what we're talking about today.

So here's the problem with legacy systems. How do you patch them? How do you maintain and secure? How do you increase the stability and the safety of those systems? And how do you go faster and deliver features which are really what your customers want? And then how do you support this growth without a massive increase in cost? As we add more subscribers and more transactions, we can't have costs go up proportionally to that. And then how do you minimize exposure from dangerous vendors? I'm going to pause on that and give you my definition of dangerous vendors. Dangerous vendors are vendors that you spend more time in audit and compliance than you do with engineering. They're vendors that surprise you at term contract renewals and increase your costs by 500%. They're vendors that put in Byzantine kind of working agreements of how you can provision. You can't use virtualization. You have to isolate infrastructure off. I think folks in the industry are familiar with these experiences. These are existential threats that can really turn the P&L of a product line upside down as a surprise. Now, a lot of what I'm saying may sound a very victim mentality. Don't take it that way. You have to attack this problem, and you have to really engineer your way out of it. If you wait, it will become a big problem for you.So, one of the things we say at CSG is that we do have a great heritage and we honor that past, but we also inspire the future. So, our view was don't be legacy. Really be heritage and continue to modernize, and great engineering companies continue to allocate time for this modernization. Modernization fuels DevOps, it creates safety and it reduces technical debt, and improves things like productivity, quality, lead time, and recovery. It reduces risk, legacy technology, dangerous vendors, and also workforce sustainability. And finally, you're never done. Once you basically do one modernization, you realize you have to do another, and you keep moving on to go across your infrastructure. So here's really the picture of the reality for modernization for us, and this is really laid out in four stories. First story is about our API platform, stories two and three are about our mainframe, and the fourth story is about our composition platform. All of these platforms are foundational. They really support a massive business for us.

So if you were going to do this to your house, you'd move out, right? Well, unfortunately, we can't do that. We have 63 million subs, over a billion trans a day. We process $87 billion a year in revenue for our customers and 75 million bills a month. All of them expect that to work. Additionally, large batch go dark transformations don't work very well, so we're intent of doing this without impacting our customers and continue to transform while we actually had subscribers. So how do you approach this? Well, the easy answer is very carefully. Great perseverance and engineering excellence. But there's much more to it, and I've got some capabilities which I'll talk about, and if you basically can master these, you can transform. So, the bottom four in blue are capabilities that you're familiar with, automated testing, CI, telemetry infrastructure. I'll spend very little time on those. They're fairly well known in DevOps. The ones at top I'll spend a bit more time on in talking specifically how we leverage things like feature switches, code porting, incremental rollout, and things like strangulation. And I have left a good bit of detail on the slide so you can take this away and leverage it, and use it as a reference point in your implementations.

So story one about our API platform was running on a heavy Unix-based enterprise service bus. And I'll go into that a little bit and how we made it through that transformation.

So I'll start with this problem. This is what I call golf course software.

So this is software that's greenlit at some higher point in the organization. See how happy that sales guy is with the CIO, and they're making this deal. And they're saying stuff like this, "It's a low code environment. You don't need developers. You can just map your data. It's really easy to operate. It's already integrated with everything you already work with," right? Don't do this. And the other thing with golf courses is there isn't a developer operations person in sight. So please don't force these heavy platforms on your development and operations team. Let them pick them. Let them pick the tools that they want to run.

So that's really what we started with. We had this platform that had horrible developer aesthetics, lots of windows you had to click in and map stuff, override the mappings, high build and test effort, 14-hour builds. You heard Gene talk about the Symbian problem at Nokia. I mean, that was it. It took forever to build, weeks to test. Deploying to production were massive deploys, 45 minutes to recycle every server. There was no observability or telemetry. TPS density was incredibly low, and there was really an unsustainable cost to support the business growth. So if you talk about a dangerous vendor, it was incredibly costly to actually support this platform.

So our approach was to move it to a commodity stack, port some 300 transactions leveraged by 1,200 downstream integrations to native code. We strangled off that old platform with feature flags and canary, and we'll show some diagrams of that, and apply, of course, all the foundational stuff, the testing, the CI, telemetry, and infrastructure.

So we'll start with this, which we've got this old service bus, right? And we've got all these transactions going to that old service bus, and we were fortunate enough, we had what's called the software load balancer. It was written many years ago. And we actually had some configuration flags in that load balancer. So we leveraged it and really leveraged this concept of the feature switching, the incremental rollout, and strangulation to convert over. And so as we started to port the code, we would go take a low-risk client, like a lower volume client that had less use cases, and we would flip that feature switch. We would basically make sure it would work. We would do that actually during the day and look at those transactions, make sure that they were working. Then we went to riskier transactions, and then we went to riskier clients. And over time, basically, you phase through this and keep the entire thing in flight, and now we can actually remove that legacy enterprise service bus from the environment.

So I will dip into this foundational modernization automated testing here because it highlights on two things. One, how important automated testing is, and you heard that earlier in our previous discussions, but also kind of the dangerous vendor problem again.

So we had this set of legacy test case tools, incredibly heavy, incredibly expensive. We only bought them for testers because it was so expensive. So it created kind of the perfect silo of actually not really getting testers and developers to work together. The cost was going up about 25% a year on that and really had a high manual test effort. We commoditized everything to Gherkin, put the tests in version control, and now we had developers basically collaborating in test suites. Great thing, Gherkin's free, but the better thing is now it puts everything in code, and in version control with everything else, and you really get those two roles to collaborate. So by doing this, we were able to grow from almost zero tests at the time to almost 14,000 tests today. We collaborated on a paper several years ago on this research, and it's available for you, so I highly recommend to leverage that if you want to look at testing in legacy systems. So, I want to also look into how we leveraged that testing, because some of the things we needed to do is we had to make sure every transaction was exactly the same. And so the tests confirm a bunch of that, but we also used routing flags basically by running the test suites, running an old route flag through, and then a new route flag, getting those results, and comparing them. And those would catch subtle differences in the XML, like spacing, things like that, that unfortunately would actually break downstream parsers. And so we were able to get a lot more coverage by leveraging that test automation and those techniques. So here are the results.

So today, we just hit about 6,000 TPS on the platform. The TPS per node went up substantially, so one of the problems would be we need a ton of nodes to actually run this. We would've needed about 28, but we only need about 4.5 to support that entire volume. Now we do run more nodes than that to basically provide for isolation. The cost per TPS dropped over two orders of magnitude times two, so you can kind of see that cost difference. Feature development increased. Things like recycling servers now only take two minutes, as opposed to before it was 45 minutes per server.

So it used to be incredibly dangerous. It's a lot safer now.

So as far as some recognition, I want to recognize the SLBOS team on the left.

So that's the team that spent over five-plus years on this. We've been able to grow that team to Bangalore. Because we have actually a commodity platform, one of the key things, we can train other people on it now. And then in the lower right-hand corner, I want to recognize two people. Mike Battalucco was the manager and director of that team, and then one of my colleagues, Jeremy Van Haren, who just did a fantastic job leading this. One thing we did notice when we took this picture is we're a lot grayer than five years ago. So one of the side effects of modernization is gray hair. So that's just something to be warned about.

All right. So I'm going to zoom in on this plaque. This is what I've got on my desk now. Which is, the best time to plant a tree was 20 years ago. The second-best time is today. So don't wait. Start modernizing. And then if you haven't yet, it's not too late. And then one of my second favorite quotes from this project is from my favorite superhero is, "Luck favors the prepared," Edna Mode from "The Incredibles." And the amount of things that we've been able to do with being prepared and having this platform ported are just fantastic. We spin up new environments, roll out new customers, and we couldn't do that stuff before.

So the final thing of the celebration was actually this.

So it turns out Wes did say that in the Unicorn project. I didn't say the top thing. What I said can't really be printed. And then secondly, that's a 2U server, not an 8U server. The 8U servers were too heavy to lift, so we just used the 2U ones. All right, so the second story kind of gets onto our mainframe and mainframe DB2. So we had a problem where we were 100% VSAM. So we had lack of commodity data access. It was only by CICS. Our maintainability was at risk. We had this unsustainable cost increase again, which really jeopardized our viability as a platform.

So our approach there was to convert those VSAM files into 500-plus DB2 tables. We implemented incremental rollout here in a variety of ways and strangled off those VSAM systems. And then we started moving those read transactions off of CICS, really onto our commodity infrastructure to access that data store.

So here's the data store migration pattern, and you can use this with any set of data stores. You really go through four steps. You make your old data store primary, which is where you are today.

Then step two is you basically make the new data store be a replica, and you compare those results. Then step three is you make the new data store the primary and your old data store be the replica. You compare those, and you do that so that you can fall back. And then you make your new data store be the primary.

So how this worked is we had a bunch of switches at the mainframe level that allowed us to walk through these states. Switch one being VSAM being read and write. Switch two, the VSAM was read and write, and we would write through to DB2. And then the next step is that basically DB2 becomes read and write, and we write back to VSAM to keep it in sync. And note that both at switch three and two, we can roll back. And that's important because when you find an issue, and you will find an issue, you need to be able to roll back without impact. And then you do these nightly compares, and this finds data that's been around for 30-plus years that actually may be broken, and you basically need to actually fix the logic up. You need to clean up the data.

The final thing is we switch to DB2 only, which basically now VSAM is inert, and there's no going back at that point.

So how we deemed the switching, if you remember this picture where the load balancer, we had basically those route flags. We added a read flag now in there so these incoming transactions could switch and now start going to DB2 and really went through the same process to start switching everything over to actually read directly from DB2.

So here's some things here on this. Before, we were only VSAM by CICS. Now we've got DB2 direct. Our read transactions were 100% CICS. Now 62% go direct to DB2. And we have really high data accessibility, and our response time, surprisingly, is actually even better than it was before. And doing all this with near zero customer impact is pretty impressive. So the next one is mainframe Java, and this one really looks at a problem that we've got, which is we have close to four million lines of high-level assembly.

So if you've ever seen assembly language, it's hard to maintain. It's hard to understand. And we have maintainability at risk, and we continue to have this unsustainable cost increase across everything.

So our approach we took is to write some cross-compiler tooling, and you'll see that in a minute, and start targeting this update logic. And we also wanted to run this code off board of the mainframe, and again, carry through the incremental rollout to basically minimize impact.

So one of the things that we did is you saw that functional test coverage. The good thing is we'd already covered a good portion of the platform, so we had to augment that legacy code base, that high-level assembly. We'd already done a lot of that work, so we got leverage and built on top of that. We leveraged code analysis really to look at all the code, and those are pictures really of the module dependencies in the system.

So we can start unwinding the sweater, right? The big hairball, and pull off a lower risk, less impactful transactions first and move through the system. Cross-compilation. So here's the deal. You've got 3.7 million lines of code. You've got two options. You can start typing, which you would just never finish. And the other problems with that is you get really long cycle times to actually get code converted. Or two, you can create a cross compiler, which is what we did. We invested with some research companies that specialize in this to basically cross-compile the high-level assembly to higher-level languages. And then what we do, and this is actually code that really came out of the cross compiler. So for folks that read assembly, you'll see that it's isomorphically the same.

So on the left is assembly code, on the right is what gets output in Java. Now the key thing here is after this comes out, we put in, really, all these foundational constructs, CI unit tests. And then the real important thing is we continue to refactor the cross-compilation to get the code quality to higher levels. This allows us to find these domain-specific patterns. It allows us to increase the target code base maintainability. This example, we have naming overrides that get loaded into the cross compiler that take symbolic names and turn them into real variable names later. And the final thing again is leveraging that functional test coverage to go across that entire code base that's been converted to make sure that they're the same.

The final thing is in production, again, use that feature switching. We basically integrate all the code with our telemetry.

So we use Elasticsearch and put all of our telemetry into that. And then also have auto rollbacks, so if we detect errors, we can automatically fall back, which basically gets our recovery time close to zero.

So again, what this looks like, we add another node called the CICS flag. And in this we have update memo, which still goes to CICS, but update memo for client two now goes to Java, and it goes to a Linux cluster, which runs off board.

What we're also able to do is auto-detect failures.

So if we start getting failures from the Java code that we're calling, we actually automatically revert back to CICS after we see a certain number of failures. Again, driving our recovery time very low.

So here's where we are, and this project's not complete yet. We're in progress. That's why I have a star on the after. We were 100% high level assembly. Our goal is to get 85% or more of it moved off to Java. The updates to transactions, our target's about 40% go direct to DB2. Maintainability and productivity, great improvements there. Telemetry, we really didn't have great telemetry on this code that was integrated with everything else. Now we use our Open Common Platform. And then getting the share practices and tools, the CI code refactoring tools, all those tools that you really enjoyed on these other platforms that we didn't really get on the mainframe, now we can leverage. And again, our goal in this is near zero customer impact as we do these roll-outs. And because of that auto rollback technique, we can make these changes during the day.

All right, so some thanks here. So there are so many people involved in this project, close to 200 people. We even had Shrek involved. That's a picture from a Halloween party. And it was really an effort across everyone to do both this DB2 and Java. I do want to recognize one individual, Gary Gouger, here. He just hit 36 years at CSG. So let's give Gary a round of applause.

So Gary is one of our distinguished architects, and what I'm really proud of is his ability to continue to adapt and learn and really take this platform to the next level. It's really been fantastic. But thanks to all the mainframe teams, all the other teams that were involved in this, it was really a great accomplishment.

So the final thing is our composition platform.

So our composition platform is where we produce these 70-plus million statements per month. So think of it as a platform that takes layout from the billing system, puts it on paper, or on PDF for presentation online.

So if you look at this, we had a lot of these other problems. Four million lines of proprietary COBOL. It was over 25 years old. Didn't have version control CI. It was on a proprietary Unix. We didn't have any testing or functional tests. There was no telemetry. We couldn't horizontally scale it, and it was unaffordable vertical scale.

So they just want you to buy bigger and bigger iron that's more and more expensive that you have to license all this expensive software on. We had a crazy number of impacting incidents per day. It really was very tough. It wasn't a safe environment for the employees, and it wasn't a safe environment for our customers.

So I'm not going to tell this story. I'm going to let actually Steve Barr tell this story through a video. There's a couple things I want you to note here, and for folks who follow Steve Mayner, you'll understand the concept of transformational leadership. But setting a vision, inspirational communication, recognition, those are all things you'll see in this. And at CSG, we talk about something as the shadow of leadership. Think about the shadow as a leader that you leave. It's very important, and I'll let Steve talk about this now.

Steve Barr (video)

For the first time in 25 years, a big part of our application is under version control. That's a big deal. There's more. We also have continuous integration, and what that means is, is that as a developer makes a change to the software, their change is automatically compiled and automatically built as a part of the build process. So we know right away when something is working as soon as we check it in. That's great. There's more. We also have COBOL unit testing. So COBOL is considered a legacy technology, but we've put some modern techniques in place to where when a developer tests or checks in the code, it's automatically tested.

So there are unit tests now around the COBOL. We've really built a learning culture. We learned that we have to change things. We have to improve things. We learned a ton through this effort, and that's what I'm probably the most proud of. There's a lot of things that I'm proud of, but we've begun to create what I would consider a safer environment.

All right, guys, I'll kick it off. You ready? Yes.

All right. Hip hip!

Scott Prugh

So let's give Steve and the composition team a round of applause. And if you see him later, give him a little hip hip hooray because it was an incredible project. So there we converted about 90% of this project is complete, four million lines of proprietary COBOL to GNU COBOL, which is an interesting project because it compiles to C, and so you can integrate in other third-party C libraries very easily. Almost zero impacting tickets. We've got telemetry, and we're moving to a commodity solution there, all Linux and this GNU COBOL.

So we've also been pretty proud, and you'll see releases on demand mentioned in a minute. Our lead time for features were months to get things through the system. We now put things in days. We're putting features in as they're done with all that testing, version control, and automation. You also saw in that picture, there was Jenkins running up on that board. That was Jenkins running basically COBOL builds and COBOL automated testing.

All right. So I'll give you a little process update with the last couple minutes here, and how it relates to modernization. So, one of the processes, and we kicked this off in 2018, was release on demand, where we wanted to start getting rid of large releases. And both Damon Edwards and John Willis challenged me and said, "Okay, you reduced batch size once, reduce it again." And I struggled with it for a few years, so we came up with this. Let's make batch size be one. One feature goes in when it's done. And look at the incredible improvements here. We're now at about 62% of all features go in when they're done now. We were at 5% in '17, and now we're at 62%. And we couldn't have done this without a bunch of changes, but one of the key changes was modernization. Being able to shorten these cycle times, improve quality allow us to actually do this. That includes rolling things back if there is an issue.

The other thing is actually how modernization improves CAB.

So I always ask this question, how many people love CAB? That's what I thought. Oh, I see one hand. I'm sorry.

So for folks who follow me on Twitter, you'll know that actually CSG got rid of CAB, and it created quite a flurry. And my mailbox on Monday morning was flooded with this, and this is what you want it flooded with, is cancellations for all those CAB meetings.

So I was talking to Dominica last night about really just releasing all those time thieves across hundreds of people of a meeting you don't have to go to anymore that was incredibly frustrating. I was fortunate enough to collaborate on a paper basically around this with a bunch of other folks in the industry. Highly recommend you read it. One of the sections is about architecting for safety, which is really about modernizing how your software works to minimize the impact of change. And then I highly recommend Nicole's recent work around how she does the research on how a clear change process positively impacts software delivery performance. It's true, it does, but heavyweight change processes do not. And so there's some great research there and some great research in the paper that I highly recommend.

All right. So the final thing is to kind of refresh all these metrics for folks. I took you through the 2018 stuff, and I'll take you through where we're showing now. So our release impact, because of the fact that we're getting rid of releases, obviously dropped significantly. It's dropped 94%. Incidents per month continue to go down. Subscribers, we've added more, as you saw in the previous slide. We're about 63 million now. That's a total 29% increase since we started this. TPS has just exploded. Again, there's just this insatiable desire to consume our APIs at 700%. Impact minutes continues to go down, so we're projecting to be about 6637 by the end of the year. That's a 71% reduction. Release on Demand skyrockets. Our goal by 19.4, which is our next release, is to be at 70% of everything going in. We don't have a new reading on eNPS, but we hold steady at that 400% right now.

So the final thing is what we learned. You can modernize your legacy applications. I encourage you to be heritage. Don't act like a legacy company. It's vital. There's existential risks out there, but it does require engineering excellence. Leverage automated testing, CI, telemetry, infrastructure automation, all those things you hear about in DevOps. Optimize for developer and operations aesthetics. Leverage things like these feature switches, code porting techniques, incremental rollout, and strangulation. And finally, fuel DevOps. You can create safety, reduce technical debt. You can improve things like productivity, quality, lead time, and recovery, and then you can reduce risk. This legacy technology, it's technical debt, the risk from dangerous vendors, and really workforce sustainability.

And then the final thing is help I'm looking for.

So right now, capacity forecasting and estimation and wishful thinking around that. This is the hardest problem in computer science, I'm convinced. How do you basically communicate capacity, still figure out how to do estimates effectively, and combat the wishful thinking of that everything should take less time and get to production faster. It's what we want to do, but it's really hard to battle that. CapEx to OpEx cloud hurdle costing. Improving intake lead time with traditional IT mindsets, like everything needs an SOW, so you have to write everything down. But then how can you be agile when you have to write everything down? And then creating capacity for what's called backlog swarming, if you follow Jon Hall, which are all those threes and fours, like how do we basically tackle those and reduce the debt around those remaining incidents out there? And that's it. Thank you very much for your time.