More Engineering, More Culture, More Security
Over the last several DOES conferences, we've outlined CSG's DevOps journey. This is a continuation of that story.
Erica’s teams provide software solutions to CSG’s 40+ development teams. These solutions range from continuous integration frameworks to reusable libraries to telemetry visualization platforms. Erica is passionate about agile and has experience leading DevOps teams where members own the end-to-end infrastructure and code. Erica also has software development experience in the defense and aerospace industries where she worked on projects such as the replacement for the space shuttle. She lives in Omaha, Nebraska with her husband and two kids.
Joseph Wilson is a computer security expert whose career included five years as the Chief of a major DoD Network Operations and Defense Center (NOC) before entering the private sector. He also served as the security architect, strategist, and Manager of Security Operations for a Fortune 250 food company prior to joining CSG International. Joe now serves as the Executive Director of Global Information Security (SecOps, NetOps, and SecDev) and is responsible for protecting customer as well as company assets for CSG International 24/7. He lives in Omaha, Nebraska with his wife and three kids.
Chapters
Full transcript
The complete talk, organized by section.
Erica Morrison
My name is Erica Morrison. I'm an executive director in our software engineering space, and this is Joe Wilson. Joe's an executive director with our global information security team.
Before I get going, a little bit more about CSG if you don't know who we are. We're a global company. We've got about 3,300 employees around the globe. We're the largest SaaS-based customer care and billing provider in North America. Some of our biggest customers are companies you may have heard of, like Comcast, Time Warner, and Dish. We've got about 62 million subscribers for these customers and about 150,000 call center seats. We support all of this with a tech stack that really runs the gamut, everything from JavaScript to mainframe. We've got about 40 DevOps teams. Same challenges as many companies: things like time to market and quality of software and operations.
A little bit more about our DevOps journey. I've had the privilege to present at DevOps Enterprise Summit three previous times in San Francisco and also this past summer in London. When I look back at our different presentations, I think they kind of tell the story of our journey. In 2015, we talked about reducing batch sizes, applying agile and lean. In 2016, we had a major organizational transformation where we brought development and operations together with true you build it, you run it teams. In 2017, we built on top of that foundation, spreading culture, investing in engineering, and shift ops left. In 2018, we're focusing on more automation and shifting security left.
I want to start with some metrics that show some of the progress with our journey. I get asked sometimes, what's the most important metric to track when you're doing a DevOps transformation? My answer is always, it depends. It depends what's important to your company. For us, reducing customer outages is something that we really wanted to focus on. We came up with a way to quantify this. We call it impact minutes, and we take into account the duration of the outage, severity of the outage, and the products that are impacted.
We use a framework from the book, The Four Disciplines of Execution, or 4DX, to track how we're doing with this. Discipline number one is to focus on the wildly important. We've said we want to focus on a wildly important goal in this area in 2018. We want to significantly improve how we're doing with impact minutes compared to 2017. I'm very excited to report that we are actually 58% better than we were last year with impact minutes. We had set a goal of 10%, so we have wildly exceeded that goal, which makes us very happy. We also have 74% fewer incidents. This doesn't just happen. This happens with a conscious decision to continue improving how we're doing. We'll talk about some things that some of the teams are doing. Examples include improving our synthetics framework, improving telemetry, modernizing our platforms. Lots of people have been working really hard to accomplish the numbers that you see here on the screen.
Let's talk a little bit more about the other disciplines so you understand this framework. Discipline number two is to act on the lead measures. This is where we create epics and features, and we actually plan and do the work. Discipline number three is to create a compelling scorecard. With the scorecard here, we've actually used Power BI. We can slice and dice the data in lots of different ways: by service owner, by product, by customer, by date. This gives us a lot of insight into what's going on here. Discipline number four is to create a cadence of accountability. We do a regular review with our executive team of the progress that we're making and make sure that we're going the right direction as a company.
Let's talk a little bit more about some of the things that we've done this year. When I look back on the year, infrastructure as code is something that I see as a theme that we've talked about quite a bit. We started our infrastructure as code journey several years ago, and this isn't new to the DevOps space, but we've really accelerated our progress in this space this year. Platforms that we've chosen: we've chosen Chef for core infrastructure as code and Rundeck for our operations management platform.
These are really the foundational elements for us that support our public and our private cloud rollout. I want to talk about how we support this organizationally. We've got a core team that supports all of this for us. This team is a key knowledge base. They provide a training curriculum. They also help answer lots of different questions as teams are rolling this out. They own our Chef infrastructure, and they also own a common set of cookbooks. They own our standards and best practices. The teams themselves are responsible for rolling out Chef, but they often do this with the support of the ASA team, which is the name of this core team. The ASA members get loaned out to these teams from time to time as well. With this, then we can feed that back into our standards and best practices. This team supports the workstation configuration and also a test framework that leverages AWS and Test Kitchen.
With all this, we've rolled out to production. Eight teams are using this in production now, and we've got another five that are using this in dev. This is a substantial improvement for us this past year. One thing we found is it really requires a change in mindset. For our more mature teams, this has kind of allowed them to level up and take their game to the next level. For some of our less mature teams, it's allowed them to basically have a forcing function for getting on some of our foundations, like our continuous integration system.
Joseph Wilson
Thanks, Erica. I want to talk a little bit about what we've done at CSG regarding combining DevOps and security. We all know that when we combine those two things, it can be a very powerful combination and that everybody wins. We took a step back and understood that the attackers always have an upper hand. They always have more time, money, resources. We wanted to create a National Guard model at CSG, and that is to give autonomy, mastery, and purpose to our developers.
What do we do? We took a step back, looked at what is considered a defensible architecture. Richard Bejtlich, a well-known security author, says there's six things that include a security network architecture that are defensible. They're monitored, they're inventoried, they're controlled, minimized, assessed, and current. Those six bullet points are really, really hard, and we knew we had to get some new technology in order to accomplish them. We evaluated some technology, and we landed, as Erica said, on Chef, in particular, Chef InSpec.
The difficulty we had, the biggest pain point at CSG we had was every single PCI assessment, we struggled with configuration management at the granular level that was expected. We leveraged Chef InSpec. We combined that with our application specifications and our platform specifications, as well as our global asset management capability that we have in-house. Our global asset management capability includes things like IP addresses, host names, whether or not those systems are PCI-based systems. It also includes some information about who's in charge of that asset and the leadership at the next level, so we can always monitor.
What we accomplished was, we have the ability now to detect network drift across the enterprise. Application and operation teams have the ability to update that configuration and that application specification on demand, or they can go and fix the drift as needed through our change management process. Scott Prue covered this information earlier this week, but if you missed it, this is now open source, so you can go and grab our asset compliance tool, otherwise known as ACT, and please do so. We'd love your feedback.
The second-biggest pain point for us this year in the security space was vulnerability management. We had lots of feedback from teams saying, Hey, I don't understand what I need to patch. I don't have enough time to do it. It's really, really challenging for us. Again, addressing this at a National Guard level, we decided to enable those teams, and we did that via the same development processes that Erica's teams use. We're using Jenkins, Python, Ruby, MS SQL, and our global asset management database.
What we do with vulnerability management is we've automated the process end to end. We take our information from our global asset management database. We then port that into our vulnerability scanner. That vulnerability scanner automatically runs every day. We take the vulnerabilities from that vulnerability scanner and pipe it right back into our global asset management database with role-based access control wrapped around it. What that does is it gives the application teams and the platform teams immediate feedback and a vulnerability fast feedback loop with our daily scans to know what they have to prioritize and go fix. Additionally, we've piped that information directly into Power BI, which gives us the ability to understand who's performing well from a patching cadence perspective. Senior leadership reports are also included over email.
Our patch cadence at CSG is a 90-day marker, so no matter what patch needs to be applied, you have 90 days from the start or the first detection of that vulnerability. We send out automated emails at the 30-day, the 60-day, and the 80-day marker. If you can't patch within 90 days, we ask for a policy exception to be put in place through our governance board, and this has been extremely powerful for us.
The next biggest challenge for us was file integrity monitoring across the board. We had disparate processes and procedures at the application and platform levels. What we decided to do is we had to consolidate to a central tool, and that tool was Trend Micro Deep Security. But we wanted to leverage also the entire framework that we talked about earlier. Let's learn from that and accelerate this process for faster onboarding.
We stood up a Confluence page and we integrated Trend Micro Deep Security FIM into that Confluence page. What happened was we gave the autonomy of our developers to go ahead and modify that Confluence page and select folders that needed to be monitored. In the case of a web server, web root directory obviously needed to be monitored. Application teams, once TMDS was installed, all they had to do is go and update the Confluence site, and immediately, Trend started receiving alerts. But beyond that, we also made molehills out of mountains. We decided to discard all of the alerts that were no longer of value, and we worked with those teams specifically to do that. We separated into operating system and application alerts. Every day they would get automatic reporting. They're enabled, they're empowered, they can look at it and quickly remediate that. The other thing that we gained from this is huge visibility across our enterprise. Today we've got 30 products and 51 technical owners and nine service owners that are using this end to end, and that will continue to grow.
Another huge challenge for those that are familiar is third-party software. I think everybody struggles with this. What we did to address this is we leveraged Chef, ACT, our code scanning capabilities, Checkmarx, Flexera, as well as US-CERT data and all the vulnerability information that we could possibly pull down, including SCCM data. What we obtained then was a complete application snapshot. That snapshot's very powerful. We started to do regex and other vulnerability pattern matching so that we could automatically detect further left in the development process what needed to be remediated, and then we would automatically create a Jira ticket and then remediate that as a part of the normal developer workflow, which is a huge win for us.
Erica Morrison
It's been pretty cool getting to partner with Joe's teams this year, and adding security into this has just really been a natural progression, and Joe's shared a couple examples of that. Another area is in the cloud space, and Joe's going to tell you in a minute a little bit more about how his team was engaged here. But the infrastructure as code work that we did really laid the foundation for some major cloud wins for us.
The first one is StatHub. This is our system monitoring tool. In May, we moved 40 back-end servers to AWS. We did this all via automation with Chef and Terraform. Not only did we spin up our servers using automation, but we wrote over 1,400 InSpec tests that verify that we are PCI compliant, which makes Joe happy with me, makes it a lot easier for me to convince him to let me do cool, fun stuff. Along with this, we really greatly improved our patching. We can now do blue-green, and we're able to scale in a manner that we just couldn't on-prem due to our ever-increasing needs for storage and compute. This was really a culmination of months of work across many different teams partnering together: Joe's security team, DevOps teams, platform, and our networking teams. We had to work through things like figuring out how to create a reusable AMI that meets security requirements, and also how to integrate with our on-prem server inventory. Through all of this now, we've got an inventory and a blueprint that other teams can follow.
Another important cloud rollout for us is a voice product. With this product, they have a lot of the benefits that I'd say would be pretty textbook benefits of an infrastructure as code and cloud rollout. Consistent rollout of changes, this is a big one. They have a lot of different unique environments, and making sure that the right code went to the right place at the right time, that it came out of source control, all those things had been challenging for this team, now greatly simplified for them. They don't have to log into the servers to administer them. Obviously, this is very nice as well. Backout had been another challenge for this team. Their backout could take over an hour, very manual, and if you're backing something out, your customer's probably down, so you're not in a good situation, and you're in that for an extended period of time. Now we've enabled blue-green deployment, so we can back out changes in minutes, which is obviously a substantial improvement.
Another product, our eCare product, which is a customer care product. With this change, they were doing manual server build-outs for 100 servers per release. We do four releases a year. They were doing blue-green, which is great, but they were doing all this manually. As you can imagine, the lot of work that goes into this, the requests were having to get to our SAs two to three months in advance. Now we've streamlined all that, we've automated all that, and they've been able to go one step further now and come up with a dedicated VM cluster per client.
Joseph Wilson
How do we reframe cloud security? This is a common challenge, and Erica talked about some of the wins that we've received this year. The first two bullets: cross-pollination is everything. There might be some pain associated with it, but I'll tell you, it's worth its weight in gold. Why not embed security experts in the development teams, specifically the hardest security challenges? If we solve it once, we can solve it for many. Also, embed the developers in our security team. Thanks for letting me have one of your resources this year, Erica. It's been a huge win for us. We've got two employees on the security operations team that get to work on automation.
Additionally, when we talk about autonomy, mastery, and purpose for the developers, we want to increase the speed in solving complex security issues. We want to set precedents, set guardrails, but we want to remain iterative. This includes community practices. When we have somebody that comes up with a great idea, let's share that. It's not always the security team that comes up with great security ideas. For example, Erica mentioned the AMI that we're using. We said, Absolutely, you can use that AMI. What other security benefits do we get? Erica's team came up with a great solution. They automatically search the webpage, identify any available patches. As soon as they're available, they're pushing it into dev.
Additionally, I think it's important for us to use the cloud to secure the cloud. It's there. Let's leverage the capabilities, AWS, Azure capabilities, and the like, and take advantage of those IaaS provider capabilities if they're there. Additionally, don't forget about your legacy data centers. This year, we've migrated our two primary data centers at CSG over to software-defined networking. What that's done is shortened the time to resolve vulnerabilities as well as configuration management. Previously, it took us almost six months to plan and work out any changes to our core network because that's how we eat and breathe our business. Today, we have the ability to upgrade firmware on those switches in almost an hour. That's moving at network speed, and that's where we get value.
Lastly, everyone's got to eat. Education, awareness, and training is everything. We still dedicate resource time and education for our employees regarding PCI and what it means to them, as well as we hold DevOps leadership series. That might include beer or other types of drinks like that to draw some folks in. But we talk about key topics, whether it's development, security, and everybody gets that ground-level base knowledge.
Erica Morrison
At the beginning of this presentation, I showed the slide that had kind of the history of the presentations that I've done with our senior VP, Scott Prue. As we've gone through this, and we've added Joe's teams in, as we've added security focus, a compelling theme that we've had from our teams is, hey, we need to have a better focus on work-life balance. This year, we went one step further with this. This is something that we take very seriously. We've partnered with our product management team, and we've said, hey, we're going to dedicate 15% of teams' time this year to focus on work-life balance initiatives. Not only that, but this is going to be a team-driven initiative. Let those closest to the pain figure out how to solve that pain. The teams are responsible for coming up with areas to target, coming up with the metrics that they're going to use to track how they're doing, and then actually executing against that.
I'd like to share some of the success stories that we've had coming out of this with a couple of different teams and some of the things that they've chosen to work on this past year. Our continuous integration team, they've done something called patching on demand. The way that patching did work for this team, our SAs once a month would give them a six-hour window, and of course, this is off hours, middle of the night. We like to call that stupid o'clock. With that time, then they didn't even get to control which servers were going to be patched when. Now they've taken this into their own hands, and they can patch these servers. They've gotten it down to three hours during business hours, and they can control the order that these are going in, so obviously much more towards work-life balance in this regard.
They've also improved their health monitoring. Our CI system is an essential system that needs to be up all the time. That means that if it's not up, I'm paging out in the middle of the night. We've developed a synthetics framework. We're testing, we're always making sure builds are succeeding, but it was kind of flaky. When we started, we said, That's okay. It's important that we get this working. But as we started getting more of those middle-of-the-night pages, we said, Hey, we need to fix this. We've gone in and we've improved that. I'll say that was actually a pretty common area that a lot of different teams have tackled with their paging.
Our data warehouse team, they chose to tackle some test automation. In one particular case, there's jobs, and they used to take 15 manual steps to validate these jobs, and we've now reduced that to three steps. The team believes overall this is about a 50% reduction in the amount of manual work to validate each of these jobs, which was a big win for us. In another area, we were looking at the pages that this team receives, and this team gets paged a lot, and trying to identify some themes and maybe some areas that we could tackle. In one particular area, we identified that there was some cache locking going on. Every time this happens, at least three different teams get paged out, and then it's an hour of research for each of these teams. Obviously, a lot of time goes into this. We said, We think we can actually fix this with just putting some better error handling in place. That's what we've done here.
With our SL BOS product, which is an API product, they've enhanced their synthetics framework. They already had a very strong synthetics framework. What they've done here is to split out all the different components and test each of those individually. I'll give you a couple examples of how that can be really helpful. Imagine you've got three servers that sit behind a load balancer. If one of those servers is not quite right and it's still responding to pings from that load balancer, two-thirds of the time, my tests are actually going to pass, and I might not know what's going on. But if I've got tests that also go directly to those servers in addition to going through the load balancer, I'm going to detect that a lot faster, and I can get to MTTR faster with that. Take that same example. Let's say I know I've got something wrong in my system, but I don't know what component it is. Is it at the load balancer layer, or is it at the server layer? If I've got tests that go both directions, I can much more quickly zero in on what's going on.
The last team that I want to talk about, order management, kind of ties together a lot of what we've been talking about here. They said, Hey, we want to streamline our deployments. They're doing deployments in the middle of the night. They were time-consuming. They use Chef, Rundeck, and Cloud to make their deployments much smoother.
To wrap things up, some help that we're looking for. Would love to hear what you guys are doing from a DevSecOps pipeline, particularly what tools are being used at your company. Something Joe and I have talked a lot about. We do have a DevSecOps pipeline, but it's something that we're looking to continue to iterate and improve on. SRE best practices. Site reliability engineering, for us, feels like a natural progression of continuing upon our DevOps transformation for us. We'd love to hear what best practices you're implementing and how this is going for you. Finally, reducing toil. This ties right into SRE. It also ties into the work-life balance stuff. Damon Edwards, hopefully you got a chance to see Damon speak yesterday. He talked about reducing toil, and he talked about how if you have excessive toil, you actually can't fix your system. We want to make sure that we're continuing to tackle this and continuing to move our teams forward.
That's it for us. We've got a few minutes left if anybody has questions.
Q&A
All right. I can't see any hands over there. If there's anybody up there, feel free to just speak up.
All right, I'll take that as a no. Thank you, everybody. Thanks.