Erica Morrison (CSG) x Shaaron Alvares
Erica Morrison is the Vice President of Software Engineering and her teams provide software solutions to CSG’s 40+ development teams. These solutions range from continuous integration frameworks to reusable libraries to telemetry visualization platforms. Erica is passionate about agile and has experience leading DevOps teams where members own the end-to-end infrastructure and code. Erica also has software development experience in the defense and aerospace industries where she worked on projects such as the replacement for the space shuttle. She lives in Omaha, Nebraska with her husband and two kids.
Shaaron A. Alvares works as an Agile and DevOps Transformation Coach at T-Mobile. She has a global work experience leading product, organizational agility and cultural transformation across technology, aerospace, automotive, finance and telecom industries within various global Fortune 500 companies in Europe and the US. She introduced lean product and software development practices and has led significant lean and DevOps practices adoption at Amazon.com, Expedia, Microsoft and T-Mobile. Speaker, trainer and writer, she is a news reporter and editor at InfoQ for Agile, Culture and DevOps, and Ambassador at the DevOps Institute. Shaaron published her M.Phil. and Ph.D. theses with the French National Center for Scientific Research (CNRS).
Chapters
Full transcript
The complete talk, organized by section.
Shaaron A Alvares
Welcome to another DevOps Enterprise Summit interview with our speakers. Today, I'm very excited to have Erica Morrison with us.
Erica is a VP at CSG, and she presented a keynote at the DevOps Enterprise Summit in London, and she's presenting a keynote as well at the DevOps Enterprise Summit in Vegas this year, which is remote. So if you haven't registered and want to have an opportunity to listen to her talk, I highly recommend you register. It's going to be next week, actually, on October 13 to October 15.
So welcome, Erica. I'm so excited to have you. We collaborated on a paper, and that's going to be published as well by IT Revolution sometime this year about incident management. And a lot of the paper actually is based on the work that you and Scott Prugh have been leading at CSG.
So welcome, Erica. Would you like to introduce yourself?
Erica Morrison
Yeah. Thanks so much for having me.
So as you mentioned, I work at CSG. I'm over a number of different software engineering teams. A lot of the work that they do is in the shared services space, so we're providing services for other teams, things like monitoring solutions, CI systems, and then also we provide some front end for some of our next-generation products as well.
So I've been introduced into the DevOps space over the course of the last several years. My background is in writing software as a developer, and then I've gotten introduced into the operations space as part of my journey through DevOps. So those teams that I run today are teams that cross the development and the operations space.
Shaaron A Alvares
All right. That's awesome.
Yeah, and I think you've been recently promoted to vice president at CSG, so congratulations. I think you've been doing...
Erica Morrison
Oh, thank you.
Shaaron A Alvares
...yeah, amazing work.
So could you tell us a little bit more about CSG, the products that you develop for your clients, and any success story, any client success story?
Erica Morrison
Yeah, sure. So the space that I'm in within CSG is revenue management, digital monetization, and output. And so we're North America's largest SaaS-based customer care and billing provider, and also working on next-generation products that kind of leverage that base and also expand into other markets and things like wireless.
And so, basically providing these solutions to support some of our customers. Customers that you've probably heard of, like Comcast, Dish, Time Warner, Disney. Some of those are all customers that we serve.
Shaaron A Alvares
Wow. That's incredible.
And so what was CSG's journey to DevOps adoption? I know you mentioned you've been working in the DevOps area for a little while, and I know that you have been present at the DevOps Enterprise Summit since it started, I think, in 2015. So were you influenced by the talks, and what was your journey at CSG?
Erica Morrison
Yeah. So our journey has really been over the course of many years, and it started with foundational concepts, things like agile and lean, reducing batch size, and then it evolved over time.
And really we took a drastic step in 2016 of having a major organizational change in part of our business, bringing together development and operations. And so that's proved really, for us and for our journey, foundational in some of our successes.
And so then the next couple of years, we're really working through what does that mean to have teams together? And so again, coming from that development background, there was so much I didn't know in the operations space, so much harder than I think I fully appreciated, and then likewise bringing those development best practices to the operations world.
So continuing to build on that, to look at things like how do we automate our deployments, and then how do we design better software so it's easier to run in production and get defects down and detect issues sooner and monitor. Those have all been things that we've just built on, and now we're taking those DevOps learnings that we've got, and we're applying them in other areas of the business as well.
Shaaron A Alvares
Wow, that's amazing.
And so I know you're going to talk about incident management. You're going to do a keynote, actually, and it's going to be really interesting because you are going to talk about how the worst outage in CSG's history was turned into a powerful learning and growth opportunity. So I don't want to disclose too much of your talk, but can you tell us a little bit more about this outage and how did you turn around the outage?
Erica Morrison
Yeah, sure. So the outage, when all was said and done, was about 13 hours in duration. And it took a good chunk of our products pretty much hard down.
And so it started in the middle of the night, and we troubleshot it throughout the day. And it was a particularly challenging issue because not only were our products down, but so were all the tools that we normally use to troubleshoot them.
So things like our monitoring system, access to tools, they were victims of the same issue that was affecting our production services. And so as the hours crept by, we were troubleshooting blind in a lot of ways compared to how we normally do things.
So we would come up with an idea, we would try it, and then things would work for a little bit. So you'd get this sense of false hope for a few minutes, and then it's like the problem just would move around from place to place.
So we eventually did take more drastic action as the day progressed and eventually started killing VLAN by VLAN until we identified one specific VLAN and everything just started working. And so once we identified that, then we could zero in on it.
It did take us a couple days to reproduce in a lab what was going on, and it was actually just some routine maintenance activity where we actually had a server patching activity, and when it rebooted, it was a non-standard OS that behaved a bit different when it rebooted, put some traffic out on the network, and got interpreted as spanning tree, hit our load balancer, had a misconfiguration, and so it looped and created this network storm. So that was why nothing was working on the network.
So that was kind of the outage itself, and I think you asked what did we take from that outage? So this outage was very big, very impactful, much worse than a normal outage. In fact, the worst outage in our company's history. And so we knew we wanted to respond differently to this particular case, so we reached out to several experts in the field.
So we reached out to John Allspaw and Dr. Richard Cook at Adaptive Capacity Labs, and we learned a bunch about incident analysis. So they came in and taught us about that.
And then we implemented an incident management system. So we worked with Blackrock 3, Chris Rena and Ron there, and implemented incident management and transformed how we run our outage calls, which I was someone that ran those calls before, and I can't tell you how different and amazing it is. And if you would've told me before that we'd be able to transform how we ran a call so much, I wouldn't have believed it, but it's just night and day in terms of how a call is run. And so that's been very transformational for us.
There's been cultural changes, and then we've had a number of other things that we've improved on the technical front, things like improved monitoring, some things we're doing on our network side as well. So just a ton of learning that's come out of this as an organization, where it felt just like this devastating thing.
And I talk about this in my talk. You could feel it walking in the halls, just sitting in meetings. We just felt we had let our customers down, and it was just this devastating feeling. And so we wanted to maximize that.
And ACL has this saying. It's something like, Outages are unplanned investments. Make the most of your unplanned investment. And so we really feel that we did do as much as we could to leverage that terrible event to learn as much as we could from it.
Shaaron A Alvares
So yeah, you talked about the impact on people within the company, and did you feel there was an impact on the culture as well? I think in incident management, we talk a lot about safety, psychological safety. So how was the culture and that aspect handled at that time?
Erica Morrison
Yeah, so with an outage this big, as you can imagine, there was lots of pressure, lots of people involved in all of that. So to us, it exposed that we still had work to do.
So we started talking about psychological safety a lot. It became something that we talked about at staff meetings, at all-hands. We do a monthly DevOps leadership series. So we started talking about that.
One thing that we realized is we needed to make it safe to talk about failing. And from that, what we've learned is we have outages, we move past them, people are embarrassed, we want to move on. But in reality, almost all of those outages had valuable learnings that are then not being shared if we're just trying to move past them.
So we actually dragged back up some old outages from several years past, and it took some convincing, but got people to talk about them, to get it to be a more comfortable state so that's a normal thing where we can talk about these and learn from them.
We will continue to talk about psychological safety, I think, forever. And at the end of the day, if you talk about it and you don't live it, you undermine that. So it is something that we as leaders continue to try to make sure is present within the culture.
Shaaron A Alvares
Mm-hmm. Yeah, no, that's really important. I agree. We can better collaborate and better respond to incidents when we know that we feel safe, right?
And so what were some of the key lessons of the incident? I know you introduced new practices, like a leadership meeting as well, monthly leadership meeting to look at incident management across the entire organization, not just single local incidents.
So what were some of the key lessons learned and maybe practices that you introduced after that event that could benefit other organizations?
Erica Morrison
Yeah, I think by far and away, the most impactful thing that we changed was rolling out the incident management system, and again, how we run our calls. So that's been a big focus for us.
You mentioned the global meetings that we've got when we review our incidents. So we had a practice already of local review and of this global review, but I think we've re-looked at both of those and said, How do we leverage these for maximum learnings to make sure that everyone's participating, we're doing these in the same way, that people understand these?
And one of the key things that I have been reminded of, it's easy to look at an outage and focus on how do we make this better? That's where we all want to go. How do we make sure this doesn't happen again? But what did we learn from this is also a really important piece as well.
So with this particular outage, we learned a lot about how our system functioned, and that was eye-opening to me, to say, Hey, go focus on this particular aspect and make sure you're learning about your systems as well. So now that's something I try to include when I'm doing a post-incident review as well.
Organizationally, I think it was a good wake-up call for us that, hey, there's work to do as far as how we handle our incidents. We've come a long way. Now it's time to up our game to the next level.
And then I think the other key thing we learned about complex system failure, what complex system failure looks like, and what we can expect going forward, which means that we need to be better prepared for some of these sorts of things.
Shaaron A Alvares
Mm-hmm. Yeah. And like I mentioned, we collaborated with other collaborators, actually, in writing this framework for incident management and learning, and it was a great opportunity for me, actually, because that's when I learned about all the work you've been doing in this space.
So it's a forum paper that's going to be published by IT Revolution again, and it's packed with a lot of very valuable lessons learned, practices, patterns to set up an incident process, incident framework, actually, within any organization.
So yeah, I think it's wonderful what you're doing because you're sharing at conferences the lessons learned from your experience, but also you're publishing those, and you're collaborating across other companies, right? So I think that's the right thing to do.
So thank you very much, Erica. It was great to have you today.
So if you want to listen to Erica's keynote at the DevOps Enterprise Summit, please register. There's still time to register. It's starting next week on October 13.
Thank you very much, Erica.
Erica Morrison
Thanks for having me.
Shaaron A Alvares
Thanks.