The Ops in DevOps is a VERB. It is not a Noun!
As the #1 insurer of cars and homes in the United States, State Farm® has embarked on a journey to fundamentally change the way teams deliver software through DevOps. State Farm has reshaped the way teams work and interact from the adoption of DevOps practices and behaviors, to the realignment into empowered product teams. With over 6000 IT professionals, a transformation of this magnitude is going to have a few bumps and bruises. What happens when you focus heavily on Development and allow Operations to be treated like a noun instead of a verb? Operations is not a place or group of people that do something. Operations is an action that teams take on their products to remain stable and successful.
This session provides attendees an in-depth look at the State Farm journey to raise awareness of the concept of DevOps within their leadership and to revamp their approach to Ops by focusing on improving the observability, tracing, and monitoring capabilities of our platforms, adopting a site reliability engineer (SRE) mindset, and a lot of enthusiasm to encourage and accelerate adoption of DevOps by product teams. We will share how we used the SRE role and focused on improving the operations capabilities available to product teams to take our DevOps journey to the next level. We will also share how we influenced our leadership to understand how important DevOps is to be successful.
Chapters
Full transcript
The complete talk, organized by section.
Jeremy Castle and Andy Hinegardner
Jeremy Castle: Hi, Andy.
Andy Hinegardner: Hey, Jeremy. How's it going?
Jeremy Castle: Not too bad. How are you doing?
Andy Hinegardner: Doing pretty good, but could be doing a little better. Could be in Vegas.
Jeremy Castle: Yeah, could be in Vegas. So we're here to talk about State Farm's DevOps journey, and how we focused a little bit more on operations and some of the wins we got out of that.
Andy Hinegardner: All right. Let's go to the slides.
Jeremy Castle
Jeremy Castle: All right. So the name of our presentation is "The Ops in DevOps is a Verb, it's not a Noun." Next slide, Andy.
State Farm is a very large insurance company. We're the 36th-ranked company in the Fortune 500 in 2019, number one auto insurer since 1942, and homeowner insurer since 1964. We offer about 100 different products. We have nearly 19,000 agents and almost 60,000 employees. So we're a very large, complex financial institution that offers insurance. Next slide.
To give you a little bit of background about State Farm's Enterprise Technology department, we have nearly 6,800 employees. Out of that, almost 3,000 are software and infrastructure developers. We have hundreds of different technologies: Java, mainframe. We've started building a lot more things with Go. Just many different technologies. Over 2,000 different web applications across so many different platforms: WAS, PCF-based, public cloud. And that's spread across 15 different business areas and 1,200 product teams. So very large, complex systems that all have to interact and talk to each other every day.
Just a little bit of background. I'm an engineering director within State Farm. I work on horizontal enablement for Enterprise Technology, mainly focused on application development life cycle. I own the majority of our developer tools and just the developer experience in general. And really, I promote a lot of our CI/CD practices and tools.
Andy Hinegardner
Andy Hinegardner: Yeah. And I'm a technology manager. I'm in a newly established resiliency engineering area. I see myself as a SRE ambassador. So my primary responsibilities are enabling enterprise site reliability engineering practices for our environment, and I'm also an SREvangelist.
Jeremy Castle: Yeah.
Jeremy Castle
Jeremy Castle: So I'm going to do a little recap from our talk last year, me and Kevin O'Dell gave. I'd say State Farm's DevOps journey really started in 2015, and it really started with this value stream map. So this value stream map, we actually hung it up in my boss's office, and it's about 16, 17 feet long. What we did is we used value stream mapping to really understand our development process.
State Farm really took a heavy interest in: how do I get code from my workstation out to production as quick as possible? But I had to start with understanding that process, and it really started with a challenge from my boss on day one of my job: "I need you to figure out how to go from my workstation to production in less than 24 hours." So that was kind of our North Star.
Just to recap some fun little stats. In 2015, what we found using value stream mapping was about 1,500 hours to get an application out to production, and some of that was actual time. A big chunk of it was wait time. But we found almost 150 steps, 35 different handoffs, many different specialized roles just to get one production activation. And that can just really slow you down, right?
Andy Hinegardner: For sure.
Jeremy Castle: Using the value stream mapping and focusing on the dev side, we were able to take production deliveries from two, three weeks down to two, three hours, and actually now we can get them down to minutes. So as we start factoring out the number of changes we put out per year, we estimated we were able to go from 175,000 days to only 150 days of time. So that's significant savings, right? Not all that was people time. A lot of it was waiting, but it really did help us hone in and make some good decisions and some good improvements in our processes.
Andy Hinegardner: Absolutely.
Jeremy Castle: Another thing we did too is State Farm did that journey where we went from projects to products. We really took in the agile mindset, really changed how we operated, and that gave us some significant improvements. It also helped us work closer with business. So, hey, we focused a lot on dev, we focused a lot on products, right?
With that, everyone gets DevOps, right? Andy gets DevOps. I get DevOps. Whoever gets DevOps, everything's perfect. You've read it in a book. We're doing everything great. We get everything out the door quicker. But if you step back, you want to go next slide, Andy.
If you think about day one, everyone's super happy. We spent a lot of time removing friction from our teams. We've focused really heavily on DevOps lately, or GitOps. So the last year we've made major investment in GitOps, and that's really made our developer community much happier, much quicker. They go to one set of tools. You're defining your infrastructure as code, and you're placing it with your source code, and a lot of big wins out of GitOps.
We've also spent a lot of time on platform enablement. Things like investment in cloud-based platforms, public cloud, heavily baking CI/CD and compliance into these platforms where we can, really helps remove a lot of friction, right? So it's something else product teams don't have to worry about.
Mentioning pipelines, what we've tried to do is position all our strategic platforms to have pipelines. That's the only way you can deliver to production. So as we started enabling GitOps in some of our past work, any strategic platform at State Farm, you're going to have to use GitOps or some type of pipeline to get out to production. So that helps us adopt the CI/CD mindset.
And last on that, we spent a lot of time as a company on developer experience, trying to understand where we can improve their day-to-day life, whether it's better machines, better monitors, making it easier to remove as much friction as possible. I'll say that repeatedly, but that was honestly kind of our North Star on how to get better. So yeah, great. Day one, everything's awesome.
Well, you get to day two, and I'm pushing change out, and are we really focused on the right stuff on operations? A lot of times within the company, you talk to folks and they would say, "Well, operations, that's a team. That's an area that I go to. I don't really have to talk to them, right?"
So as we made some of these transitions from project to product and merged these teams and tried to have multi-skillset teams, people started out, they'd ask the question, "Well, I have to monitor my application in production?" We didn't prepare people all the time to be responsible for the end-to-end life cycle of products and really try to raise some education on: how do I monitor my applications? What's observability? These are common things that we didn't prepare people for as we start thinking about our DevOps journey. We need to do a better job of it.
So what are some ways we did that? State Farm, I think, has a pretty interesting story behind that. We're a large company. I think we have almost 800 leaders just within our Enterprise Technology, and that's from executive down to first line. We need to get our leaders excited about DevOps. You'd hear this a lot: "Well, that's just something, that's a State Farm thing." And it's not. It's an industry-wide movement. Really, I think there's a lot of compelling and awesome stories behind what's going on in DevOps, and we had to figure out how to bring that into State Farm.
Another thing we wanted to really look at was site reliability engineering and how we could bring that into the day two things, to bring coding, a technical automation mindset, into things that happen in production, not just tailing tickets and closing out manual tasks. How can site reliability engineering make huge improvements?
So I'm going to cover the get-leaders-excited part. That's a picture of me talking to Kevin O'Dell, reading on "The Unicorn Project." It's kind of a funny video, but I'll give Kevin a ton of credit. To try to get our folks excited, we started holding these big events within State Farm. Me and Kevin spoke at All Day DevOps. We had other people speak there, and we made this a huge viewing party. So we tried to get everyone excited within our department and watch these videos and be engaged and learn more about it.
And then Kevin spent a lot of time earlier this year, right before COVID hit. We had like five, six industry speakers lined up. We were going to have everyone come in. We had an all-day conference. State Farm was making a huge investment in their leaders to learn more about DevOps, and it was an awesome event. But we had to move it. COVID hit, and we ended up doing it all virtual in August. We got industry speakers to come in. It was pretty amazing. Got a lot of positive feedback, and we started building these DevOps cohorts within State Farm.
That was like: people are excited. Let's talk about DevOps. Let's read "The Unicorn Project." That's the other thing we did. We got every single leader a copy of "The Unicorn Project," and we're starting to do book clubs around it, talking about how DevOps can be applied within our products at State Farm. So you're getting a groundswell of excitement.
And then the other thing: State Farm, we're starting to ask people to go out and actually talk through these different items, talk at these type of conferences, get people excited about them, because it shows that we are doing things in the industry, and that gets these people excited back on the product teams.
You want to go next slide, Andy? So I talked to you about how the leaders got excited, but Andy, about a year ago, was charged with, let's figure out how to get SRE into State Farm. He's going to tell you some of the cool stuff we've been doing in that space.
Andy Hinegardner
Andy Hinegardner: Yeah. Thanks, Jeremy. That's really what I've been tasked with, helping to really start up SRE practices within State Farm. We are new at this. Like Jeremy said, I haven't been in this position that long. We've built a couple teams. We're in the process of building another. But really, when I started to look at all of this, none of what we're really talking about from an SRE perspective is necessarily new. So I thought I'd spend some time today explaining how we got to where we are and then some of the steps that we've taken so far.
The big reason that we looked at site reliability engineering is because we were seeing some increase in customer impact. Like Jeremy said: hey, we can get code from your desktop to production in two minutes and it's awesome. But if you don't really know how that application is running, how it's behaving, is it healthy, is it dead, is it sick, then you're not really engaging in the full cycle of what this means.
So we missed some availability targets. We had our agency force that were seeing some higher recovery times for some of the tools that they use, and that really got us to think: all right, we've got to do something about this. We can deploy code all day, but we need to look at the app side of the house as well.
What we did was we decided to start with a small team with a startup mentality, and we wanted to fill that team with folks who had broad skill sets across the organization. And we wanted to focus on our critical applications: our customer-facing applications, financially significant applications, the apps that generate money for the business. That was what we started with.
What does that mean? There's a lot of book reading, a lot of videos to really start up SRE, and one of the biggest things within that is really the culture change associated with it. We're all in IT, and things are going to break. I think years ago it used to be we can't tolerate anything not functioning. But I think the goals that we have now are the realization that things are going to break. It's just how fast do you recover from them, or what are the things that you can put in place via automation, things like that, for recovery.
The other thing we really saw in our organization was balancing features with the operations side. The dev side of the house is the business saying, "Go. I want all this functionality out in production." But there needs to be that balance. So it can't just be push the code and forget about it. That's one of the things we were seeing as well.
A big thing, too, is just automating toil or automating those tasks that are repetitive within your team or your environment. That's a big one, and I think that ties back in with balancing the features with the ops side. You need to give teams time to actually work on some of that toil so they can just get it off their plate.
And then another big one we found is the concept of a blameless postmortem. When events happen, what does that mean afterward? You pull a bunch of people together. Traditionally, you get on a bridge call, and everybody is working 24/7 to get something fixed. And then at the end, you figure out what the root cause was. You drag that person in and say, "Don't do it again, or else." And that really promotes a negativity within the operations side of the house.
So the concept behind blameless postmortems is that we have issues. We're in IT. Things are going to break. Let's all get together and figure out what they are, and then let's ask questions like, "Hey, why was that allowed to happen? Can we do things from a systems perspective to not allow people to do that?" For example, if it was human error. So that's where we started.
Some major things that we found are around dependencies, observability, health, and traceability. Dependencies: do you know what is using your application or your infrastructure, and do you know what impacts you may have on someone else or they may have on you? I know it seems simple, but when you get in a large organization and pretty complex applications and infrastructure setups, that isn't as clear as it could be. So that's one target area we had.
The other was observability, and that's basically: is anything weird happening within the systems, and are you monitoring? Do you know that those are happening? Are you getting notified when you see traffic spikes or things like that?
And then another big one was health. Is your app or infrastructure healthy, sick, or dead? I know I mentioned that before, but that was really a big one. When we moved from project to product teams, that product team owns everything start to finish on it. And we were really seeing that folks weren't necessarily putting the instrumentation in place to understand the health of the system.
And that led us into also the traceability side of the house. If your system's okay, like, "Hey, all my screens are green. Awesome." But if you're affecting your customers because you're just a small part of their customer journey, then you need to really understand what that customer is doing via all the systems and know that those systems are acting how they should and acting healthy.
That's really where we got to knowing is half the battle. So we ended up instrumenting what we call LATTES, which stands for latency, availability, tickets, traffic, and saturation. This was just to establish a baseline on some of those critical business applications because we didn't necessarily have a good baseline where dependent teams could talk to one another and be talking about the same thing.
So we very simply set up a Prometheus instance. We used Promregator to scrape these metrics on our PCF platform. That gave us a good baseline and was actually able to let these application owners, in this case, really tell the dependent services that they're hitting how healthy they were and see issues when they weren't.
Another thing, if you've looked at anything SRE, the big thing is around SLOs and SLIs. So you have to establish targets for availability of your systems and then also measure that. Here at State Farm, we could have done a better job in that space. We have SLOs and those things. Sometimes I think they just got picked out of the sky and applied to things, and there wasn't a really good measurement. Some areas were doing it better than others. But that's a big focus when we go engage with these teams to say, "Hey, what's your goal for availability?" And then are you measuring and knowing that you meet that?
On top of that too, that led us into thresholds and alerting. Hey, you found something. You need to let other folks know and then take action. So I'll do a little more on that in just a minute here.
The big thing we're focusing on in our early journey of SRE at State Farm is really reducing the MTTs, or the Mean Time Tos. There are lots of them. This isn't all of them: detect, identify, notify, repair, and then between failures. You need to be able to measure some of these things and know what's going on with your systems, and then be able to trend that to know what's going on.
So I mentioned our LATTES monitoring solution earlier. We were working with a customer-facing team. We implemented that solution, and within 24 hours, we alerted on an event that was happening in that system that they normally didn't have visibility into, which was really cool. The teams were able to recover that in about 54 minutes, which was a big improvement, where we would sometimes see hours in between recovery times. And we also saw the same scenario hit again, but the next time the alert fired, we recovered that in less than nine minutes. And I think even recently, we're down to less than two minutes. So that's some awesome progress, and just understanding and knowing what's happening with your systems and measuring these things.
The other part of SRE at State Farm is focusing on the future. We made a pretty quick determination that we really thought we could move the needle a lot more on applications that we're re-engineering to move to public cloud. We could do a lot around visibility, helping the stability of some of our current systems. But we really felt that we could build the resiliency engineering piece into applications as they move to platforms like AWS, for example.
So we all know public cloud will solve all your problems, right?
Jeremy Castle: Yeah, no.
Andy Hinegardner: But it is a really good platform for resiliency. And just easy things that are built into that platform take a look at things like your deployment patterns and how you can back that code out if you see any issues in production. Just at a base level, that's a good place to start, and it's pretty easy to get in there.
The next thing we really focused on was the architecture, engineering, and design aspects, and this goes for software and infrastructure as well. We need to be engineering for failure because it will happen, like I stated before, and we need to really bake resiliency into part of the design, and we all need to do it. That's really been, it falls into that culture side of the house as well, that everybody really needs to be thinking about these practices and principles, and we want to shift that as far left in this process as we can.
One of the last things we landed on is to make SRE easy, create consistency, build an SRE platform that others can very easily consume, and then we really need leaders to support the SRE initiatives.
Jeremy Castle: What do you mean by SRE as a platform?
Andy Hinegardner: So if you build a monitoring stack, then make that available for everybody to use very easily. If you have a log shipping solution, those types of things from the ops side of the house that teams can implement, and they get out of the box.
So from a leadership perspective, it really all starts here. I can't say enough how much I appreciate the support that we have from a leadership side here at State Farm for SRE and for DevOps. But from the leader perspective, some things that you may want to think about if you're doing DevOps and SRE together: just support the creativity of the team, focus on automation, let the teams experiment, and also expect the teams to know the health of their system, their app, or infrastructure. That should just be a given.
I know that there's more enterprise kind of teams that look more broadly, but if every area is ensuring that their app is healthy and getting notified when it's not, or taking even automated actions if it's not, it's just going to be a much better experience for our customers.
So that also leads to consistency. We're a large organization, as Jeremy stated. We have a lot of different solutions, tools, teams, areas, departments. It can get kind of crazy, the amount of stuff to keep track of. So from an SRE perspective, we're really looking to partner with a lot of different areas to help provide consistency across the board.
You need to give teams the autonomy and responsibility for their solutions, and then give them that time to tackle their technical debt and their toil that they have that I mentioned earlier. The biggest two things I could give any leader in this space is to figure out how to measure everything and then automate everything. That should be the first place to start. But if you can even be remotely successful in those two areas, I think you'll be successful in getting site reliability engineering kicked off for your organization.
Closing
Andy Hinegardner: With that, I'd like to say thank you from myself and--
Jeremy Castle: Thank you very much.
Andy Hinegardner: --and we look forward to hearing from you. You can hit us up at these addresses on the screen here. Be happy to have a chat or answer any questions you might have. Thank you very much.
Jeremy Castle: Thank you.