Governance, Compliance, and Risk in the SDLC Can Be a Fun Event!
As the #1 insurer of cars and homes in the United States, State Farm® has embarked on a journey to fundamentally change the way teams deliver software through DevOps. State Farm has reshaped the way teams work and interact from the adoption of DevOps practices and behaviors, to the realignment into empowered product teams, but how do you balance the organizational need to manage risk and provide governance of the Software Delivery Life Cycle at a highly regulated company?
This session provides attendees an in-depth look at the State Farm journey to embed a loosely coupled event architecture into our DevOps toolchain to broadcast key events in the SDLC. This has allowed us to bring better overall compliance to the State Farm internal standards and policies and development teams don't even know they are broadcasting events. The capturing of DevOps Events and their corresponding data has allowed us to capture a holistic picture of what really happens during the life of a code change and this leads to opportunities to use real time data and automation to govern our SDLC instead of the dreaded manual reviews or controls.
Chapters
Full transcript
The complete talk, organized by section.
Jeremy Castle and Ryan Chambers
Jeremy Castle: Hello, I'm Jeremy Castle.
Ryan Chambers: And I'm Ryan Chambers.
Jeremy Castle: And I'm an architecture director at State Farm.
Ryan Chambers: And I'm a technology engineer.
Jeremy Castle: And me and Ryan are here to talk about "Governance, Compliance, and Risk in the SDLC Can Be a Fun Event!" We've been working on an eventing framework here at State Farm that we've instrumented and hooked into all our developer tools to try to achieve a little bit better transparency into the SDLC. It's going to allow us to build some automation on top of it that's going to make for better governance, compliance, and managing our risk.
01State Farm and Enterprise Technology context
Jeremy Castle: So let's dive a little bit into the numbers behind the neighbors here at State Farm. We're the number one leading auto and homeowner insurer in the United States. We have 85 million active policies and accounts so far. And then just in terms of size, we have over 57,000 employees, and then we actually have over 19,000 independent agents all over the United States. And in addition to being the number one auto insurer and homeowner insurer, we're also the number two largest life insurer based on policies in force in the United States since 2016.
So a little bit about the Enterprise Technology Department here at State Farm. We've got approximately a little bit under 3,000 software infrastructure developers out of a total of 7,000 employees at Enterprise Technology within our walls. With that, there's hundreds of technologies. So we have Java, mainframe, PL/I, we have WAS, we have public cloud. We have just tons and tons and tons of technologies all over the place. That makes for a complex environment. We have over 2,000 web applications that we host on these platforms, and those are spread out across 1,200 product teams, across 15 different business areas that Enterprise Technology supports. So it's just a large, complex environment.
With us being a financial institution, we obviously have a lot of regulatory things we have to do to maintain compliance. And that really kind of centers around what our talk's about today: how are we going to try to automate this to make it a little bit easier on our developers and really get that DevOps mindset, remove as much friction as possible?
02Where Jeremy and Ryan fit
Jeremy Castle: So let's start out and talk a little bit where me and Ryan fit into State Farm. It says engineering director. I'm also an architecture director. Just titles, we'll figure that out at some other point, but horizontal enablement for Enterprise Technology, that's really where I sit. My responsibilities are really the application development lifecycle at State Farm. I own the majority of developer tools and the experience around that, and I'm heavily involved with CI/CD practices and the different tools.
Ryan Chambers: And I'm Ryan Chambers. I'm a technology engineer in the Delivery Life Cycle Insights area. I focus on building solutions that help improve the overall delivery experience at State Farm. And I built the framework that we're going to talk about today to make all these events accessible to teams.
Jeremy Castle: Yeah. And Ryan's one of those super engineers that can figure anything out in very little time. So he is a great guy to work with, and I think he's a key piece to making what we're trying to do here successful.
03Delivery Experience
Jeremy Castle: So let's talk a little bit about just the area. I want to give everyone some context about where we fit in at State Farm a little bit deeper. So I live in a suite called the Delivery Experience. That's what we've called ourselves. I've got about 10 products that sit underneath me, and our mission is to provide one cohesive ecosystem of developer tools and services to improve the experience of our product teams within Enterprise Technology. So this is a pretty cool mission because you actually get to go out and you try to solve problems for developers. You try to make their day-to-day life better. That's actually a really rewarding thing to do.
I know, Ryan, you help a lot of developers, and I think it's pretty cool to see some of the stuff we're able to do. And we've been fortunate enough at State Farm that they've really supported that within our Enterprise Technology Department, and our executives support it.
So if you look at this picture, Delivery Experience is our suite. We sit horizontally across these different business areas, and we're about a group of about 60 folks that provide support for developer tools and services. And what are those tools and services? Think about things like SCM, so GitLab. How do you build pipelines in Jenkins? Big investors in the CI runners that GitLab provides as well. GitOps, I would say we're doing some really cool stuff with GitOps. I have a whole product team around GitOps, and they're trying to enable it, and I recommend we have several people that have done talks on that this year. Same with security scanning tools, and we took that over from our InfoSec department. That's been a big win for us because I think we bring a developer mindset to some of the security tools we've been trying to provide. Open source APIs, SDL best practices. So really I think about it as just end-to-end developer tools.
And our mission, just at the end of the day, is make everyone's lives a little bit easier and get people thinking about DevOps and how to work differently.
04The problem: manual governance and fragmented delivery evidence
Jeremy Castle: So that's really got some questions, right? Does your developer experience look like this? Do you have CAB gates used for governance of the SDLC? Do you have to go into a committee and have them look at your changes? They probably don't really understand what is in those changes, but they're going to rubber stamp it before you go to production.
Is your proof of testing manual? Do you have to put those in Word docs? Does a test organization have to fill it out and sign off on it? There's a lot of this. I think a lot of enterprises have this type of flow, or at least they used to, right?
No visibility into potential bottlenecks in the SDLC. How long does it take to get a change to go out to production? Do you do that manually? Is there a way you can automatically do that, or is it just all manual and guesswork, right?
And then just the myriad of tools and platforms to manage a change. How many people do you hand off to? Do you have a change area that has to sign off on things? How many tools do you have to hop through? Do you have to go into GitLab, ServiceNow, just various tools? That makes, I think, a pretty big headache for developers at the end of the day.
05From checkout to grab-and-go
Jeremy Castle: So me and Ryan started talking, and this probably happened about beginning of 2020, actually. I think we started having some discussions. We were like, "Let's think about what it looks like for a developer to get something out to production."
And we started saying a feature is a lot like a shopping list. You're going to have some requirements. You're going to do some coding. But then did you commit your code? Did you run a security scan? Did you unit test, integration test, system test? What environments did you deploy to? Was your code reviewed by anybody?
Think about when you're trying to bake a cake. Usually you have a shopping list of different things you have to do. You have to go to the store, buy eggs, probably some mix. Make sure you have pans. You have all these different things you've got to do, and it comes into a list. And that's very similar to what a developer has to do when they're creating code and trying to deliver a feature out to production.
So, can we take this concept? Let's think about it as a developer when I push something out the door. So you start thinking about it. I have my list. In today's model, like a lot of people, there's a cashier. So the traditional model, you have a person ring up all your groceries. This process takes longer. You've got to stand in line. You've got to talk to a person. They have to ring you up. It can be slow. There's a lot of people in line. You could be waiting a lot longer than you want to. Same with traditional, with what a developer engineer would have to face. Typically, I'm going to pile my list, put all my ingredients in a basket, I'm going to come up to the cashier, someone's going to check off on it. It's a slow and tedious process for us.
So we started saying, okay, we're getting close to this self-checkout model. If you think about that, this gives the customer additional freedoms and hopefully makes for a smoother experience. So you just go up to your basket. There's an electronic kiosk you go up to. This process takes less time. The lines are typically shorter. You can ring up your own items. You don't have to deal with as many people, many handoffs, right?
Okay, so that's kind of where we're at today, I think, at State Farm. We've enabled a lot of self-checkout, but still, it's pretty painful in some ways because there's still hoops you have to jump through. There's still some gates you have to do. There's manual work.
How do we get to the point where what our developers do day to day can just be self-reported, and that maybe they don't have to go through a checkout? That's where we really start thinking today, where you can see some of the shopping going: you simply scan your app as you walk through the store, you grab your stuff, throw it in a basket, you leave. There's no lines. You don't have to talk to anyone. You don't have to ring up items. Probably everything's embedded with near-field communication chips. But you basically go in the store to accomplish your goal, to pick up your food, and you just leave. You're automatically charged up.
So that was the concept as we started framing this up in our head: can we get to this more grab-and-go model where, hey, whatever you're doing day to day as a developer, we're going to self-record that and then use that to determine whether you're compliant and go out the door?
06Automated governance reference architecture
Jeremy Castle: So a kind of interesting thing happened. We had this theory: what if our developer tools self-reported important actions, and we built governance around those events? And the funny thing is that me and Ryan have, at a couple points in this journey, at the same time gone back to this paper called "DevOps Automated Governance Reference Architecture" that IT Revolution actually published. I think we were down this journey, and I think we both kind of stumbled across this paper around the same time. We started IMing with each other like, "Hey, have you read this part?" And we're like, "Yeah, this actually seems really what we're trying to do."
So it's laid a foundation and kind of a framework and, I'd say, a mental model for how we're trying to approach this and kind of an architecture. So we've been able to look at this paper, refer back to it, and say, "Hey, are we thinking about this the right way? Is this how we should architect our eventing framework?" So what do you think about that, Ryan?
Ryan Chambers: Yeah. So this reference architecture has a lot of good points in it and walks you through the entire process of those critical things that you really need to implement the pattern for automated governance. The thing that I kind of zoned in on as being the most important thing as we start our journey is the event framework part. So in this document, it talks through how to collect the information, and the event framework's your backbone, and that's where we really started our journey.
07Architectural principles
Ryan Chambers: So we started out with a few architectural principles in mind. And these helped us keep in line with the direction that we wanted to go.
First one is "don't mind us, we're just listening in." And the goal for this principle is to really collect as much data as possible with as little impact to our development community as we could possibly manage. And we did this by using tooling that we use, the webhooks within GitLab, and then wrapping our custom CLIs so that there was no impact to our developers for our initial set of events that we're collecting.
Jeremy Castle: Yeah, a lot of modern tools I think we discovered already have webhooks hooked into them, ways to notify. So we used that to our advantage.
Ryan Chambers: Yep. It makes things super simple.
The second one, social distancing systems. For this one, we wanted to really advocate decoupling processes and systems so that our framework and the automated governance principle could grow over time. Change is a constant. So we know that as time goes on, tooling's going to change, processes are going to change, patterns are going to change. So we want to make sure that our framework could keep up with that and we could adjust as things change.
The third one, you can run, but you cannot hide. So for this one, we knew that we needed to collect a lot of data, but we knew that just having the data by itself wasn't going to be useful. We needed to provide context. So with this principle, what we wanted to do was tie that event back to something, whether that's a commit SHA, a product, an artifact. We need to be able to associate it back to something so that we could paint that end-to-end picture that we're looking for at the end of the day.
The fourth one, go, go gadget. So as we built the framework, we knew that we were going to start off small. We wanted to get our feet wet, figure out how things are working, and understand what the demand was. And then as time goes on, we expect this to grow quite large. So the framework needed to be able to scale. In context, we started out with around 15,000 events per day when we first implemented this last year, and now we're at over 150,000, and we're only at the start of our journey.
The last one, robot insurance. I think this one's probably the most important one. We wanted to advocate automation. We wanted to get rid of all the manual processes as much as possible and start really pushing teams and areas that need to provide the governance a way to easily integrate with our framework to automate processes and make things frictionless for our developers.
We did this by making the framework as easy as possible to both publish and subscribe to. So now anybody within State Farm's world can now subscribe to those things no matter what platform they reside on and do their processing as needed.
Jeremy Castle: I think just decoupling those systems has just been incredibly powerful. Taking an event-driven architectural mindset, the advantages that have popped up out of that have just been night and day.
08AWS architecture
Ryan Chambers: So we started building this out on AWS. The platform itself made it super easy and to hit all those architectural principles that we just talked about. The big services that we leverage for the design are Lambda functions, EventBridge, and Elasticsearch, or soon to be OpenSearch.
Lambdas allow us to do the serverless computing, which makes scaling, reducing cost, and even just pushing quick changes out to production very simple. EventBridge allows us to really build our framework around that event-driven architecture so that we are able to react to events as they are published through the framework and do the data manipulation, storage, or the broadcasting when it's appropriate. And Elasticsearch allows us to really have that visualization piece, as well as the correlation and the ability to query data. That allows us to tie everything together so that we can start building advanced analytics on top of the events and get that end-to-end picture we're looking for.
09Questions the event framework can answer
Ryan Chambers: So as we started this journey, we had quite a few questions that we could probably answer through some manual work, but it was very hard to get to. So those were the things that we started with as we started collecting data and seeing if we could just answer them with the data that we had.
So the first one is the frequency of a Git push. How frequently are our developers pushing to GitLab? We have quite a few projects, as we noted on the first two slides. This is a big organization. So what we were able to see was that we average around 15,000 pushes across 2,000 projects in a single day. There's much more than 2,000 projects, but that's how frequently they get pushed to.
The second one, security scans. At State Farm, each component that gets pushed to production has a wide variety of requirements, and security scans are one of those, and we do a wide gamut of security scans. And we were able to associate those back to the products themselves, but the process sometimes included manual work, and we wanted to get away from that.
So today with our event framework, we're able to tie back those security scans back to the products as they're pushed to production, and we're also able to determine what any security violations are in those scans and what type of scans they are. So today, we do around 20,000 security scans on a daily basis. Of those security scans, 17,000 are secret scans. So we're looking through GitLab for anybody pushing secrets on accident. 1,500 of those are dependency scans, so just analyzing the composition of the application, figuring out if anything needs updated or there's security findings for any open source dependencies. And 1,500 are from static scans, so just inspecting the code, looking for best practices.
And then the third one is lead time between change and deployment. This one, I think, is super important, and I was very surprised by the results. Today it takes around 25 hours from the time a developer pushes a code change to GitLab to the point that it is approved by a manager to go into production. This amazed me because back when I started, it took months to get to production. So the fact that we're able to get down to 25 hours is really good, and hopefully by the end of this journey, we're able to get that even reduced further.
Jeremy Castle: Yeah. And the interesting part for me, sitting in a director's seat, is I get asked a lot of times, "Well, what is our lead change time to production? How many scans are being performed?" It's really hard to do that. Before we had to skim through logs, estimate counts. With this, it's all automated, and you're getting 100% accurate data because it's based on what people are actually doing.
So our tools publishing these events, we've been able to answer these questions with some accuracy. And now we can innovate on this stuff and go, "Hey, what's some things we might want to potentially do in these spaces that we really couldn't do before?" And it's all set up to be automated. And actually, there's a couple things that actually really surprised us.
10Unexpected value and use cases
Ryan Chambers: Yeah. So the unexpected value and use cases that we were able to determine right off the bat. We started out with tagging our events with that context that I discussed earlier. So as we deploy things out both to test and production, we attach metadata to it that relates it back to the source code, the change, so the commit SHA within that source code, and also the product within our organization. So something that we know that ties back to the team.
When we do this, that information allows us to make some really cool observations. An example of this is costing came to us and wanted to understand the cost of a Cloud Foundry application as it sits in each of the environments based off of how much memory it's using, how much processing is being done within the application. So with this, we were able to quickly tie that back to the organization and give them that information to the point where they were able to plug into their model and move along and figure out what they needed to.
The other side effect of this is that now, as we're publishing this across the organization, teams that have downstream dependencies are able to identify when their downstream dependencies change, plug that into their alerting, and that helps them get to root cause analysis if there are issues much quicker.
CLI instrumentation. This initial effort would not be possible if we didn't have wrappers around many of our CLI interactions that we do within our delivery lifecycle. So with the CLI instrumentation, we were able to plug that in, and we were able to get these events without any impact to the developers. They just had to upgrade the version, which we moved automatically, and there, we're getting those events.
The cool thing that we noticed with this was this not only helped us figure out those various actions, like how many scans are happening or what the lead time was, but also helped the teams that support those custom CLIs do their job. So now we're not only getting information around those actions, but they're getting information around errors occurring, how frequently certain commands are used within their CLIs. This is stuff that we never had before, and it helps us really troubleshoot issues when people come and report them.
And it also helps us do things like sunset or transition versions. So now we can actually reach out to our consumers because before we broadcasted to a whole bunch of people not knowing if they got the message or not. Now we're able to directly communicate with them and say, "Hey, you're on this old version" or "You're on this tool that's going to go away. Now is the chance to do that, and here's some documentation on how to do it using this data."
Jeremy Castle: Yeah. Eventing has allowed us to know usage rate down to the exact person and job and time that it's happening. So that's been really cool to see.
Ryan Chambers: And the last one is dashboards and alerts. So we leverage Elasticsearch, and that allows developing teams to plug in and create dashboards using Grafana or Kibana on the fly. So they're able to create panels that show when their applications are deployed, what security scans vulnerabilities pop up, and create alerts based off those things.
11The cart is full
Ryan Chambers: And now we're at our shopping cart is full. If you go back to our shopping experience analogy, we've hit all those checklist items. We're capturing that data automatically. We're capturing the GitOps information, the evidence of test repository information. At State Farm, we have a centralized place where we put all of our scans and tests done on an application. We're putting those through the DevOps Eventing Framework with the events that correlate back to the organization.
Now, the future goal here is to get to the point where we get to that grab-and-go model. So we collect all the information behind the scenes, and now the teams are able to get through that experience much quicker and with much less friction.
12What the future holds
Ryan Chambers: Now, what does the future hold? Transparency of changes in the environment. We want to get that end-to-end picture. We're collecting all this data. We want to tie it back to the various things that it impacts and get a full picture of what's actually happening.
And that allows us to really get to that automated governance model. So we're collecting a lot of data right now. Now we want to be able to use that data to say, "Hey, are you meeting this requirement? Are you doing the things that you need to be doing in order to go to production?" Identifying the level of risk that's going on with the change that you make, which leads into the analysis of code change in almost real-time. We're able to identify the exact changes that occurred on the application and what is going into production at this point. Now we can inspect those code changes and really see what the level of risk is. Was it a small change? Was it a large change? How frequently do changes occur? We can use that information to really guide both the development team and the areas that govern these things on how they perceive the change.
And the last one is analytics of quality of change. We can look at the quality of the change and see what was changed in there and how useful it is for the organization.
13Closing
Jeremy Castle: Just to wrap this up. So me in a leadership spot, having the ability to look at the transparency of the changes in our environments, be able to confidently go to our auditing and risk folks and say, "Hey, this is exactly what's happening in our environments. We're not manually auditing change records. Now I can do this off the data and dashboards," and you can visualize things, that's incredibly powerful, and we didn't really have that in the past.
We were doing all the right things. It was just very manual and intensive, and it added a lot of friction. I think having this framework in place really sets us up for the future to do some interesting things with the data. Where can we put machine learning and AI on top of it? And we didn't have that in the past. And now the future looks really bright for us, and to have the ability to automate a lot of these governance things and checks and remove a lot of friction from our developers' day-to-day lives. And that's actually something really exciting. That's part of our mission. That's why we get out of bed in the morning, at least sometimes. And it just makes it, I think, a good working environment for developers and engineers.
So I'd like to thank you for attending our talk. Me and Ryan have our email. Contact us if you have any questions or want to learn more. We'd love to talk to you. And just thank you, and really appreciate attending our talk.
Ryan Chambers: Yep. Thank you.