From Insights to Progress: Utilizing Compliance as Code to Optimize Deployment Practices
A core principle of our micro-service platform has always been Compliance by design. That is the platform should ensure that the customers of the platform should not be hindered by the processes and being compliant comes with ease.
As a financial organisation there are many compliance areas and one that is core to being able to work with DevOps practices is of course the Change process. Early only we managed to get a new type of change introduced, automated, which enabled us to create fully automated pipelines to production. This is one of many successes that has led us to have 70 teams with 200 applications in production on the platform today.
But even if we have given the teams the ability to deploy each deployment must be impact assessed to evaluate the risk of the change. And that assessment has always been a subjective evaluation done by the responsible product owner.
We have worked with our Process office to move away from this subject model to a fully Compliance as Code scoring model that evaluates each deployment based on a set of key metrics. These metrics are based on the teams' previous experiences, what practices they follow and the content of the upcoming deployment.
With the right tonality and way to communicate what actions a team can take to reduce the impact assessment we want to share how this have pushed the culture of DevOps even further into the teams using the platform.
Chapters
Full transcript
The complete talk, organized by section.
Christian Edström
This presentation is called "Utilizing Compliance as Code to Optimize Deployment Practices." We work for a bank.
We were saying that it looks like we were working for NASA or something like that, with a cool space-themed thingy there. And we'll get to why.
Some short background: Swedbank, or "Swede Bank," I've heard that pronunciation as well, is a large retail bank in Scandinavia. It's based in Stockholm. It's 200 years old. We have mainframes, and we have pretty much everything out there. We've been around for a while.
Approximately 17,000 employees. To complicate things even further, we also have 60 local banks in Sweden. That opens up for some interesting challenges for us in IT. So we are not only one bank. There are many, many of them.
My name is Christian. I've been at the bank for 25 years. I know nothing outside the bank. I have a software engineering background. Started off in COBOL, then moved into Java, and now mostly PowerPoint and Excel.
Markus Backman
I'm Markus Backman. Twenty-three years. I will never catch up, unfortunately, because you will most likely never...
Christian Edström
You're junior.
Markus Backman
I'm junior. Started as a software engineer. Now, well, architect. So that means PowerPointing and those types of things.
I like Internet of Things as well. If we have the time, we'll show something. But let's see.
Christian Edström
Yes. So we'll give you some background story. Our platform is called Andromeda. It's a very cool generated space picture. Instead of producing 150 PowerPoint slides, we decided, when in the U.S., do it Hollywood style. So we have a small sci-fi movie for you. If you don't like space, you have the opportunity to leave now.
Video
Ladies and gentlemen, citizens of Las Vegas and planet Earth, we confronted a towering legacy. We once inhabited a monolithic world bound by limitations, confined by the familiar: a codebase of 16 million lines, born in the year 1999, carrying with it 204 releases of the past.
The monolith was ingrained in our culture, and our efforts to modernize it fell short.
We recall the moment when we envisioned the Andromeda platform, born of the microservices philosophy. The platform should define our ideals of collaboration and decentralization, where each team would give a solar system of their own.
Technology was easy to change. The culture was to be the epic challenge.
The year 2019 marked our first steps toward Andromeda, when we launched the first rocket that enabled teams to bend space-time in this new galaxy.
We are now 75 teams strong, collectively running 241 applications. We have seen a steady uptake from teams using the balanced service provided. The teams are themselves responsible to ensure the survival of their own service system.
The space-time service was the fertile ground to enable teams to safely explore on their own. An exponential increase of teams: the journey from the monolith to Andromeda. We have witnessed 6,071 deployments in this year alone.
Even with this exponential increase of introduced changes, we are yet to see any instabilities in the galaxy.
But amidst our cosmic success, it's worth noting that back on Earth, our monolith experienced an unscheduled rapid disassembly, struck by a meteor: a critical incident that affected everyone back on Earth.
This serves as a stark reminder of the importance of our journey to Andromeda, where the fragility of our past is replaced with resilience for the future.
But a critical incident escapes no one. So how would our galaxy rise to the challenge?
Once again, this, my friends, is our legacy: a journey from Earth to Andromeda, a journey that transforms the cosmos and ourselves.
Christian Edström
So that is what can happen when two middle-aged engineers get tired of PowerPointing...
Markus Backman
And start using a lot of machine learning tools. It's really funny.
Christian Edström
Exactly. So we'll try to fast-forward to the mature part of that movie.
Andromeda, that's the name of our microservices, CI/CD pipelines, all the cool, fancy buzzwords of today. The one-liner of this platform is: a developer-friendly application development platform based on microservices architecture, with managed services for developing, deploying, and operating cloud-native applications.
And the reason for why we named this Andromeda is because we think space is a lot cooler than banking. And 2.5 million light-years, that's the distance from Earth to Andromeda. And that represents the challenge that we were up against.
Markus Backman
The culture.
Christian Edström
The culture.
Before we wrote a single line of code on this new platform, we had some goals set up for us: daily deployments, compliance by design. We should be able to scale this platform group-wide. As I mentioned, we're a Scandinavian bank. We're also in the three Baltic countries. It needed to be able to grow.
Team independency. Teams should be able to push some kind of a button at their own will, not being dependent on anyone else.
And last but definitely not least, developer experience. It should be fun to work on this, and with this platform. It should use modern technologies.
All of that didn't come, or is still not, without challenges. This presentation will be mostly about the people and process, not so much about technology. But we love to talk about that outside this presentation.
Being an old bank, we had processes designed for some human intervention. You know, send a Word document to this place, or check these boxes in an admin tool of your choosing. That needed to go away in order for us to deploy and do things a lot faster.
Markus Backman
The technology marks.
As I said, not going to go in deep in this at all. Basically microservice. We are mostly a Java shop, so Java, Spring Boot, container-based. Nothing strange, to be honest.
Of course, we use space theme all over for all the services. So we have services called Black Hole, which is representing how to communicate with the mainframe, which is a logical name for mainframe. So why not?
We also then are basically saying our golden path is called Space-Time, which you showed in the movie. It is basically how you bend space and time to get closer, and you actually can move from Earth to Andromeda much faster if you use the golden path. So we're just trying to be a bit fun with the naming.
But if you want to have a conversation about technology and choosing tools and so on, please use Slack for any comments. We are happy to try to respond to as many as we can later on as well.
Processes. That is, of course, a bit more challenging. When we started, I said we should be able to deploy daily and so on, which means that the change process is one of the key things.
And as Christian mentioned, it basically looked like: "Okay, you want to do a deployment to production? Sure, absolutely. Just create a change request."
"A change request?"
"Well, yeah. You need to fill in a lot of things in a form. How have you tested this? Have you done this? What's the impact? What's the possible impact of this? Will it impact customers or only employees?"
So you need to answer a bunch of questionnaires, and then hopefully you get the approval back and say, "Sure, you can go to production." But that's then a human needing to write stuff.
So we actually went to our process office and said, "That's not really good enough. We want to change this."
I would say this slide is representing two years of my life, talking with our process office and a lot of other parts from the organization to educate them about what is actually possible if you think in another way with technology and automation.
So we got them to change. Now we actually have a change type called automated change, which any pipeline can, if they would like to, say, "Absolutely, we want to be part and be able to create automated changes." Then comes a lot of requirement, what you need to have on your pipeline from the CI part all the way to the continuous delivery part.
So now we have a process. We have a process of onboarding. We have a process of quality assurance of pipelines. But the Andromeda platform was the first that actually did this.
I said two years of my life, lots of education, lots of talking, but at least now we have this.
But then just having the possibility to do automation isn't the only thing that is important, because then comes the problem: how do we change culture, actually, people to start doing change more often?
Christian Edström
Exactly. And we put a lot of effort in this part.
This is the developer portal available for anyone. We collect and publish statistics on the platform. These numbers, we said that before the presentation, they're a month old.
Markus Backman
Yeah. So not a week old. I think a week updated. I think it's 205 applications, so three more applications, and then we have 7,500 deployments this year.
Christian Edström
You pointed out that in the movie we talked about 6,000-something deployments. Now it's roughly 1,000 more of those.
Markus Backman
Yeah. Just in September alone it was 1,000 deployments. So it's definitely scaling in a really fun way.
Christian Edström
I mentioned the word fun earlier.
Gamify deployment. What used to be a very scary thing, to deploy once a month, a lot of agony and pain and sleepless nights, we want to shift that 180 degrees.
We have badges you can earn if you dare, for example, to put something into production on Christmas Eve, or on Saturday...
Markus Backman
Black Friday.
Christian Edström
Black Friday, perfect day.
Markus Backman
Black Friday, right, in the U.S. Yes. Perfect day, actually. I think it's over 50 or something of these different badges a team can get. Everything like, you bet your Friday, you bet your weekend plans if you deploy very late on Friday.
Christian Edström
Still, yes, there's fun. If you feel comfortable, do it. And we just try to have fun with it.
We communicate successes, the metrics, the team of the month. And this goes not only to the development teams themselves, but to upper management, and the upper-upper management, et cetera.
Really important to show how they can improve themselves as well. So a lot of effort put into the tonality that we do not want to be an intergalactic police force chasing people, but rather enforce them to do the right thing on their own.
Markus Backman
But sometimes policing is fun.
Christian Edström
Sometimes policing is needed.
Markus Backman
Is needed.
We send weekly reminders. This is basically every developer gets this in the mailbox Monday morning. Basically summing up all the services they are part of and owning, seeing what is needed from a maintenance perspective, upgrading versions, and seeing if there are pull requests not attended. If there are a lot of pull requests being approved, not put in production, basically trying to push them so they know what to do on Monday morning.
Christian Edström
This slide, now we are back where the movie ended: the unscheduled rapid disassembly, when something horrible hit planet Earth and the monolith.
We were infamous, or famous, in the Swedish news, and also I think to some extent outside Sweden as well. We got a fine of $82 million. That's a lot, given the currency or the price for dollar for us guys these days.
And the reason was because the Swedish FSA said that we do not follow our internal procedures, and we did not have the suitable controls.
We've actually had other banks in Scandinavia thanking us that we got hit by this, so they know what they must avoid. We're happy. It's always good to be first, right?
So basically, from a customer perspective, a lot of our customers basically got negative balances on all their accounts. They couldn't buy anything. They were standing at the till, and all the cards were declined.
Markus Backman
My mom called me and asked where all the money went.
Christian Edström
Yeah. Because you take a card, can't get the money, basically declined. You open it up: "Look, I have minus a lot of money." Of course you get worried. So you understand, it was not a fun experience for our customers. Not a fun experience, we define, for Swedbank either.
So now we'll get to the corporate reaction to this, from one angle at least.
So Markus, what usually happens after something horrible happens?
Markus Backman
Well, now FSA said it was our internal processes that wasn't good enough control. The internal process, and basically everything, started with a change that someone did in production, and that escalated to getting customers on the minus balances.
So then it comes to change process. So process office that owns that process, was their suitable reaction when it happens? Change freeze. No one allowed to do anything in production. That's usually how it goes.
I don't think that's the best option, because then you start patching up stuff, and the next time you do something, higher risk and so on. But that's kind of...
This is actually a slide from last year. I don't remember who actually put the slide up.
Christian Edström
You mean from this conference?
Markus Backman
Yeah, from this conference, absolutely.
So the thing, we have the software engineer versus governance, with compliance and audit. You have the different views. From the developers: "I just want to get my feature out." But then the governance part, like FSA, requires us to have a good control of changes.
Of course, if we just let software engineers decide, it's going to be feature Wild West. There will be things going and so on, and then we can risk more fines. So that's not good.
If we use governance, risk, and compliance, and all that, they will have a 12-week evidence review for every line of code change you want to put in production to make sure everything is fine. And we don't want that either.
So usually you come up with the change process saying, "Okay, fill in this form, and then we will have a decision point in an amazing CAB meeting."
Christian Edström
Change advisory board.
Markus Backman
Change advisory board. Does everyone know what that is? It's a horrible thing, right? No one is sitting on one?
Of course it has its purpose, maybe. But for me it's like, I want to move here: the automated governance part. Because we can't really sit and, like, we have now 7,400 changes this year. Those can't end up in the change advisory board. They will not want, trust me. Process office doesn't want that, because they don't know how to handle that sheer amount of changes. But we still need to have good control.
So how do we do?
As I mentioned, I was here last year. When I got back, I started to think a bit on this topic. And of course we started having a conversation. Process office went through our organization trying to see, how should we actually control now when we have gotten this fine and so on?
The good thing with Andromeda is we had no team has production access from the beginning. You can request that during incident, because of course that's one way to do production changes that are not fully controlled. But that process we have, and all changes are recorded. So we know all changes. We can't go into production doing an unauthorized change. So we have that.
The problem is that the service owner is then responsible to do a so-called risk and impact assessment. So before they press the deploy button, which they have access to, they need to see: does this have a big impact? If it has a big impact, it will require a CAB meeting.
So I got back. I took the 3,500 deployments last year. We are over double that already, interesting. And I looked at that, and no single service owner raised even one of those 3,500 deployments as a large impact.
Could be, could be not. I'm not sure. So I thought maybe that is the scenario that we should focus on.
All deployments are recorded. We get a lot of metrics from there. We have a standard platform, standard pipelines, and everything. So we get the metrics of everything happening from the teams and all services.
So what I've sat down was basically, let's see if I can create some form of scoring model that in the end comes up with an impact score between, to make it simple here, zero to 100. And together with process office sit down and talk and say, "Where do we draw the line? Where do we think that we are high enough that this is a major or critical change that requires a CAB?"
So I presented this just after Christmas, New Year's actually, just beginning of this year. And they said, "Yeah, absolutely. This looks really, really promising, because we don't want 7,500 CAB meetings. So please fix this for us, but in a better way than we can with others."
So that means the model, the automated impact scoring, it looks two sides. It looks on the application, basically the version they want to deploy and the difference between the version that is already in production. So the change.
It also looks at the team owning the application, because now we are actually wanting the team to have good practices, risk and impact-aware practices.
These are some of the metrics that we are using today. If you have a business-critical application from the perspective of the bank, you need to score a lot better in other areas to basically be allowed to push things, because the impact is higher for the business.
Of course, looking at if you're batching large changes, we all know higher risk. If you haven't deployed for a long time, higher risk.
We also look at all the commits. And because we look at the conventional commits, we use that for all commit messages. So there we can see. We also check on the cognitive complexity of the application. So if it's very, very complicated to understand and read, there's higher risk for changing.
If you use automated test cases, the pipeline has four stages. If you don't use all four stages, you are basically saying, "Might be higher risk," and so on. Also look at critical vulnerability.
On the team side, I think that is more important, actually, because then we are trying to push the team's culture of how they work in their practices. So are they using feature flags? Are there enough people actually actively working with the application? Otherwise it's a risk for us if one person is leaving and only one knows it.
Have they done the secure coding practices, and seeing the scoring for that? Have they actually tested rollback, and so on?
So these are practices we want the team to have done, and that reduces the score. And therefore they will get out.
Today, you get an email after the deployment. And this is going to be a stopper for you soon, where we basically are saying, before you press the deployment button, we do the calculations, say, "Sorry, it requires a CAB meeting. Set up the change," and so on. Or, "Yes, go ahead. Green light," and press the button.
But here is an example: our team that deployed something that actually got a major. And in the email you get what you're doing good to reduce the impact, and what your team should actually do to reduce the impact in the future, to not end up here. So very, very transparent on saying, if you do these things, you can deploy as much, as often as you want.
So that email has been out since March. And what we can see now is that the major and critical classifications have dropped quite good, especially the major parts. So 45% from the start, when we started doing these calculations, up until now.
So the teams have adopted. They're doing the new practices as described in the email. We have less risk, and we don't have any large-scale things breaking the platform. And we hope to continue that.
And this is one of the things that we want to push. For me, I'm really happy. And of course, as soon as we launch that you're not allowed to deploy, actually press the button, and not getting the email after the fact, this will of course be enforced.
Next step, and here are things that I think I would like to have a lot of conversation with you. How can we move on? Because for me, I see we need to learn from the smaller incidents that might occur, of course.
Should we have different models if there's a new team versus a more experienced team? Is there a way to do that? Including more measurements, especially when it comes to resiliency, needs to be there, because that's kind of missing.
And I will actually do this because, as I said, I'm an IoT geek. This is just an extra fun part.
All the deployments are happening all the time, but I kind of like to visualize it. So I was actually at a Lego store in New York on vacation and saw the Saturn V rocket. And I said, "Oh, I like space. That would be fun to build." Next thing, "Okay, maybe I should build something that actually looks like it fires underneath it. That I can do."
But then I came, "Ah, we have a platform. We can connect that." So I actually built this small Lego rocket with an IoT. So that is behind me when I'm working in my home office. Everybody that has some video meetings with me can actually see when deployments are happening: red when there are deployments on Andromeda, blue when a feature flag is changed in production.
So it is one way to visualize, and also good for upper management to see that things are happening all the time in the background.
Christian Edström
And we got so intrigued, so now Markus is producing these rockets. All different offices want one of their own.
Markus Backman
Yeah. It is really fun what you can do when you actually have the data, like scoring model versus an IoT. It's definitely fun parts.
So that was all what we had.
Christian Edström
Yes. Thank you.
Markus Backman
Thank you very much.