From 4 Releases a Year, To Once Every Other Minute, In the Public Sector
We'll take you through the transformation of NAV, Norways biggest governmental agency, responsible for paying out a third of the federal budget. Five years ago, we had no internal developers, and 4 releases a year. Now we have 300 internal developers, and deploy to production every other minute.
We believe we did something right - treating internal platforms as products, going for continuous improvement over projects.
We also did some things that we are not sure about. How to split up a gigantic organisation into autonomous teams.
This talk will tell the story of our journey, and share what we learned.
Chapters
Full transcript
The complete talk, organized by section.
Audun Fauchald Strand
Hello. Welcome to this talk about how we started the transformation of Norway's biggest government agency. We call it "From four releases a year to once every other minute." My name is Audun, and this is Truls. We're going to give you some insight into our transformation, but first, we want to show you some data.
And let's start with Accelerate, perhaps the most important book for this story. Accelerate identifies deployment frequency as one of the four key metrics that indicate the performance of a software development team. And we have searched for, and also found, data from the last 12 years with just enough resolution, allowing us to plot the average number of deploys to production per week per year. And here it is.
As you can see, the change is pretty big. Before 2016, NAV used to take pride in arranging four coordinated massive releases per year. But something changed in 2016 that accelerated into 2019, and as of today, this number is over 1,300 per week. This translates to once every other minute of a working week, at least here in Norway.
But when we dug deeper into the numbers, we found something strange. The last graph that Truls showed us was a steady increase per year. But when we changed the resolution to weekly, we saw that before week 32 in 2019, we had a few hundred deploys every week, and then we had a little drop in the summer holidays. And then straight after the summer holidays, we went from a few hundred to more than 1,000.
This was kind of a mystery to us. Why did this happen? And in this talk, we're going to show you what happened to explain this sudden jump of deployment frequency.
But let 2019 remain a mystery for now. Let's turn to 2016, where the change began. And to understand how, and not to forget why this change started, we need to go further back in time to start this story, to the very start of NAV, and even a bit before that.
In 2006, our politicians in Norway decided to create an interconnected super agency with three already huge existing agencies. We combined the service for providing benefits with the service for helping people get back to work and the social service. And the vision was to create an agency that gave its users, that is all Norwegian citizens really, a unified welfare experience.
NAV is actually a Norwegian word that means the central part of the wheel, and this is a good description. NAV is the center of the Norwegian welfare state, and we want to support a lot of the cases of a welfare state through just NAV.
The size of NAV is actually quite unique. We pay out about a third of Norway's national budget every year. Most of this is in age-related pensions, but we also have other benefits. If you compare this to other countries, I've seen somewhere that NAV does the job of something like 30 different agencies in the US, for instance. So we have centralized a lot, and we put a lot of functionality into one big system.
And the political ambitions were high for NAV as an agency. But sadly, though, those ambitions were not followed by equally high ambitions for the IT systems behind this new super agency. So NAV was born with two separate legacy monoliths. Both of them were absolutely essential for the welfare state to function. But being one organization now, we needed to make those monoliths communicate in some way.
And making changes in one monolith is difficult enough, but making coordinated changes in two interconnected monoliths, that's nearly impossible. And this made for a large organization not optimized for change at all, and we compensated with lots of change management.
And the very first thing that NAV did as a new agency, well, that was building yet another monolith. This time, a new pension system. And having no in-house developers at all, this new system was built entirely by consultants.
And of course, we had the applications outside the big monolith as well. We had task management for caseworkers, systems for handling or showing applications to people, and an application platform. But almost all of this was built by consultants, so it was really difficult for us to take ownership of our own technical direction. We basically had, as an agency, 19,000 employees, and just the IT department was almost 1,000 people, but we had no developers ourselves. So we decided to start insourcing.
And that started in 2016. And we got a new CTO. His name is Torbjørn Larsen, and he started off with a clear vision of what NAV could be and what NAV should be. And he told that story again and again, both high and low in the organization and outside the organization. In retrospect, he checked off all the boxes for what transformational leadership is all about. And fundamental in this story was the need to reclaim technical ownership of our own systems. That was the very same systems that had been built and maintained by consultants for the last 15 years.
So reclaiming technical ownership was a huge task. Truls and I were some of the first developers hired, and a lot of the time in the first few years was just getting more people in and making sure that they were of the required quality. And everybody, of course, wants to hire the best developers. So it isn't an easy process.
We now have almost 200 developers. If you count the data engineers as well, we have 300. And we have both senior people, and we are starting to take more people straight out of university as well. So now we have the capabilities to build and maintain and shape our own systems.
So now we set the scene. We told you a little bit about what NAV is and our history. Now we want to talk three different topics for you. We want to talk about platform, sustainability, and culture, and we're going to show you how all of these are connected, and you need to work on all of them to succeed.
So let's start talking about platforms.
Truls Jørgensen
So at the time we're going to talk about NAV, I didn't actually work for NAV. I worked for a big telco outside NAV. I'd been a consultant at NAV for a year before I started there, and the experience was so bad that I actually quit the consultancy company.
But I heard some rumors about the new application platform they built at NAV. Specifically, I heard that you could get a new server in production in under 10 minutes. And this was much better than what I'd seen in the big telco.
Audun Fauchald Strand
And actually, I worked back then as a consultant myself, I might add, on the team that made that platform. And thinking back, it was pretty good for its time back in 2013, I think. It demanded a clear separation between application and environment config, and offered zero downtime deploys in return. And you could also, as Truls just mentioned, get your own server, your own virtual machine, in 10 minutes. But the problem was that very few teams in NAV actually used it.
Truls Jørgensen
So NAV did an assessment using the Continuous Delivery Maturity Model from the Continuous Delivery book. If you look at the different facets of continuous delivery, you see you need a lot of different things to be able to get good speed. We got really good at the build and deploy, but we hadn't really thought that much about the rest of them.
And plotting out this assessment formed much of my motivation to join NAV as the first developer back in 2016. Because what good is there in offering deployment in minutes when it takes months to change a comma in a simple web application because it needs integrated and coordinated testing and really hard deployment mechanisms? So we needed to think bigger. And what we realized was that just being good on build and deploy wasn't enough. We had to look at solving all the other aspects as well.
And we knew we wanted to do this the right way. So we were inspired by Spotify, and we wanted to build what they call a golden path platform. Something that's easy to use for a lot of developers and makes it easy to do the right thing. And we didn't want to support all the necessary edge cases at first.
So a few years later, when we were further along with insourcing, we started a new iteration of our application platform. We saw that that was a good tool to use to address the other parts of the Continuous Delivery Maturity Model. So the first thing is that we wanted to use the platform as an instrument to improve both the organization, the quality of the application architecture, and system architecture at NAV.
And to do that, we wanted to, of course, use great technology, and we wanted it to be open source. And as Audun just said, we wanted to optimize for migration. We wanted to make a platform that the teams really, really wanted to use.
So by far the most important thing when you create an application platform, and this is also recognized as one of the most difficult problems in computer science, is choosing a name. We had multiple strategies to choose from. This time we chose the pick-a-name-and-retrofit-it-into-an-acronym strategy. So we made the NAV Application Infrastructure Service: NAIS.
And we wanted this to be nice to use for developers. And with nice to use, we meant that it should be as simple as possible but not simpler. We wanted to remove a lot of the unnecessary creativity that teams need to do. You shouldn't need to consider different alternatives for load balancing or traffic shaping. The platform should be opinionated, but do the right thing in an easy and nice way.
Audun Fauchald Strand
So from the start, we didn't want to make the perfect platform. We wanted it to be open source, but that's more of an inspiration than reuse. We wanted it to be specially designed to work perfectly for teams at NAV and make it as easy as possible for them to deploy their own applications running on the different other application platforms onto the NAIS platform. So we did a lot of the heavy lifting upfront and integrated with parts of the Aura platform that Truls talked about earlier.
And NAIS was born open source, and that kickstarted a process to open source most of our code in NAV. We were inspired by Government UK and their excellent writeup not just on how, but on the motivation for why public sector code should be open. As our software is paid for by the public, it should be publicly available.
So we now have over 1,000 public repos on GitHub. Most of them are just code in the open. They solve very specific Norwegian welfare state problems and don't aim to be useful outside NAV. But we have a handful of repos that are run as proper open source projects, including NAIS.
We even introduced a completely new buzzword: cake-driven development. We used cake for almost everything. We gave cake to any developer outside the NAIS platform that sent us pull requests to fix bugs in the platform. We used cake to get teams to migrate to the platform, and it's a really good trick. Instead of having the team considering if this is difficult or not, they can only think, "Do I want cake?" And most teams want cake, so they started on this process.
And we were not just giving cake. We made hoodies, we made socks, and for the hipsters, that is not us, caps. And of course, stickers. We made huge piles of stickers and sprayed them all over the place. And this also created a sense of pride for working at NAV. They understood that it's possible to work at NAV, in something that used to be a really bad place for software development, and actually make good things.
I would say we used what most teams do when they create software for external users. We used the same product development techniques but for our own internal platform. And that helped a lot when it came to migration and optics.
Truls Jørgensen
So NAIS was a successful platform. And as of now, most applications at NAV run on the NAIS platform. But NAIS didn't happen in a vacuum. There were other shifts going on at the same time.
So we need to talk a bit about DevOps culture. NAV didn't always have a DevOps culture, as you probably figured out by now. And back in the days where software development was something that NAV bought from the market, there was no DevOps culture at all. And all this outsourcing meant that we built up a big control regime.
NAV created the specifications, and the consultant estimated, and NAV accepted, and then the consultant built it. And then NAV said, "Well, this is exactly what we needed." And you had a bad process, and making any change in this process with loads of people and loads of coordination is really expensive.
And when change is expensive, you tend to change less often. And the less often you change, the bigger the change becomes, because the need for change doesn't stop, and the developers don't stop typing. And big changes are risky, so you need to put more control mechanisms in. You get manual testers, you get change managers and coordinators, all trying their very best to keep the risk down.
But now we get very risk-aware, and that makes you change even less often. And because at this point, change is rare, you don't see any value of automating, so change becomes even more scary and even more risky. And then you have a downward spiral, and NAV raced down this downward spiral all the way down to only four releases a year.
And it even went so far that we actually celebrated the size of our releases. Because when you release only four times a year, the releases become really big. Someone can remember a cake being given out to the people. We celebrated the fact that we had 103,000 development hours in a single release.
And of course, when you have such a big release, it's almost impossible to think about and reason about the risks involved. No one actually knows the size of all the changes and what the consequence of an error is. And this again leads to more testing and more coordination and more change management. And all this creates loads of fear. Fear of errors and fear of something slipping through the net of manual testers.
Audun Fauchald Strand
So that was how it used to be. But in 2016, with a new CTO and in-house development, things started to change. And this new awkward feeling of trust between developers and the suits was beginning to emerge.
So specifically, the trust gave us the opportunity to deploy ourselves as a software team. This didn't happen all over NAV, but we had a few teams where there was enough trust between the old ops department and the development team so that they could start to deploy themselves. And with much less overhead, the few teams were able to deploy small changes more often.
And people, when they saw this, they saw that everything doesn't explode just because they deploy often. Deploying to production every week doesn't mean that everything changes. It doesn't mean that the users have to get used to a new UI every week or that the applications explode.
And when they see that the change is safe, they start to deploy even more often. And then we have an upward spiral, and the changes get better and better.
Truls Jørgensen
But even though we had some trust, that is enough trust to deploy ourselves, that trust wasn't widespread to all levels of the organization. And the upward spiral Audun mentioned was not without conflict. Because those change managers, they used to be in charge of all deployments, and they were now being challenged by this new DevOps culture.
And the fear of losing control, they're very much alive at this point, even though they were partly realizing that the control they were used to having was merely an illusion. That illusion of control dies hard.
So their response was to create a system of categorization for applications. The most modern applications, where we had NAV employees in the teams and we owned the technical direction, they went on the white list. This meant, as we talked about earlier, that they could deploy themselves, and they had full control over the deployment process.
And the old system, the old monolithic applications, they needed the same amount of change management and the same amount of coordination as before. So they could only release on the four-times-a-year schedule. This was called the black list.
But most of the applications at NAV actually were placed on something in between, the grey list. These were apps that they decided needed some kind of change management. There wasn't enough trust to make it possible for developers to deploy themselves.
And the system they devised, the grey list, involved deep abuse of the JIRA state machine, and it made it possible to deploy using a JIRA ticket and a button. But you still have a lot of central coordination, and every new application meant you had to talk to someone to get it evaluated and put on the correct list.
And at this time, this is a good time to return to the data and the mystery in 2019. Remember this graph? What looked like a steady increase turned out to be a sharp jump at a specific time, week 32 in 2019. So what happened in week 32 in 2019?
This was actually the exact moment where we stopped having the grey list. It wasn't necessary for teams to create JIRA tickets to deploy to production anymore. All the teams, except for the old monolithic applications, could deploy themselves. And the platform had the functionality necessary to have the transparency that the people could see what happened. And removing the bureaucracy and the centralized coordination made NAV go much faster. And as you could see from the graph earlier, there was a big jump just because we released that one blocker.
Audun Fauchald Strand
So in the end, we embraced a DevOps culture, and we are now changing our systems, some of which are core to the welfare state, every other minute. To do that in a responsible manner, we are building skills to reduce the probability of failure, such as creating deployment pipelines and other skills necessary to do trunk-based development and still get sleep at night. And also establishing good monitoring and building security mechanisms to further reduce mean time to recovery.
Truls Jørgensen
Now we looked at platforms and culture, and we looked at how important it is to create a good technical platform to solve the most common problem for the development teams and how the culture and the platform kind of interact and how you need both to work together.
So the last thing we want to talk about is sustainability. And we have this quote we enjoy very much from Alberto Brandolini: "Software development is a learning process. Working code is a side effect." And this kind of sums up what we think about when we consider software engineering, how to create systems that work over time. But although the quote is kind of abstract, we need to go deeper and see what this actually means in practice.
If the learning process is the most important part, how should we go about organizing a big work effort? We've had multiple projects at NAV, although we don't like to call them projects anymore, modernizing the different parts of our systems.
Audun Fauchald Strand
And big parts of the organization still want to optimize for deliveries, but we want to optimize for learning. So how do we do that?
Truls Jørgensen
And if software development is a learning process, that means you should get to learning as early as possible, and that learning happens in production. So get your working skeleton to production as early as possible.
And that sounds easy enough, right? But the legal complexity of public sector makes identifying a minimum viable product surprisingly hard. We have had success with trying to support only parts of our user base as a first iteration. For instance, only those with one job from one employer and no other benefits. But it's really hard to make that work legally.
But still, it's important to identify that minimum viable product because contrast this approach with a project mindset. We start with a plan, and we create all the tasks needed to reach the finish line. That is all the functionality and all the edge cases, and then you deploy to production, and then you're done. You have crossed the finish line.
The problem is that the finish line is a lie. It doesn't exist. It's impossible to create a plan for everything you need before you start working on your projects. Then you discard all the learning you're going to have whilst you create your minimum viable product. And we want to optimize for learning.
And when you get to the production line, when you get to production, you realize there is no finish line either. Just because you've gotten into production, the work doesn't stop. You have to continue evolving the system. This means you have to take a long-term view of software development. The system will live for as long as the problem is worth solving, and you need to frame the system with an organization that has the necessary knowledge to own and maintain that system.
And of course, when you have to have a stable organization, you need to have a proper approach to financing. And you need to have funding for the team as long as necessary because we never want to stop learning.
Audun Fauchald Strand
So sustainability is about pace, and it's about the right cross-functional skill set and a problem size that fits the cognitive capacity of the team. And none of these things are trivial to solve, but we believe that stable teams are an essential part of the answer.
And stable teams need stable funding, and this is a challenge in the public sector, where most of the funding processes are project-based and more suitable for building offices and bridges than software. And this is a long political process that has implications on how the welfare state is funded. But NAV has taken a clear position here. We want to move away from projects and into product-based software development with funding.
So to conclude what we talked about during this talk, we showed you some data showing we made great progress. We moved from a few hundreds to more than a thousand deployments to production per week. But diving into the data, there was this big mystery. Why did we have this sudden jump of deployments? And of course, the solution was not really a surprise.
We replaced bureaucracy and coordination with trust, and we took away some of the blockers for speed. And we've gone through a lot of the changes we made and how they're all connected. We solved some technical problems, and we did that with an application platform. And that application platform helped shape the quality of our applications as well.
But just having a platform in isolation doesn't solve the problem. You also need to look at the culture and how we work. So we put a lot of effort in building trust as a fundament for DevOps culture. And as we have matured as a product organization, we strive to remove projects and have abandoned a mindset that software development is a race to the finish line, because there is no finish line. Instead, we like to think of it as a race to the start line to get as early as possible to production so that you can start the learning process.
Thank you so much for listening. Hopefully, we've been able to answer a lot of your questions on Slack during this presentation. You can also reach us on Twitter if you have more questions later. Thank you.