12 Factor Terraform: Next Generation Infrastructure As Code
Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth.
In order to do this, we need to be able to deploy quickly, effectively, and with total confidence in success.
The 12 Factor App is old news as far as software development is concerned, and countless firms have achieved great success in applying its principles.
At Babylon we’ve gone a step further and have applied 12-factor principles to our IaC pipeline, and extended the concept of a microservice down to the infrastructure layer.
This means that our delivery teams are empowered to deploy their own infrastructure and innovate around it, whilst maintaining our confidence that we can allocate cost correctly, that we remain compliant with our internal and external standards, and that we are delivering as fast as we possibly can.
Chapters
Full transcript
The complete talk, organized by section.
Richard Vodden
Good morning, ladies and gentlemen. My name is Richard. I run cloud engineering for Babylon Health.
Babylon is an organization whose mission is to put an affordable and accessible healthcare service into the hands of everyone on the planet. Not a small aim, I'm sure you understand. And as part of that, we recognize that technology is a very, very important thing to augment the knowledge and skill of the medical professionals that obviously provide that frontline care. So we are primarily a health organization, but technology is an extremely important part of what it is that we do.
And one of the things that we've been working on over the last sort of 12 months is really making sure we can deliver that technology as quickly as we possibly can in the best kind of way. The kind of thing that we've all been working on, hence we're here today.
A few things. Firstly, I think you're all aware of Gene Kim's cloning machine. So whilst I am here upstairs in my loft conversion talking to you through the camera, there is an evil clone of me downstairs in my office on the laptop able to answer your questions on the track one Slack channel. So if you jump on there, my evil twin will be able to answer any questions you have as we proceed through.
Secondly, this is a subject which we could spend all day talking about, so this really is a sort of introduction to the concepts that we've come across as we've gone on our Terraform journey at Babylon. There is a lot more detail behind it. Please do ask me about it. I'll be at all the networking sessions over the next couple of days and really look forward to answering your questions, and similarly, quizzing you on how you've worked with Terraform, how you've made your infrastructure as code work as effectively as hopefully it can.
01Terraform Is Just Another Programming Language
So the key message I want to try and get across, and the key approach that we took at Babylon was we thought Terraform isn't special. It is just another programming language. And that's the key tenet to everything I'm going to talk about for the next sort of 25 minutes.
Concepts such as dependency injection, such as separation of concerns, all of these things that if we were writing code in Java or in Node or in any other established language, we wouldn't think twice about executing, I've found that Terraform coders seem to sort of forget about.
Some of this is because the facilities of Terraform have been quite limited over the years. Terraform 0.12 obviously has brought in some amazing features, which has very much helped enable some of the things that we'll be talking about in the next few minutes. But not everything. Some of this you could have done with 0.11. Some of it is just a way of thinking about making sure, in particular, that our code doesn't repeat ourselves. This DRY principle is something I'm going to keep coming back to.
So let's have a look. Let's slide up. So let's have a look at what it is we're really trying to achieve.
02The Problems Babylon Needed To Solve
What were the challenges? I joined Babylon August last year, came into a team of amazing platform engineers. What is it that that platform team was struggling that we were trying to get over with the infrastructure as code and the Terraform that we're trying to use?
So firstly, and I think many of you will recognize many of these, all of the cloud expertise kind of lived inside my team, and as such, we were constantly being called upon by other teams. So we have 100 and something microservices in Babylon, all existing in a Kubernetes cluster. Those microservices talk to infrastructure, but the concept of microservice was very much the Kubernetes bit. The RDS instances, the ElastiCaches, the Lambdas, all the other things that made part of the ecosystem were sort of mysterious to the rest of the organization, and we were the only real people that understood how it worked. More than that, we were really the only people who could actually deliver it, which is a sort of huge problem when we're trying to lean things out and making that work visible, and we have bottlenecks everywhere and handoffs and obviously a huge amount of delay.
So linked to that is increased pace. By enabling the other teams to deliver, we can go faster, but similarly, we ourselves wanted to go faster. We were finding that the way we'd structured our code was making life very, very difficult. I think you've all come across the phrase yak shaving. We were finding if what we wanted to do was just make a simple change to an RDS, actually, we had to do seven or eight things first, and that there was no such thing as simple, which was very, very difficult.
We also had an awful lot of divergence, both in configuration, well, in configuration really. And the challenge of ensuring that there was kind of a small number of ways of setting things up really hadn't been considered in the evolution of Babylon over its five-year existence. So we had RDSs set up in entirely different ways. People were using different Postgres versions. People were using different MySQL versions. People hadn't considered moving to Athena, all of that kind of stuff.
Secondly, linked to that, I think we all understand the only way of really keeping track of your cloud costs is to make sure all of your cloud metadata is in really good shape. If you don't have consistency, particularly in that tagging, then you can't really track where your cloud costs are going, and you can't attribute them up to those microservices. And therefore, you find the bill lands with your platform team, which is exactly the situation we were in and exactly one of the reasons why we decided we wanted to go down this path.
And finally, almost most importantly, but certainly equally importantly, we need to enforce compliance. We're a healthcare organization. We're ISO 13485 compliant. We've just submitted HITRUST, so hopefully we'll be HITRUST compliant very soon. We're HIPAA compliant as we work in the US a lot. We need to be able to demonstrate to our annual auditors that we are compliant, and we were finding that we had to do a lot of work to gather evidence before each one of those audits. Where if we sat down in advance and went, "Right, this is how we're going to be compliant, this is how we're going to evidence it, this is how our infrastructure as code innately ensures compliance," we find our audits are actually a much more pleasant experience, and we're able to just kind of go, "Oh, yeah, here it is. This is the answer. This is how we do it. This is how it works." And we can evidence, again using that metadata, that the resources that have been provisioned in the cloud have been generated using this infrastructure as code and therefore will be compliant because we can show that the code is compliant. And that's been hugely, hugely helpful for us.
03Applying Twelve-Factor Thinking To Infrastructure
So let me wind back a little bit and talk about what I mean when I say let's look at how these problems would be solved in any other programming language.
There was a chap who worked for, I think he still does, actually, Heroku a few years ago now, nearly eight years ago. Nearly nine years ago. I can't do sums. And he came up with these 12 rules for writing. I'm holding up seven. I don't have 12 fingers. Twelve rules for writing a cloud-native application.
Now, I am upfront not going to pretend that we're going to hit all 12 of these, mostly because we've only got 24 minutes left, and that would mean two minutes on each of them, and that wouldn't be very interesting at all. But secondly, not all of these actually apply to infrastructure. So for example, number 7, port binding. This is all about not using Tomcat application containers and that kind of technology, but really ensuring that your application is a single executable and that it accepts a port binding in its own right. That doesn't apply to infrastructure in the way that it applies to code. So there are a few of these we can cross out. We get down to eight. So I believe I won't bash them all off. You can review them in the slides afterwards.
I don't think that those four that I've crossed out there really apply directly to infrastructure. But one thing I will add, number eight, rather than doing concurrency through scaling processes, we should actually dynamically scale using elastic services, and that's a key tenet to any cloud infrastructure. We're not going to talk particularly about that today because that's about which resource you choose.
Similarly, I assure you we're not going to go through each and every one of the remaining nine. I'm going to show you when we did our analysis, when we sat down. We're sort of going to skip to the end and look at what the answer was, so that we have time to sort of discuss this properly with a little bit more context.
But let's pick on one. The one that I latched onto when we were doing this exercise: exactly one codebase. I think all of you have had some experience with IaC. I think there's a kind of journey that organizations go through when they're talking about infrastructure as code. Certainly, every organization I've worked for has started off putting all of their code in one repository, and we all know that's incredibly painful. So what is this talking about? Why are we saying exactly one codebase when we all innately know that having one monorepo is incredibly challenging?
And it's all because of this little line on page two of the definition of Twelve-Factor: factor shared code into libraries which can be included through the dependency manager. And that is something which I think very few organizations think about when they think about how to structure their infrastructure as code.
So what we've tried to do in Babylon is take that tenet to heart. So any reused code is extracted into a kind of library. We understand that Terraform has modules. I think most people have come across Terraform modules. So really, the question we're going to be answering today is how should we structure those modules to best answer those four challenges we had at the front, to absolutely minimize refactoring, and to make sure that we're doing the best possible we can with our infrastructure as code.
So when we did our analysis of those remaining eight, maybe nine factors, we came across four big problems, one of which I've talked about already, the factoring the shared code into libraries. There wasn't really a kind of standard answer. You couldn't jump on Google and go, "How do I do this with Terraform?" And the Stack Exchange answers would come up and say, "This is how you do it. This is the right answer."
Firstly, factoring shared code into libraries. We need a dependency manager. Terraform doesn't really have one. There are standard modules out there, but when I was looking at them, I found that, for example, the standard Vault library has inside it an EC2 module. When I go to the standard Consul library, that has another EC2 module. They're very similar, but this means we're repeating ourselves. So if I want to change how an EC2 is deployed, I have to update both of those modules, and really, I want one module for EC2, which is pulled into Vault and Consul if they're both using EC2.
Another core tenet of Twelve-Factor is to separate build, release, and run. How do we achieve that in an infrastructure as code world? What does it mean to build? One of the challenges we've all talked about before is how do we organize our testing? What do we mean by build, release, and run? Where does testing fit in that process? Obviously, there isn't really an artifact like there might be with a programming language like Java or C, or Go even. But then similarly, we're not in the same kind of world we're in with JavaScript either, where the code just executes. So what we did there is we wrapped our modules, our libraries, we gave them all very distinct version numbers, so whenever we talk about something, it has a version number associated with it. We'll talk about the structure of those modules and how we've organized them in just a second.
Explicitly declare and isolate dependencies. Again, this comes back to the point I was making about Vault and Consul. Both those Vault and Consul modules relied on EC2, yet they contained their own code. It wasn't declared, it wasn't isolated. The EC2 stuff was sort of distributed across and made maintenance really particularly difficult.
And finally, storing environment-specific configuration in the environment. What does that mean? When we're working with Kubernetes, we've got things like config maps. When we're working with servers, we've got disks to keep things on. But when we're doing infrastructure as code, what does it mean to have an environment? Where can we store that environment-specific configuration, and how can we make best use of that and mean that things are maintained, audited, that we know what it was that changed in the event that something might break?
04Babylon's Four-Level Hierarchy
So we came up with this four-level hierarchy for the modules and the Terraform code that we've written.
I'm going to start from the bottom of the pile, which seems a little bit strange, but the bottom we've called Module. We've used a capital M here to distinguish it from the just sort of generic Terraform module. These are very small. There's a one-to-one relationship between a module and an AWS resource, well, fundamentally we're an AWS shop. And what the module makes sure of is that these resources are named correctly, that the metadata is correct, that tagging is there, and that any opinions we have as an organization about how a particular resource should be deployed, that's encoded into that module.
So for example, we want all our EBS volumes to be encrypted. If you use Babylon's EBS module, you don't get an option to not encrypt an EBS volume. Similarly, when you deploy an EC2, the EC2 uses the EBS module, so we know it's encrypted, and we've isolated that dependency.
The component is the next level up of abstraction. This is a grouping together of modules to form something useful. For example, it's very rare to just deploy an RDS. So most likely you're going to deploy an RDS, you're going to stick a security group on the front of it to define which other resources are able to talk to it. You're probably going to create some IAM roles to decide who's going to log into it. You're going to create some CloudWatch alerts. The component, each of those has a module of its own, and it's the component that pulls in those dependencies, groups them together as a sensible business value delivering thing, which itself can be version numbered. So if, for example, the very first time we released RDS, we didn't have the CloudWatch alerts, and now we just had the IAM role, we could say this is RDS version one. When we add those CloudWatch alerts to it, we can go RDS 1.1, and we say, right, we have a new feature. We have a new change log. We understand which of our databases that have been deployed have been deployed using which version of that component.
So the next level up is the service. This is where the consumers consume those components. So we have one service for each of our microservices, which consumes any kind of infrastructure. And so this means this is the part where we can enable teams, because those services are their own repositories, they've got their own state files. We can devolve control. We can say, "Here is your service." There's no longer a possibility that teams can accidentally run over somebody else's infrastructure because the IAM role they use to deploy it will only let them touch their own because it's all contained within the service.
And finally, we have the concept of environment. This is where we group the services together, and this is a repository, an actual GitHub repository, where we hold the environment-specific configurations. So for example, the IP ranges of our VPCs are in that environment. And then there's a description of each of the services which needs to be deployed to that environment.
So we can draw a kind of picture here. I realize this is slightly small, but you can see at the top we have those microservices. Some of them don't consume infrastructure at all. Some of them just standalone business logic that talk to other microservices to get their data and store their state. They don't talk directly to infrastructure in any way at all. But some of them you can see here do.
And you can see that dotted line all the way around the outside is how we define environment, and that has three services in it. We have Foo, we have Bar, and we have Doh. When we look at that Doh service, we can see that makes use of two components, the components being DB and Redis. And then when we zoom into that component, you can see that each component uses the slightly smaller modules using RDS and IAM in this particular instance, and ElastiCache in the case of Redis. But notice that both DB and Redis are using the same IAM module. So we've managed to reuse that code.
It has a version number on it, so if we want to change the naming convention for our IAM roles, for example, we can release a new version of the module. It won't get automatically rolled out to everything because we need to release a new version of each of the component to consume that IAM module. Similarly, each service will need to release a new version in order to consume the new version of each component. And finally, each version of those services will need to be rolled out into their respective environments.
Now, this may sound unwieldy when I say it like that, but this is how every other language works. If you're writing Node, you type NPM update, and it goes and gets the latest version of all of your modules. And that's very much the gap that we found was missing when we were doing our work with Terraform. Excuse me a second. There we go.
05Beroku And Dependency Updates
Well, yes, absolutely. I said it. Dependency hell. How do we get around that incredibly long, complicated, "Oh, we've just released a new IAM module, which means we need to update both the DB module and the ElastiCache component, which means we then need to update the Foo service and finally deploy it out to all of the environments"?
Well, we wrote a very small Go script called Beroku. Any coincidental naming like Heroku is, as I say, entirely coincidental. And what this does is it uses GitHub releases, which are basically just rubber-stamped GitHub tags, to keep a record or to be able to track down which versions of each of these modules, components and services and environments that exist. And so it can go into the Terraform code, and it can update the connection string on the Git source when you declare the module. It obviously tells you what you're doing, but it does no more magic than that.
It just: go Beroku update. It goes and looks at every single module. It then looks at GitHub and says, "What's the latest version of this module?" And says, "Oh, shall I go and update these for you?" Yes. You can then raise a pull request to update it. And it really takes the manual work out of that module dependency updating process.
But even then, that itself sounds very manual. I'm here at a DevOps conference, or a virtual DevOps conference, talking about how to solve dependencies, yet I'm saying you can raise a pull request. So how do we take this to the next level? How do we make this even more automated?
The answer to that is always testing. If we want to automate things, we need to be able to find out when they're broken and be told automatically so the automation doesn't crack on and produce something that's horribly broken because we didn't test it.
06Testing The Hierarchy
So how does this structure help us deliver automated testing in the world of infrastructure as code? Well, interestingly, this hierarchy works very well indeed. Hence one of the reasons I feel it's worthwhile coming here and talking to you about it.
Firstly, those modules are very small and very simple, and they line up nicely with the concept of unit testing. We have a very small number of TerraTest scripts written against each of the modules, which check: does the module deploy? Does it even work? Does it create the resource it's supposed to? Does it have the naming convention that we said it should have? Is it tagged properly? Do we have that metadata? When I say, "Please create me a database that belongs to this team," when I look at the AWS console, does it have that team's name tagged against it? And finally, are the opinions that we talked about, I said we encrypt all our EBS volumes. I think most people do. When I create an EBS volume, is it encrypted? And we have TerraTest scripts which go around and test each of those things every time a module changes, and every time a pull request is raised against master. Again, just like any other language.
But it's small and it's simple because it's right at, I guess, the top of the dependency hierarchy, but the bottom of the functionality tree.
The next level up, components. This is where we can start doing more end-to-end testing, the kind of thing we'd talk about integration tests if we were writing perhaps a larger application in Go or Java. So now if we take the database example where I've created a database, I've created some CloudWatch alerts, I've created an IAM role so that I can log into it. Perhaps I've got some logging, those kinds of things, all as part of my component using the various modules that have already been unit tested. Now I can start looking at some, what you might describe as user journeys. Again, TerraTest comes in here. Can I get the credentials for my database out of my Secret Manager? I've said Vault here, but it could equally be AWS Secrets Manager or something else. Can I get those credentials? Can I then log into the database? And can I put some data in there?
Great test. Simple test. Another one that's really important, and I'd suggest if you guys follow this approach, then you might want to put this one in. If there's some data in my database already and I upgrade the module, is that data still there afterwards? That's quite an important catchall. And that's the kind of thing we've done. Those user journeys, it's high level. Does the component work? Does it deliver the value that it's supposed to deliver?
When we get up to the service, so this is where, for example, we have an identity microservice which is responsible for authenticating all of the users that come into the front of the platform. What we do here is we just run the microservice test. So we deploy the microservice, we deploy the infrastructure that's supporting it. Probably the other way around in hindsight. We'd probably do the infrastructure first, then deploy the microservice after it. Instantiate whatever dependencies that microservice needs to have its test suite run, and then just run the application code, the application test suite, the test harness the application has. Because we have extended the idea of the microservice down into the infrastructure layer at this point.
That service code that says, "This microservice needs these components at this version," that can be stored just as well in the application code repository as it can. At the moment, we're doing it in a standalone repository, but we will eventually merge them together. So we can just run the application tests. Again, we can run tests which assert some of the non-functional, so is the data still there? Because many tests will be self-contained and will create the data that they need to create.
Finally, at the environment level, this is where testing is a little bit more complex. This is where, how do you test an entire environment? We could get a little bit more end-to-end. We can run our end-to-end tests, but it really depends what that environment is for. The interesting environment, obviously, is production. So at that point, we're looking at the usual kind of Runscope API testing and monitoring that we do.
The other thing that we do for those environments which are not ephemeral is we regularly run a Terraform plan, I think it's once an hour, against the entire environment, and that tells us whether any of the resources that are deployed in that account have been changed in any way. Has somebody manually gone in and manipulated one of those resources? And so we get an alert if that plan shows anything other than no changes.
07Repository Structure And Results
I mentioned that every entity has its own repository. Those repositories have a very strict structure. That structure is isolated, so each component has its own repo, each module has its own repo, each service has its own repo, and each environment has its own repo, and these structures are the same for all of them, except the environments which I won't quite go into here, just for matter of time.
The first decision we made was all the boilerplate code will have an underscore before it. Just makes it easy. You can glance at what really matters. You can just look at that directory structure and say, "Ah, I need to look at RDS. I need to look at IAM, the _locals, _outputs," which we all kind of understand are boilerplate code. You can sort of see through.
The next important thing is we include an example. That example code is real code. It really works. You have to pass in credentials, obviously, and you have to pass in the name of the AWS account and various VPCs, but the example actually instantiates that component or that module or that service. You can just run it. The tests then use TerraTest to execute against that example, which means that we know that that example, which is fundamentally meant for documentation, actually works because we test it every time we do a release.
So let's just have a look back at the four problems I said we were trying to solve right at the beginning, and see if we've knocked them on the head. We were trying to remove bottlenecks. Well, what we've done is because we have those services which are very strictly owned by the teams developing the microservices, we can give them their own IAM role, that we can give them their own state file. We can keep everything nice and tidy. We can mean that they can't impact other teams alongside them. So yes, we've got rid of the bottlenecks in that sense.
Therefore, we have increased pace. One of the things that was slowing us down when we had our big monorepo was a Terraform plan would take 20 minutes. Actually, now, because we've got a state file per service per environment, they're much smaller. The Terraform plan takes 90 seconds, and most of that's downloading the Terraform modules because we have ephemeral build agents.
We've ensured consistency. Everything is deployed using a module, a component, or a service into an environment. We know that it's right. We run regular Terraform plans. We know when things get changed. They very, very rarely get changed, but we find it when it happens. We absolutely know that everything we're doing is either consistent or, as importantly, we know when it's not consistent.
And finally, we're enforcing compliance. We have from the ground up those modules enforce the rules that we've set out, say, "We are going to be compliant with this framework because we encrypt all our EBS volumes. We ensure that everything is HTTPS." It's baked into the module. We completely understand that all of those resources that have those rules associated with them have been deployed using this code, and therefore are compliant, and it makes our auditing journey an awful lot easier and an awful lot more pleasant.
So with a minute under half an hour, thank you very much for your time. Thank you very much for coming to my presentation. I completely understand that that was a lightning tour through a topic that we could spend hours and hours and hours talking about. My evil twin is on the Slack channel still, tapping away, answering your questions. Thank you very much.