Calculating the Operations Cost of Software You Haven't Developed
One of the challenges an enterprise has with DevOps is cost. It is not acceptable for an enterprise project to request that the board write a blank cheque. This can be further complicated by legacy systems perhaps requiring that actual hardware is purchased rather than being able to use a cloud provider.
What we would like to have is an idea of costs before development have created all the software.
Chapters
Full transcript
The complete talk, organized by section.
John Davis
My name is John Eric Davis, and I'm going to talk to you today about calculating the operations cost of software you haven't developed yet.
So I wanted to start by saying, well, why is that a relevant question at a DevOps conference? Why are we even discussing this?
I think it's because a lot of people go on the same journey with DevOps. They first hear about it in a word-of-mouth manner, and they get interested, and then they go and read more material about it, maybe a book, something like The Phoenix Project, and then they add excitement. And then by the end of it, you're passionate, and you're thinking, "This is amazing. This is something that can really change me and the company that I work for."
And then you go back to work, and you try and explain exactly what you want to do. And you get this list of, well, you could either call them questions or problems. And what's important is to not be disheartened at that point, but to look at that list of questions or problems and see, are they valid? And is somebody else solving these? Is this unachievable?
So one of those particular questions is around budgets and around running costs, and how's that going to work given the somewhat inflexible nature of that and a highly iterative, flexible system like DevOps? So that's what I want to talk about today.
The enterprise in question that I went back to was easyJet. Most people are probably familiar with easyJet, but if you're not, it's a low-cost airline that was formed over 20 years ago and was genuinely disruptive. It was really changing the industry and bringing costs down, and was incredibly agile and was able to change very quickly.
And that's grown. It now flies over 76 million passengers a year, and it has over 10,000 employees. And like any company, when it grows, it becomes harder to achieve certain things. Your competitors move. They get closer to you.
So what easyJet has decided to do is embark on a new program to create a new commercial platform, to add the ability to have more business agility. And that is where I fit in.
So I've done a variety of roles, from trainer, starting off teaching C++ and then .NET, doing a lot of development at investment banks in low-latency algorithmic trading, the mandatory failed startup that everyone seems to have done when you try to escape. And that failed, obviously. And now I'm currently helping easyJet with their architecture and delivery.
So like we said, we're going to focus on one particular question, which is: I was asked, "Well, how much is it going to cost for the running costs of this particular part of this program? What's our bill going to be here?" And somebody had actually done a finger-in-the-air estimate. And because it's such a large program, they'd actually come up with this finger-in-the-air estimate in the millions. It was about £4 million or something.
And that's not good because it's a big number, but it's also not good for another reason. It's like just too finger-in-the-air. And we need to talk about that.
It's also very relevant that the number was so big, because if it was a small number, obviously no one would care. But because it's a large number, it meant it was worth people putting time and effort into forecasting and calculating a more accurate number.
So I also had another question. I thought, "Why do I get all the fun questions?" Which was not just how much is this going to cost, but what hardware do we need to buy? And again, it's a DevOps conference. What hardware do we need to buy? Are we in the right place?
Because part of the solution was to be interfacing with the legacy platform. And that was in a data center. It needed to be quite chatty, so it was going to have to be co-located, and it meant that, effectively, we were going to have to buy some hardware. And in some ways, that's even more of a scary question because nobody wants to get it wrong. You don't want to buy the wrong hardware. Not buy enough.
The other thing that made this very difficult is that a lot of companies have quite a rigid budget structure. They have this idea of the yearly budget, and people are asked to come up with this large number and are then almost rewarded if they just don't spend over that. It's like, "Hey, well done. You can have the same again next year," which doesn't really reward people to try and bring that number down.
But it actually creates other problems as well, which is that it's just too unrealistic. It's way too unrealistic to ask anybody to come up with, in this case, the operations and the running costs, but in nearly everything else: what are those costs going to be for the next year?
And that's a big problem because it means people almost don't take it seriously, because it doesn't matter. Nobody can calculate that accurately. So they just come up with these wild guesses. And those wild guesses are generally going to over-provision things, because you don't want to have to go back later and ask for more money. It doesn't look good.
But because it's just a wild guess, you're equally likely to actually under-provision and still have to go back because you didn't base the forecast on anything empirical.
So how can we improve that? Well, we need to look at, first of all, the duration that we're talking about and accept that projects are also at different stages. So if we're looking at a project or a feature that is an experiment, it's only going to be rolled out to a percentage of the customers to try and see whether it succeeds or not, and we're measuring success as it meeting a customer or operational outcome, not that it just met something and we didn't spend all the money, then clearly the resources required for the experiment phase are going to be very different for the later phase if it succeeds and moves into an exploit phase, where you now want to ramp it up and roll it out to the rest of your customer base.
So we're looking at redefining the structure within the question that we've been asked.
And then we just jump to a buzzword. And this is because once we'd had that conversation about, well, what is a reasonable structure to try and answer these questions in, we found that it wasn't as hard as we thought it was going to be to come up with an approach to actually forecast these costs.
And we didn't say, "We're going to use microservices because that's going to help us forecast operational costs." We picked microservices for the same reason that most other people pick microservices: the ease of deployment.
We'd all worked on projects where even the smallest hotfix required a build of the entire software, deploying it to a big test environment, and taking hours or even days. And now we can hear about a bug, have a hotfix, and actually have that rolled out within minutes or so. It's much better.
The other thing I wanted to briefly mention was that although it's very important to not think that microservices equals good architecture and good design, and you can't go wrong. Of course, you can. You can have bad architecture no matter what you pick. But it does seem to be a very good fit.
Domain-driven design is all about looking at the project and saying, well, what are the different areas of that project? What data do they need? And when they need to share it, how can we do that without having dependencies between them?
This service may need this data, but it should not be dependent on that one. It should have sent out the data. That one should consume it and would probably store it in a different schema or something that's suitable for it, and it should not be impacted.
So we had all these good reasons for picking microservices, but what we found is that there was also an incredibly strong correlation between the performance requirement of a given microservice and its cost. Or, more accurately, there was a very strong correlation between the performance requirement of a microservice and the resources that were required to meet that performance requirement.
And if you then were to be able to map from that requirement of resource to a cost, then you can now start to forecast what your future microservices might cost based on how your old ones are performing.
So there's quite a lot of information there. We're going to go into a bit more detail and say, well, what do we mean around these performance requirements?
Well, every microservice, or indeed any service, should have a set of non-functional requirements around resilience, security, and the one we want to talk about, performance.
And that would normally be expressed in a certain way. It would say that for a number of transactions a second, that the microservice should respond in a given response time, often in milliseconds. And just slightly more accurately, it would normally define in a percentile. Well, actually, for only 95% of those requests, you need to respond in that time to allow for variances such as large payload sizes and things which you don't want to have to improve the performance to meet everything. You just want to say that for 95% of the time, it meets that performance requirement.
Now, for all of those non-functional requirements, you obviously should have tests. And that would include things like the resilience side, making sure that for a given microservice, for us, that we would define patterns around permanent problems that it might have, transient problems, isolated transient problems when a given message is maybe a poison message. How's it dealing with the other ones?
But for all of those tests, you would hope in this day and age that they're all automated as well. So yeah, we think we're doing a pretty good job of that on the resilience side, but also on the performance side.
So for every single microservice, it will have a performance requirement, and on every Git commit, then our TeamCity box will orchestrate with JMeter, our tool of choice, to put that service, spin up an environment, put it under the load that it's requested to meet, record the results, and if it doesn't meet that, then fail the build.
Now, that's important because that's not to do with forecasting costs. That's just what we were doing anyway: to prove that you weren't having any slippage as people start to change your microservices. So we already had all of that infrastructure in place, and we were then going to just build on top of that to be able to do this forecasting.
So it's hard to do this without going into some examples. So we might as well pick something here and say, imagine that we've just built a single microservice. So we're going to build a couple of things and try and forecast based on that.
And this one has a requirement to deal with 400 transactions per second in under 100 milliseconds. So what the developers are going to do is they're going to, first of all, start going through that process and seeing if they can achieve it.
And there's only two ways that they're going to achieve it. They're either going to achieve it by writing better code and looking at indexes and improving things, or they're going to throw more hardware at it. Throw more kit at it.
So what's key is that, effectively, people are aware of that, and that they try and improve things, and that these figures should be after they have done the first wave of optimization because we want to bring these costs down where possible.
And what that's going to say is that it doesn't matter how good at code you are, at some point, you have to get the right resources.
So the resources in question here look quite abstract, and the reason for that is that we do all of this testing in the cloud, and we use AWS. But I didn't want the talk to be AWS-specific. So it's not like M4.large. It's just, conceptually, it's a resource that's getting bigger or smaller, and that you know the costs of that resource.
And that would be equally true if you were doing platform as a service, and you were picking some cloud service, and you could select the size of it. Again, you'd know the cost of that. Or even if you're in the data center and you're speaking to a provider about provisioning virtual machines for you, they can give you a cost. So you have a way of mapping from one to the other.
And what we'd start to see is that there's just no way that the smaller resources are going to be able to meet that requirement. So we can see that the large and that the extra large one reasonably comfortably deal with that requirement.
Again, still not forecasting. This is just what they've done to work out what resources they need to meet the requirement.
And we actually automate this as well, where it can actually deploy to different sizes and then record the result, so that it's easy to find the optimum size for the one that you need.
But what we could then start to do is to look at the data and say, well, I suppose this is now very interesting. We don't have a massive monolithic software with a big bill attached to it. We've now got some resource sizes, and we can always get to cost from those, for very small parts of the application.
So when we start to have tens or hundreds of microservices, I wonder, again, with this correlation, if you find that actually another microservice that has to deal with the same number of transactions a second and get back in the same response time. Yes, they're going to be different categories. Are they CPU-bound, memory-bound, IO-bound, all of these things? But that aside, they are going to start to have these familiar traits.
So we start to be able to say, well, I suppose if a future microservice that's similar also has the same performance requirement, then we could probably allocate that to a large VM, and we'll see about costs later, and start to predict what those costs might be.
But that would obviously be very restrictive if the only way you could forecast was based on, by chance, having future microservices have exactly the same performance requirement of the ones that you'd already built. Very limiting.
So we're going to do two things. We're going to take the services that we've written, and we're going to see how they perform at different loads, both above and under what they've actually been designed to do. And we're going to extrapolate from that.
And we'll also, as we'll see, we'll need to still change the resources as we go through that, because otherwise you can be led astray.
So let's have a look. The first thing that you'll find is that if you take a microservice, in the case, the one that we just built, and it was happily dealing with its 400 transactions per second in under 100 milliseconds, if you didn't do the test, you might start to extrapolate in a linear way that it can deal with 800 and so on. That's not a good idea, because they don't tend to scale in a linear way. They tend to just get swamped. Once they can't cope, they can't cope.
So we would like to be able to say, how would it have performed if we'd given it a bigger resource? And then also, what if the future microservices have got smaller transactions per second requirements?
Well, yeah, we know that the large one could cope with it, but is it too big? Could we actually find a smaller resource that would have been able to cope with that?
So if we run that again, the same microservice, just a single one at this point, under now different resource sizes, you'll see a similar pattern where they can all deal with a certain number of transactions per second up to a point, and then they just tail off and they can't meet it anymore.
But we start to be able to see, if a future one needed to deal with 600 transactions a second, again in under 100 milliseconds, the resource size of, in our case, like an M4.xlarge or something, would probably be sufficient there.
But also, we also start to realize that we could get some savings and only maybe allocate a medium to that resource. So we start to make some progress. And at least even by now, it's using empirical data and coming up with some sizing. It's not going to be perfect, but it's still based on something.
We can actually, though, take it a bit further. And for this, I'm just going to come up with an actual requirement for a future microservice and say, what if we have a recommendation service, something that every time you do a search, is going to recommend somewhere else that you might want to go?
So we decide to roll that out to only 5% of the customers. And we know that the total is something like 2,000 searches a second. So we can calculate that the performance requirement for this future microservice would be to deal with just 100 transactions a second at that point in the experiment phase.
And we decide it's going to be part of the similar request as a web page, so it needs to have a similar response time. On this one, we'll say that it needs to do that for under 150 milliseconds.
But by this point, let's now say that we've built a few more microservices. Even if we've only built two, three, four, we'll have the data for those as well.
So when we query that data and say, can you show me how the microservices that we've built to date would've performed at that load, which may not be the load that they were asked to perform at, but you now have this real data.
And in this case, I've removed the larger resources, because we're not interested in wasting money. We're interested in what was the minimum resource that could've met that requirement.
So in this case, we could say, yeah, across three different microservices, it looks like if we'd allocate a medium-sized virtual machine to that, it could probably cope.
But what's important is that we start to add in this utilization data, because otherwise, this is not granular enough. It starts to only be able to map to these kind of large... Even at the VM size, it's too big.
Because what you'll then find if your microservices are broad enough, they'll have very different performance requirements. And this is where people tend to, when they're doing the finger-in-the-air thing, go wrong because they over-provision a lot.
So if you only needed to deal with 20 transactions per second or two or something, people tend to say, "Well, yeah, we'll just give it a small virtual machine. That'll be fine."
But by recording the utilization and maybe seeing that for a very low performance requirement, it only needed 10% of a CPU, you can then start to make some decisions about how you allocate the resources to that microservice.
So if you're in the cloud, then you can start to say, well, when I go to my chosen cloud provider, it never really just says, "What virtual machine do you want?" It says, "Do you want something that is CPU-intensive or is optimized for memory or IO?" And a lot of people just don't know. They don't know the answer.
Whereas if you start to record this, you can start to see that across all of your microservices, maybe there's patterns there, and that actually most of them, certainly in our case, were more CPU-bound and using very little memory, but it'll obviously be different for every company. So you can start to tailor the resources.
In the on-prem world, again, sometimes you've got more control. So you can actually say, when I'm building or provisioning my virtual machines, just make sure you power down the memory on that. We don't need all of that for that particular microservice.
Or if you're dealing with the very small microservices, you now move one layer down and you stop talking about virtual machines and you start talking about containers. Because you can still provision sizing at the container level.
You could still say, well, I'm going to take a small virtual machine, I'm going to put some containers on that and tell each one, "You can use a tenth of the CPU," or, "You can use this much memory." So you've got a lot more control.
And then you can start to map to costs, and the two are not identical because clearly it depends what the project phase is. If you're at the experiment phase, and again, if you're in the cloud, you're probably going to pick like an on-demand type pricing. I will sign up and basically pay by the hour for that resource because if the experiment fails, I don't want to be tied into a contract.
If the feature succeeds and it moves into an exploit phase and you're still in the cloud, then you could move to a reserved instance model, or the naming convention of your chosen provider, to say, "I'm happy to sign up for a few years." And again, on-prem, you can ask them again exactly what costs are you going to be seeing at that point.
Now, the only real difference between the two is that when you're performing these tests, you tend to be interested in different loads. If you're on-prem, you don't get elastic scaling, so you tend to be much more interested in the peak load. But the tests are the same.
If you have a peak load or the average load that your services are under, it's just a different number. It's the same test, which is to say what resources are required to meet that load.
So once you do these tests at different thresholds, it doesn't matter whether you're trying to scale for peak or you're trying to scale for the average one. You can now map to costs.
Let's just see how long I've got. Okay. Let's finish up quickly then.
At easyJet then, we went through this process, and we'd only done, say, five microservices, and we went through to forecast the cost for the other 80-odd. And we went through this exact thing of all of the small ones were where the savings were found.
And we actually saw an order of magnitude improvement. So instead of it being like 4 million-odd, when we extrapolated the worst case, which was not the experiment, but if everything was exploited and was needed over a given amount of years to compare it to the millions finger-in-the-air number, we came up with more like 400,000. Like an order-of-magnitude improvement, which can be the difference between projects being canceled or not, if you can actually do that.
So in summary, yes, the exciting key pillars of DevOps are around making sure you've got your structure right in terms of teams, requirements, and architecture, but also everybody needs to work together with that list of questions, that list of problems. Bless you. How did we deal with that?
So this is just an approach that we went through and which we found obviously really useful. It could be extrapolated further, and if we're asked questions around licensing and total cost of ownership and things, we could extend it into that, but it's worked well for us.
So thank you for listening. If you want to reach out to me on Twitter, it's @MrJohnEricDavis, and the same on LinkedIn.
Thank you.