Architecting an Optimal FinOps Platform
This session will discuss the techniques for optimizing your cloud investments using FinOps Platform ideas. The core concepts of FinOps are well understood but consistency in implementing the recommendations provided by a plethora of reporting and analysis tools are less structured. This talk dive deep into these concepts and present a scalable model.
Most organizations who have adopted the public clouds (and in some cases private clouds) over the past decade are currently struggling with an ideal way to optimize their cloud investments. This talk will focus on the techniques and eventually a blueprint for building a FinOps platform. The main points I will be focusing on is the relevance of the triumvirate of Finance, Product and Technology, core principles of the FinOps platform and the building blocks of this platform. I will expand on the 5 core principles around - Data democratization, pipeline management, organizational alignment, sustainability and the platform operating model and map it to the building blocks. This approach will power measurable improvements for an organization trying to optimize their cloud investments using a platform implemented in a bespoke manner aligned with the pre-existing core technology platforms they have.
Takeaways:
1.Why should you consider a FinOps Platform for your cloud optimization problem?
2.What are the key building blocks of the FinOps Platform?
3.How can the Platform capabilities address the reusable nature of the needs of Finance, Product and Technology bringing it all under one unified umbrella
Chapters
Full transcript
The complete talk, organized by section.
Ajay Chankramath
Awesome. Good afternoon, everyone. Thank you for joining. I think we are good. Let's get started.
My topic today is going to be architecting an optimal FinOps platform. How many of you are familiar with FinOps or work with FinOps today? I see some show of hands. How many of you have never heard of FinOps? Good. So I'm not going to spend a lot of time talking about what FinOps is. I'm just going to assume that we know what FinOps is. I'll probably spend a few minutes just talking about the concept behind FinOps, but other than that, we are going to jump right into the platform side of things and talk about how FinOps and DevOps sort of come together.
My name is Ajay Chankramath. I head up the platform engineering team for Thoughtworks North America. We have our Thoughtworks colleagues from the Netherlands here, so if you have any questions or need any answers to any of the things that we are going to talk about, talk to any of us.
Anyway, let's jump right in. One of the things that we should always be thinking about when we talk about FinOps is that we all have cloud cost optimization problems. Anybody who's using a cloud has a cloud cost optimization problem. It is not something that's going to completely go away at any point in time. It's always going to be there. It's a question of how we actually approach that. If you need some crash courses in FinOps, the best way to get that is check out finops.org. That's in the FinOps Foundation under the Linux Foundation, where you can get a lot of information.
What are we going to talk about today? Like every other conversation Gene talks about, we'll talk about what triggered this whole idea of creating this FinOps platform. Then we'll talk about the overall problem definition, how we are solving the problem, and then jump right into the platform capabilities. Why should this be a platform versus a bunch of automation that we do? Then we will tie this back into the whole DevOps problem, because our pipelines and our engineering effectiveness workflow are well known these days. We need to see how we can marry those two things and make sure that your developers are enabled to solve this problem, not as an afterthought, but as you build your products. Then we'll talk about some takeaways.
First thing: how did this start? If you look at what is happening within the industry, you see that a significant amount of cloud cost optimization problems are being solved in a tactical way. There is no strategic way of solving it. This is something we understand and see day to day, having worked with about a hundred clients who have been doing this on the cloud for the past seven to ten years, and also from what you read in the industry.
You also find that this is not a very unique niche problem. If you take the digital natives out there, about 85% of them are on the cloud, and they have this problem. They have understood there is a problem with cloud cost optimization. Everybody is trying to shift left, just like every other best practice within DevOps, meaning you're trying to solve the problem sooner than later. That's great. But about 30% of cloud costs are currently wasted. If you think you have done better than that, definitely take a look at it. You might have, but there is a really good possibility that your wastage might be more than 30%.
The other fact, and this is the most interesting thing for me, is that if you look at a particular year through which you budgeted everything for your cloud and you ended up making some optimizations, then going into the next year you continue to underestimate and your budget is still not aligned with what you really want it to have.
How do people solve it? The biggest ways people try to solve it are to go out and buy a tool or a platform. That's great, but tools and platforms aren't solving the problem for you today, because any tool that you buy is perhaps going to solve little bits of your problem, but mostly it is going to tell you what the problems are. Solving the problems is still with you, with your organization. If you buy a lot of tools, buy a platform, and try to put together a FinOps team, that isn't going to actually reduce the cost. It might end up increasing the cost.
Organizations typically need a holistic solution, not isolated local optimizations. That holistic solution is not just a bunch of tools and process. It is about building the right kind of solution contextually as to what works for you.
What's new about solving this FinOps problem as a whole? It should not be that different. You understand the problem, extrapolate the data, make sure that you get the right kind of data to say where your problems are when it comes to cloud cost optimization, then put together some kind of phased work plan and optimize it. That's no different from any other problem solving in the FinOps world. There is this whole cycle called the inform, optimize, and operate phase. You collect the data, make some fixes, and make sure that it becomes business as usual.
There are lots of platforms out there. If you are in this field and using cloud, I'm sure you have used some tools to do this optimization. The tools that are out there give you reports and tell you recommendations on what you should do. The next thing is, how can you automatically make some optimizations? There are other tools that look at backend catalogs of your cloud providers and give you AI-based rate optimization ideas. That saves some money. If you are trying to save 100% of your wastage, that probably gets you from zero to 30% in savings.
Now comes the biggest problem. When it comes to really trying to create a solution that will solve your overall wastage problem, you still don't have a solution that you can go out and buy. If somebody tries to sell you a platform or tool saying, here is the solution for your FinOps problem, buy this and your problems are solved, I'm sorry, that doesn't solve the problem. If that solves the problem, we should talk. I don't think it does.
The way I typically talk about this is the three Rs of FinOps: reporting, recommendation, and remediating. Reporting and recommendation are set, and there are lots of fantastic tools out there that tell you what the problems are and what you should do. Remediation to solve the problem requires more contextual information, and that's what we are really going to talk about.
As we talk about reporting and recommending, one thing you should be thinking about is accelerator platforms. There are several platforms out there. You might have heard of a tool called Kubecost, and I'm sure you've heard of CloudZero, Cloudability, CloudCheckr, and so many tools. What they promise and what they do is great: they tell you what your problems are and tell you how to solve them. For example, Kubecost tells you information about current usage in Kubernetes clusters and your overall cloud profile. It also goes to the next step and tells you the things you should be doing: unassigned resources, abandoned workloads, and so on. So you see the difference between doing it and telling you what to do. This is the primary information that you need, but the next step beyond that is translating it into remediating the problem.
There is another set of accelerator tools. One example is ProsperOps, and there are other tools like Spot.io and IBM Turbonomic. They take it to the next level: how do you look at the current rates available from your cloud provider and automatically apply that without you having to jump through additional hoops? This gets you from zero to 30, but you still have 30 to 100 remaining to continue to optimize. These are good tools if you want to get started and see some value, but that's where it stops.
Another one from a sustainability point of view is something we built in collaboration with a few foundations. This is called Cloud Carbon Footprint, the CCF tool. Check out cloudcarbonfootprint.org. Essentially it is a set of APIs that works with your cloud service provider and gives you information about your carbon footprint and how you can optimize that. It is similar to the accelerators, but specifically focused on sustainability. It tells you, if you use these resources for these activities and these nodes, here is the carbon footprint you're leaving on the table and here are the sustainability goals you're not meeting. Again, this does not do it for you. It tells you.
One question we always ask is: whose problem is it to solve? When we talk about FinOps, one thing that we keep hearing is whether it is the FinOps team's problem, or a technology problem, or a product problem, or a finance problem. If it's a finance problem, finance has requirements for better budgeting, forecasting, and cost controls. So is it a finance problem? Absolutely. But could it be a product problem too? It could be, because product is thinking about cost of revenue and cost of goods sold, making sure that if you are selling a product, the money you're spending to build and sell that product is not more than what you're making out of it. But eventually it turns out to be a technology problem, because the onus is on the developers who are building the product to understand what the challenges are and solve them. The answer to whose problem it is: it's everybody's problem. This is what makes it so complex.
I said I'm not going into too much detail about what FinOps is, but I want to give a high-level view so you can understand the platform ideas. There are six key concepts. These principles drive cloud cost optimization. The first is ownership of your usage. Then you need collaboration. There are multiple axes where ownership lies, so you need a clear understanding of ownership. Then there is the idea of a centralized team. You've heard concepts like a FinOps team. That's a glue team that brings these people together.
A common misconception about cloud is that people always say, I want to do cloud cost optimization, so I'm going to save money, or let me go to the cloud to save money. That's an unfortunate way of looking at it, because FinOps is not about saving money; it's about making money. It's about increasing your business value. But that's easier said than done, because people still want to save money.
Another big aspect is report accessibility and timeliness. When you are generating reports, how do you make sure they all talk to each other and get the right information? Eventually there is the whole variable cost model of the cloud, which is super important and one of the main reasons people go to cloud in the first place.
Now let's talk about why this remediation problem has to be solved using a platform approach. The primary reason is that if you look at data from our work with clients across industries, and data from the FinOps Foundation, you find multiple things to consider: the data used to make decisions; automation priorities once you have the data; remediation priorities; and eventually the maturity level you have after solving the problem. There is inherent misalignment between each of these activities today.
When I talk about data, I mean cloud utilization data, finance data, business data, observability data, and all the data from various sources that drives decisions. Then you do automation. If you are looking at anomaly detection and see that your cloud cost usage is less than expected, it's not always a good thing, because that might mean your customers are struggling. If it is more than expected, it's not always a bad thing either. You need to know your baselines and make sure anomalies are understood properly. Usage reports and utilization are similar. The priorities of your automation are somewhat different from the data available.
This is also interestingly different from your remediation priorities. Anybody who has played with FinOps knows the number one thing is tagging: making sure you know what your resources are used for. Are you tagging your resources correctly? That is a remediation priority, which is somewhat different from automation priorities and data priorities.
Now look at maturity. You do all these things, people tell you what to do, you build your automation, and you say you are going to do these things. But the maturity of what you are trying to do is different too. The number one thing from a maturity point of view is unit cost economics. Most people who use the cloud know that unit cost economics is one of the driving factors and have developed some maturity there. For those not familiar with unit cost economics, it is the idea that if your product, service, or application uses a certain amount of resources, you know what those resources are and what money you are spending for them. There is inherent misalignment between all of these.
Now let's jump into platform capabilities as a whole. This should resonate because it has nothing to do only with FinOps; it is about the overall aspects of how you should build the platform and what capabilities you are trying to create. We talk a lot about cognitive load. We want reduced cognitive load as you build this FinOps platform. Those familiar with Team Topologies know this idea. Efficiency means shared responsibility: not one team trying to make all the decisions and make everything happen, but something that has to happen across the board.
The third key category is agility, making sure there is proper discoverability, because this becomes huge when it comes to the complex services offered by every cloud service provider. Replaceability is another key aspect. We talked about accelerator platforms that give you reporting and recommendations, but what if you are tied to that platform forever? What if the licensing model changes? You want a model where you can change those things. Replaceability becomes extremely important.
Composability is also key. The overall platform is not about reinventing the wheel. It is about building on what you have been using as an organization. The platform capability has to have composability built in so that at any given point in time you can change the shape of the platform itself in the most efficient way possible.
Looking at the building blocks, the first set is data related: many different reports come in, so how do you make sure there is timeliness and correlation of the data? The second is organizational: do you really have a DevOps culture? Do you really enable your developers to do the things you want them to do, or are you expecting somebody else to make those decisions while your developers build your product? That becomes a huge part of how you build this.
Third is pipelines. Pipelines are essentially a commodity at this point. When you're building pipelines, how can you incorporate some of these things so this doesn't become an afterthought after you discover problems and go fix your pipelines? It should not be that way. It is something you have to start doing from day one.
Then your overall operating model for platforms: could you make your developers more self-sufficient? Could they self-manage resources? Could you have a more product-driven way of doing things? All platforms, as we know by now, have to have a platform product model. All platforms are built as products, and a FinOps platform would absolutely be built as a product. You need product thinking coming in.
The last thing is sustainability. Doing the right thing with respect to sustainability should not be an afterthought. It should not be something you do when you have time and money. You can actually have it both ways, and that's something to think about as you build this platform.
Talking about GreenOps, we don't look at this as a completely different entity from FinOps. For us, GreenOps is part of FinOps and it has to be. I map the whole GreenOps cycle into the FinOps cycle: inform, optimize, and operate. You get your sustainability data, understand your targets, build sustainability aspects into your IaC and pipelines, integrate with third-party tools like CCF or any other tool, align with ESG, and eventually look at this with the right automated governance. The activities that go into implementing a proper GreenOps cycle are no different from how you typically do a FinOps cycle. It doesn't take extra effort other than getting the right data. Once you have the data, you act on it the same way.
Now, coming to the actual platform itself, the base layer is some kind of cloud provider. Based on that, there are key prerequisite components: accelerator platforms that provide reporting and recommendations; sustainability accelerator platforms; and your observability platform. This assumes you already have some kind of observability platform in play. If you have it, that's great. If you don't, that is an absolute prerequisite. Even if you don't know it as an observability platform, you probably have it.
Then come the five key components of the platform capabilities: core capabilities, metrics, alerting and notifications, policies, and automated governance. Core capabilities include resource tagging and entity right-sizing: the things you should build to ensure you get what you're trying to get in an automated remediation fashion. Metrics are how you get data and put it back into the system so it can start making decisions by itself. Alerting and notifications are not an afterthought. Some tools already do this, but the question is how you integrate all of those things and bring them under one umbrella.
Policies are where cloud becomes interesting. As a technologist, you may not think about financial policies and business policies, but the question is how you incorporate those into the overall platform. Automated governance is also essential. If you expect a team of people, whether FinOps people or finance people, to sit around and make decisions about what should be done, that is not going to happen. That is not a scalable model. You need automated governance.
When building this, you have to build it with an API-first mindset. This is not something you can manually manage. There is a lot of machine learning happening in this space, a lot of AI-based decision making. The tools for automated rate optimization already use a lot of AI capabilities. Your platform should be able to understand what is happening based on data within your cloud space and your contextual environment, and make decisions based on that.
Don't forget the cultural aspect, the people aspect. You want to make sure anything that can be automated and anything that can be driven through developer experience is incorporated here. Earlier talks covered cultural aspects, and those are incredibly important when it comes to building this platform and making it successful.
Now let's tie this back into the overall DevOps lifecycle. We keep hearing the term engineering effectiveness. The idea is making sure your developers are more effective and productive so your customers get the benefit. The way I look at it, FinOps has to be built into every step of the way from day one. It is not an afterthought. In planning, designing, defining, and delivering your products, each step has things you can and should do: choosing the right backend catalog and resources, integrating with your observability platform, and building IaC so tagging, right-sizing, and the other things we talk about become automated ways of working. Making this a continuous cycle means you don't fix your FinOps problem after it happens; you make sure the FinOps problem does not happen. That's the only sustainable way to solve it for the long run.
We can't talk about these things without talking about a shared responsibility model. There are multiple owners: product management, finance, development teams, probably third-party providers. Having a clear sense of ownership is extremely key to making this successful. When you have too many components in there, we have seen that this is the number one reason why FinOps offerings don't become as successful as you want them to be. You want to make sure this shared responsibility model is something you start with. The point is to create a clear understanding of responsibility across the board for everybody involved.
What are the takeaways? First, why a platform for remediation? Because we have lots of things out there for reporting and recommendations and hardly anything for remediation. Doing it this way solves the problem for you, as opposed to just knowing what the problems are. Second, we talked about key capabilities of the platform. You need to understand your context and your problems, because it is not about buying an off-the-shelf solution and using it. It is about understanding your problem and mapping it to key capabilities. Third, we talked about the shared responsibility model and how to make sure the right people are thinking about it and solving this problem.
Where do we go from here? This can be split into multiple areas, starting with strategy. Think about ownership aspects, the right skill sets, partnerships with vendors who could be selling products to you, especially on the reporting and recommendation side, and what strategic aspects you should think about. On reporting and recommendation tooling, ask what your real goals are. Are you buying a tool because somebody promised it will solve the problem for you? Kubecost and the other tools are great tools, but are they the right tool for you? Organizations still struggle with how to make that decision. Sometimes we get involved after they have bought the license and find it is not the right solution. That's too late to make that decision, so you need to think about it earlier.
We talked about shifting left from the point of view of engineering effectiveness and how to do that on the cost side. You have to start thinking about whether there is a unified way of looking at your cost and usage. The FinOps Foundation has recently started working on some of these things, and this is still an evolving area. For consumption, we talked about tagging. Continuing to have more innovative ways of doing it would be great. As of somewhere recently, the current maturity when it comes to tagging is that about 95% of organizations still don't have proper tagging. Eventually, from the point of view of wastage, ask how you can get more informed targets and fix these problems.
To conclude, check out our website on some of the key aspects of this platform and how we are building these things. You might also be familiar with my colleague and chief scientist Martin Fowler's website; check his blog out, and you might find interesting information on platforms there. For questions, reach out to me. I'm here for the next couple of days. Reach out to my colleagues too. Thank you.