Taming Complexity - The Rise of Platform Engineering in Enterprises
As IT capabilities increase, we deal with ever faster-moving changes in an environment with more independent teams and several technology stacks - how do we keep this under control? Platform engineering is an emerging answer that brings systematic governance without increasing the amount of red tape. The structure of the talk is as follows (using 3 different case studies that I have been part of ): The changes in IT landscape, a working definition and reference architecture for platform engineering, the specific experiences with implementing it in 3 different organisations.
Chapters
Full transcript
The complete talk, organized by section.
Mirco Hering
Good morning, everyone. The flow of people coming in is slowing down, so we'll get started.
I am quite conscious that I'm standing between you and lunch, and potentially more important, you're standing between me and the Champions League match that I want to watch. Thank you. Thank you for having me.
And this is one of those talks where Gene set the bar pretty high, because he called them expert talks in this morning session. So we'll see how it goes. I appreciate having the soapbox here to talk, and hopefully you'll be able to take something away.
My name is Mirco Hering. I work for Accenture. I look after our DevOps practice and our developer experience, which is an interesting challenge when you think about we have over 700,000 employees, and the majority of those are engineers. So you can see how that is kind of close to us, to figure out how we do this better.
So what I'm going to share with you today is really the ideas behind platform engineering, developer experience, what that means, how we can make progress in that.
I'll start with something I know here, I'm friends, right? So we're all kind of coming from the same view, where software is incredibly important to everything we do. And here you have the common quotes about software is eating the world, and every company is a software company. I think that's very true.
The challenge is, I don't think we have done what is necessary to reflect that. Because if you think about every company is a software company, what that means is that your engineering and your IT have to be at the forefront of what you're doing. And we've been still treating it very differently.
I'll give you a couple of examples. If you go to the CIO and you ask him, "So what is the system that runs your IT?" you wouldn't get a good answer, I'm pretty sure. If you go to your sales organization, you'll very likely get something like Salesforce. If you go to your manufacturing, you'll get an ERP system from SAP, perhaps.
The CIO doesn't really have that. There's a CI/CD pipeline and there's Terraform and there's all this kind of stuff, but there isn't really a system that has been built out to manage it. It's quite interesting, which means we haven't really invested in that space enough for us to deal with the concept that software is now our business.
So there are really three trends that I see around this.
One is obviously software engineers as key employees. So it is not just the person who sits in an office for a while who's making sure that the server is up and running. They're now one of your key employees, similar to your designers and your architects and everything else.
We have an increasing complexity in software delivery. When I started in software delivery about 20 years ago, releases were on a quarterly basis. And that was completely fine, because you could manage it. There were about a hundred components. You could name them. You could number them. You would do it once a quarter. You had these kind of rooms like this, where you had a project plan on the side and you could tick off what's happening. That was possible.
Nowadays, that's not possible anymore. We have hundreds of microservices. We are deploying 15 times a week. So clearly, we have a very different complexity that we have to deal with.
And we had an interesting trend that came over the years where, with Agile and DevOps, we empowered the teams. Now, that led to what you've heard this morning as well, a sprawl of tools. And we have decentralized a little bit too much.
So how can we get the balance right? Because the days are also gone where we have, "We are Rational organization. We have all the tools from one vendor, and everyone needs to use this." So where can we find the right balance? And I think platform engineering is one of the answers for that.
And then we have a more liquid workforce. We all know this. People are moving between companies more frequently. That means one of the things that is really frustrating is that we have really well-understood principles like CI/CD, but they're not necessarily codified in a way that you can easily translate from one organization to another. And I think there's really good work happening in that space as well.
For me, just through terminology problem with everything we do, we have a problem with terminology there, because everything means something different to different organizations. So for me, developer experience and platform engineering are obviously going hand in hand.
Developer experience is focusing on the person in front of the computer, and platform engineering is ultimately the system that they engage with. I think that's a really reasonably useful definition.
There's this other thing that plays a role here, which is very, very frustrating, which is VDI and Citrix, as you see in the middle of the top. Any of you who are engineers will know that that can be an absolute pain. So there's nothing I can do about this, but I just wanted to make sure the view's at least touching on that, because there's a lot of productivity lost just in that.
There's a little anecdote. I was working for an organization a little while ago where, because of the way that the VDI was set up, it actually took about 16 hours to check in a patch. And so that means the patch team was the best team at table tennis by far, by far. So we need to think about those things, and I'll talk a bit about that in a second.
All right, so let's start by focusing on our engineers. This is basically Dan Pink's what motivates knowledge workers, and I think engineers ultimately are knowledge workers. So we're talking about autonomy, mastery, and purpose. And we are uniquely positioned with Agile and DevOps to actually help along those three dimensions.
I've stolen this from Sketchplanations, which is a really cool website for making things simple.
Purpose is basically we're working towards something that is meaningful to us. And there's a couple of things that we're doing to help with that. We are creating vertically integrated teams. We heard again this morning, and we have the Agile message where we are not just a tester that executes the test step. We are actually understanding what we are testing. We are understanding our functionality. We have product owners. We have people who are explaining to us what are we doing with our customers.
And we are providing business measures so that the teams that are owning a specific customer journey or customer service can actually see what that means for our business. So what is the funnel of a product through this?
Autonomy, and I think that is one of the misunderstood pieces. Autonomy doesn't mean that I, Mirco the engineer, can choose what tools I want to use, because that is incredibly expensive for your organization. And it's also dangerous.
Here's an analogy. I have a six-year-old at home, and I want my six-year-old to become a really, really responsible adult at some stage. Now, that means I want him to be autonomous.
If I'm now sending him into the kitchen and say, "Have some food," what do we think is going to happen? Is he going to have a nice balanced diet, or is he going to go straight for the cookie jar? And he knows where that drawer is. And in the drawer, there's a section for me and there's a section for him. And guess where he always goes first? My section. But nothing is wrong with my son. That's just how it is.
So very similar, when we say, "You are now an autonomous team," or, "an autonomous engineer," that doesn't mean automatically they know what good design looks like, what security means, what data integrity is. That doesn't come by just letting them loose. It comes by providing them the right guidelines to make it easy to do the right thing.
Guidance. Autonomy means they can achieve their outcome and can focus on that, while everything else is kind of taken care of or gets easier.
So from an autonomy perspective, there's a couple of key principles here.
Decoupled architectures, which means you don't have to wait for everyone else to have a common object that they need to upgrade. If you're working with big package software, that might sound familiar. If you're working with mainframe, it might also sound familiar. So decoupling architectures is one of the things that we can use to increase autonomy.
We should have seamless tooling, so things that don't require us to log into 15 different things to get something done.
Self-service capabilities, and again, platform engineering has a huge focus on that.
And then developer touchpoint. What I mean by that is, how many interactions do I have as a system? Remember that the developer actually doesn't necessarily care about the whole system that sits behind it. He cares about his individual touchpoints with it.
So if he gets a template that has already security approved and has all the right data stuff in it, et cetera, then that's fine for them. They don't need to know which tool you've used for that. They actually don't care. They care about their touchpoints.
And at the bottom, you see three quick questions from an analytics perspective that you can use to just measure your autonomy over time, because that's always one of the things that we want to achieve. We want to become more and more autonomous. That means less and less tickets, less and less teams that we need to interact with, and that will ultimately get us better.
And then the last one is mastery. And again, we have the mechanism for that in place, certainly from a technical perspective. And we talk a lot about that, about the fast feedback cycles that are coming through CI/CD, but also the business mastery from stakeholder feedback. Am I actually developing features that are useful?
I was shocked 15 years ago, and I'm still shocked when I walk into organizations and I ask that question. "So the features, man, we've increased. We are now 30% more features into production." Awesome. How often are they being used? And you very often basically get silence. Because that's not what the organization does. They get stuff into production. They don't necessarily measure whether they're being used. And so that is an important part of the puzzle.
If I look at and take platform engineering, if I look at the typical IDE or whatever, the developer portal, whatever you want to call it, very often there's not necessarily business metrics on it. And that obviously shows us that that's not the important feedback exactly they're seeking at the moment.
So we have a lot of good stuff in there. We can achieve the empowerment of our developers. We can unleash the creativity of our developers by focusing on these three things.
This is a nice little slide that a colleague of mine has created, which kind of demonstrates a bit how broad that developer experience is. You have, on the left, team is onboarding. We talked about this this morning again. But I've been in an organization where it takes 10 weeks, 10 weeks before you have access to all the systems you need. Perhaps it takes you two weeks to get a laptop. That is obviously an absolute waste of time for your key employees.
You have the developer marketplace in the middle. That's kind of your platform engineering part, the developer portal part. And then on the right-hand side, you have these career journeys.
So it can't be that our model for engineers is the same as everywhere else. Engineers want to stay engineers, potentially. They don't necessarily want to lead teams and big projects and so forth. So we need to figure out how we can create that engineering culture that allows them to continue doing what they do best.
There's nothing worse. I know this from experience because I had to learn that, than promoting your best engineer to become a team lead. They're really not going to enjoy this, potentially. And that is a really hard truth. And that means we can't just have one size fits all.
So there are five blocks of dimensions, or whichever you want to call it, that go into this developer experience.
The first one is organizational structure. And that is this vertical organization that we talked about. That is the, how do we actually organize ourselves around our architecture? Or change the architecture to start organizing around us. And that's the project to product. There's a lot of good thought process going into that.
Then we have the culture bit. And the culture bit is close to our heart and level. There is the blameless postmortems. That is the engineering culture that is allowing engineering sandboxes, that people can explore the creativity of trying new things, even though we want to control it. How do we allow them to still have that space for them?
We have the developer career model. Talked about that. We, in Accenture, have this what we call Chief Technology Engineer program, which is really trying to address for people to get promoted outside of the traditional metrics. So by doing stuff like what I'm doing here, speaking at the conference, providing white papers, all this kind of other stuff that is not necessarily get measured in the organization very often.
And then we get to the two things that are closely related to platform engineering.
We have a developer journey, and that is your process. So this is from onboarding to, how do I create a server? How do I create my first service? All of those.
And then we have assistants and tools, and I'll talk about that a bit more in a second as well.
So if we are serious about changing the experience for engineers, then we need to think about their workflow. And there's a couple of things. This is a busy slide, but basically what it tells you is this: there's features that you're delivering, there's maintenance that you're delivering. And it's not as easy as people say, that we just have one two-pizza team, that's it.
But we have to worry about what actually goes into this and what the different systems are that people engage with. And then you have a transformation on top of that. So you have maintenance, you have new features, and then you have uplift. And that might be tech debt reduction, might be upgrades, might be things that you want to change by introducing a new tool or a new practice like CI/CD.
Now, we've been now in the DevOps community for, and ITF for 10 years, and DevOps a bit longer. I find it still quite shocking. Maybe it's DevOps, right? We keep talking about we need to break the silo between dev and ops. You need to do the culture work behind this. But then you go to organizations, this is what you see.
The developers log into a tool to build features. Ops, or even if it's the same team, they log into a different tool to manage the operational procedures. Let's call the left one Jira, call the middle one ServiceNow. And then if they want to see business metrics, they actually have to go somewhere else.
Nothing enforces the problem we have with culture more than having different systems. Because at the moment you log into a different tool, you know you're in a different context. You know that you spend 80% of your time in Jira and 15% of your time in ServiceNow.
We can't say that we have a seamless culture if we keep forcing people into different tools. So we need to think about that. And I don't think there's a terribly good answer. I'm still looking for it. But to me, that is really frustrating. After so much work that we've done, we still haven't created that seamless interface for our engineers.
So I talked about how delivery has become a lot more complex, and you see a couple of data points on that. But I think that is really important to think about. We are at the point where if we just allow ourselves to continue doing what we're doing, we are going to get into more and more trouble.
Because as I heard once, to err is human. If you want to create a catastrophe, you need automation. And that is true.
Think about this. If you want to take a server farm down manually, that's pretty painful. You have to log into lots and lots of machines. Taking down your whole region in AWS, pretty trivial if you have the right access. And there's examples where whole organizations disappeared overnight because they didn't secure their stuff correctly.
And I have my own experiences where I get the call in the morning from one of my engineers saying, "Hey, unfortunately, they deleted all Jenkins servers." And at that point, you're stuck.
Because the other thing that we have to learn from this is, let me automate everything. Try to find the person that can do it manually. Pretty hard to find, because no one knows this anymore. It's quite shocking.
And that's very often because we automate, you just hack it, versus thinking about, like, I'm a big fan of documenting first before you automate, because then at least you know what you're automating and you've optimized the process. But that stuff is really, really important.
And then look at the level of toolset. These are two random things that I took from the internet. How is it surprising that CIO is struggling with figuring out what tools to use?
And we still have in here my second grief, at the end when we talk about what I need help with. We need to solve this somehow. Every vendor has their own capabilities, and that's fantastic. But we haven't really created a reference model that allows us to talk about it in the same way that the Salesforces and the SAPs and so forth talk about it.
And we don't have a common reference model. And I think that's one of the things that will come now, that we will see from platform engineering that people are creating reference architectures. And there's things like Humanitec and McKinsey one. You will see one that I've created. But they will come out, and hopefully we start coalescing on something that we can all stand behind.
So this is a reference architecture that I use. And again, this is obviously opinionated, but from a platform engineering perspective, these are elements that you'll have.
You have your SDLC management tooling, as mentioned before. That's the stuff where you have new features going through.
You have your service management platform. That's where you have your ticket, where you have your monitoring.
And then you have the developer workstation, IDE, developer, your Visual Studio .NET or whatever you want, VS Code, whatever you use.
And then you have this core part of the platform. And the core platform starts at the bottom of the cloud landing zone. Cloud landing zone, at the end of the day, it doesn't really matter whether you're in a public cloud, private cloud, or even your own data center. It is where the resources are coming from. That's where your services come from, your data, your storage, your CPU, et cetera. So that needs to be composable, which means we can start orchestrating it with a layer above.
The layer above has three different platforms. Again, there's a data platform. The data platform is obviously where all your data sits. So from a business perspective, most organizations are developing data platform for themselves, for their business needs.
Very similar, we want to be part of that, because A, we have a ton of data in it that we want to use, but also we need to have access to the data to create our test data, for example. Source it from there, mask it, all that kind of stuff comes through with good data platform.
On the right-hand side, you see the container platforms. As you're working with containers, microservice, et cetera, then you have that orchestration platform there.
And then in the middle, you have your software engineering platform. And I'll talk about that a bit more. But that is where a lot of the stuff that I showed on the previous slide, where tooling sits.
And then above that, you have your service catalog, your developer portal, et cetera.
So if you go into the, I'm not going to go into this. You can make a photo and you get the slides. And I have a new version of it that has a third column. It just shows you how many capabilities there are.
So if you have something like this, the end becomes a very easy conversation now with your organization to say, "Okay, which capabilities do we need? And with our tooling, which of these are covered?" You can basically color this in any which way you like.
That is the first part of having a meaningful conversation, because it's not as easy as, "We have Jenkins, hence we have CI," or, "We have Tosca, hence we have test automation." That is just not true. So we have to kind of be a bit more differentiated around that.
So to build this out and to make it as useful as possible, we need to understand that our platform is ultimately our product. And our engineers, our developers, are our customers. Simple as that.
We can't just create tools and help the people use it. We can't just prescribe them. We have to really see this as an engagement with the community.
And some of the best measures that you can have for this is, how many people do you have? If you have a central tooling team, shared service DevOps team, I don't care how you name it, it's the same principle. It's one of the things that we get really frustrated with, I think, when people say, "Oh, we can't have a DevOps team." But who's going to do all this? Because at the end of the day, it is very difficult if you have a team of 10, an Agile team of 10, for them to have all the skills. And so clearly someone needs to provide some of these services behind them.
So whatever you name them, they will be the people who create a product for your company. And they should then have proponents, sponsors, or whatever you want to call them, in the community of the developers. So you have alpha developers, the mainstream developers, people developers, and that's who they're working with. Because that's how you measure.
And I can tell you from my own experience, we obviously have the same challenges. It doesn't help if you have a central team that everyone keeps referring to. You need to have these champions that sit directly with the team that understands this as well. You need to multiply this, because your shared teams should be pretty small and your organization might be pretty big. That means you have a multiplication problem. And the only way that it works by having these champions across the organization, like in security for your security tooling, with the Java community, et cetera.
And then you can get the benefits that you see on the right-hand side, like the orchestration, the automation, the philosophy, all that kind of goodness.
So how do you go about this then?
First thing, we need to identify our users and our use cases. Why are we actually doing these kind of developer journeys? That's really what you're going after, the VSM that you do to figure out what does it take me to deploy my mainframe service? What does it take me to create a new API for our customer portal? Whatever these are. And then you figure out what the capabilities are that sit behind it.
Measure your feature usage. I talked about this from a business context. This is true for us as well. Remember, we're creating a product. So that means we need to measure how often are people actually using the specific tools that we use, the specific capabilities we've developed. Because they will change. They'll change all the time. So that means you have to continuously evolve. And the only way that you can do this is to keep measuring what works.
If you have an API management system in your organization, you might be aware of that, because that's what you do with APIs. You measure how often they're being used, and they're not being used anymore, you basically retire them. And if you don't know how to measure your APIs, then you will just have more and more. Same problem.
You have to have a transparent backlog. That means the community can engage with your backlog. You say what you're going to do next. They can basically put requests in, and you have an engagement of that. What is the most useful thing for us to do?
You allow contributions to the platform. So let them do a pull request on your CI/CD pipeline. And as long as it's with all the standards and so forth, they can contribute this back.
We make it loosely coupled. And loosely coupled is really, really hard in tooling. And you all have experienced that. There's not necessarily the same API that does CI for Jenkins versus for CircleCI versus for whatever. They all work differently.
So loosely coupled means, again, you need to go one deeper in your reference architecture to define how that works, so you can then replace each of the different components.
And another thing that is frustrating is the data behind this. So it is not surprising that it's really hard for us to correlate this data, because each of the tools has its own data model. And that means what is called a release in one thing is called a sphere in something else, and it's a team in something else.
For you to correlate that, you need to have basically your own data model. And then with that data model, then it's possible for you to replace something. Because if you're replacing tool one with tool two and they have a different data model, then you're basically back to square one.
And then you have to have it well documented. And I know we hate that. I hate documenting. I hate writing it. But boy, is it hard if you don't have it.
So perhaps that's where some of the gen AI stuff will help, which we certainly see. But it also needs to think about the users.
Here's a fantastic test. When your new engineer comes into the organization, see whether he can do stuff without asking anyone. Because that's where you've well-documented things. Every new engineer joining your organization is an opportunity for you to see the organization again, because they are seeing it for the first time. They can navigate it, you've done a good job. If they struggle, then that's one thing for the backlog. It might not be the most important thing, but it's something to add to the backlog.
So I talked a lot about the pattern and how much it evolves. And here's another thing that I learned from my six-year-old. And I'll call it the breathing in and breathing out of governance. Because I think that's the only way. The extremes will never work. One platform to rule them all, or everyone creates their own. It's in the middle. It depends. It's the middle consultant answer.
But I think this breathing in, breathing out model works quite well, which means I have a certain standard and I allow people to experiment. They can choose their own tool. They can change something with a certain capacity: dollars, time, whatever way you take it.
And then after certain periods of time, they have to come back and say, "Does it still provide value? Is this better? Should we add this to our standards?" If you agree to just add it, then it becomes part of the standard.
If it doesn't, at some stage you have to basically make the comment, says, "Look, this is not adding the value that the cost of maintaining it," et cetera, et cetera. Basically the TCO of this house. And then you shrink it.
So you go through these ongoing phases of expanding and shrinking that allows you to manage your platform over time. And you go to the DevOps Enterprise Summit, and you go to the exhibition hall, and you find three new tools that you want to try. So go for it, but then figure out, is it now a new standard? Does it replace something? Or is it in addition? And then you just basically account for the extra cost.
So with all this, I'm super excited about this. I think there's a really interesting movement with this. I would love to collaborate on these reference architectures, on everything else. I'm going to put a series of blog posts out on this.
Thanks for having me. Enjoy your lunch. If you want to talk more, ping me, connect with me on LinkedIn. And yeah, see you for the rest of the conference.