DevOps Heresy - What I have Learned From Working in Large Enterprises
So we all know the tenets of DevOps - automation, full stack teams, cloud. In my transformation journey with a large IT organization I have learnt that some of these tenets are not always true. Too much of good thing can actually hurt.Here are some statements that will surprise you, come to this talk if you want to know why we made these choices:- The team did not fail pipelines with failed tests- The team did deploy code that had failed security scans- The team chose manual cloud deployments over automation- The team moved from Full Stack to a federated modelReally?!? Yes - of course there is more to this than meets the eye. I will explain what we learnt and how this can benefit everyone on their DevOps journey.If you are interested in hearing experiences that challenge your view of DevOps join me in this talk. I will share the mistakes and failures of my journey and how I mitigated over time. Our story has been bumpy and so are the stories of many out there. Let's exchange notes.
Chapters
Full transcript
The complete talk, organized by section.
Mirco Hering
[00:00:17.500] Well, welcome to DevOps heresy. I will admit that it's somewhat terrifying to be back on stage after three years of doing this in front of cameras, so be kind. It's also good to see that there's not a lot of torches and pitchforks, so hopefully that's that.
[00:00:41.500] In this talk, I want to talk about DevOps transformations, big-C, big-capital-T transformations, and how we get all the automation in place so we've got full-stack engineers and full-stack teams and everything in the cloud. Oh, sorry, I think that talk is actually next door.
[00:01:01.600] What I want to talk about is the hard reality of working in large enterprises. A few years ago, if you would ask me whether I would have got goosebumps when teams came to me and said, 'we had our test fail in the CD pipeline, but we deployed anyway,' or when people started saying, 'we're going to centralize our functions into a central team, we're creating a DevOps team,' I would have said, no way, you guys don't get it, we're going to be more DevOps and we need to move forward.
[00:01:31.200] Over time I've learned that, unfortunately, reality is very often different. So I let some of these things go, but I let them consciously go. What I want to do with you here is explore why sometimes we need to let some of these things go for the sake of organizations making progress.
[00:01:45.900] I had a beef with some of the agile community a few years back, where agile coaches would come in and tell you exactly what the real agile is. They would tell you all the reasons why you are not agile: because you're not following this framework or that framework. Everything became a bit too dogmatic, and I realized that here I am having the same problem with DevOps.
[00:02:11.500] A little bit about myself. My name is Mirco Hering. I'm the global DevOps lead for Accenture. As you can imagine, Ben's talk this morning was straight to my heart, tears in the eyes. I do blog at Not a Factory Anymore about the stuff that I learned, and I'm super proud to be part of the IT Revolution family. I wrote a book a couple of years ago, DevOps for the Modern Enterprise. And I do like to break things occasionally, which is kind of a good way of learning.
[00:02:38.800] Where do I usually start? This is really an amalgamation of multiple organizations that I've worked with. It usually starts with an organization that has hundreds of applications. They have very complex contractual models in place, going back to what Ben said this morning: different vendors, different incentives. There's a mix of applications, different technologies, mainframes, SAP, Oracle, you name it, all the good fun stuff that we all as engineers want to work with, and hundreds and hundreds of people involved: testers, developers, BAs, etc.
[00:03:12.100] I was last on stage five years ago, where I talked about the ideas of transformation, and a lot of that stuff made its way into the book. What I'm now going to talk about is more living in the engineering, going back down to the engine room, and what I took away from that.
[00:03:29.400] The obligatory agenda slide: we're going to talk a little bit about CD pipelines, we're going to talk about full-stack teams, we're going to talk about the cloud. And then what I really want to do as well is give you a little bit of what you could take away from this, something you can do at home.
[00:03:46.600] Let's start with CD pipelines. The goal of CD pipelines is that we can build, deploy, and assess applications with fast feedback loops and, ideally, highly automated. That covers building the application, assembling the infrastructure, doing all the testing, doing all the security scanning, and deploying the application.
[00:04:09.800] This sounds great, and there's lots of talk about how we put everything in the pipeline and get super-fast feedback to the developers. What I've learned is if you're putting everything into a pipeline in a complex environment, these pipelines become bloody long. They become very complex. Is the purist answer really, yes, you need to have everything in the pipeline and full applications being scanned and full infrastructure standing up? I want to cover a couple of areas.
[00:04:34.800] Let's start with a usual illustration of the pipeline. Unit testing is amazing, especially if you're using TDD. Then unit testing really helps you to assess your application. But if you don't do TDD, unit testing is something that the developer does along the way when they're developing, or potentially afterwards because they now want to start testing. It does happen in organizations, I've heard.
[00:05:00.200] Is it then the right way to incentivize people to just create more and more unit tests? You have thousands of unit tests, you have to keep maintaining them, and I start realizing that potentially that becomes really expensive. You want to think about what functions are either changing frequently or often being used, so that assessing that function is really important. But the stuff on the fringes, do we really need to?
[00:05:25.400] I'm going to keep coming back to the topic of economic decision-making. You need to think about this investment right now, right here in this automation: is it useful, or are we doing it to achieve 80% unit test coverage or 100% automation? There's no one walking around giving you the badge for 100% automation: you're the real DevOps. Really start making decisions on the basis that you figure out whether there's real value to it.
[00:05:49.600] Testing is very similar. We want to run the pipeline 100 times a day. That means we want to, 100 times a day, test the full application. If you have an open source framework, all you have to pay for is a bit of compute and the different instances. If you're using a commercial testing tool, all of it becomes a very different conversation, because if you have 10 pipelines in parallel, you need to have 10 different licenses. If you want to run multiple seeds in different permutations, you have thousands of permutations to run.
[00:06:21.400] Perhaps you shouldn't run it all. Perhaps you should run a very small set once in a while, or all the time, and then once in a while, the weekend, at night, whatever you want, a full set, and then you check in your pipeline that you recently ran a full test. The speed of your operating will put a lot of pressure on your infrastructure, your licensing, and your commercial architecture.
[00:06:41.600] The same is true when you get into security scanning. Security scanning is amazing; it gives you lots of really good information, but it can also be incredibly overwhelming for developers, potentially if you have a heritage application. I don't like to call them legacy; I like to call them heritage. You get 10,000 vulnerabilities, and then somebody has to sift through this and figure out what is relevant, what is not relevant, what is in the code that we've just changed. That can take hours. You're not going to run this in the pipeline.
[00:07:08.200] I actually move away from the model of the CD pipeline, and I talk a lot more about CD meshes or CD networks. You have certain scans, you have other things that go directly into the pipeline, and then you have certain things that you do once in a while. Perhaps you don't instantiate the infrastructure every time because it takes five minutes to instantiate the infrastructure and you don't want that, and you don't have to, so you just do that once a week.
[00:07:31.100] All of these things I would definitely not have said five years ago. But now, having done this more often and having seen how these things start becoming painful for organizations, I'm a lot more open to conversations about what can we break, what rules can we bend a little bit, but consciously, because we have a good reason for it.
[00:07:49.200] Next thing: full-stack teams. The most efficient way to organize for DevOps is in full-stack teams, and that includes obviously infrastructure. Again, I would say, obviously that's the right thing to do. But is it?
[00:08:05.600] I use this analogy with my clients. I have a five-year-old son at home, and I want him to become a really good, responsible adult when he is older. If right now I'm sending him into the kitchen and saying, 'hey, you choose yourself a meal,' he's probably going straight to the cookie aisle: give me some cookies and some gummy bears and off it goes. That's not because he's a bad person. It's just that he hasn't yet matured and learned all the right decision-making.
[00:08:28.200] If I go to a team that has for a long time just done development, but ops was delegated to someone else, infrastructure was taken care of, and I just said, 'you're now full-stack and autonomous,' they don't exactly know how to make good decisions about security and how to balance which vulnerability is important or not important. They will have to use pretty crude rules.
[00:08:52.200] The alternative is we're just going to put an infrastructure person in there, and a DBA, and a security person into the team. Your little agile team of seven becomes an agile team of 20. None of these people have a chance to do work with their team, because when they're going to PI planning, there's like three DBA jobs in there, and then the rest of the time the DBA just needs to work for other teams. They are going to be a stranger in your team. That's not actually what we are after.
[00:09:19.600] The easy answer is just put people in, and obviously we're going to train our Java developer to also understand network and security. But is that the right answer? Earlier this week we heard from Vanguard about abstractions. I think that's a key part of this. If there is a central team of some ilk, they're building the abstraction that the team doesn't have to worry about it, the cookie cupboard is closed, but all the vegetables are freely available for my son. That makes decision-making a lot easier. That's the power of abstractions.
[00:09:53.500] What I learned along the way is that as a DevOps community, we always knew there was a shared service. We had these agile teams, but we all used the shared service provided by Amazon or Google or Microsoft. They provided their service in a way that was easily accessible, with APIs and self-service. If your infrastructure team operates exactly the same way, there's actually nothing wrong with that. They've given you the abstractions to do this, and then you get further and put your testing and other frameworks in there as well. That's the DevOps team that people don't like, but there's actually nothing wrong with it if you do it the right way.
[00:10:27.100] We've heard this quite a few times this week when people talk about shared services or platform teams. There's nothing wrong with that. Some of you here might identify as members of the DevOps team. You do not have to feel bad, as long as you're doing it the right way. At some stage we might be so mature, my son is 18 years old, they're now doing it in trains or tribes or whatever you call it, and it becomes a shared community. It's an open source platform that you run yourself. But I will be honest: I haven't seen that transition successfully yet. I can see how we can do that over time, but it will take a while.
[00:11:12.700] What we're really after is not full-stack teams. We're after autonomous teams, and here are three ways you can get to that. First, make information available. There's still way too many organizations where a development team can't see the database log file, or the middleware log file, or any of those things. Aggregate your logs. Give read access to everyone.
[00:11:35.600] Provide services for the obvious: passwords, instantiating servers, running test suites, all of that stuff. It's obvious. The last one comes back to what you've heard on stage earlier: measure the dependencies and get rid of them. The most powerful thing we have to get to more autonomous teams is ServiceNow and Jira and those things, because those are the tickets that the teams are raising. Figure out what are the most frequently requested services and get rid of that.
[00:12:04.300] Then you will start speeding this up. The team becomes more autonomous and more empowered to do what they need to do. You don't need to have all the skills in a team. I find this really hard: if I have a really good Java developer, why does this person need to understand our network load balancers and security settings and all that stuff? They should focus on building the best application that we can build.
[00:12:28.600] Let's talk about cloud. The cloud environment in the cloud is fully automated. There's nothing manual about the cloud. That's why it's awesome. Ideally, our services in the cloud are fully immutable. Fantastic answer.
[00:12:44.700] I was working with an organization relatively recently where we just started our cloud journey. With the right intention, everything was going to be automated from day one. It's not even allowed legacy cloud to happen. We're going to automate from day one. Boy, was that barrier hard.
[00:13:03.400] What happened is, we don't exactly know how our applications are going to operate in the cloud. Now that I have to write it in Terraform and whatever automation script you want, it meant the team not only had to learn that, but they needed to predict what they wanted to do. It was very hard to experiment because it's a lot easier to experiment when you can just go into the Azure console or the Google console and change something. We actually made automation a barrier to experimentation.
[00:13:29.600] What we ended up changing that into was being a lot more flexible up front, then being able to describe our application once we figured that out, and then to automate it as the fast follower but not from the beginning. That broke everything in my head and my team's head, because we were like, 'we can't say that.' But it helped the team, and that is actually a lot more important. The purist doesn't win here. The people win who get the outcomes.
[00:13:58.200] The one-two-three for that one was: experiment, allow experimentation, then describe it, because once you describe it, you can actually automate it really well.
[00:14:08.700] Let me talk about immutable services. This is composition of an application: lots of different things in there, CPUs, storage, network, data, et cetera. In the ideal world, we say the thing is immutable. We put it in the cloud, no one is ever going to touch it again. That works sometimes.
[00:14:25.100] But what happens when you have an application that takes you five hours to instantiate because you have to load so much data into it or any kind of other complications? You're not going to do that. If you've prescribed it as these golden images, then you can't actually use the cloud. It's going to be way too slow for you, or you have to stand it up in parallel. So we ended up having both. We had golden images that are really good for small applications; they can be quickly instantiated and we had the automation. But we allowed ourselves to upgrade an agent on it or upgrade a database or whatever was required.
[00:15:00.600] You have this heterogeneous cloud environment where you have to deal with both. Again, that gets harder to manage, but if you're forcing everyone into one or the other, you're taking a lot of options off the table. In a large organization, you need to have options.
[00:15:16.700] Those are three things to look at, and basically what I told you is: it depends. Bloody hell, there's a consultant on stage saying everything depends. That's not the answer your bosses want to hear or the executives want to hear. The boardroom question is, if everything depends, how do I know whether they're making progress or not? How do I know they're on the right track? Because saying 'we're not going to automate this' can also be just the easy way out, the lazy way.
[00:15:43.100] I'm going to give you three rules that I use. The first one you've heard me talk about a lot: make economic decisions. Figure out how, in everything you do, you can measure the value of what you're producing. Automation by itself, sometimes you're doing it because it's not expensive, and that's fantastic. But figure out what your economic framework is. It doesn't necessarily have to be cloud compute. It can be licenses. It can be the manual work effort. Consider it all.
[00:16:18.800] The next thing is measure things. Everyone here is talking about DORA metrics, and they're fantastic, but they're pretty high-level metrics, your top-level metrics. When you are starting to improve unit testing, the DORA metrics are not going to immediately move. You need to figure out how we can find something to measure that moves.
[00:16:38.800] I use this scorecard. On the top left you have business effectiveness: I want to experiment, I want to experiment quickly, and I want to scale if I find something that is successful. On the top right, you have the perhaps more natural DevOps metrics around delivery efficiency, automation, speed: how quickly can we get into production? On the bottom left you have service reliability. No matter if I have a really cool idea and I'm deploying it really quickly, if it's not reliable, there is outage cost. On the bottom right you have something more esoteric: architectural flexibility.
[00:17:20.400] There's this dark, deep secret in DevOps that unless you change the architecture, your terminal velocity is given, because the complexity of the architecture and the dependencies drive how fast you can move. That means we have to find a way of measuring that. Some of that is technical debt, but a lot of that is how many dependencies we have for applications, what I call the blast radius: how many things do I actually have to deploy to make a change? The bigger that is, the slower people move.
[00:17:53.500] The third thing: every DevOps transformation is a choose-your-own-adventure. You need to understand that there's no central cockpit that allows you to say everyone's going to move this way over the next couple of years. It's choose-your-own-adventure by the teams. They need to make contextual decisions about what the right next thing is for them to do.
[00:18:14.300] That is scary because it means everyone is doing a thousand different things. If you've ever seen five-year-olds play soccer, they're all over the place: sometimes next to the ball, sometimes chasing something else. You need to provide a little bit of rules. Here's two examples. The top one is a technology tree. In Civilization, if you want to have science, you need to have language and numeracy first. The same is true for what we are doing. If you want to do continuous delivery, then you need automated builds and unit testing and all those things. There's a technology tree for continuous delivery.
[00:18:57.700] Every team can decide how they're going to play this role. They can change their boxes if they want to, understand what the dependencies are, but choose what is most important for them first. It's still pretty guided because it goes towards continuous delivery. At the bottom you see an enterprise bingo card, similar concept. We know all the things that teams want to do and should do, but we don't want to tell them what to do next. You might have coaches, teams can have some kind of discussion, you might do value stream maps, and they can determine the next thing they want to do, whether that's technical, process, or organizational, as long as we keep making progress.
[00:19:43.200] I'll try to bring it together. Big-T transformations were something that we were pretty good in IT at for a long time: CRM transformation, ERP, and all those big things. They all had a current state and a target state, and then we determined the bit in the middle: how do we get from A to B? Both static, static to static.
[00:20:11.200] That is the equivalent of when the internet was new and you had route planners. You typed a destination into the web and printed out this beautiful map on how to get there. We all know what happens: you start driving, you do a little detour, something changes, and then this whole thing is useless. That is the equivalent of doing a big transformation where you think two years in advance you can determine all the things you're going to do, which application teams by which time, and which specific technology you're going to roll out.
[00:20:46.300] What you really want to do in this complex world, where the as-is state is pretty flexible because everyone has a different view of how mature you are, and the to-be changes every time you go to the Enterprise Summit because there's a new thing and new vendors in the exhibition hall, is figure out how to give your team a compass. They will detour, they will get distracted, but you need to give them something so they can move forward. That's what I meant when I said give them an economic framework, give them a way to measure things, and give them some rules. That's the compass for transformation.
[00:21:20.900] If you give them that, that's actually pretty easy. You don't need a huge organization to enable the transformation. You just need enough people that can do those three things, and then the executive can guide along both ways. Every transformation is a journey of many steps, and each step should be a conscious step, a conscious experiment that we evaluate and see whether it has moved us forward.
[00:21:46.500] What I still have problems with is everything I discovered in this contextual conversation. What happens, and I've seen this many times, is you have this pendulum swinging: from federation to centralization, from experimentation to cost-cutting, from people-focused to technology-focused. There's nothing wrong with it, because this pendulum will always swing. There is no one right answer. The problem is we need to make this pendulum not swing to these extremes, where a new CIO comes in and you're going to implement a completely new CRM platform. That's not healthy.
[00:22:20.200] What is healthy is to keep experimenting. I always say to my clients, if a year after we spoke last you're still doing the same things, if you haven't learned anything and we've just implemented what we saw a year ago, that is very unlikely to be correct right now. How can we let this pendulum swing just right and govern that pendulum swing all across our organization?
[00:22:42.300] If you have any good ideas on that, it will be an absolute pleasure to hear from you. You can find me on Twitter. You can look at my blog posts if you like. Thank you for being kind. I haven't seen any torches or pitchforks, so thank you for that.