Tales from the DevOps Transformation Trenches
I know that 'why Culture Is Important For DevOps' talks are almost always rejected. This sounds like that, but it's deeper - I talk about experience with various (anonymised) clients of the IBM Cloud Garage, pitfalls they fell into, and how lean startup and extreme programming techniques helped.
Holly Cummins is the worldwide development practice lead for the IBM Cloud Garage. As part of the Cloud Garage, Holly delivers technology-enabled innovation to clients across a range of industries, from banking to catering to retail to NGOs.
She has led projects to count fish, help a blind athlete run ultra-marathons in the desert solo, improve health care for the elderly, and change how city parking works. Holly is also a Java Champion, IBM Q Ambassador, and JavaOne Rock Star. Before joining the IBM Cloud Garage, she was Delivery Lead for the WebSphere Liberty Profile (now Open Liberty).
Holly co-authored Manning’s Enterprise OSGi in Action.
She is an active speaker and has spoken at JavaOne, Devoxx, JavaZone, JFokus, The ServerSide Java Symposium, JAX London, QCon, GeeCon, and the Great Indian Developer Summit, as well as a number of user groups.
Chapters
Full transcript
The complete talk, organized by section.
Holly Cummins
So I want to start with a bit of a confession. I don't know if any of you have considered submitting to this conference for next year. I think it would be great if you did. One of the things we see is that the strength of this conference is how many different industries are presenting, how many different kinds of stories we're hearing, and it's the stories of our peers that make it really compelling.
If you do submit, you'll probably see a nice little article from Gene explaining what to do in order to get your paper accepted, because they do get quite a few papers. One of the things it says is if you're a vendor or a consultant, well, sorry about that, you will almost certainly be rejected. Another thing it says is these types of submissions are almost always rejected, and one of those is "why culture is important for DevOps." So you've already seen that the title of this talk is "Culture is Important for DevOps." The second confession is that I am a consultant. It is actually worse than that, because not only am I a consultant, I'm also a vendor. I work for IBM. We have a stand out there. We have some lovely tools that we would like to sell you. I'm not involved in the development of those tools. I'm not going to try and sell them here.
Because I'm a consultant, one of the things I see is a range of practices across different companies. It's a bit like the DevOps confessions that we saw on Tuesday. There are a whole bunch of things that companies do, that we all do, that we don't really want to get up on stage and say that we do because they're a bit embarrassing and we know they're really not right. They're not the sort of embarrassing that you can turn into a lessons learned that's a really positive thing. They're just the sort of embarrassing that you really hope nobody knows about.
Instead of DevOps confessions, this is a little bit more like DevOps accusations, where I see lots of different companies and I see the things that they do. The effect really of these things is a failure at DevOps. Hopefully all of you will be sat in the audience and thinking, "Well, I've never seen that in any company I've worked for," and, "Oh, I would never do that," and, "My teams would never do that," and, "Who does that?" If you're thinking that, then the bad news is that this presentation might be slightly less interesting for you. But the good news is that you're in a really good place and can leave feeling happy.
One of the things that I often see when I visit a client: I should say why I am visiting clients. I'm Holly Cummins. I work for the IBM Garage. Our mission is to help customers adopt the cloud, adopt DevOps, adopt Lean, adopt design thinking in order to really do a transformation. Very relevant for what a lot of us are doing here.
I visit a client, and I get introduced to the DevOps team. I think, okay, DevOps team. Not everybody chooses to have a DevOps team, but there are reasons for that model. But then the next thing is, "Well, yes, this is our DevOps team, and last year we called them the builds team." So we have this DevOps transformation, but it's a very shallow transformation, and what's really happening is we're doing the same things that we've always done, but we're attaching new labels to them. That's not really going to help as much as we hope it will.
Ultimately, what we see with the new labels is that we also see importing new technologies. I think there is this desperate hope in the industry that if I adopt containers, I'm going to get DevOps. Well, it doesn't quite work like that. Then you think, no, I know containers aren't going to magically make me DevOps, but Kubernetes will magically make me DevOps, right? Well, no. Even Kubernetes, as popular as it is, isn't going to make everything okay with DevOps.
We had a conversation with a bank out in Asia Pacific recently. They came to us and said, "IBM Garage, we really need you to help us. We have some really serious problems. We can see that we're going way too slowly now, and we've got all this COBOL estate, and we'd really like you to help us rewrite this COBOL estate and turn it into microservices." That might be the right thing to do. But then there was this little qualifier, which is that the release board only meets twice a year. At this point, it's not the COBOL that's the problem, it's not the lack of microservices that's the problem, it's the fact that the release board only meets twice a year. We see this all the time, where we sometimes have a natural tendency to focus on one thing and not notice the other things that are also getting in our way.
If you look at the eight DevOps trends to be aware of in 2019, one of the top trends is microservices. What that article from Hacker Noon says about microservices is that microservices are independent entities and hence don't create any dependencies and break other systems when something goes wrong. At this point you all should be laughing, because anybody who has ever done microservices or managed a microservices system knows that this is the theory, and the gap between theory and reality can sometimes be pretty large.
I visited a client and they were doing microservices, and they said, "Every time we change code, something breaks." Which is, of course, the exact opposite of what you're supposed to get with microservices. With microservices, the dream is one thing can break and everything else is completely resilient. Whereas in fact they had a situation where one thing made a change that was quite innocent and correct, and yet everything else still completely broke. The reason was that they had a distributed monolith. They had replaced their inter-component communication with HTTP, but they hadn't actually decoupled it in any more meaningful way. So they had a distributed monolith, but without the benefit of compile-time checking, which is worse than just a non-distributed monolith. Just because a system runs across six containers, that doesn't mean it's decoupled. Decoupling is something else. It's not just about how you lay it out in space.
Speaking of space, this is the Mars Climate Explorer. Most of these are client stories. NASA was not a client of mine, so this is other people's trenches, but still a bad trench. The Climate Explorer had a very sad end. It crashed into Mars, which is not what it was supposed to do. It was supposed to go around Mars. It had two systems: one was the system on the Explorer itself, and another was a control system on the ground. The problem was that the Mars Explorer itself used metric units, and the unit on the ground used imperial units. For the measurements they were doing, the units were close enough that it wasn't an order of magnitude out, about a factor of two out, so you couldn't spot by eye that there was a problem. But every time they tried to steer the Mars Explorer, they ended up steering it somewhere that wasn't where they intended.
This is a case where having a distributed system did not help. We had two independent teams developing it. They didn't communicate appropriately about what their expectations were, and the result was that the Mars Explorer was lost. With microservices, you've got to have consumer-driven contract tests. You've got to have some way of making sure that everybody has a shared understanding of what's supposed to be happening, and that if that understanding is violated, it gets picked up as soon as possible in the development cycle. It doesn't wait till the end.
Another space story: this is Cluster and the Ariane rocket. This was one of the most expensive space losses in history, a $370 million loss. They tried to send the rocket up into space, and it did not go up into space. It flew up a little bit and then exploded, and that was the end of the rocket. They had reused software from a previous version, which was sensible. They sort of tested it, but they stubbed out one component, and the one component they stubbed out was the component they reused. There were assumptions baked into that other component that were no longer true. The one component that broke was the one component that was stubbed out when they tested it. Even though they'd done a lot of testing, they ended up with this catastrophically expensive loss.
We see the same thing with the Mars Climate Explorer. At the end, they did a detailed investigation of what happened and said, "Had we done end-to-end testing, we believe this error would have been caught." You look at that and think, why didn't you think end-to-end testing was a good idea before you sent it up into space? But it's easy to miss things out. They add cost, they add complexity. With microservices, the first thing you have is contract tests, which are a shift left to catch the problems early. Then the next thing you have, your fail-safe, is the integration test.
Another thing I hear all the time is, "We have a CI/CD." These tools are great, but CI and CD, it's something you do, it's not something you buy. I see this gap: we have this system, which is the CI/CD system, but then I hear things like, "I'll merge my branch into our CI next week," which is another way of saying, "I'll merge my branch into our continuous integration system next week." I hear, "Continuous integration, continuous integration, continuous deployment, continuous delivery, continuous delivery. We release every six months. Continuous integration." We keep talking about continuous, but I don't think that word means what you think it means. If you're only integrating once a week, and if you're only releasing every six months, that's a really strange definition of continuous.
We do have to be pragmatic. When we talk about how often developers should be pushing to master, what we're really saying is how often they should be integrating. Strictly, continuous integration would be every character. That is technically continuous, but it is a really stupid way to develop software. More reasonable is committing several times an hour and pushing every commit, or pushing several times a day. You can carry on down the spectrum: once a day, once a week, once a month.
When I first joined IBM, 20 years ago, I was working on WebSphere. We had a worldwide build call every day of the week, including Saturdays. You did not want to get focus on the build call. People worked out that the most effective way of ensuring you were never drawn onto the build call was to not break the build, and the most effective way of not breaking the build was to never deliver software. They pushed every six months to avoid build focus. The incentive you thought you were creating wasn't the incentive you were actually creating. We thought we were ensuring quality by making sure no one breaks the build. Instead, we ensured no one did any work.
Where the spectrum is okay depends on the complexity of the team, the size of the team, and the complexity of the software. But I really advocate for trunk-based development, and the cutoff line for trunk-based development is once a day. If it's not going at least once a day into master rather than into a branch, you're not doing trunk-based development, and you could be doing a lot better.
The next question is how often you should release. In CI/CD language, what we're really talking about is how often you should deploy. Again, there is a spectrum. With WebSphere, it was once every two years. Many companies now are managing to release to the public every push, several times a day. That's amazing. It's really hard to get there. You need good engineering techniques, feature flags, and safeguards. Deploying once a sprint is okay. Once every two years is seriously old school. Once a quarter may have reasons, because customers don't necessarily have the appetite or capability to consume releases more often. Every epic is good. Getting something deployed to the public every user story is the sweet spot. Every push is hardcore in terms of engineering practices.
The last one on that CI/CD spectrum is how often you should test in staging. What we really mean is how often you are delivering. Your test system and pipelines should support a test on every push. There is nuance: you probably have a range of testing. Some is quick. Some is your big hardcore suite with performance and load tests that take three hours. If you do that every push, you get economic and time problems. Do the lightweight tests every push and the more involved tests less often.
What I often hear is far from the ideal. A common conversation starts, "We've implemented all this functionality, but we can't actually release it." Why? It's value sitting on the shelf, and you're not going to get that value until you get it out to someone. What are the barriers stopping more frequent deploys? Often there are genuine barriers.
We see microservices where an individual microservice gets into good shape, but then people say, "We don't release just one microservice," even though that's the whole point. "We release all our microservices at the same time so that we can do proper testing to make sure they all work together, because we don't actually know if they all work together because we don't have the contract tests." Rewind: get the contract tests in place, make sure the microservices are actually independent, and then get some of that value from microservices.
Another thing we see is the Wizard of Oz scenario, where it looks beautiful and complete. There is a UI developers have been working on for months, and when you click on stuff, maybe there's a 50/50 chance whether it's actually going to work. We know about user stories and vertical stories: front end, integration layer, back end, and a single user story should encompass all of those. What we're still catching up with is what order we do that in. A lot of the time, the front end is done first, because it is the most gratifying bit. That is rewarded by the business and by management, and leaders need to keep an eye on that.
On a large project, one IBM Garage team did a ton of back-end and integration work to make sure that when they delivered their story, it would actually work. Another team wrote the front end, and at playback they showed this beautiful front end. Everybody said it looked amazing, and the team got an award for it. But it was an empty front end. Its effect sitting in the code base was that nothing else could be released, because if a user saw it and tried to click on something, they would have been angered and raised defects. Having that wired in stopped everybody else from releasing. We need to reward and encourage the secret work: start with the back end, go up to the integration layer, then go to the front end, so by the time anybody sees it, it works. Developers need to do this, but management also needs to make sure we don't reward empty front ends.
Do vertical slices, for sure. Another thing to do is deferred wiring. Code is safe to have in the code base and can go out if it's not wired into anything. Feature flags are useful too, though more scary than deferred wiring. We see postmortems where new functionality was guarded with a feature flag, but one bit was missed, or a side effect wasn't caught, and the code went out protected by a feature flag and caused a problem. Again, there is a gap between theory and practice.
Another thing we see is, "We can't ship until every feature is complete." Why? A lot of this has to do with what we expect from our users. "Our users won't find it compelling enough if we release now." Sometimes that's true; if it's Hello World, users aren't going to be delighted. But there is a point beyond which you shouldn't continue to develop without getting feedback. Reid Hoffman of LinkedIn said, "If you're not embarrassed by your first release, it was too late." Lean methodologies try to strip out everything that won't give you the business feedback you need. Some of that stuff feels important, so it can be painful to strip it out.
We have this fear that if we release and it's not perfect, then that's it: we only have one chance to get it right. Sometimes that is true. The Ariane failed in 36 seconds. They didn't have an opportunity to watch dashboards and push a change to stop it. It was too fast, and you can't A/B test a rocket when it's that expensive. But a lot of us think we are there on that spectrum, that we have one chance to get it right, when actually there is a spectrum. For some things there is brand damage, or market indifference, but often we can have a conversation with consumers and stakeholders: here's our roadmap, you can see we're getting better all the time, try again in a week and it will be better. Finally, you can have A/B testing where you release to such a small portion of your user base that it's safe. We need to question our assumption that we're all riding Arianes and instead get to the point where we can do A/B testing and more incremental delivery of value.
Another thing that stops releases is process. With the Mars Explorer, they often couldn't see the Explorer. DevOps is all about feedback, but the geometry of planets meant the Explorer was sometimes behind Mars, and sometimes the controller was on the wrong side of Earth, so they could only communicate every few days. Even with that feedback every few days, the navigation team actually spotted a problem. We put change-control processes in place to protect us from risk, but with the Mars Explorer, their change-control process actually created risk. The navigation team reported the problem, but they didn't use the correct process, and so nothing was done. We see this all the time, where processes that are supposed to protect us don't.
We had a customer satisfaction situation where a client said the provisioning software we'd sold them was broken. It was supposed to let them spin up cloud instances in 10 minutes, but they were still taking three months to provision their instance. When we investigated, it turned out they had an 84-step pre-approval process. The technology was working fine, but all the processes around it meant that the technology couldn't do what it was supposed to do. Rasmus' Maersk talk had a similar story: they put in provisioning and got lead time down from 100 days to 85 days. We see architecture review boards where the software has already shipped, is out in the field, and is working, but the review still has to happen and isn't adding value. For all of these processes, we need to ask, "Is this actually adding value?"
Another thing we see often is, "I'd like to ship, but I can't ship until I have more confidence in the quality." This is fixable. Usually what's causing this is something with testing. Often I'll be introduced to the test team, and the test team test everything manually because they don't have skills in automated testing. There is absolutely a place for a test team. Exploratory testing will find things that tests written by the people who produced the code will not find. But in order not to be a barrier to release, these things need to be automated. Otherwise bugs escape to the field. When I hear "our tests aren't automated," what I actually hear is, "We don't know if our code currently works. It worked a while ago the last time we did the manual testing, but who knows what's going on now." That's a scary place to be.
Another thing that prevents release is, "It costs too much to release." That is something we should fix. Part of the point of DevOps is that every release should be incredibly boring. If we see lots of cost in the release, keep releasing until we drive that cost out with automation. Keep releasing until it's cheap.
Another thing I see, which feels like a low-level development thing but isn't, is that someone will break the build and won't know because the only way to see whether the build is broken is by going to a separate Jenkins webpage and looking. Someone else checks the page, sees the build is broken, and nags them. This is inefficient, and it shows something about priorities. If, as a leader, you walk around and don't see the build status where everybody can see it, that says we only sort of care whether our code is passing or not. Getting build status up prominently with something physical, like a traffic light, matters. Lean has this idea of an obeya, a room where all your project metrics are visible because they are that important. This is the same idea in the physical space: don't rely on someone to pull that metric; put it on the wall. On a complex project it may be a more complex radiator, but get it into the physical spaces.
Another thing we see is centralized control. People don't feel ownership of the build because only Bob can change Jenkins. Another thing external people are good at spotting is broken windows. I come in and say, "The build's broken." They say, "Yeah, it's been broken for a few weeks." Didn't anybody care to fix it? "Well, no."
I mentioned DevOps accusations at the beginning, and I know this can sound like judging. Of course, it's not that easy. We all can get into bad places in well-intentioned ways. One of my colleagues has this idea of modern DevOps, when you're cloud native and you've started on the cloud and it's easier to do DevOps properly. But many of us are doing heritage DevOps, where we have a lot of stuff already and are trying to figure out how to do the best we can with it.
One thing we hear a lot, which fills us with delight, is, "You'll be coding on the mainframe." Trying to do DevOps on the mainframe is this dinosaurs-on-roller-skates situation where you can maybe make it work, but it's going to be really hard. It's tiring, and there's a lot more technological resistance and cultural resistance.
What we see with a lot of DevOps transformations, and we've seen over these three days, is a failure of transformation endurance. We have really good intentions, and then it becomes so hard, and then we give up and maybe try again in a few years. Having the idea of not just doing the transformation now, but keeping going with the transformation, is key. With all of it, remember the why: don't focus on the technology, don't focus on just a few things, but think, what was I trying to achieve? I was trying to achieve business agility. Am I actually getting that? If not, let's try something else.
I'll leave that there. I will be hanging around, happy to have conversations, questions, all that kind of thing. Thank you very much.