Ten (Hard-Won) Lessons of the DevOps Transition

Log in to watch

San Francisco 2015

Ten (Hard-Won) Lessons of the DevOps Transition

DevOps is no longer just for Internet unicorns any more. Today many large enterprises are transitioning from the slow and siloed traditional IT approach to modern DevOps practices, and getting substantial improvements in agility, velocity, scalability, and efficiency. But this transition is not without its challenges and pitfalls, and those of us who have led this journey have the scar tissue to prove it.

A successful transition to DevOps practices ultimately involves changes to organization, to culture, and to architecture. Organizationally, we want to create multi-skilled teams with end-to-end ownership and shared on-call responsibilities. Culturally, we want to prioritize solving problems and improving the product over closing tickets. Architecturally, we want to move to an infrastructure with independently testable and deployable components.

The ten practical lessons outlined in this session synthesize the speaker’s experiences leading teams at eBay, Google, and KIXEYE, as well as from his current consulting practice.

Chapters

Full transcript

The complete talk, organized by section.

Randy Shoup

Hi, I'm Randy Shoup. I was chief engineer at eBay for six and a half years. I was director of engineering at Google for Google App Engine for a while, and I've also been CTO of a gaming company called Kixeye.

I mention that because all of these experiences sort of go into what I'm calling my ten hard-won lessons of the DevOps transition. I guess the most important set of lessons is that these are basically organizational and cultural lessons far more than technology or process, and that's what I found to be much more important to getting DevOps to work properly.

The other thing is you'll probably find that a lot of these are going to feel a little bit like back to basics, but I think that's really important because if we don't get the basics right, it's hard to get the advanced stuff right.

So let's just get started.

Okay. We'll get there. There we go. Excellent.

Some of them are going to be back to basics, but some of them are going to be a little bit radical.

The first one, which is maybe on the radical side, is I strongly believe that organizationally we need to reorganize our teams around the concept of ownership. What I mean by that is that we really want end-to-end ownership of the entire product or service in a particular team.

Particularly, I mean small cross-functional teams that own an application or service all the way from design to deployment through to retirement. That team has within it all the skill sets that it needs to be successful. Of course, it's not going to do everything, so it depends on other teams for supporting services. But it's sufficiently autonomous that it's able to move rapidly and independently by itself.

I also strongly believe in the Werner Vogels philosophy of "you build it, you run it." What that means is the same team that builds the software also should be the one that's operating the software, and that we don't have a separate concept of a maintenance organization or sustaining engineering or something like that.

One of the examples where we were able to make this change and do it successfully was at this gaming company, Kixeye. When I arrived, we sort of had this conflict around how to manage the databases. The development team was responsible for organizing the tables and writing all the SQL and issuing all the queries. But the DBA and the ops team was wholly responsible for the performance and the uptime of the system.

You can see that there's a pretty inherent conflict in here, right? The development team is adding a bunch of queries, and the ops team doesn't have any effect on that. That division of responsibility was very sort of counterproductive and very disruptive.

So we needed for teams to go one of two ways, and we did both of them actually for different teams. The alternate strategies were we needed to either create a centrally maintained persistence service where all of the capabilities and all of the functionality around the persistence was owned by one team, with a very clean, simple, well-defined API that the application teams could use, or a customer team should adopt the database themselves, and they should run their own persistence.

It's my strong belief that so many of the problems that we have between particularly the development side and ops side can be solved by properly drawing the organizational boundaries end to end around a particular problem.

The next one is around losing the ticket culture. In my consulting practice, when I go into organizations, one of the questions that I ask is around what the rate of tickets flowing between two different teams is. You can tell a lot about an organization, particularly how functional or dysfunctional it is, by how many tickets are flying back and forth between particular teams.

When we have what I call the ticket culture, the team that receives a ticket is sort of incented to just do what they're asked for, right? Whereas in an ownership culture, where we've properly drawn the organizational boundaries, the team does what's needed.

In a ticket culture, we have one-way communication, whereas in an ownership culture, we're doing more two-way collaboration. In the ticket culture, the goal is to simply close the ticket, but in the ownership culture, the goal is to make the product successful. The ticket culture is reactive; the ownership culture is proactive. The ticket culture reinforces those silos that we have in our organization, whereas the ownership culture reinforces collaboration. Of course, the ticket culture is prioritizing process, and the ownership culture is prioritizing results.

The next lesson is around replacing approvals with code.

I was at eBay for six and a half years, and during much of that time, eBay had something called the Architecture Review Board. This is a pretty standard idea at sort of traditional IT organizations. I actually served on this board, so all these critiques kind of go right back to me.

The idea was to have a bunch of smart, experienced engineers that would review all the projects that were going forward within eBay. But the problem was by the time projects would get to the Architecture Review Board, it was often too late for the recommendations of the board to actually be useful at all. Having to wait for that approval step made the development process slow down pretty substantially. We basically intentionally introduced a bottleneck into the process.

Also, again, this is a critique of me having served on that, is that the board was too disengaged from the details of a project, often, to make really detailed and good recommendations.

Actually what eBay ultimately ended up doing was getting rid of the board and dispersing all the architects out to the individual teams. In practice, what that meant was we would take these smart, experienced engineers, and instead of putting their knowledge into an approval process, we put their knowledge into code.

Teams with specialized skills, for example like databases or security or compliance, instead of providing an approval step, would instead provide a service or a library or a tool and make it self-service for the development teams to be able to take advantage of that knowledge.

The example of where this, I think, is done particularly well is at Google, where there's actually no approval process for technical decisions at a central level at all. There's simply no equivalent of the Architecture Review Board and no equivalent, actually, even of architects.

One of the hardest things to do without approvals is security, but Google has mastered this as well. The security team, rather than providing this approval step, instead provides secure foundations by maintaining a bunch of lower-level libraries and services that the individual other products and services are built on. They also provide self-service penetration tests and vulnerability assessments that allow teams to be self-service in seeing whether they meet the security objectives.

I think this is a much better way to go if we can make it work. In fact, I often like to say that the easiest way to enforce a standard practice like this is with working code.

The next lesson is around how teams interact with one another. In my experience, what I'd like to do is enforce what I call a service mentality. The other term that I have for this is something that I call vendor-customer discipline.

What I mean as an example here is a service team is a vendor and acts like a vendor, like you would have an outside vendor, and the internal products are the customers of that vendor. The service is only useful to the extent that it's actually providing value to those customers that it serves.

The key discipline to make this work, just as if that service team were an external vendor, is that the customer team can choose to use the service or not. We've built this internal service. Hopefully, we've done a good job at it. But it's the customer teams, not some top-down manager, that decides whether that service meets the needs of that particular customer team or not.

That has to be the case because the customer team is ultimately responsible for providing the service that they provide, and therefore they're ultimately responsible for figuring out what's best for their particular use case. It's really just an example of allowing the customer team to choose to use the right tool for the right job.

What you can see is this actually provides extremely powerful incentives and an extremely powerful sort of continuous improvement idea within the organization, because that internal service that we've invested in needs to be strictly better than the alternatives that the customer team has, right? Strictly better than the alternatives of building it themselves, of buying it, or of borrowing it from somewhere else.

To make this idea even more powerful, I strongly believe that we should be charging for usage of even internal services. We're pretty familiar with this in the cloud area, but in my view, the high-performing organizations are doing this for internal services as well.

We should be charging the customer teams for using the service that we build. This is not because we want to make profit or because the idea is making money, but the idea here is to align the economic incentives of the customer and the provider and motivate both sides to optimize their usage, or what they build, for efficiency.

By contrast, when we have something that's free, we tend not to value it particularly, and we end up wasting it quite a lot. When something is free, there ends up being no incentive to control our usage of it and no incentive to find more efficient alternatives. We can see this in our normal economy in pollution, right? When it's free to pollute the air and the water, then that's what people are going to do.

The particular example I'll use here is from Google. Again, I was director of engineering of App Engine, which is a platform as a service. There are about three million different applications worldwide that run on App Engine, including 15,000 within Google. For many years, App Engine charged the outside customers, but it was free to use it within Google.

One particularly, I'll say, aggressive user of those 15,000 within Google, for a long time was using an incredibly large amount of an extremely tight resource that we had in App Engine. We sent a bunch of emails, and as the director, I went and begged their director and their VP, "Could you guys please spend a little time to optimize your usage?" They were never able to make that a priority, and so that never happened.

As soon as we started charging them for their usage, all of a sudden they were able to prioritize making the modifications that they need, actually looking in detail at the usage they were making. They were able to make a 10X reduction in their usage of this particularly painful App Engine resource. So they went from using, for example, 100 units of this to 10 with a very simple code change.

The only reason they did this wasn't because they were evil, but because they had no incentive. The lesson that I take from that is charging people is a way of aligning the economic incentive so they do the thing that really should be done for efficiency.

The next lesson is around prioritizing quality. This is really a cultural thing as much as it is a technological thing. In my view, with my teams, I like to say that quality, performance, and reliability are priority zero features, right?

All of us build things, and we have, here are our priority one things we want to do, priority two, priority three. I strongly believe that quality and performance and reliability are priority zero. What that means is that when we have a degradation in one of those metrics, that we should pull the andon cord, right? We should stop the line.

We do this not just to make a more efficient process, which it does, but we also do it because these values of performance and reliability are just as important to our users as the cool features and the slick UIs that we build, right? If the game isn't up, it doesn't matter how cool the features are. It doesn't matter how fun to play it is if it's not available.

How we do this, this won't be a surprise to anybody who's in this room, is that we have developers write tests and code together. Now, it's not important to me the particular discipline about whether it's testing first or testing in the middle or testing at the end, but part as a developer, part of doing my work on a particular feature is developing the tests that go along with that work.

Over time, we develop the ability to do continuous testing of features, continuous testing of performance, continuous testing of the load characteristics of the software, and that gives us the confidence to make risky changes to our service. It's really a great example of slowing down to speed up, right? We make this investment up front, which actually puts more effort into the development process, at least initially. But we're able to speed up because we can catch bugs a lot faster, a lot earlier in the development process, and we can fail faster.

The example I'll use of doing this well, again, comes from Google. Those of you who saw Mike Bland's keynote learned how Google got to this point. But the current development process is as follows. Whenever there's a submission, there's always a code review by at least one other person. That's mandatory. There's quite a culture of automated testing around lots of different things at Google. Also, you won't be surprised that Google has a searchability of their code, right? So, single searchable source code repository.

By combining these things together, we actually get a lot of the benefits of what I call an internal open source model. This is a place where having those 15,000 customers at App Engine who are also Googlers is a real advantage.

They could submit a bug report, which would be wonderful and very helpful. They're of course going to find bugs in our software as they use it. But more often than you'd expect, certainly more often than I ever expected, instead of simply having a development team saying, "Here's a bug report," they would say, "Here's the bug. I found the problem in your code, and here's the fix, and here's the test that verifies the fix," which was wonderful.

We talk about moving quality up to the left, right? Moving quality upstream. This is moving quality upstream to people that depend on us. This was really wonderful as a service provider within Google. Also, these guys, the people that were using App Engine and running into bugs, were able to scratch their own itch and help us to solve their problems for them by solving them for us.

But not everybody's Google. Often you'll enter projects or companies where we really haven't had a history of investing much in testing. So here are Randy's quick way of approaching this, and I've done this a couple of times, and it seems too simple to possibly work, but I've never seen it work any other way.

The first step here is to write functional tests around a particular component, right? If we only have a small amount of effort to put into writing tests, we want to make sure that they're meaningful, and those end-to-end or functional tests are ones that actually have meaning to the customers that we're working to serve. Then, of course, we fail any build that breaks one of those tests. Then over time, we just simply keep ratcheting up the tests, right?

We can do this opportunistically. We can do this sort of incrementally over time, simply with this little kata. Every time we add a new feature, we add new tests for that feature. Every time we get a new bug, we write a test that reproduces the bug and verifies the fix. Over time, we build up our strength, we build up our muscle of testing, and we get better and better.

The next lesson is around actively managing technical debt. Obviously, our goal here as leaders is that we want our teams to maintain a sustainable and well-understood level of technical debt. We want to measure this in terms of how much engineering effort it takes to fix a particular problem.

We want to be planning all the time about when we accrue technical debt. We want to plan about how and when we're going to pay it off, and we want to track the amount of effort that we're spending on the feature work that we do versus the amount of technical debt that we accrue doing it.

Oftentimes, as engineers, we hear, "Look, that's great, Randy, but we don't have time to do it right." Actually, I say, "Wrong." I say, "We don't have time to do it twice." If we do a halfway job the first time, we're going to have to come back around and do it again. In fact, in my experience, over and over again, the more constrained we are in terms of time and resources, the more important it is actually to get it right the first time and then be done with it, and we can move on to the next thing.

Often when we enter a project or a team that's like this, we're sort of in this vicious cycle of technical debt. We've done something quick and dirty, we've accrued a bunch of technical debt, we have no time to do it right, we do something more quick and dirty, and round and round, it gets worse and worse and worse.

But with the application of moving to a quality culture and investing in testing, we can actually move that to more of a virtuous cycle of investment, where we spend our effort investing in quality, then the result is we get a solid foundation on which to build. We gain confidence in how strong our software is, and we can get faster and better. Now we have that much more free time and resources to invest more in quality, and it gets better and better and better.

The next lesson is about sharing on-call duties. We talk about sharing the pain, and I've done it. This is pain. But it's very important that sort of culturally, and sort of organizationally, that we share equally among the organization.

In my teams, all members of the team rotate on-call responsibilities. This is the strongest motivator I can find for everybody on the development side, the ops side, QA side, wherever, to build in solid monitoring and solid diagnosis capabilities into the software. It's also the best way to learn the real-world behavior of the system, to actually monitor it while it's running.

We talk about empathy a lot in the DevOps community. The very best way to develop empathy for our customers and other team members is to walk in their shoes by carrying the pager.

But not everybody is, really nobody's born with the ability to do this immediately. So what I've always instituted is a sort of apprenticeship or training mechanism where we train new engineers to be able to take on on-call responsibilities. I've always found that the best practice is to have, at any given moment, a primary on-call person and a secondary on-call person. The primary's going to take all the pages, and the secondary's there to back that person up.

This lends itself very well to this apprenticeship idea. When we start as an apprentice, the apprentice starts as the secondary but is sort of seconded to a very experienced engineer as the primary. So the apprentice can sort of look over the shoulder of the primary engineer, see how he or she is doing things, and learn over time as an apprentice would from a master.

Then over time, we'll get sufficiently comfortable that the apprentice will then become the primary. But then we make sure that we have an experienced engineer to back that apprentice up. We do that for a while, and then finally, the apprentice can graduate and can join the rotation in the normal way.

This, I found over and over again, is a great motivator for teams to build quality software and to get confidence that the software that they're building is really good.

The last lesson here is we've heard a lot about blameless post-mortems, and they're a wonderful tool. But the critical thing that I've learned in several sort of previous organizations that hadn't done it before is that we really, as leaders, need to make sure that the post-mortems are truly blameless.

It's very difficult to overcome a blame culture that's been there for a while. The institutional memory and the personal memory of blame is pretty long. The initial post-mortems that we did in the operations team at the gaming company elicited a bunch of fear because people had, "Oh, yeah, you say blameless, Randy, but I'm used to being blamed." People laugh. It's the laughter of knowledge, I think.

What we can do as leaders is that we need to constantly reinforce the idea that these post-mortems are about learning, not about assigning or apportioning blame. So I say to myself all the time, "Randy, when you say blameless, you've got to really mean it."

As a leader, when something goes wrong, we shouldn't be asking, "What did you do?" We should be asking, "What did you learn?"

One of these really effective blameless post-mortems is going to go like this, where we have an open and honest discussion where we document exactly what happened in the incident. We document what went right, and a lot of times, quite a lot of things went right. We also document what went wrong. We keep our focus on learning and on improvement.

We ask ourselves, how can we change the process, the technology, the documentation? How could we have automated the problems that we found away? How could we have diagnosed the issues more quickly? How could we have recovered service more rapidly? Our goal here, of course, is to take the fear and the personalization out of it.

If we can take the fear out of it, then wonderful, crazy, non-intuitive things happen. I have seen this over and over again, where if we can take the fear out of it and make it clear that all we're trying to do is improve the system for the next time, I have seen engineers almost compete to take personal responsibility for what went wrong.

One team's talking about what they did to contribute to the incident, and the other guy says, "Oh, you think that was bad? We found we weren't even taking backups, and our replication wasn't working." Over and over again, I hear this sort of seemingly non-intuitive behavior from people. This is what comes from making a truly safe space and a truly blameless approach to the problem.

But the other sort of similar reaction that I see over and over again goes something like this: "I really feel bad that we let our customers down, but finally we have the ability to fix that broken system." Does that sound familiar to anybody? Yeah.

"Finally, we have the priority to be able to fix that system that I knew was broken, and now we can do that."

Great. Okay. So as Gene asked all the speakers, what are some of the top five takeaways? Well, I kind of think all ten of them are reasonable, but if you had to take only five of them, I would do: reorganize the teams around this ownership idea; replace approvals with code; we really need to prioritize quality; we need to actively, as leaders, manage our technical debt; and we need to make post-mortems truly blameless.

At the end, what could I use help with? Well, I could use some more help in helping to encourage leaders to lose the blame culture. I'd also be interested in people's thoughts on measuring productivity in a principled way. We sort of have a good sense of productivity, but I've yet to find, in 25 years of software, a principled way of measuring it in a good way. And then also help with overcoming the fear or resistance of people taking the pager.

Thank you very much.