My Ops Team Can't Keep Up with My Dev Team, Creating Strategic Differentiation in Ops

Log in to watch

Las Vegas 2020

My Ops Team Can't Keep Up with My Dev Team, Creating Strategic Differentiation in Ops

Meeting with clients and potential ones, by far the most common frustration expressed by C-level executives is the inability of Ops to keep up with Dev in today's fast paced cloud model. There are a number of general principles that are often the solution for many of the problems.

In this talk we discuss different DevOps techniques used to transform medium sized enterprises, resulting in Ops changing from a drag on the velocity of the organization, to a strategic differentiator for the business.

Chapters

Full transcript

The complete talk, organized by section.

Dave Mangot

Hi, my name is Dave Mangot, and today we are going to talk about strategic differentiation in operations.

I've been in operations for about 25 years now, and I've seen a lot of things at a lot of different companies. I'm lucky that I get to work with companies now on leveling up their operations teams. That's my job. It's great, and I really enjoy it.

One of the most common things I hear when I talk to CEOs, CTOs, CIOs is, "My ops team can't keep up with my dev teams." There are a lot of reasons for this. There are containers, there are all kinds of other things that are happening now that are brand new, that enable developers to go faster. There's cloud, there's all kinds of things. But there are also things that operations organizations do to themselves that cause them to not be able to keep up with the development teams.

Today, we're going to talk about some of the things that I've seen in organizations that I think, if you recognize them in your organization, can help you accelerate your operations team. We're also going to talk about what a high-performing operations team looks like.

For me, having done this for so many years, I really don't like it when I hear this, "My ops team can't keep up with my dev team," because I've been doing this long enough that I remember when operations was just considered a cost center. The value in an organization was in development. It doesn't happen in operations because you just keep the site running. I love that SRE and all these things have come along, where people recognize that operations is actually a really valuable part of the organization. I've been lucky enough to be on teams and to lead teams where we've created strategic differentiation.

We had this company that I was working with, and we had Cassandra there. For people who don't know, Cassandra is this distributed database. It's designed for very high throughput, large amounts of data, terabytes and terabytes of data. It operates in a ring topology, so the data is replicated throughout the ring. It also allows you to talk to different nodes if you want to be able to get data or store data, and then Cassandra takes all that data and does all the right stuff with it.

We were running Cassandra, and it runs on the JVM, so there were a lot of things that you had to know about. But we were having a lot of trouble with Cassandra. We were running at about 70% utilization. On your MySQL database, maybe that's not as bad, but on Cassandra, that's really pushing the envelope because Cassandra will do things called compactions, which go through and make better storage of the data. When a compaction kicks off, it uses a whole bunch of CPU. We had a lot of outages because we were running so close to the edge. When we had an outage, the recovery could take days.

This is because there's a lot of data. If we lose the data, let's say we had a replication factor of three and then we lost one of those nodes, then the other two nodes that were holding onto that data would have to stream the data across the network and rehydrate this replacement node with all the data. It was pretty complicated and pretty fragile. You had to have expert-level knowledge of Cassandra in order to be able to do these kinds of recoveries, or even to do these kinds of configurations.

We didn't like the situation that we were in. It wasn't good for customers. Customers don't like it when it's hard to retrieve their data. Customers, in this case, wanted sort of real-time storage of their data so they could use this data. We had to do something about it. What does everybody do in engineering when there's some problem that they're in and they need to get themselves out of it? Yeah, okay, I hear you saying that you write it yourself, but we didn't do that till later. We upgrade, right? Upgrade solves all problems. I was looking for a rainbow and unicorn slide. I found these things with the dogs and the balloons, and I just had to use it. Upgrade solves all problems, of course.

Well, the problem was upgrading didn't solve the problem. In fact, upgrading made things worse. We dark launched a new Cassandra ring. We were sending data over there, and we were getting these massive timeouts, like timeouts that made the ring unusable. We could not store data. We could not put customer data on there. That was just not an option. That was not something that we could do.

We were left with the enviable task of figuring out where this problem was introduced because the version of Cassandra that we were on didn't have that problem, but the version of Cassandra that we wanted to go to did have the problem. Because Cassandra's an open source project, it shouldn't be too hard. We just had to go and do some testing and figure out where this problem was introduced between the two versions. We went to GitHub and looked at the Cassandra project and compared the version that we were on to the version we wanted to go to and discovered that there were 5,827 commits between the version that we were on and the version that we wanted to go to. All we had to do was find out which commit out of almost 6,000 commits was the one that was causing the problem.

How do you do something like that? It turns out we were actually able to find the actual problem, and we used a technique called git bisect. In order to do git bisect, what you do is you say we have 6,000 commits, you go to commit 3,000. If the problem still exists at commit 3,000, you know the problem was introduced before commit 3,000. If the problem did not exist there, then you know the problem was introduced after commit 3,000. Let's say the problem exists before, then you go to commit 1,500, and you keep doing this binary search basically until you find the problem.

In order to be able to do this kind of search, you need to be able to stand up a Cassandra ring. We just said that these Cassandra rings could take days to rehydrate if there was a problem, and that the configuration and things like that were very complicated things that were hard to get right. What we were able to do as an SRE team is create an ability to stand up a Cassandra ring in 20 minutes using SaltStack and its event-driven infrastructure. What would happen is the ring would stand up and then we would get an understanding within the ring that all the nodes were there, and that would create the actual ring. From there, we could run our test.

We were able to isolate the exact commit that was causing the problem in two and a half days out of the 6,000 commits. Everybody always wants to know, "Okay, what was the actual problem?" There was some code that was subtracting some nanos from some millis, and so we were able to isolate that thing. We wrote a patch. It fixed the problem. The patch we submitted upstream, they gladly accepted it. All kinds of fun, happy stuff happened at the end.

But how many organizations can really do something like this? I went and talked to the CTO about it and I said, "What would you have done if we hadn't developed this ability to be able to stand up a full test Cassandra ring in 20 minutes?" He said, "Well, one of three things. One, maybe we never would've tried. We would've just said, 'The version of Cassandra that we're on, we're going to have to live with it. We're going to have to figure out some ways of making life better for both the operations folks who are getting paged, but also for the customers.' Maybe that's what would've happened. Maybe we would've hired some consultants, some expensive consultants. Who knows how long it would've taken them to find it. Maybe they never would've found it. Or maybe we would've just done something else completely, like use a different piece of software, abandon Cassandra. Who knows what we would've done?"

But the ability to stand this thing up and be able to isolate down to the problem and be able to fix the problem in cooperation with development, the developers, this is a pretty awesome thing. For me, this is the essence of working in this DevOps mindset of everybody coming together to solve these problems that the business has.

There were knock-on effects from being able to do this kind of thing. Because we were able to stand this stuff up and because we worked out techniques for being able to automate a lot of this stuff, our ring recovery time now went from a maximum of days down to a maximum of minutes. This is pretty great for operators, for everybody in the business, because now failure is not such a problem.

On top of that, in the course of doing all this, we were moving from AWS Classic to AWS EC2 VPC, and we reduced our costs through the ability to do capacity planning and right-sizing the nodes that we were using to the workload that we were doing and the storage that we were doing to the stuff that we were storing. All that stuff enabled us to reduce our cost by 45% just in that migration.

On top of that, we used this ability to stand up the rings and work with engineering such that we were able to create a solution that mostly replaced Cassandra entirely. It was a purpose-built solution that we worked with developers. We gave them the ability to stand up these test instances. We talked to them about how we're going to do resilience, or I guess also I would say robustness. We talked to them about ways to design the system, and ultimately, the conservative estimate was that we saved 70% in costs over what we would've done if we were running everything on Cassandra, which is pretty impressive. If you want to know more about how we did some of this stuff, there's a great talk about Cassandra that we presented at AWS re:Invent a number of years ago, and it goes into much more detail about the actual Cassandra stuff.

But the question then is, how does my ops team do that? We're going to talk about a couple of things that I see a lot of times when I'm working with organizations and ways that we can get out of that. The first obstacle really is figuring out what kind of SRE organization you're going to be. There are a bunch of different models, and we're going to talk about that. Then we're going to talk about empathy, ticket systems, and alignment. After this talk, you won't have to hire me. You'll know everything that you need to know. I'll put myself out of a job. My hope really is that when we're talking about some of these things, you will recognize some of these things in your own organization, and you'll be able to take those things back and use them to make your operations component of your organization that much better.

When I work with SREs, I have two rules for them. These are the overarching rules for when they have to make a decision about something. Number one, and these are in order, is keep the site up. Seems kind of obvious. We're in operations. Of course, keep the site up. But we want the site to be available. That's what the business is paying us for. The second thing is keep the developers moving as fast as possible. We're going to talk about how these things relate to a high-performing SRE or operations organization and how these things related to what we were doing on that Cassandra ring. You can already see: keep the site up. Obviously, if we're having all these outages, we need to take care of that. Keep the developers moving as fast as possible, we already alluded to. We gave the developers the ability to stand up their own nodes and try things out.

In order to know where you're going, you have to know where you are. There are two SRE models I look at that are sort of the opposite ends of a continuum. If you really want to dig into these, I highly recommend the O'Reilly book Seeking SRE by David Blank-Edelman. For me, looking at these different models, I look at them as a continuum of keep the site up.

The Google model of SRE is a very active model of keeping the site up. Google has their SREs. They're on call. They have standards for the developers. If the developers don't meet those availability standards or performance standards or whatever, then it's the developer's problem again. The Google SREs don't run the site at that point. But it's a very active model. They're very actively participating. They're the ones who get paged.

On the other side, we have the Netflix model, and that's a very supportive keep-the-site-up model. Netflix defines your availability numbers, your performance numbers, all those kinds of things. Ultimately, you are responsible for running your own service. But if you're having trouble, if you're not meeting your numbers, then the SRE organization is there as a very consultative organization. It is someone who will come and work with you in order to help you achieve those numbers. Obviously, if you're not achieving those numbers, you're going to keep hearing from them. It's the very supportive keep-the-site-up model as opposed to the very active keep-the-site-up model.

In your organization, it's good to know which model you're leaning towards, which model you want to get to, because without a clear definition of which model you're operating under, it's going to be very hard to be a high-performing SRE organization.

The interesting thing about this idea of keep the site up or keep the developers moving as fast as possible is we like to talk a lot in DevOps about empathy. This is a model that is proposed and advanced by the teams at Stanford and UCLA. It's this three-component model of empathy, and it's a little bit more advanced than our dictionary definition of empathy, that we just feel what other people are feeling. There is that in this model. That's called experience sharing. There's this other idea of mentalizing, which is about recognizing a mind in other people. The part I want to focus on here is pro-social concern.

The interesting thing about pro-social concern is it is the feeling that we get when we recognize somebody else is in a situation that we have the ability to help out. It is motivating to us to want to go and help those people who are in that situation. Where this comes into play often outside of operations is with healthcare workers. If a nurse is working in palliative care, you don't want to feel depressed at the end of every day because of what you're seeing. We try to work with healthcare workers on this idea of pro-social concern, this idea that you have the ability to help, and that ability to help is the best thing that you can do. If you rely on that part of empathy, then you are not a wreck. In fact, you're actually empowered to go help other people.

This is the most important part for operations teams to take into the work that they do when they're working with developers. Our idea of empathy was this pro-social empathy. We also talked just a few slides ago about keeping the site up and keeping the developers moving as fast as possible. I will assert that operations teams who really care about the customer and making sure that the customer has a good experience, and obviously that has great effects for the business, are exercising their pro-social concern to make sure that the customers have the best experience possible. If the customers are not having a good experience because the site is down, then they are exercising their ability to help and get that site back up.

The other part of this is keep the developers moving as fast as possible. I think that operations teams that are really concerned about that, that know that they have the ability to help developers, and don't have developers waiting around for things and having to stop their work because they're waiting for operations, are exercising their pro-social concern to keep the developers moving as fast as possible. Empathy is a good thing.

Having those developers sitting around is why I do not like ticketing systems. When I talk about ticketing systems, I want to make clear I'm not talking about agile boards, Kanban, or Scrum. I'm talking about those ticketing systems where if I want a new EC2 instance, or I want a new SQS, or I want something, whatever the equivalent is in Azure, then I have to open up a ticket and wait for operations to do that. That's not keeping the developers moving as fast as possible.

There are a number of problems with ticketing systems. Number one, in lean, they're waste. This is where work goes to wait. It's a handoff. We don't want to create handoffs in the flow of work through the system. That's not DevOps. The second thing is they're exceptions. If I'm writing code and my code doesn't know what to do at some point, it needs a human. "Help, I need a human." It throws an exception. It says, "I cannot continue to work. I am stuck until a human comes and helps me." A ticketing system is a way of codifying that type of work. We're basically saying, "We need to get something done, and we're stuck, and I need to get a human to come and help me." We don't want humans to come and help us. We need to be empowering as operations teams, and we need to do things that will enable work to flow through the system, but in a self-service manner.

The last thing I don't like about ticketing systems, and why you should burn your ticketing system to the ground, is that they are tracking toil. When I say toil, I'm talking about it in the SRE perspective. If you've ever read the Google SRE book, this should be familiar to you. In one of the chapters, Vivek Rao says, "Toil is a kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value and that scales linearly as a service grows." We don't want manual repetitive work. We want self-service. We want to empower developers to be able to do things themselves in a very safe, guardrails manner.

You've heard Damon Edwards talk about operations provides a platform. When we were giving the developers this ability to create this system that replaced most of Cassandra, that saved 70% of costs on top of 45% of costs of the COGS of running the Cassandra service, we gave them the ability to launch nodes themselves so they could test things out.

I actually have a problem with Vivek's quote here, because I don't believe that toil scales linearly as a service grows. I think that toil scales sublinearly, because there are all kinds of coordination costs and things that are involved in doing this. The problem is, as it scales sublinearly, we're actually going to get worse the more success the business has. If the business is more successful and we are generating more toil because it scales as the service grows, then the business is actually going to slow down the more success that it has. We don't want that. We do not want to operate in this ticketing manner that is tracking toil.

There's one problem here: we've got this kind of toil work, which is a problem, and that toil work is something that we treat as this kind of special thing, which is open up a ticket and get this thing done. But there's another problem that I've seen where people treat work as special, and that is with remediation work. This is sort of just the opposite of toil, where we elevate remediation work to be some kind of special kind of work. The problem with this is that failure is not exceptional. Failure is the result of regular work. Jessica DeVita has some great stuff to say about this, and Allspaw teaches us that work is not linear and you can't just action-item your way out of a problem.

If we take these action items and create them as some kind of special thing, that's a problem. What I've seen organizations do is say, "Hey, we're going to track our regular work over here, but remediation work, we don't want to have an outage again, so we're going to track that in some other special way." That's not okay. Remediation work is just work. We know about cognitive biases, and we know that recency bias is not a great way to do prioritization. What can we do about this work? Treat it like regular work.

People have seen Dr. Kersten's presentation a few years ago at the DevOps Enterprise Summit, from project to product. Dr. Kersten says there are these four types of flow items: features, risks, debts, and defects. Remediation work is just part of those flow items. Maybe it's risk, maybe it's debt, maybe it's a defect. It obviously depends on the situation, but remediation work is just work, and we shouldn't treat it any differently. Different parts of the life cycle, different flow items are going to be highlighted. Maybe at one time in the year, we're going to work more on risk, and another time, we're going to work more on features. That's how we have to prioritize remediation work. We have to fit it into what the business needs at the time.

This is why frontline managers are so important. They need to work with the engineers, and they need to work with the business, and understand what is the proportion of these things that we want to allocate so that the remediation work does get done, but it gets done within the context of what the business is trying to accomplish. The only way that the business can get those kinds of things communicated is if there's alignment.

A lot of times when I'm working with companies, I see organizational debt that's manifested as tech debt. Managers need to understand what the business priorities are, but the organization also needs to be structured in such a way that that stuff is communicable and it flows correctly. Anyone who's ever heard about DevOps knows silos are bad. But this is why silos are bad: it does not allow us to have our organizational alignment. What winds up happening when we don't have this alignment is it manifests itself as tech debt.

Tech debt is not an awful thing, but tech debt is an awful thing if it is something that is coming out of the fact that we don't have our organization structured in the right way, and then it manifests itself as tech debt. If I have two competing silos and they're each doing the things that are best for their silo, a lot of times that looks like tech debt. What I advise people is tech debt should be a conscious choice. We can say for the next six months, we're going to accept this problem, and that's okay because at the end of it, we're going to be able to wipe out this entire class of problem. But what we don't want is organizational tech debt manifesting itself as tech debt.

How do we get this alignment? We can do this through various methods: OKRs, V2MOMs is what we used to use at Salesforce. The idea is that leadership communicates what is important, what is the most important thing, second most important thing, what are the things they're going to be looking at, what are the things they're going to be trying to accomplish. Then the people who are successively below them in the hierarchy, hopefully operating in a very Westrum generative manner, are going to align with those things.

This is why it's important, I always say, to have operations and engineering under the same leader, because that allows you to have this strong alignment. If you have those individual silos, then it's going to manifest itself in all kinds of ways, tech debt being one of them, as a result of your organizational debt.

People ask me, "Okay, Dave, this sounds great. You want to put this group under this or this group under that. Where do you actually draw the line between development and operations?" My answer is it's a fuzzy line. It's a fuzzy line on purpose. That is by design. Because that place in the middle where it's fuzzy, where those two things overlap, that's where the DevOps happens. That's where the DevOps is. Our customers don't care if it's development's problem or operations' problem, or the ops people don't like this, or the development folks don't do that. They don't care about any of those internal things. What they want is a high-quality product that's delivered to them.

Where the overlap is between development and operations, that's where the DevOps happens because there's only one product. By having those teams overlap and work together and try to deliver the best possible thing they can, that's where we're going to get the best results. That's where we're going to be able to achieve 70% cost reduction. That's where we're going to be able to reduce our recovery time from days down to minutes. That's when operations is enabling the business, they're keeping the site up, and they're keeping the developers moving as quickly as possible. That's when they become strategic differentiators.

Please DM me, please talk to me in Slack. Please communicate. I will be around for the entire conference, and I will be in the Slack even afterwards. Thanks very much.