DevOps at Scale is a Hard Problem

Log in to watch

San Francisco 2017

DevOps at Scale is a Hard Problem

Sr. Director, Production Engineering · Yahoo, Inc

You join a company with 3000+ engineers; you interview a ton of them; understand their pain points -- sound familiar? Well, this was what my early days at Yahoo looked like back in 2014. The most common response I got, “I wish there were fewer barriers”. The barriers made it hard for engineers to freely express themselves through the products they built.

Attempting to solve this, I join forces with a colleague of mine (Shay Holmes). We set out on a mission, a pretty long one, to try and remove most of the barriers using software. The macro goal? Velocity! Agility! Step #1: come up with something that we could use as a reference for our (DevOps) transformation. So, we came up with a nice slide deck; did multiple talks, road shows, blog posts, and much more, to campaign our mission.

We thought we were doing great up until the time came where we had to start executing. And, yes, we hit roadblocks; multiple roadblocks.

First, we attempted to route alerts directly to the team who had the best chance of fixing a production issue, essentially bypassing 3 layers, to lower TTR and TTD. This also enables a culture of ownership (you wrote it; you own it), and creates the right incentives to keep monitoring clean. But, turns out, this was a rather hard problem to solve (and not fully solved yet).

Second, we try to get teams to continuously deploy software to production. Fortunately, most of Yahoo was already doing CD as it had been a corporate goal for like a year. But, there was a problem: few teams were trying to work around the automated CD requirements by having manual QA approval. The reality was those teams were trying to merely meet the requirement of goal (which, btw, is not bad in itself); not its spirit. How do you overcome something like that? (when it was agreed upon as a best practice)

Third, we try to treat operations as an engineering discipline and invest in solving (production) problems using software and automation. But, building an intelligent, event-driven, auto-remediation framework needs investment, and most teams were already pretty buried in tech debt and barely making any forward progress. How do you dig yourself out of this hole? How do you convince teams to automate themselves out of their jobs?

Fourth, we made a big push for AWS; we came up with multiple use cases that made sense. Made good progress on some; not so much on others. Some of the challenges we faced included cost (can get expensive real quick); security (securing our users and data is non-negotiable); this also requires a ton of investment, especially at the foundational services layer, to make it feasible at scale. Given the enormous challenges, how do you make forward progress?

I will dig deeper into each of these during the talk; give insights into what worked, what didn't, and the lessons learned. It should be a productive session.

Kishore Jalleda, Sr. Director, Production Engineering, Yahoo, Inc

Chapters

Full transcript

The complete talk, organized by section.

Kishore Jalleda

My name is Kishore. I've been with Yahoo for three years now. I'm here to talk about our transformation. Again, this is mostly about our journey and progress, as opposed to perfection.

So the scale: all of you know Yahoo has over a billion users. Pretty large-scale infrastructure, multiple data centers, multiple POPs.

When I joined the company about three years ago, one of the first things I did was to interview a bunch of people, a bunch of engineers. There was one person who I'm not going to name. I'll just call him Bob. It's not his real name, by the way.

He said something interesting. He said, "I've been wanting to do this, trying to build this stock recommender engine using the data from Yahoo Finance, but I've not been able to do it because there are several constraints." And there was one thing that he said that really struck me. He said, "I wish there were fewer barriers."

When I say barriers, it means there are several different things. Hardware takes a long time to arrive, too many paranoid approvals, the code base is complex, et cetera.

That's when it struck me: I think as leaders, or anyone else, the most important job all of us can be doing is to help people like Bob, to find people like Bob, to remove those barriers for them so that they can come to work every day motivated, energized, excited about building something awesome for their customers.

Fundamentally, the journey, at least the one I've been leading, has been all about this. That's when it struck me. We define this as DevOps. At least this is what I define as DevOps: it's about eliminating the barriers, like the cultural, technical, and process barriers between idea and execution, most importantly using software.

Again, fundamentally, it's about creating entrepreneurs. People like Bob, who want to do something different, who are motivated individuals, helping them build something awesome for the customers.

Again, velocity, MVP, all of you have heard this. I mean, I worked at a lean startup, so I use these terms a lot.

That's when the journey started. But before that, what was also important was finding a co-founder. I call this person my co-founder just because we wanted to sort of change the world. We sort of wanted to break all the rules and do something awesome for the engineers. So I found my partner in crime. That's where the journey had begun.

It's important, if you are planning on doing something at your company, to try to find someone who you are radically aligned with.

And this is what we came up with. We came up with what we think DevOps is: to enable a culture of ownership and excellence, engineer agile and automated processes, develop self-serve and reusable tools so that we can kick ass at delivery, just deliver products to market fast, prevent defects from reaching customers, and then repair production issues quickly. This is sort of the guiding principles that we use, values if you want to call it, in every decision or every practice that we do.

There was some initial response to our strategy deck. We went around the company, we were campaigning, we were giving talks about, "This is what we want to do. This is our vision." Some people said we were preaching to the choir. And that's okay. At least we're preaching. Some people were not even doing that, at least the same people who said that. Some were very perceptive. They asked, "What is our day job?" Which is pretty cool.

And then soon we started executing. We had a plan. We were looking at some of the pain points. We were looking at the barriers. And then one of the first things we wanted to do was what I call directed alerting.

I'll be talking about four initiatives here in this talk. This is the first one as part of our transformation. It's called directed alerting.

Let me ask you a question. There are two options: option one and option two.

Option one: alerts go through three different teams before they go to the dev team who actually wrote the software.

Option two: alerts directly go to the dev teams.

How many people would pick option one if I just asked you? Option one. Wow, not even a single person? Okay, so a handful of brave souls.

But yeah, again, this is how things were, at least in the business unit I was part of. Not anymore.

I fundamentally believe, if you go back to the slide, we talk about ownership. So you wrote it, you own it, you run it. There are several variations of this. The most important thing about this part is you have to have the engineers closer to the customers. The feedback from production should go back quickly and directly to the people who actually wrote the software. That's fundamentally what it is.

Simple, right? Again, 99.9% of the people in this room are aligned with me, which is amazing. You guys don't need this talk. But anyway, the important thing is, how do you make that happen? Because it turns out that it's actually a hard problem to solve.

Because you're saying, "We pretty much don't need three different teams in between." So it's almost like going to a team and saying, "We don't need you." And it was not just one team, but three different teams. One of them was mine.

Not just that, you should also go to the dev teams and convince them that this is the right thing to do, which is obviously not simple. Obviously, they're going to say, "I don't like my developers to be waking up at 3:00 a.m." Again, which is probably wrong. And of course, changing people's mindset is always a hard thing. And as I mentioned, saying no to someone, to a team, is also a hard thing.

And when you're stuck, be patient, breathe, calm down, and then follow these simple rules. I'll be talking about 10 rules or so, nine or 10, as to what we did to help overcome this barrier, to make it so that alerts directly go to the dev teams.

Rule number one: have a strong vision. In my case, the vision was always to have alerts go directly to the dev teams. Make sure that every alert is actionable. Make sure that every alert requires human intelligence to fix. Ideally, get to two alerts a shift, so that you can keep up with root causing each alert. So that was the vision I had.

Communicate the vision widely across the company. Communication is never, never, never enough.

Leverage outages and failures. There's always going to be some. I cannot stress how important this is. Conduct postmortems. Ask some difficult questions, some thought-provoking questions. Stir some emotions. Don't be a jerk. But again, do make sure that you're asking the right questions.

As an example, there was a postmortem where there was an alert sitting with a team for six hours. So I asked, "Why was the alert sitting with that team for six hours?" And they're, "Okay." They didn't know what to do with that alert, or they did not know the importance of that alert. It's like, okay, that needs to be fixed. So you should be asking those questions. Do not shy away from asking them.

Rule number three is probably the most important rule: find your allies. They won't come to you. You have to go and find your allies, especially if you're serious about a transformational change. It helps drive that bottom-up change.

And if you're not making progress, try to infiltrate the enemy lines. Not literally. But the example I like to give is, if there's a large team and if a leader is not aligned with you, but someone on the team is aligned with you and your product, I think it's okay to talk with them, show them the value that your product or your tool is actually giving you, so that those people will end up making enough noise to actually go about driving through the initiative.

Again, this is very important. Find your allies. You have to go and meet them, see who are aligned with you.

Rule number four: half-assed approaches never work. So when we were trying to make this change, one of the common things that I found from my team was, "Can we send the low-priority alerts to that team?" Or, "Can we have that team do reboots for us so that we don't experience the pain?" I said, "No. It's all in. Alerts directly go to the dev teams. And let's figure out a way. Let's invest in tooling to make that happen."

But I think it's so important, because maintaining status quo is easy. Everyone can do it.

You also get all kinds of reactions. There was one team that said, "Oh, we'll have no one else to blame." And okay, that's not a good approach or response.

Stay calm in tense moments. There's always going to be some. In fact, when we were making this change, I'll talk about a product that we launched. My boss clearly told me, "Are you sure you want to make this change? Don't fuck things up. This is not the time to be messing with things." But again, you're always going to have those moments where you have to make those decisions.

You have to make some changes internally for this change to happen, and you have to be working with several different teams.

And again, top-down support is critical. Make sure that your boss is aligned, your boss's boss is aligned, and all the way up. It's so important. Just don't do things on your own. Again, the onus is on you to be doing this.

In my case, I talked about allies. I actually found a strong ally. In 2015, we shipped a product called Daily Fantasy. We had a new leader who came from a startup. He was amazingly radically aligned with me. He said, "Yeah, this is DevOps. Yeah, we'll do this." And so he and I got together. We made the change happen. We shipped it. Again, it's a great opportunity for us to show that something as radical as this was actually possible at a large company.

Learn to say no. One of the hardest things to do as a professional, or anyone else for that matter, is saying no. When you're making such a big change, when you don't want three different teams in between, you have to learn how to say no. And not just you, your teams should also learn how to say no. It's important. So make sure that you're empowering your teams to also say no.

Align incentives. What ended up happening was, if you remember the previous slide, we ended up sort of embedding the team B into team C, which is my team, the production engineering team. Team A was shared between different business units. Again, you can't just get rid of people, but if you can find ways in which you can align the incentives, then go ahead and do that.

Don't miss the boat. If you are like most, I'm not sure about smaller companies, but at Yahoo, at least, most big decisions come down to a go/no-go meeting. If you are driving that meeting, make sure that you go in prepared. Make sure that you get the buy-ins from all the stakeholders before the meeting, not after the meeting or during the meeting. People should not be surprised in the meeting when you are proposing something. So make sure you meet with them individually and explain your vision. Make sure that there is actually alignment that is happening. This is important as well.

Last one, actually it's rule number 10: make sure you're celebrating all wins. All wins must go viral. As I mentioned earlier, when you're doing something which is pretty radical, everyone in the company should know that. So celebrate those wins, talk about them a lot, talk about them in your all-hands, write blog posts, show graphs which indicate the impact, et cetera, and people will start to pay attention. That's how they would start to believe that something as important and radical as this is actually possible.

And then we shipped. We launched. That's how the TTD, the time to detect, looked like. There was a dramatic reduction just because the alerts were directly going to the dev teams. And then we slowly started to roll it out to more teams as time went on.

So that was the first initiative, directed alerting.

The second initiative is continuous delivery. How many people do CD here? Cool. About half, maybe less.

Again, same thing. Before we talk about this initiative, let me ask you a question.

Deploy to production: option one, no humans allowed. Option two, humans allowed. How many people would pick option one? Quite a lot of people. Awesome. Option two? Again, a handful. Cool.

Again, there was a major initiative. I didn't drive this initiative. This was even before I joined the company. But a bunch of amazing folks convinced Marissa, who is our ex-CEO, to make this a top-down initiative. It was a corporate goal across the company. There was buy-in from her.

So as I mentioned earlier, once your CEO buys in on something, it's amazing. It's so much easier to actually get things done.

But it didn't go as smooth as we would have hoped, especially because when you say no humans allowed, that thing can actually be scary for a lot of people. But again, that's how modern software should be built. I believe CD is table stakes. People have been doing CD for a long time. In fact, for more than a decade. And so if you're behind, it's probably time to catch up, if it makes sense.

But again, this was actually really hard for a lot of people. Expect a lot of failures early on. You expect to fail much more often than you would expect to succeed. There are always going to be dark, gloomy days. Never give up.

At the heart of CD is a certification plan. So if you can get that right, and it's not going to happen overnight, eventually you'll get to a point where you can actually ship software continuously to production without any humans allowed.

That's when there was an initiative called Warp Drive. There is an initiative called Warp Drive at Yahoo. Amazing program. That team is responsible for driving cross-functional initiatives across the company. Again, helps drive that bottom-up energy. Highly effective. They were instrumental in getting Yahoo to CD.

So if you are planning on doing something at a large company or a small company, have a team which is responsible for changing the culture and driving the technical excellence across the company.

Even when your CEO wants it, even when there's an amazing program who is running this across the company, there are always going to be stragglers, especially at a large company where you have micro-cultures within, who would just not embrace the spirit. So they may be doing CD for the sake of hitting the quarterly goals, but they don't truly embrace it. And I think it's just a fact of life that you're always going to have teams like this.

So what do you do? The strategy is to, again, infiltrate that team with evangelists and people who actually are champions of CD. It's important because those champions, those evangelists, will actually help you get that team there.

It's also important, public shaming and peer pressure, if done rightly, correctly, because no one really likes large numbers next to their names. So if there's a weekly email that was going out which showed your progress on CD or any other initiative, when people see smaller numbers, then it creates that pressure to actually go and do the right thing.

But eventually, I call this the law of velocities, if there is a thing like that, but eventually when you have most of the company on CD or any other initiative, and when there's one or two teams who are not on it yet, they will get there. They have no choice because obviously one team is dependent on another team in a services model or microservices, and then one team is shipping fast and one team is shipping slow, then obviously things are not aligned, and there's no other choice except to align.

Again, results just like most companies. This is possible. At scale, the deployment time has come down a lot. The frequency has gone up a lot. Production engineering and any other teams don't do the deployment. It's all automated. You just commit. There's a scheduled time. It just goes to production.

But again, as I mentioned, CD's table stakes, so it's important that you try to keep up if it makes sense in your industry.

Third initiative: automation culture. Again, same drill here. I'll ask you a question. Server, container, VM is in a bad state. All of us experience this. Unhealthy.

Option number one: you wake someone up at 3:00 a.m. and that person takes that compute out of rotation, and then goes back to bed.

Or, some framework automatically takes it out of rotation, and then runs some diagnostics, and then creates a ticket and assigns that to a human.

How many people would pick option one? Awesome. Look, all of us are... Yeah, again, common sense. I agree.

But the problem is, again, which is why I'm assuming you're here, is how do we get there? But before that, if too many machines go out of rotation, then obviously you wake someone up.

The challenge is, you're always going to be asked, what about job security? And most importantly, if you're already buried in debt, if you're barely above water, how can you make the strategic investment to build tools and build those auto-remediation frameworks to actually come out of this hole that you're in? So that's the challenge.

But it's mandatory, it's critical that you actually make those bets.

In our case, we had a team which was doing CD, and we sort of pivoted that team to start building tools, and we sort of ran it like a startup. We planned the whole year. There was greenfield thinking. Motivated individuals who were passionate about things like auto-remediation, destructive testing, and chaos, and whatever. They came up with their pitches and they formed teams, self-organizing teams, and we built a bunch of tools.

But again, ruthless prioritization is critical. So the contract I had with the dev teams was, eventually, we will stop providing you production engineering support. Not immediately, but eventually. So we'll build the tools, we'll do the handoffs, but eventually that's the plan.

In fact, we even went as far as listing all the products. Yahoo is a 22-year-old company, so obviously there's a lot of legacy stuff as well. So for every product, we sort of said, "Is this important? How much revenue is it making? How much activity is there? How many developers are there?" And then made a plan and sort of entered into an explicit contract with the dev team saying, "In six months, you're off here based on what you are supporting as of today." I think that's important. And that actually really worked out a lot.

Again, the promise to my team was they can get to work on higher-value-add work, as opposed to doing things that don't really matter. And I also promised that if someone says you can automate yourself out of your job, it's almost always not true, because you can always find something else to do.

And as I mentioned about saying no, unless you say no to things that don't matter, you actually can say yes to things that matter. So it's important that if you're doing things manually, then you say no, invest in automation, invest in tooling to auto-remediate those error conditions in production.

We built a bunch of tools. Even lately, we've been working a lot on Kubernetes.

But results: again, hundreds of error conditions which would've normally been handled by humans are now handled by machines. Dramatic reduction in the repeat incidents and number of man-hours saved, just so that people are not doing work that didn't matter.

Just to end on this initiative: do not build tools. If you're a tools team, if you're building tools, make sure that you aim for shipping something sooner than later. Trust me, we have built tools which were not self-serve or which were not a service, but which served a specific use case of a customer, and they ended up being adopted much more widely than tools which had perfect test coverage, TDD, and all that good stuff.

So it really depends on what your customers want and what their pain points are. And please don't build tools in a vacuum. Or if you're building a product, don't build them in a vacuum.

Okay, last initiative. The question here is, again, if you were to pick a cloud, what do we do? Private, public, hybrid. There's no right or wrong answer, so I will not ask this question. But it really depends on the problem you're trying to solve.

For example, if you're a startup, you would be doing something. But in our case, I think it was important when we pushed for this. I think it was important that we made a bet. We experimented with something like AWS. Again, not necessarily AWS, but any other public cloud.

We had several different use cases. There was one thing called fail-safe. So if all of Yahoo's data centers are down, we can still serve some static content from AWS, so there's a stack running there. We also use it for load testing. There were also teams which were building up backends over a weekend and then just tearing things down. So that's the beauty of having an on-demand, self-serve, billable compute.

We also launched many new projects after this initiative, mostly the acquisitions and the newer projects which don't have that much of a reliance on Yahoo's backend.

The problem if you're trying to start an initiative like this, especially at a large company with lots of data centers and hundreds of millions of dollars invested, is most people are afraid to break the rules. In fact, I had some senior people come to me and say, "Can we use AWS?" I said, "Yes. Initially we may have to pay using our credit cards, but eventually..."

When I say break the rules, it's important. You break the rules, but then break them in broad daylight. What I mean by that is make sure that you have buy-ins from people and make sure that you are getting their support. Just don't do something in a vacuum and then just show up and surprise people. So I think that's important.

Again, we made enough noise that people noticed, and it actually ended up becoming a corporate goal. There was a pause in between, but then you should expect to hear more in the next few years on AWS. But then step number one was to get consolidated billing so that it was easy for teams to use something like AWS.

In terms of results, we shipped, as I mentioned, a fail-safe stack on AWS. Many new products were launched, and there's more to come.

Also important: try to be selfless. I know this is sort of a little bit philosophical, but if you start something, it doesn't mean that you have to end it. But it's important to blaze the trail.

Closing thoughts. I've also been writing a lot on these topics on LinkedIn. I've been publishing a few articles. So if you want to check it out, you should.

But when I say DevOps anti-patterns, there's a ton. I'm not going to list all of them, but one of the most common things that I find is, people will come to you and say, "You know what? We want the dev teams to be focused on the stuff that matters, like features, and we don't really care about anything else. So we'll just send all the crap to a different team."

So that's when you know that you're not creating the right incentives for the dev teams to keep things clean, to keep things in an operable state. So if you find any, try to reverse them.

A better model, in my opinion, you can call it DevOps or whatever, is about ownership. So if you're a dev team, you fundamentally own everything, starting from your build, test, deploy, monitor, on-call, postmortems, capacity, everything else.

And if you're a non-core dev team or an ops team, you sort of do the same, but you do it in a different function. So you're building infrastructure, you're building tooling, you're building CD frameworks, which is what my team has been moving towards at Yahoo. So that is why we've been able to work on something like Kubernetes, just because we've been making all these investments.

Reflect and soul-search. One of the favorite questions I always like to ask is, "Why does my team exist? What value are we providing?" And then normally, if you go down that path, it generally leads to better outcomes.

And a few other things. I think uptime is overrated. I like to say that customers don't really expect or care about five nines uptime. I think they care about five nines customer service. So if there's an outage going on, TTR is important, but also focus on how can you keep the customers engaged before and after the outage.

Again, the velocity is the same thing. It's not enough if you just ship 50 times a day, or 100 times a day, or one time a day. It's important to also focus on what you're learning after you're shipping.

It's almost like you're a software developer. If you committed something and if you don't care about what impact your change has had on the customers, if you're not learning from it, it doesn't really matter if your change goes out in a week, or in a month, or one year. For you, the job is done. So it's important to also learn, using the right tools, what kind of an impact the change has had on the customers.

And then I think democratizing operations is important. If you want the dev teams to own something, it's important that you democratize operations. A great way to do that is to build the right tools.

Another trend I've been observing is ops is moving up the stack, and then the devs are actually moving down the stack. In fact, Daily Fantasy is a great example of the developers owning their operations, and this is actually possible at scale.

And does DevOps matter? We're at a DevOps conference. Of course it matters. But I think, again, it's table stakes. People have been doing this for a long time.

I think, in my opinion, the strategic differentiator is going to be how obsessed you are with your customers and how quickly you can learn from the changes you have made in production.

And then you need more Bobs. It's your job to find them. It's your job to groom them. It's your job to remove those barriers.

So good luck with that. Thank you.