Best Practices for Availability

Log in to watch

San Francisco 2017

Best Practices for Availability

The ask was almost naively complicated -- get over 100 geographically distributed SaaS engineering teams to deliver services that meet their stated SLA in production. Also, do it without a budget or dedicated headcount. No pressure, right?

To address this challenge, the team co-opted a half-dozen other teams that had demonstrated highly mature development practices and also achieved 99.99+ % uptime. This pilot group helped the team to meet its goal of identifying and visualizing risks to uptime across the organization. These teams helped identify some critically important behaviors, particularly when observed in all of the pilot teams.The first deliverable turned out to be (unexpectedly) a document entitled Best Practices for Availability, that set out the foundations for what successful teams have in common. It lays out three pillars for availability: culture and ownership, mature service management, and availability-driven architecture.

But the Best Practices for Availability document is not a policy or standard; it is intended to drive awareness and inspire other engineering teams to emulate successful techniques. To meet the goal of the project, quantitative analysis was required. Working with the pilot teams, a set of tools were developed to assess the organizational maturity and availability risk associated with a service team.

One of those tools was a failure mode analysis instrument loosely based on the Six Sigma process of the same name. Service teams were asked to contemplate the likelihood and impact of problems across thirteen different problem domains. These problem domains include things like failures of an instance, database, or region as well as DDoS vulnerability and risks associated with cross-service dependencies. This allows senior management to see risk as function of problem domain in a heat map and that's extremely helpful when making investment decisions.

The program produced a best practices document, created engagement in and between service teams, and allowed senior management to measure risk and progress. It also provided a roadmap for how successful teams find a way to dedicate resources to availability and quality work. The tactics used by the team were also a notable outcome, as will be presented in a sidebar discussion entitled, "how to get 100 engineering teams to do something for you." The program is currently running in an ongoing state, with service teams periodically resubmitting failure mode risk information so that progress over time across the organization becomes observable.

David Owczarek, Sr. Manager, Document Cloud, Adobe

Chapters

Full transcript

The complete talk, organized by section.

David Owczarek

Hi, I'm David Owczarek. I'm here from Adobe, and I want to tell you about a project that I've been involved with for the last couple of years to look at how performance in the production environment is affected by the maturity and the best practices of the teams that are doing the work.

So let me start with the drill-down that we've all been asked to give you. I think you all probably know who Adobe is. There is one thing that I would like to say about Adobe, though, which is that over the last few years, we've been quietly reinventing, or maybe not so quietly reinventing ourselves, as a services company.

And so probably there's not a lot of creatives in the audience, but if you're a Photoshop user, you now have a subscription to the Creative Cloud. And you still download Photoshop, but it's licensed online, and all of the storage is online, and all of this stuff. So the entire company has gone through this conversion. We define ourselves now in terms of clouds.

So we're going to talk about the Document Cloud today. The Experience Cloud is what actually resulted from our acquisition of Omniture many years ago and then some other companies after that as well, digital marketing.

And so all of Adobe's business now really relies on being online. And the transformation for us was to figure out how to go from being a software provider that had 18-month cycles for enormous packages that we charged lots and lots of money for, to delivering this online in smaller increments and charging small amounts for it. And the whole business changed. Finance, technology, everything has changed as a result.

The Document Cloud is a composite of a number of different applications. It's primarily two things. It's the online destination for things involving PDFs. So Acrobat and Reader and all of these things, when they have an online component, they go to Adobe's Document Cloud, and you can share and do forms and all this kind of stuff.

And then we also run one of the largest global digital signature platforms. Adobe Sign operates in five geos, highly secure. It's all part of the same portfolio packages. So this is my experience. I'm basically the service delivery manager for this Document Cloud stack.

I don't know what the next slide is, so we're going to... All right, so why am I here?

About two years ago, which was the middle of this transformation, we brought in a new executive, Abhay Parasnis, who's our CTO. And he came in, as you might imagine, surveyed the landscape and said, "We're a services company now. We need to be always on."

But there's one thing that's sort of obvious. We should be as highly available. But there's another piece in here that's really important, and it's the line about Adobe's customer experience. Because one of the things that Abhay did coming in was to say, "Our customer experience is our online performance." That's it. There's features and functions that you need after that, but if we're not up and fast, we don't have a business, especially now that we've converted this enormous company into a services company.

So this is the ask, and you know how these things work in an enterprise. He's got a whole array of people that work for him. One of those people got the job of actually making that happen, and that person came to me and said, "You're going to do this."

So what do you do when somebody asks you to go out to the company that you work for, which is an enterprise, which has hundreds and hundreds of teams, and says, "Make this thing happen across all these teams"? It's really kind of a nasty problem.

And so... Oh, let me wait a moment on that if I can. I can't. Okay, so here's what happens when you actually try to do this. Maybe you all have familiarity with some of these things.

I will say that most of the teams were actually pretty interested, but it's an enterprise, and you get one of everything in an enterprise. So there were some that were not so interested. And one of the things that I'd really like to do as we talk through this story is to explain to you the ways that I countered those reactions, some of which are emotional and some of which are not, in order to get things done. Because my boss doesn't care what the teams are telling me. He just wants it done.

So how do you get all those teams to work with you?

Before I get to that, because that's another word-heavy slide, I will tell you that a lot of what I did was sold on me personally. And that's not because I feel like I have some great gift. It's because I was running a service. I was running services in the Document Cloud. So I was somebody who had to consume my own program.

So as conflicted as I was over being the person from corporate who's going to come in and tell everybody how to do things, which is what it felt like, I wanted to try to find a way to engage people and bring them along in that process.

And so I did use my personal reputation. You know me. I've run a service. I have to do this, too. I can't think of a better way. How are we going to get this done? And that's really an interesting point: how are we going to get this done?

So I did probably what any normal person would do. I found a couple of people to be a steering committee: principals, fellows, folks in the company that I really trusted and who would also be listened to by the folks that we needed to work with, and invited them into the challenge.

And it's sort of like, well, do you want to spend all your time doing maturity models and gathering scores and ranking people? And how else are you going to do it?

We decided that we would expand the model of those folks who were helping out as a steering committee to actually go out to a bunch of teams that we knew were doing really well. We have hundreds of service teams. Many of their numbers are public, so we know who the good ones are.

So we went and picked the top six, and we sat down, and we had a number of meetings where we asked them very open-ended questions, and then we shut up. We just listened. We wrote everything down. And then we went back afterwards and threw it all on a dry erase board and organized it. And the results were really interesting. We'll get into those in just a moment.

Actually, I'm going to take a sip of soda from my mug while you read this. Sorry about the noise.

There's a couple of really key points here that I had to use. The first, for me, was, despite me saying I'm selling it on my own reputation, the most important thing about getting this done was that a top-level executive said this will happen, and that there was a quote-unquote "chain of custody" back to that decision.

So if I went to a team and said, "I need you to do this," they went to their boss and said, "I can't do this. The priority's wrong." Their boss needed to say, "Well, actually, you do need to do this. Let's figure out what the priority issue is," because they had heard it from their boss.

That's the hardest part of this whole thing, because you're dealing with people who are very senior, who don't have much time, and who want to know what you're signing them up for. So this was the most painful and probably lengthiest part of the process, was getting the message right and getting it into the hands of people, and then literally in a meeting with a bunch of VPs, having one VP hammer the commitment out of the other VPs that somebody would pay attention to me and return the information I was looking for.

So that chain, to loop it back, was absolutely vital, and it was really difficult.

The other thing I did was I tried to co-opt the teams into solving the problem with me. And again, this is the argument I made earlier. I have to consume this, too. I can't think of a better way. What do you think we should do? And if you come up with a better way, I'll do it, because I have to comply.

Ultimately, we didn't really have problems with most teams, and part of it is because of the way we did this ask. What did we actually ask for?

We had actually the things I just mentioned that I didn't want to do. We had a maturity model and we had a failure mode analysis instrument that both required real work. You've got to sit down and fill them out, and you've got to think about things. And that's hard to get people to do.

So we didn't ask them to do that. We didn't hammer that point. The important thing for us was that there was a conversation about the topics in these two instruments, particularly in the failure mode analysis.

We're trying to get people to four nines. We need you to think about what's preventing you from being at four nines. The failure mode analysis, which we will go through shortly, tries to get at that in a very general way. It's also something you have to fill out and complete, so it's an exercise.

But what the real ask was, we went to these teams and we said, "When you're done with that, have a meeting with your team, review the results, and see if you want to do anything differently. And really do it. Really think about it. Don't phone this in. We really need you to help."

And when you make a personal appeal like that, that can be very powerful because a lot of times it's just, "You've got to do this. It's run by corporate. Up there, he says he wants it," blah, blah, blah. But in this case, it's like, "No, this makes sense. We need to be at a high availability number. Help us by providing real input."

And then for anything that failed, there was the chain-of-custody loop where they'd have to do it anyway, which actually I used as an argument once or twice: "You're going to have to do this anyway at some point, so you might as well engage with me now and give me input on how to make it better."

All right. So we picked six teams. We asked them open-ended questions. We threw everything on the dry erase. What did we come up with?

This is what we came up with. There were three basic areas where we saw patterns from every single team that were important that we wanted to highlight. Except that I'm not going to talk about the last one, the availability-driven architecture, because... Well, let's talk about why I'm not going to talk about it.

The way that I didn't think was the right way to solve this problem was to write a handbook for how to build a four-nine service, given all the technologies in all of these teams.

Trying to be prescriptive about the approach, short of some kind of central technology initiative, which we have coming along the back to catch everybody, doesn't make a lot of sense. And this is something I've heard in a couple of different sessions here. We focus on outcomes and not methods, because the methods are going to vary in all these teams, but the outcome is what we really need to have happen.

And more to the point, we don't have a lot of dumb teams. I hope that you can all say the same thing, but we don't actually need help for most of the architecture. There's a few places where we might be operating at a scale that's new and that presents some risks. And if you don't have an internal team that's already done it before, that can get a little dicey. But we have really bright people.

So we don't need to tell them how to do it. We need to tell them what the outcome should be. And one of the outcomes was, you might not need to be four nines. Maybe you only need three nines because, as you all know, there's cost to four nines. And if you are retrofitting a product, that cost could be really incrementally huge.

Obviously, when you've got a new effort, it's really nice and easy to build it in the way you want. But that's not the way it works in my enterprise. We have a 35-year footprint. And in the Acrobat world, we have versions of Acrobat that are years old that we're still supporting some online calls for here or there. So we end up having a pretty big footprint.

But let's talk about the culture area, because I've heard a lot about culture. It's pretty pat now to say that engineers, devs, should be on call. But what we heard from these teams was a little bit different than that.

And it was, yes, devs being on call was a very important step for us. But more importantly, when we made that change, the benefits that happened, happened without us having to monitor the progress. They happened organically on their own.

There's this concept in DevOps of, "Tell me how you're going to measure me, and I'll tell you how I behave." Now we're measuring people differently, and the behavior is different.

Engineers come in, and they say, "I wonder what happened last night. Somebody released some code. Did that work out? We were looking for some..." And it's this natural curiosity factor that creates engagement. And our teams reported back this was one of the most transformational things.

We didn't lead this question, any of these questions. We just collected the data.

Second thing, and I want to spend a little bit of time on this because I think this is a problem we all have. There's a problem that every team has, in that you are never going to have less than a thousand things to do from whatever your product organization is or your security organization. The requests for features and things in your stack probably is overwhelming. At least it has been for me the entire time I've been there.

And the teams that did the best found a way to solve this problem. They found a way to deal with debt. They found a way to deal with security. They found a way to balance all of these things. And there were two basic models that I wanted to share with you for how they did that.

The first is a pretty clear-cut one. It's reserving resources. In one case, we had a team say, "Oh, we spend 50% of everything, budget, people, whatever the number is, 50% on quality and performance, and we don't expose any of that to our client teams."

Now, that obviously can cause problems, but this team made it work, basically closing down the spigot to the point where they could handle the incoming requests and achieving a new equilibrium based on that.

So this is a pretty classic engineering-driven way of handling the problem, which is, let me define my own work, and I'm going to pick up work. The part that's always difficult is how large a percentage it needs to be to actually move the needle when it comes to availability.

The other approach that we heard was co-opting product management. If product management believes that it's important, it will happen because then it comes in as a feature request. And no product team is ever going to turn you down if you ask to say you're going to write a feature for them.

This tended to happen in the applications that were very transactional. So think about identity stacks where you're doing authentication, or an API service where you have a lot of small, easy transactions. It's a little easier then to make the argument that, well, if you're doing an identity call, it should return in tens of milliseconds and blah, blah, blah, and to get your product team engaged in that to the point where you say, "We're now no longer able to miss that objective." It becomes a product priority to replace it.

And when it's a product priority, it happens. When it's an engineering priority, depends. At least that's what happens in product-driven companies that I've worked for.

And then this last point, we heard also consistently from all six teams. We have lots of dependencies. We have people who depend on us, and we manage them ruthlessly. And by that, I mean we disconnect them if they cause us problems. We have different ways of making the customer think everything's okay if they have a problem. We negotiate limits with them. We throttle them. All of these tools to say, "You will not bring us down if we depend on you for this or that resource." So I thought that was incredibly valuable.

And I want to get into the second area, which is service management.

I probably won't talk a lot about this because I think everybody gets it. The important point here is that, again, when asking a pilot set of teams who were all very high-performing, these are the things that came back.

Monitoring, in particular for a couple of teams, was the first thing they said. "We monitor everything. That's the only way we feel like we have the ability to control our future." And that monitoring is across several dimensions, which you see.

The other thing that was really stunning was how much process stuff came up. We actually expected to be pulled into technical arguments, and there was not a single team that led with technology.

When we said, "How do you achieve this availability?" there was not a single team that said, "Oh, we use AWS," or, "We use Redis," or whatever. They said, in a lot of cases, "We have really good process."

And that's important because, as you all know, there's inefficiencies in all of those that cause downtime. Incident management is just speed to respond, for example.

So as much as we don't like to do process work, process work turns out to be really important. And now when I look at teams that say they have an availability problem, my first inclination is, is it process, or is it actually technology? I'm probably not going to look at technology first, because I can go into most teams and I can identify lots and lots of solvable problems that reduce downtime that don't have anything to do with its tech stack.

They have to do with the things I just told you about, like let's not release changes that are too risky, or let's not take 10 minutes to get on the phone call when there's an incident. That's downtime.

Sometimes it's hard to fathom the fact that downtime doesn't have a technical fault at the bottom of it, but often the causes of that downtime, which are usually complex, present that way.

This is all pinned on top of a secure development lifecycle, SDLC. That'll show up later in the future work, where we try to co-opt that process to allow for some of the things that we learned to go in as controls for the organization.

So what did we actually do?

We asked teams to do three things. We asked them to fill out a maturity model, which was something we had developed already previously, and we were just reusing it in this context. We developed a failure mode analysis loosely based on the Six Sigma process, which I'll show you just here in a second. And we asked them to fill that out, and we asked them to have the conversation. So that was what it looked like.

When we looked at the results, what we were most interested in really was the failure mode analysis. So let me show you what that looks like.

Oh, I'm going to go ahead one. Right. So we asked people to go through an exercise. We said, "Take your architecture." We did this for every service, by the way. So if you had a service family that had microservices, and there was an overarching thing, and then there were six underneath, you had to do it for all of them, unless they were so similar that the results were the same.

We said, "We have 12 domains of failure, just 12. This is really simple. We don't want it to take a lot of time, but we want you to do a predictive exercise to say, for each of these domains, what's the probability that you're going to have a failure next year? And if you need to, you can go look back at what your probability was from last year as an actual."

It's a great starting point. Most services had a run record where they knew that number. So that was a great way to anchor the conversation.

And then if something does happen, what's the best and worst case in terms of the number of outage minutes you'll take?

So just to normalize the conversation, what's an outage minute? If you were completely down for one minute, we'd call that a one-minute outage, and we'd take that off your SLA. If you were partially down, like you had some major functions unavailable, we'd prorate it by something that represented the actual customer impact, so you might get 30 seconds of outage time, just so you know what I'm talking about here.

So we said, what's the minimum number and the maximum number if you have a problem in your region, for example, if you have a region failure? And you'll note here that in a couple of these dimensions, we said, "You can't answer zero. You can't say there's no chance this will happen."

If it's your database, a third-party vendor, an instance failure, or a zone failure, you have to assume one will happen. And the reason for that is because they have happened. It happened last year. Anybody who uses AWS knows that we have to keep solving these problems.

So that turns into a form, which this is a little bit simplified because... Oh, no. Back. I'm back. This is a little bit simplified because I don't have best and worst case in here. But this is the exercise we'd ask people what it might look like, simplified as they filled it out.

And this 30, I want to just comment on. For some things, it's hard to imagine all of the bad things that can happen, but at some level, you could have a bad enough thing where you just say, "Eh, I have to invoke my disaster recovery plan," if you have one. We do. And it defines an RTO. And so the worst case is the RTO. So you shouldn't have any answers that are worse than the RTO if you have a disaster recovery.

So I'm going to show you a plot of the results, but this isn't real data. So Adobe would not let me get out of the building if I actually published the performance data.

Now, this is a coarse instrument. So this is one visualization of the data. It's a high-low chart. So what we did was we took each domain. We stripped out the team information on this one. Say, where are the problem areas for us as a company? Because this is aggregate-level information.

My view on this, and those of you who are well steeped in math can beat me up later, this is a coarse instrument. I'm not really going to take any individual team scores to the bank. What I'm looking for is the overall trends in the data amongst all the teams.

So in this example, region failure is clearly the one that is either a problem or that everybody's worried about, because they filled out terrible numbers for it. And these are the average, by the way, best and worst case across all the teams. That's why they don't hang so much on the bottom, because many teams have no problem with many of these domains, and their bars are really small.

But across all teams, as an average for the high and low scores, you get something that starts to separate. And there's two ways you can draw lines in this chart to make it really valuable. I don't know why I didn't, but I will show you what they are.

The first is that you could draw a line right in here. That's four nines. So any dot that's above that, interesting right off the bat.

And the second thing is that you can draw a line here and say, "This is all the stuff we're going to worry about. This is the stuff that we do well." Sorry about that. In our real data, there was a really nice break in the middle of the curve that did exactly that, that said most teams are really good at the eight things to the left of this. To the left? To the right of this. But these four things on the left, that's what we need to worry about.

That's really important information from a corporate perspective because it really tells you where your larger scale issues are, which we thought was really important.

Now, there's a number of other visualizations that we did on this data. We could obviously look at a single team's scores. But the other interesting view is to scatter this. And I have not had a chance to develop that chart, but there's a density of scores in each of these dots. And it's kind of important to know what that looks like in terms of... Because these are averages, so a couple of numbers can really push where things fall on the chart.

So where does that leave us going forward?

We actually have more future work than this. I added to this slide, but I couldn't lug my laptop down here.

We have services, SLAs, recorded centrally in a dashboard where anybody can look at them. So there wasn't any mystery about who had what number in terms of availability. What we have not had a chance to do is to cross-check the maturity and failure scores with the actual performance to see how that data illuminates where there might be areas of opportunity.

One thing that was very clear is that there were a few cases where services came on board and they weren't enrolled in this central dashboard. That became mandatory. So that was one really good result of this.

Every service has to subscribe, has to enroll, and has to declare an SLA. And the SLA may be three nines or four nines. I think we actually offer four bins: three nines, four nines, two nines and a five, and then everything else. And you don't really want to be in the everything else category.

The thing that we really want to do to make it stick, though, is to put some version of the failure mode analysis into our software development lifecycle. And by that, I mean that we want some kind of check before people go to production that they've thought about this stuff. Most teams have, but maybe not completely. And that's pretty important to us.

So we did that. And going forward, we'll do more visualizations, we'll work with the SDLC, and then actually we're going to probably move on from there. We will continue to look at these scores, but it's a pretty complicated exercise to do this.

One of the things that I didn't realize until I had done it was that when you have an enterprise and you have that many teams, you don't have fresh data. By the time you get through all the teams, the first teams have moved on, and they're different.

And so that was one of the big learnings, is that we could baseline it. But to do it on an ongoing basis, we'd have to have a lot more automation, we'd have to have a lot better way. And that's why we focused on let's just make sure people are thinking about this. Because remember, the most important part of this was the conversation. It wasn't the instrument.

So how do we make sure the conversation happens? Let's stick it in our SDLC, and let's stick it early enough in that it's not the week before we're going to prod, obviously.

So that's a description of my journey. I hope you enjoyed it. If you have any questions, I'll certainly hang around afterwards. Thank you very much for your time.