Launching MSN on Azure: A Story of the 76 Point Checklist

Log in to watch

San Francisco 2015

Launching MSN on Azure: A Story of the 76 Point Checklist

Get specific and learn twelve categories and 76 points to make any cloud service successful. Cloud services on commodity hardware are less resilient and fail more often requiring more work to build a rugged and responsive system. This checklist is the recipe that will turn any dev team into DevOps superstars.

As a bonus hear first hand stories of moving Microsoft’s MSN service of 450 million users to Azure. Stories that will make you laugh and cry!

Chapters

Full transcript

The complete talk, organized by section.

Eric Passmore

I want to start out with a little bit of a story.

I had a drain that broke in my home, and the water was leaking all over, so I had to call a plumber. And I'm talking to the plumber, and he's like, "My cousin builds these little go-karts, but they're super powerful. It's like one horsepower per pound," which is a lot of horsepower. A super-fast motorcycle is one horsepower for three pounds.

So he's like, "I'm taking this little race car that my cousin built out on the track because I've got to warm up the tires, back and forth. And there's a stand, and there's a bunch of girls in the stand, so I think I'm going to impress them."

So he's like, "I slam on the gas, and all of a sudden, the race car starts spinning around in circles, spinning." He's like, "Oh my God, what am I doing now?" And then he remembers what his cousin told him. And he said, "If you ever get into trouble, point the wheels in the direction you want to go and punch it."

So he points the wheels in the direction he wants to go and punches it, and it just straightens out.

So that's an example of a checklist. Is that a success or a failure? I don't know. I'm tough. I would say it's a failure. You shouldn't be doing donuts in the middle of the track. It seems unsafe.

But that's an example of a checklist. And checklists, when you have some emergency, there's a sense of urgency to it, that makes sense. But we also use checklists for preemptive stuff, too.

But that's the end of my story.

How about this one? Both at the same time. Da, da, da. Wait one second, hang on. How about now? Wow. That's amazing.

All right. So our story starts with a business that wants to thrive and grow, which is MSN, which is the team that I was leading. And so we started working on a new version of the MSN portal.

I'm going to go pretty quickly through these slides because you can go through them pretty quickly, and they're fun.

So we were creating this premier content. We got a new look. Best content from all these publishers from around our world, huge diversity, great images, stock quotes, sports scores, weather.

And then MSN's not tiny. It's the number one portal in 26 different markets. It's got 450 million unique users worldwide, monthly. Eighteen-plus billion page views, which turns out to be about 200,000 events per second. And then it's a billion-dollar-a-year business.

And the team wasn't small either. Four hundred seventy-five people located in Redmond and in Vancouver, Canada, and in Hyderabad, and Ireland, too. All over the world. This is the team that's going to help us get the business to the next level.

So that was the beginning of our journey. But what you have to understand is where we came from. We had this publishing system that was really difficult to use, and it was like we were praying to old gods.

There's an incident at night, and it was resolved, and I go in the next morning, I talk to the person who was working on the incident. I'm like, "What did you do?" He's like, "I rebooted the server." I'm like, "Why'd you do that?" He's like, "I don't know. That's just what we do."

He's putting the altars up there like, "Please, take it, and go away. No more pain."

We had this fear of queue lengths. So the queue lengths would get high, and we're like, "Oh no, what should we do? Maybe we should reboot." And then we had a bunch of single points of failure. So we had some problems with the old publishing system.

And the reason that we had problems with the publishing system wasn't that anybody had done a bad job. It's just that over 10 years, they kept trying to rebuild it, and either they over-engineered it, overreached, or there was a reorg, which killed the project. And so this whole 10-year period, this publishing system that was running, that was supporting the business, was considered legacy, and we're not going to fix it.

So our business goals required a new tech stack because the old one, it was such a deep and dark hole that if we incrementally fixed it, it would've taken forever. It was actually far faster to rebuild a bunch of stuff.

And so we wanted to build it. If we're going to build something new, like, hey, let's do it on the latest stuff. We wanted a lower cost of ownership. We wanted something more scalable and reliable, but we also wanted to ride the wave of Azure. Azure keeps improving rapidly. So that was a great opportunity to do that.

Okay. Now it gets interesting. And I couldn't make this up if I tried. Well, actually, it gets even more interesting.

But there are lots of opportunities, lots of risks. So first, we needed a whole new data center infrastructure worldwide and make copies available, because if one data center goes down, we want the other ones to serve it. That's new.

We have integration risk: all these new services coming together. We needed a telemetry pipeline, and Azure really didn't have one, so we had to co-develop with Azure a whole new telemetry pipeline and a whole bunch of visualization stuff. And then for some gaps, we took some of our logs, put them in Elasticsearch, and used Kibana to visualize them.

Wow, that's how good of a speaker I am. Watch out.

We were also doing things at scale. These 200,000 transactions per second, ouch. So we needed a storage service. We needed to aggregate them together because one account wasn't enough. We needed a new document cache, this new global topology with new edge nodes.

But also, we had new processes. We built all this new stuff. There's new ways of doing things, hopefully better, but they're different, right? New deployment processes, coordination across teams, brand-new tools.

Then it gets crazy. Absolutely crazy. The decision was to flip over all the traffic, right? All the traffic on a single day, right? All regions, all markets, all users. Billions of page views, probably about 200 million users at once.

And I went to my boss and I'm like, "Hey, you know that if I do this, this will be the most amount of traffic that I flip over in my entire career. If I go to Facebook, someplace much bigger, I'll never flip over this much stuff at once because they don't do it that way."

They heard me, so we got a slightly better plan. We got 10% to launch for a month and then 90%. That's better.

So I'm kind of freaked out at this point, right? Failure not an option, right? I'm like, "Well, I guess..." I was thinking at this point, if this works, great. If it doesn't, I guess I'll find a new job.

So I thought I knew what to do, but I was so wrong. Very wrong.

This wasn't my first goat rodeo. I had worked at AOL before that, where I'd rebuilt the AOL homepage, and we rebuilt it all on Java, and we put it in a new stack, and we did that. And that was pretty big, too.

So I'm like, "Hey, I'm just going to take the process that I used there, and I'm going to use it at Microsoft."

So this is what I call my standard method. First, I document those first-level dependencies. I'll explain what a first-level dependency is.

I had a great conversation with Gene Kim. He's like, "Eric, what is a first-level dependency?" I'm like, "That's a great question." Right? It's like I'm Admiral Rickover, right? Like, I don't understand, right? It's great.

Do failure mode analysis, right? What could possibly go wrong, right? Let's imagine it. And then create a health model. If something does go wrong, how are we going to fix it?

Sounds reasonable, right?

So this is a first-level dependency, right? Your application's the one on top, and then all the stuff below it is all the things you depend on, right? So it could be a database, could be a piece of open-source library, it could be something.

But the first-level dependencies are important because they're a boundary, and boundaries deserve tests, right?

So document dependencies, right? Seems pretty simple. Inventory just the first-level ones, just the ones that are right below you. Ensure that you have an SLA, right? And create a high-level architecture so we have a picture. Pictures are good.

People are working really hard. They're too busy, right? It was seen as extra homework. Like, "Eric, I'm building this service. Why are you asking me to make extra pictures?" Right? And the team didn't know how the information would be used. And also, to be honest, there's a lack of trust. Like, "We're going to give you this stuff. What are you going to do with it? Just give us more work to do," right?

Then the failure mode analysis, right, which we tried to keep going on, too. You brainstorm failures. You just brainstorm them, right? And then you score them, right? So the bigger the impact, the bigger the score. And then you score the frequency, right? Daily, weekly, monthly. How often do you think it's going to happen?

Data center goes down, like, every 18 months, but that's a pretty big impact. So infrequent, but big impact. You take those and you multiply them together, and then you sort your list by the score. So the big ones are at the top.

This was torturous for people, right? And they had a tendency to focus on the rare and impactful events. They're like, "What happens when the backhoe rips up the fiber between the data centers? What are we supposed to do then?" I was like, "But how often do you think that would happen," right?

And they had an exhaustive list of data-dependent bugs, too, that kept coming up. Like, "Yeah, if we get the wrong article format, and it's corrupted, the whole website would look horrible." I was like, "But how about brainstorming other ones?" Right?

And they were only finding the pain that they knew about, and they weren't actually looking broader and seeing the whole playing field.

All right. So what do you think happened with that health model? Well, I don't know, because we never got here, right? We didn't have a set of mitigations, right? We didn't associate them with a failure mode. Blah, blah, blah, blah, blah. It didn't work.

All right. So I got 90% of all the traffic, right? Like, couple hundred million users, and they're all going to be flipped over in one day. And basically, I don't have a plan, right? It was just panic, really, on my part.

I think I just went home for a while, and I was just like, "Ah, man, I don't know what to do." Right? It was pretty miserable.

So I just said, "You know what? I'm just going to tell people what to do," right? I was kind of beside myself, and I figured out I need something new. The old way wasn't working.

I want everybody to be awesome, and they seem to hate the theory, so I'm just going to give them practical steps. Basically, we're going to race ourselves into shape, right? I want to do a marathon. Great. Run one. Oh, I couldn't complete it. That's okay. We'll run another one next week, right?

I'm feeling really bad because I don't like to tell people what to do because, well, I don't like checklists that much. Any of you guys know why I don't like checklists that much?

Error-prone. Error-prone, yeah. Anything else? Inflexible. Inflexible. That's all you're going to do. It's all you're going to do. Miss something. You might miss something. Right. People game what's on the checklist. They say they've done it; they haven't actually done it. Yeah. Status reports lie, right? Yeah.

So there's some severe problems with checklists, right?

So we were specific. We created four categories. We had 24 groupings, and we had 76 checklist items. Right?

So this is what I really considered to be a failure, right? Because I would've much rather had a conversation. I much rather would have had people talking about what those failure modes were and how to mitigate them, and had us continue to improve. "See a problem, fix a problem" sort of mentality has grown out of that when I've done it in the past, and now I'm just basically shooting from the hip. Right? This was just a guess.

Four categories we came up with were pre-release, 13 of the 76. Monitoring, 29. Deployment, 11. And mitigation, 23 of the 76. So it's kind of evenly distributed on all this stuff.

I'm going to show you what some of these look like. I didn't have time because we only have 25 minutes to show all the checklist items. But if you send me a LinkedIn or a Twitter, I'm happy to share the whole list with you.

So pre-release. For our partners, like weather or stock quotes, we need some quality gates for them, right? For their services and their content. Right? So when Stats is like, "Hey, I'm going to give you real-time basketball game scores and we're going to change our API," we're like, "Whoa, slow down. Go through our quality gate first. Make sure the new method works." Right?

Security, this is what we put down: URL manipulation, SQL injection, XSS attacks. Why did we put this down? Because there was a team outside that was doing scans of all our websites, and this is what they were finding. So we were just reacting to the reports that we'd been getting for the last six months.

Automated process: all pre-deployment processes are automated. So all these things like security and the partner quality gates, we want those to be automated, not a manual step.

That's an example of pre-release.

Here's an example of monitoring. A stack trace, right? You should log them, right? Standardize your logging, all your inbound and outbound calls, including the response headers because we're almost all HTTP. Length of request and response, put those in.

Correlation. Services should maintain and respect an activity ID, so that when you get an ID that comes in from the top, all the other layers below it have the same ID. So now when my boss comes to me and says, "Hey, I loaded this page and it just failed," we can look through the whole stack and say which item in the stack failed because we got all these different logs that are tied together with this ID.

Services have to propagate the activity ID. They have to maintain logging for the duration of the request. So you can't say, "I'm tired now, I'm just going to stop."

Log verbosity, right? That should go up and down, right? So these are the stuff that we had on the checklist just for logging.

Deployment. Automated release process. That could cover a lot, right? That's what we put. Staged deployments. Deploy without service degradation, that was important. Right? So if you deploy and your website's really slow and you're getting errors, then you didn't do it right.

Patching, right? Expect it. It's going to happen. Monitor your deployments. Being able to roll back to your last known good state. And it can't take any longer than your rollouts. And then data: you should be able to roll back your data, too.

Last one, mitigation. Auto retry. Set your SLAs downstream, mutual agreement. Try to degrade gracefully. Like, "I am not getting the weather stuff back. I'm just going to stop serving up weather right now." Right? I'm going to just take the module off the page and I'll retry. I'm going to put weather in jail, and a minute later I'm going to try again.

Configure VIP health. Automatically remove the VM when the service and the node is unhealthy.

So this is the type of stuff that we're coming up with. It's pretty explicit, right?

So this is how we started. We just took seven items and we assigned them to everybody. I think we had 10 teams, and so we had a PM open 7 times 10, 70 tickets.

And it was these, right? Alert using raw counters, right? Alert using synthetic probes. Fail your application, and validate your alerts. Fail central data services, and validate those alerts. In other words, take down the database and make sure you get alerts. That was it.

Ensure errors are in the logs. Because guess how many teams put their errors in the test environment and not in the production environment? Guess how many teams alerted into the test alert system and not the production alert system? It was about 50/50, right? They're really busy, right?

And then take the basic page for training. We had a little wiki you had to go through and read the things on there. That's what we did.

How do you guys think this worked? Any guesses? Raise your hand if you think the teams were happy to get this new list of seven items. Raise your hand if you think the teams were unhappy to get this list of seven items. Yes, they were miserable. Right?

And we had to doggedly pursue them with status reports that we sent out all the time.

The good news is, remember when I told you how I told my boss that this is the most amount of traffic that I'd ever launch, ever? So he was sufficiently scared that he said, "Hey, you should get these seven items done." Right? Coming from a place of fear and loathing.

This is DevOps not. It doesn't feel good. Didn't feel good for me, didn't feel good for them.

So we assigned them, we had a due date, and they reported their status every week. I flipped over this checklist mode when I had about three and a half months to go. So T-minus three and a half months. So every week counts. You miss a week, you're like, "Don't know if I'm going to make it."

And then we assigned the rest. So then we took all the other items, and we assigned them out to everybody, which is one heroic PM effort to open a lot of tickets. And the teams picked the ones that they wanted to do, and we didn't know which ones they did or didn't. We just tracked the total completion, and we said, "Hey, could you try to get to 95%?"

Like, "Which 95?"

"Just pick them."

And we organized knowledge-sharing sessions to try to break the logjams. We found out, for example, that the teams had put all their logs into one big table, which is a NoSQL storage. And we're like, "Well, wouldn't you want to break them up a little bit? Because it'd be really hard to view all that stuff at once. It would take forever. Like, you do a query, you could wait there for 15 minutes."

And I meet with them, and there was a meeting, probably about this many people. And they're like, "Can somebody else just do it for us? It seems like something that we could centralize for the logging."

And I was like, "Oh, no. But here's the deal, guys. If you guys don't do it, then when you have to change your logging, you won't know what to do. When you have a production incident and you want to see your logs, you won't know where to look. So if you do it, you will know these things."

Guess what happened? They self-organized and decided to elect somebody to do it centrally. I couldn't stop them if I wanted to. And guess what happened when we had a production incident? They didn't know where the logs were.

There's top five takeaways here. But if I was just to bring it down to one solitary single item, I would say do failure injection tests at a coarse granularity, which means blow up whole data centers if you can. That tested so much. If that's one takeaway, that would be it.

Schedule time to blow things up. It took us six tries to get the data center failover to work correctly, and we didn't change a lot of the technology. We changed our thresholds, we changed our processes, we changed our handoff, but after that, we had it nailed. So the whole process was pretty much automated, and we handed it off to other teams.

And so now the team in India, when they had a problem, they pretty much just had a script that they ran, and it drained one data center and sent all the traffic over to the other one for writes. The read stuff was automatically failed over.

And so that was a success, but it took us basically six marathons to get it right. And it was tough to do that, but that would be the one takeaway that I have.

Also, when we did the coarse-grain failure injection, I'm like, "Hey, we just blew up the database. Did you get any alerts?" Again, alerts went to the test infrastructure.

"Oh, sorry."

"Yes, fix it."

And logs went to the wrong place. So that'd be my one takeaway: if you can do it, do the coarse-grain failure injection testing just to check the basics. You learn so much.

So what am I looking for your help for? So we talked about checklists. We talked about how bad they are. And even though they're bad, I think there's a group of people that are in your organizations when you're trying to scale that really do well with checklists.

If you get something working, and you know how it's supposed to work, and you got your prototype, and you're trying to scale it out to the rest of the org, you might consider having a checklist to give them. And they'll just be actually pretty happy because they're usually late adopters, and they're like, "I'm not going to do this until everybody else has done it, and I'm not going to do it until you tell me exactly what to do."

But the question is, if I want to reach everybody, I want to blend together my standard method, which is really a systematic way of thinking, with something that's a lot more narrow, like a checklist. How the heck can I do that?

I had some ideas that I put in here, and I talked to other people around the conference that are pretty smart, and they basically hated my ideas. So that's all right. We got more work to do. I wouldn't say they hated them, but they were like, "What? Why?"

So ladder up. Create a maturity model from the checklist. And people are like, "Oh, but we hate maturity models," because what happens is when people create a maturity model, they never put themselves in the lowest bar, and they never put themselves in the best bar. They always create a maturity model which puts them in the middle. I'm like, "So true."

Prototypes. Do one project, big and broad. Use your early adopters. Do that systematic process of first-level dependencies, failure mode, brainstorming, health model. Then create the checklist out of that, and do the deep and explicit things.

Games. Have teams pick their checklist items and see who can do the best when you have a disaster. That seems sort of ad hoc, but that's what I came up with.

But I think there's different types of people, and they think in different types of ways. And that's why the thing that I had really failed, because people wanted a checklist to go through.

So that's my story. That's what I'm looking for for help. And for anybody who wanted to talk about Azure, happy to talk about that too.

Thanks for coming.