The 10 Year Microsoft Journey

Log in to watch

San Francisco 2014

The 10 Year Microsoft Journey

Sam describes a ten-year transformation at Microsoft Developer Division from a box product delivery cycle of four years to a hybrid SaaS and on-prem business, with a single code base, triweekly delivery of new features in the service, and quarterly delivery for on-prem customers. He presents three waves of improvement and learning: first, the reduction of technical debt and other waste to gain trustworthy transparency, second, the increase in the flow of customer value, and third the shortening of cycle time to allow continuous feedback and continuous business improvement.

The current scale of the business is that there are millions of customer accounts each on–premise and in the cloud. This hybrid situation will exist for many years, and is a necessary part of the business.

Sam will give examples from monthly service reviews of key practices and metrics, such as hypothesis-driven development, funnel analysis, performance monitoring, MTTD and MTTR improvement, log analysis, root cause remediation, scale unit replication and canarying, common code base, testing cycles, georeplication, feature flags, compatibility and compliance testing.

The organizational issues of transforming from a traditional box software delivery team to a modern SaaS team will be addressed as well.

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

I am not a large animal veterinarian specialist, what have you, and I'm not sure about this equine taxonomy thing. So I don't know if I'm a horse growing horns or a budding unicorn or a rhinoceros or whatever. I'm not going to talk about the first part of agile transformation, the days when we went from being very stodgy and waterfally and had nearly four-year product cycles to getting into standard agile 101 practices of a good definition of done, a common product backlog, standardized sprints, and all those sorts of things. I've written about that a lot.

We achieved things like 15X reduction in technical debt. We had 2X reduction in product schedule and what have you. As you know, we grew up as a company that shipped software, box product we call it, for people to install in their data centers or on their desktops. And we are now a company that combines those existing products with cloud services and are moving to a cloud first, mobile first world where we do both and we reuse.

So, in my reuse of title, you'll see that. The part of the story I'm going to tell you is about our move once we were agile into what we call cloud cadence, which is very much about DevOps. And I think the essence of this story is about moving to a culture of usage, a culture of observation of users, usage patterns, understanding usage, driving from usage, being part of the milieu of where your users are. And in particular, I'm going to talk about eight things here.

I'm going to talk about why we think the faster cadence matters. I'm going to talk about how we organize our teams, how we've adapted the agile schedules and ceremonies for this world, what our deployment practices are, what we call live site culture, which is maybe the softer side, but there are also some very concrete practices. I'm going to talk about how we weave business success into technical practices, what telemetry we put in place, and how we use that for learning and improvement, and then how we take what works for us and try to productize it for our customers. Start with why.

Well, when I talk about DevOps or a journey to cloud cadence, I want to be clear. This is a particular view of the elephant. This is consistent with Gene's three ways. We're talking, of course, about agile development.

Everyone's talked about agile development. Drink a beer. It's accelerated the construction process. We know that.

We're working in sprints. We're doing more. We're producing more. And that's created the need for an automated release pipeline that keeps clean code getting deployed in an automated way into production.

What we've heard less of is that that production needs to be monitored constantly for health and performance and any live site incidents and for usage. And from the learning in production, you need to be able to respond and tune the backlog of what you construct next. And really, that whole feedback loop is what we think of as a DevOps lifecycle. Okay.

It's not just the release pipeline. It's not just the feedback from Ops into Dev, the second way. It's that whole feedback loop, the third way. Here's an example.

We have a service called VS Online, Visual Studio Online. Grew out of what we deliver on-prem as Team Foundation Server, which some of you may know. VS Online is updated with new features every three weeks. There's a new deployment to now millions of users.And those capabilities that those users of VS Online see are then rolled up on a quarterly basis, more or less, and delivered to customers on-prem.

So unlike the born in the cloud companies, we keep one code base for the on-prem product and the cloud SaaS delivery, and we work so that the cloud SaaS offering, the service, is effectively a beacon of what's coming next if you install on-prem. And most of our large customers, like most of you, are, in fact, still running on-prem in their data centers. But they also get to subscribe to the service and see what's coming next. And this is an example.

This is actually an example from one of our internal uses. And of course, we use all of this ourselves. How do we organize our teams? Well, this is not going to be surprising.

This is what Scott yesterday talked about as one of the anti-Taylor moves. We're in scrum teams. We typically call them feature crews. They're cross-discipline.

There's a product owner in the team, in the team, working inside the team on a day-to-day, hour-to-hour basis. There is a dev lead. There is a test lead, historically. We're experimenting with combining those into a single engineering group.

The test discipline has been evolving the fastest as we've been moving into a services world. But those three have been called triad. We think of 12 to 18 months as sort of the minimum period that you keep a team intact. That isn't to say it's the maximum period.

There are teams that have been working together for much, much longer than that, that have particular specialties. They're compiler experts or they're database experts or they're whatever, but the notion is that you work with this group of people for at least that minimum period because there's a lot of chemistry that happens in that team, and we bring work to the team. We don't try to shift people around so that you disrupt the chemistry. And that team is responsible for its own backlog.

They pull work in, and they are responsible for delivering what they commit to in a sprint. They work in a team room. A team room is not a big open area with 100 people or 50 people. It's just the people who are working together.

So, you don't need conference rooms for meetings. You just talk across your desk or turn your chair around and talk to the people you're working with. There are little rooms on the side called focus rooms where you can take four or five people and just go in for an hour. You don't reserve them, you just use them.

Walk out when you're done. Someone else can walk in after you. Behind the camera, there are more whiteboards. There's lots of whiteboard space and what have you, and you just work together with the people you work together with.

And you talk with them when you want to talk with them, and it doesn't feel like you're disrupting someone because the stuff you're talking about is of common interest. And if other people join in the conversation or overhear you, that's a good thing. How do we deal with schedule and ceremonies? I'm going to start at the right-hand end.

We don't really try to plan for more than 18 months ahead, and we don't really try to plan with great precision at that horizon. We have a sort of vision for where we'll be in 18 months. We think of it as a North Star. We will create some artifacts like a document, some storyboards, occasionally some conceptual videos, some proof of concepts or spikes or whatever, and say, "We think we'd roughly like to be there then." But it's not a spec, and we know it's going to change.

And we know it's going to change on a six-monthly basis, what we call a season. I know that's a misuse of the English language, but we refer to spring and fall as the two parts of the year. I think it may have to do with where we're headquartered. We could call them wet and dry, but I don't think that would have the same effect.

And in that six-monthly period, we're much more specific, and we'll start talking about big dependencies or cross-divisional things or what have you.And then the teams work in three-weekly sprints. So, if I say it's now sprint 73, that means the same thing in Mountain View and Seattle and Raleigh and Hyderabad and Cambridge, England, and everywhere else. And it's our shorthand, and you talk about what happens in that sprint. And that synchronization I'm talking about is what we do across developer division, where we work.

You can think of it as about 1,500 people if you look at it from the org chart. If you look at it from the code base and who's contributing into that, it's about double that. And then about three of these sprints... And by the way, why three weeks?

I don't know. It works. Today's Diwali in India. Happy Diwali.

And they're not working. So, we don't concern ourselves with that. Two weeks we found was too short, and we started colliding into holiday schedules, and it became a big deal. Four weeks was too long.

Three seems to work well at a global level. And we are globally distributed. Roughly every three sprints, we'll have planning chats among the triads leading feature crews where people have the, "What are you planning? Here's what I'm planning to do.

What's coming up next? Do you have any suggestions, course corrections? What do I need to do," et cetera, kinds of conversations, very informal, and we will do course corrections at that level. But, the peel off the backlog for what I commit is at the sprint level.

Now, we've tried to keep the ceremonies really lightweight. So, the start of a sprint, the sprint planning is broadcast out from each feature crew with a mail. The sprint completion is broadcast with an update of that mail that has a video included, and an update to the items, the user stories that were peeled off the backlog. And the video is in customer terms.

So, for example, here's a sprint mail. The table has some item numbers from the VS Online instance or TFS instance. It's now VS Online. What are we doing?

And at the end, what have we accomplished? And this video is three to five minutes, and it's explained in customer terms. Here's what you can now do. I'm sorry.

And beneath the fold here, it doesn't fit on the screen, you see an update of the table with how much we did, and if work wasn't completed to done, there's an explanation, and it gets carried over. Okay? And then these are sent and put on a SharePoint site and accumulated. And so you can think of these as one per crew, and the videos are short enough that you can review them across large numbers of teams.

Let's talk about our deployment practices. So in a live site world, we're always up, 7 by 24. There is no maintenance window. We do not say, "At 3:00 a.m.

on Sunday, we will be down for blah, blah, blah." No, we're always up. For someone, that's work time. All of our deployment is fully automated. That means that we need to have decoupled services.

They may not be all micro yet, they're getting smaller as they go, and they need clear contracts. We depend heavily for our development on the notion of feature flags. Feature flags are the idea that we don't branch unnecessarily. So in the old days, if you were working on something, you'd create a branch, you'd go off, you'd hide your work in progress in the branch, and then eventually you'd come in and you'd merge.

And that created a lot of merge debt. The idea of the feature flag is you're working in master, and by doing that, your code will get deployed, but it will be hidden by a feature flag. And you can raise the flag in production at the granularity of the feature and the user, or group of users, and see how it works all the time, and it allows dark releases. It is one way of doing incremental deployment to a group of people, and one way of canarying.

It's not our main method of canarying at the moment. I'll come to that. So, all the code is deployed, but the first thing you've done is you've created a feature flag that hides itAt runtime, you lift it first for one user, then for a few users, then for a circle of users. You can discover from this that they say, "Hey, like the idea, but you got it wrong.

This is too complicated. This is whatever." You can do A/B versions, whatever, and you can experiment and refine until you're ready to go broader. Now, with these updates that we keep doing, people want to know what's going on, so we just keep blogging and saying, "Here's what's in this three-weekly release, and here's when it's coming on prem, and double-click here and you'll see the details of what's in there." And it really has changed our conversation with customers. We don't get asked, "What's your five-year roadmap, blah, blah, blah" anymore.

People just see, "Hey, there's this stream of value coming." Now let's talk about canarying. We started off in one data center. We're hosted on Azure. We started off in Chicago and we earlier this year, took the decision, I'll talk about background in a moment, to start moving multi-data center.

So, the term SU here is scale unit. Think of that as a sort of self-contained data center in a data center, a whole rack with all of its switching and what have you. So we will deploy to San Antonio first. When that is up and running and has acted as the canary, we'll then, in a matter of hours, roll to the next data center, and this lets us start rolling out worldwide.

We're not up in Europe yet, but that will be coming soon. And so we can take advantage of the fact that Azure is now, I think, in 17 regions, and eventually start following that. Sorry, was there a question? 19.

19 regions. Sorry, I'm corrected. So, data center to data center, we have the canary. Let's talk about live site culture.

Need to keep moving. So, in a live site culture, you always put the first priority on the site being up. We've organized with a very small global response team. These are the people who follow the sun.

So like, in Australia, there's a guy, Grant Holliday, who is one of the book authors on We Have, Developer Quality, who's on this team. There's another guy in Europe, what have you. There are a handful of people on this team to whom issues will be escalated. And they have access to the code, all the code.

If they can't handle a problem, they will fall back to the so-called DRI by feature crew. There's a 168-hour schedule of who's on call from each feature crew. And the internal SLA is that the DRI has to be in the code within 15 minutes. Then there's a weekly live site review of all the incidents and a monthly service review that's much broader than just the incidents.

And live site issues become product backlog items. Everyone sees the live site health dashboards all the time. This is one of the lunchrooms in one of the sites. Live site issues go into our VS online.

You can see here, one of them. Almost everything on this screen is automatically filled in. You can see the outage window, and the remediation. Everything's driven to root cause.

And by the way, we don't really roll back. We roll forward because we've got all this customer data, and we can't really... We might drop to an earlier version of the application tier, but we'll never try to roll back the database. We're also tracking business success all the time, both in quantifiable and qualitative terms.

So we try to reach out to our customers, and we do reviews on success. So, examples. Everything has a funnel. Some things have multiple funnels.

We track trends, and we really try to follow how successful are we, and this is something that all the team watchesSo we'll track month-on-month growth, which is in double digits, not just cumulative numbers, but really, you want to see growth rates, derivatives at the level of users and accounts. We'll do experiments. A simple early experiment was when we introduced Git as one of the version control options. We did good web page design by the book, kept things very simple, and only 30% of the people who got to here kept going.

And we thought that was awful. So we experimented with violating good web page design and making a much more complicated sign-up page, and all of a sudden, 50% went through. So by violating good web page design, we got a significant increase in our success rate and decided to keep going. We also track our top customers.

Now, we have very strict privacy policies, so we don't contact them without permission. We ask, "May we get in touch with you?" Almost everyone says yes, and then an engineer on the team is assigned as a buddy for contact. And we'll reach out and say, "How's it going? Do you have any suggestions?

What do you like, not like?" So forth, as a way of connecting the team to the customer. And you get qualitative feedback that you just can't see with instrumentation. Speaking of instrumentation, we try to instrument everything. We also try to think of our telemetry like product code and monitor it, root cause everything.

And I show this, this is now practically two years old. This is one of our very early dashboards. But you'll see Twitter here, and we still use Twitter. Why?

Because there is no piece of infrastructure, network or anything else that is shared between Twitter and VS Online or Azure. So if everything is down, if someone disconnects Azure, microsoft.com, whatever, we can still tweet, and we can say, "Hey, there's a service outage and here's what's going on." We can tweet to our users, they can tweet to us, or someone can say, "I'm having a problem with my account. Can you look at it?" And it gives us a redundant path. Metrics.

And I'm going to have to do this really fast, and I'm sorry. So these are from our monthly service reviews. The charts you see at the bottom are time to detect, time to mitigate, the number of incidents, live site incidents by severity, and then the last two charts are kind of mislabeled. They're incidents by, it says by source, it should say by root cause, and time out by root cause.

And you'll see a big bump in November. Why? It was last year. Because we had a launch then and we hadn't properly load tested the site or the service.

We discovered there was a queue that had a 64K limit that we'd never hit before, and no one really knew about it. So we reprioritized the backlog to make live site work take over feature work. And that had an effect, and a couple of months later, you can see we were getting much better. And we do this for every service, every constituent service.

But it meant that we were treating live site work along with feature work in one product backlog. And that led to the canarying across data centers. We're trying to apply the same principles to our installed products of great telemetry so that we can look at usage patterns there. And we're trying to productize from what we're doing internally.

And you may have heard of our Application Insights, which is going to provide opportunities for-- You'd have telemetry. Now, this is my help slide. The things that I would like help with are understanding how you're doing on, and how you assess improvement on dimensions of practice. We think of seven of them.

We think of how you handle agile scheduling and teaming, how you manage technical debt, how you think of the flow of value to customers, how you think of a backlog in terms of hypotheses and experiments, how you collect evidence and data, how you manage live site culture and dealing with production first, and then how you take advantage of the cloud. Are these the dimensions you think of, or are there others? That's the conversation I'd like to have. So with that, thank you very much.

I'll be here. Thank you, Sam.