Implementing a One Engineering System

Log in to watch

San Francisco 2016

Download slides

Implementing a One Engineering System

Ed Blankenship

Product Manager · Microsoft

Sam Guckenheimer

Product Planner · Microsoft

We've been on an engineering transformation journey as we move down the path of consolidating our internal engineering systems. We'll share a lot of the stories from our journey and pitfalls to help you learn from our mistakes.

Chapters

Full transcript

The complete talk, organized by section.

Sam Guckenheimer

One Engineering System.

The saying we have at Microsoft these days is from an African proverb: "If you want to go fast, go alone. If you want to go far, go together."

Satya Nadella, our CEO, is totally into the idea that engineering productivity is one of the key drivers of the company's success. He's an engineer himself. He believes adamantly in this. The whole senior leadership team believes in it.

We've totally turned into this focus on One Engineering System, and we're from the team that is delivering this. It's based on Visual Studio Team Services, the same SaaS offering that we make available to customers and we use internally.

But how did we get there? Well, if you go back a few years, this was the Microsoft org chart, circa 2011. It was drawn, interestingly, by a Google engineer because we didn't publish our own org chart at that time.

The engineering process in the company was that every major division, Windows, Office, Bing, et cetera, did its own engineering system, and this had been true for decades. The idea was that by giving them responsibility for their own engineering system, they could go faster. That was the principle. They could be more effective. They could do what they needed.

The consequence of giving them their own responsibility for their engineering systems is that they didn't quite invest enough in them, and they ended up actually going, well, more problematically and slower, and things got more difficult. That led to a very interesting unintended consequence: no reuse would go unpunished.

The consequence of that was the philosophy, "Take no dependencies. We need to do this. We can't count on that other team. Don't know that they're going to be around to support that stuff. So if you see something you like, fork the code."

The consequence of that was that we ended up with about 300 XML parsers. There aren't 300 versions of XML. We ended up with a similar number of tree grid controls because everyone likes tree grids, and we didn't have a standard. I don't know how many versions of jQuery and all that stuff were buried in these different code bases that were opaque to everyone. That turns into cruft, which turns into a kind of technical debt.

So we set out to say, there must be a better way. There must be a way to get modern engineering practices in place that can support this organization and its engineering teams at scale.

We bought into a North Star, and that North Star is that source in the company would be available to anyone, that any developer could offer improvements to anyone else's project, that all of the company IP would end up being made of reusable components and services, that anybody could find and reuse components from anywhere else, that devs would be rewarded for creating things that were reusable, not just things for this next deliverable.

There would be no lag time between one dev checking in and everyone seeing the work, so that instead of having these really long, slow build processes where no one quite knew what would happen, you'd get to see instantly the changes and the impact of changes. And the build and test time would be proportional to how much code was changed, not proportional to how much code was left alone.

Finally, we encourage people in the company to, as part of their career development, move around. As long as everyone was locked on their own siloed systems, that was very difficult. So we wanted to get to a place where if you grew up in one part of the company and you wanted to go to another part of the company, the tools and process were familiar.

So let's take a look at what it looks like in practice. Ed, let's switch to the demo.

Ed Blankenship

Yeah.

Sam Guckenheimer

Talk to us about how we work.

Ed Blankenship

This is going to be a really fun process. Anyone who ever knows about showing live systems, this is our live engineering system that we use to build our products from day to day.

First, I'm going to just start—

Sam Guckenheimer

So I just want to riff on this for a minute.

Ed Blankenship

Please, yeah.

Sam Guckenheimer

Okay. We are on the internet. We are connected—

Ed Blankenship

It'll be fun.

Sam Guckenheimer

—to our production engineering system. I'm crossing fingers. A little bit later we'll actually watch a live demo.

Ed Blankenship

Okay, so we're not quite sure what you're going to see. This will be fun.

First, our team particularly is really large. We have about 70 different feature crews within the broader team, and so we need different views. This is our main view across all of the teams, but each of the individual teams, this happens to be one of our teams called Blueprint. They have their own different view. They can see—

Sam Guckenheimer

So hang on just a minute.

Ed Blankenship

Yeah.

Sam Guckenheimer

Let me riff on that. One of the issues that comes up all the time is, how do you deal with enterprise scale and team autonomy?

Ed Blankenship

Absolutely.

Sam Guckenheimer

So we just went from a roll-up dashboard of the pipeline down to a feature crew's dashboard, where this belongs to a group of, I think, 10 or 11.

Ed Blankenship

That's right. Each feature crew is about that big, and you talk about how individual feature teams' work accrues back up to broader initiatives that we're making. This is a view of their Kanban board. These happen to be individual stories and bugs, but they can map up all the way up to broader-level scenarios. But I'm going to hide that for a bit.

This is actually the live team's Kanban board. We don't know what's in there right now.

The nice thing about this is every team can decide how they want to do work, which agile practices they want to implement, et cetera. They can customize everything. Swim lanes, like if you're familiar with expedite lanes, this team happens to have one for customer support cases, live site incidents that—

Sam Guckenheimer

Yeah, if you notice that red one's blocked saying, "Wait for customer."

Ed Blankenship

Yep, and then this green one—

Sam Guckenheimer

It looks like Jet's having a problem.

Ed Blankenship

But you can organize the work in many different ways. All the teams do it a little bit differently. It's pretty nice to be able to do that.

The reason why I bring this one up is that at Microsoft, it's extremely important to make sure that you have traceability against all of the different artifacts in the system. So if you think about source code commits, work items, bugs, new features, test cases, deployments, builds, we've got linking across everything. That's particularly important when you start to want to see what is the delta between any different release that's going out or being impacted in some other part of the company.

We start right from here. Even from one of these bug work items, we'll talk a little bit about branching in a minute, but you can start even source code options right from different parts of the application.

Now, each team also, we have tons of different builds. Like, tons. I just happened to find one that was kind of interesting. It's our ALM search team's build. It's a continuous integration build. You'll start to see that traceability start to show up. These are all the bugs and stories that were part of it.

Sam Guckenheimer

Wait, I don't know if everyone can read that. You're going kind of fast.

Ed Blankenship

No, I don't think so, yeah.

Sam Guckenheimer

So, yeah.

Ed Blankenship

I kind of don't want them to see all the new features either, so—

Sam Guckenheimer

Okay.

Ed Blankenship

We haven't disclosed them.

Sam Guckenheimer

Cool. So those little bars that are fewer red and most are green—

Ed Blankenship

That's right.

Sam Guckenheimer

—their height is how long the build took to run.

Ed Blankenship

That's right.

Sam Guckenheimer

Those are current builds, and then down there, you see associated work items. Those are the stories and bugs and so forth that actually make this up. And then on the right are the test results.

Ed Blankenship

Yeah. We've got test results, code coverage. What's nice is, since we build on every particular platform in Microsoft, we have to have an engineering system that can support all types of tests. We have a common test results store as well. Code coverage, even down to where this build might have been deployed, into which environments and which pipelines they happen to be used in.

This is really the start of all of that traceability. If I go and take a look, I can even start to break down from even our original dashboard. I can start to look at the different builds and what tests are in there. It's particularly nice when this build, the CI build, happens to have one failing test from today.

Sam Guckenheimer

Yeah, it looks like a security test—

Ed Blankenship

It is a security test.

Sam Guckenheimer

—that was failing.

Ed Blankenship

So hey, we have one failed security test out of our entire test suite. That's always good.

But it's got great information, stack trace, error messages, the logs. A bunch of our UI tests will actually put screenshots in here, so that's a nice way to do it. And then individual bugs that actually get filed right from the test failure, so you can start to get traceability right from there, too.

Finally, if I just open up one of the work items, whether this is a feature or a bug, what I really love is I usually start from there, as a product manager. I'm usually trying to see where this is in the development life cycle, and this just starts to pull information from other parts of the system to say, "Hey, here are some pull requests. Here are some branches. Here's some code commits." It'll also start to tell you where it starts to deploy, if that bug has actually been deployed or not.

So as a PM, all I need to really know is, "Hey, where are the features or bugs at, at any given time?"

Sam Guckenheimer

That's right.

Ed Blankenship

And so I'll cover pull requests in a second, but Sam, why don't you—

Sam Guckenheimer

Okay, good.

Ed Blankenship

—tell us a little bit about Git.

Sam Guckenheimer

Good. Before we switch back to the slides, I just want to comment on what Ed just showed you.

We were downstairs talking about security in the Lean Coffee session and the need to have traceability from your stories to your code, to your tests, to your deployments, your bugs back, your incidents back. This is how we do it. This is by no means the only way to do it or the only tool to do it, but this is how we do that traceability problem.

Let's switch back to the slides.

Now, what we do for code is that under Visual Studio Team Services, we are using Git. We're in the process of moving the whole company to Git. The benefit of that is giving everyone one master branch. Or making everyone use one master. We do use lightweight topic branches. So a bug will have a branch, come back with a pull request. Branch will get closed in a day.

Unlike the old days when you said, "I'm going to do something new, and I'll go off here and create a branch using centralized version control, and then I'll come back in three months, and it'll merge." And then you have all this merge data, and everything's on the floor for a month.

These get collapsed immediately, within a day, within a few days. And you have tiny continuous merging. So the merging happens while the code is really fresh in your mind. It's not this long, what did I do? It's now, and everyone commits back to master through the CI build.

So let's take a look at that process and how we use pull requests.

Ed Blankenship

Yeah.

Sam Guckenheimer

Can we switch to the demo, please?

Ed Blankenship

Yeah, sure. Before we talk about that, I was going to mention one of my favorite stories is how bad Windows was for me in Git.

Sam Guckenheimer

Oh, my goodness. So Windows, can you imagine trying to clone a repo with 40 million files and bring it down to your own laptop?

The scale. There are some things about introducing Git that are really worth talking about.

One is start from the right. Get the release process working right first. You can deploy out from Git and have the pipeline work and get the benefits of the CI/CD pipeline, and then start moving back to the left if you've got an existing code base that's pretty monolithic and needs to be refactored. But you need to start refactoring toward a microservices model so that you have smaller repos, so that you have things that are independent and digestible before you go distributed—

Ed Blankenship

That's right.

Sam Guckenheimer

—with Git. Big learning.

Ed Blankenship

Yep. So instead of boring everyone by showing you what Git repos look like and branches and stuff, which are all fun and games, one particular thing that I have really enjoyed in our One Engineering System has been how we've implemented pull requests.

Sam Guckenheimer

Yeah.

Ed Blankenship

It is the main way that we start to get a first set of quality on a lot of the changes that are happening. Every change that goes through requires a pull request, and it's great to have another set of eyeballs on here.

I've done a little searching this morning about a cool pull request from the system. This one is interesting because I noticed Brian Harry, our vice president—

Sam Guckenheimer

Yeah, my boss, yeah.

Ed Blankenship

His big boss. Our vice president was on this pull request for some reason. So all fun, but it's got the normal things that you—

Sam Guckenheimer

Wait. You're going fast again.

Ed Blankenship

Absolutely.

Sam Guckenheimer

Let me make sure people in the back understand what we've got here.

Ed Blankenship

Please.

Sam Guckenheimer

So on the bottom, you see a review history.

Ed Blankenship

Yes.

Sam Guckenheimer

And who reviewed this pull request when and what the approvals were on this particular change. Then the reviewers are summarized over on the left-hand panel.

Ed Blankenship

That's right.

Sam Guckenheimer

So we have the idea that anyone has visibility into that history, and the review history happens on these tiny batches, on the pull requests, just on the change.

Ed Blankenship

Exactly.

Sam Guckenheimer

And you see the diffs like this of what those changes were as the reviewer.

Ed Blankenship

Then we can start to make comments. Threaded conversations, we can resolve them, all the typical things you would think about.

But I mentioned traceability is super important. We can start to make sure that things are linked together, but not only just a set of manual reviewing, we actually have a whole system that we've started to implement around making sure that certain branches stay at the highest quality.

For example, on our master branch, we have a set of branch policies that start to run as part of the pull requests. You can do things like making sure that a specific build is successful. That build can have a test pass as part of it. You can require that a certain number of people are—

Sam Guckenheimer

Yeah, but isn't one of the great things we do handling this at scale, I think.

Ed Blankenship

Absolutely.

Sam Guckenheimer

So go ahead.

Ed Blankenship

That was the one thing I was going to say is one of the hardest problems, especially on really large code bases, is finding the right teams or finding the right people who should be the reviewers.

Sam Guckenheimer

Yeah. So you're going to have one person review all the code?

Ed Blankenship

Yeah. Of course not. So we've got, and this is actually all of our mappings, where we say, "Hey, if any changes are in this part of the source tree, we want this team or this person to be a required reviewer," and it'll automatically add it.

So as soon as someone tries to change something in the billing code, it will go and grab the commerce team and require that reviewer before it can get merged up to master. So it's actually been a really good helping—

Sam Guckenheimer

Yes.

Ed Blankenship

—thing for us.

Sam Guckenheimer

So we love pull request policies. We love the ability to pull request policies automatically trigger reviews, and those reviews come from the right set of reviewers based on the corresponding directory group.

Ed Blankenship

That's right.

Sam Guckenheimer

That happens very nicely. Okay. We'll switch back to slides.

Ed Blankenship

Yeah, let's do that.

Sam Guckenheimer

So slides, please.

What does all of this enable? Nicole Forsgren reminds me that it's all about culture and metrics. So what are our culture and metrics?

You can see at the top of this a little snippet cut out from our engineering dashboard, and you can see that we're tracking all of these fine-grained metrics on the engineering process. There are really four categories.

Live site health. We're very big on live site culture. Live site culture means that you work to keep the customers up, with time to detect, time to mitigate, and then you work to do a root cause repair so that the problem doesn't happen again. We call those incident prevention items, and you need to close them in a sprint typically, or two sprints.

We watch for live site problems that are aging, in other words, aren't getting closed, map against customer support metrics, map against SLAs, and we track those by individual customer. That's about live site health.

Then we look at the velocity, which is the left-to-right movement. How long does it take to build, to test, to deploy, and so forth? And how long does it take us to improve, to implement learnings?

Engineering, we think of a set of governors. So we used to look at these centrally a lot. Now it's basically up to the teams at a lower level. Things like bug cap per engineer. If your bug cap is four, you have eight engineers, you get to 32 bugs, you can't do anything new. You need to pay down your bugs before you go further. Keeps the debt out.

Ed Blankenship

Bug jail.

Sam Guckenheimer

Yeah. Called bug jail, bug hell, whatever.

Aging bugs, how long are they around? What's the test pass rate and coverage? You saw that on the initial dashboard. Everyone sees that. Are there any failures? Are there any reds in there obvious?

And then, of course, we track usage, the outside in. How is this doing for customers? How good are we doing at acquiring new customers, engaging? How dedicated are they becoming? Is anyone churning out? And then in a fine-grained way, we can look at telemetry as well.

Now, we do that by means of a common telemetry pipeline. We have a pipeline. Again, this is something we use internally, and then we offer most of it to customers through Application Insights Analytics.

That pipeline, to give you an idea of scale across Azure in the first of these charts, is currently ingesting about 1.6 petabytes of data a day, 1,600 terabytes.

Ed Blankenship

A day.

Sam Guckenheimer

A day. And you can see the month-to-month growth here. It gives us all of the data on the running services. We can use it for this very high-volume ingestion, for fast queries over the datasets, and use it for text search.

So let's switch back to the demo, look at how we release.

Ed Blankenship

Yeah. I was just going to mention, one of the best things that, of course, we want to do, especially as a SaaS-based service, is to make sure that we get changes through the pipeline as quickly as possible.

Just to give you a sense of scale that we have, we have more than four million customers now on our SaaS service, in addition to all of the great engineers we have at Microsoft using the service. Getting it out there quickly has been an interesting challenge for us.

But, just going to pop up. I, unfortunately, don't have a live production one running right now, but I did find one that is running, so I'm going to show you.

Sam Guckenheimer

Okay.

Ed Blankenship

This is our release pipeline for production. We have five different rings that help us with exposure control, as you might imagine. Ring zero is our canary instance, which is actually the account that this is in. We give ourselves the first set of pain in case there's anything wrong.

Sam Guckenheimer

Yeah. So let's explain what a ring means. For practical purposes, the fine-grain definition is these are one or more Azure scale units.

Ed Blankenship

Right.

Sam Guckenheimer

Everything's deployed on Azure. You can think of these as data centers. By the time you get through ring five, you've gotten through 10 data centers worldwide.

Ed Blankenship

That's right. We actually do have a prod update going on right now. We've got all sorts of traceability and logging. We're going to see if the internet will work, and then we'll start to get some live output from one of the builds that are—

Sam Guckenheimer

Ring five, I think, is Australia and Brazil.

Ed Blankenship

That's right. And ring four here. Where is it? Oh, this is the fifth ring. That's right.

Sam Guckenheimer

Yeah.

Ed Blankenship

Start at zero. It's off-by-one error.

Sam Guckenheimer

Right. Okay.

Ed Blankenship

Always fun with those. Anyhow, normally you'd start to see all of the live updates. But once a deployment has actually completed, some releases go to the first ring, some releases go all the way. We can decide whenever we want for that.

But this is where the traceability ends and meets together as the glue. We've got our builds. Which builds have gone out? What branches did they come from? Who approved different releases? What work items were part of there?

So I think this is the part that's the most interesting, is you can start to get a full list of all the work items that are starting to be a part of this and compare them with previous releases at any time, as well as even down to the individual commits or the tests that are being run as part of this.

This is kind of the glue. We're constantly monitoring this and seeing where things are going. But we're able to ship very quickly out to many environments that support over four million customers.

Sam Guckenheimer

It's very cool. You can see the traceability all the way to production.

Ed Blankenship

There you go.

Sam Guckenheimer

And I think ring four is waiting for an approval. The deployment's automatic—

Ed Blankenship

I think so, yeah.

Sam Guckenheimer

—but you do need an approval to say, "Okay, go ahead."

Ed Blankenship

Yep.

Sam Guckenheimer

Okay. So let's switch back to slides.

All of that allows us to create a level of transparency. For example, when there are live site incidents that lead to outages that affect a number of customers, we blog about them, and we tell you exactly what happened.

For example, the one with the colorful charts, which were made possible by that telemetry, is going through what happened to the differences in query optimization between SQL 2014 and SQL 2016, and how we discovered the impact on memory usage and what that surprise was. That led to new process safeguards and changes in queries so that we could guard memory differently.

All of that's described openly to anyone who wants to get on the blog.

Ed Blankenship

That's right.

Sam Guckenheimer

Let's switch to the demo one more time because I want to—

Ed Blankenship

Yeah, I was going to say one thing that a lot of people seem to be surprised at sometimes, but is really important for a single engineering system at Microsoft, is the ability to both build and release any type of application.

You can imagine we have apps on iOS and Android and deploying out to Linux environments. We've got it all pretty much in the company.

Sam Guckenheimer

Right.

Ed Blankenship

We've had to build a build and deployment pipeline system that really does enable you to do everything from standard server applications all the way to, you can deploy to the production Apple Store if you want for your iOS apps, and then build pretty much everything you can imagine under the sun. That was a really important thing that we needed to make sure we did.

Sam Guckenheimer

Yeah. It's one engineering system. We have to do a lot of development for Linux now. We do a lot of development for iOS, for Android. Ed's driving a Mac, so—

Ed Blankenship

Ah, yes.

Sam Guckenheimer

—all of this needs to be supported here, and we have the corresponding tasks to do that.

Okay. Let's finish up on the slides.

So is this working? Here's a picture of one of the team rooms in our building where one of the feature crews is. They are responsible for one of the dashboards we showed you or the work shown there. When they want to do their daily standup, they walk to one side of the room, do their standup. They want to talk during the day, they swivel their chairs. And it's just a room for them. It's not a huge open space like this. It's just for them, and the little focus rooms on the side.

Ed Blankenship

I love it. You can even move the little desks around however you want.

Sam Guckenheimer

Yeah. They arrange the desks the way they want. The desks typically can be used for sitting or standing. No one here is doing standing. But it's totally up to them.

Now, we've tried to take that same agility into the org. Around every 18 months when we say, "Okay, you can change teams," what happens is that we take the team leads into a room like this.

Each pair of leads, the engineering lead and the product owner, stand up and say, "Hey, we're called Blueprint, and we work on this stuff in Agile, and yada yada yada, and here's what we're looking for and here's what we do," and so forth. Then the next team comes up and says, "We're about version control core, and here's what we do," and so forth.

Ed Blankenship

And they usually say, "Hey, my team's better."

Sam Guckenheimer

Right. Yeah. And then everyone boasts about how good they are.

Ed Blankenship

Of course.

Sam Guckenheimer

Then everyone in the room, we don't have them here, but you had them in the coffee session, takes a Post-it or three Post-its and writes down first, second, third choice on where to work.

Then the managers or team leads get those Post-its and then try to give people their preferences of where to work. We're hitting about 95% first choice right now.

Ed Blankenship

It's really an amazing thing in so many different ways.

Sam Guckenheimer

Yeah. So people get to decide where they want to work, to some extent. It's obviously within site, geographic site, and within organizational area, but you get to pick your crew.

And how is it working across the org? Well, this is telemetry from our One Engineering System. We're up to about 62,000. We're over 62,000 internal users right now. It's grown about 4X in two years. It obviously can't keep going that way. These are internal users.

What we haven't done yet, need to move to next, is measuring satisfaction with Net Promoter Score internally and so forth. But the signs are pretty good.

So we're moving from a place where no reuse would go unpunished to one where the assist gets rewarded.

Thank you very much.

Ed Blankenship

Thank you.