Implementing a One Engineering System
We've been on an engineering transformation journey as we move down the path of consolidating our internal engineering systems. We'll share a lot of the stories from our journey and pitfalls to help you learn from our mistakes.
Chapters
Full transcript
The complete talk, organized by section.
Sam Guckenheimer
One Engineering System.
The saying we have at Microsoft these days is from an African proverb: "If you want to go fast, go alone. If you want to go far, go together."
Satya Nadella, our CEO, is totally into the idea that engineering productivity is one of the key drivers of the company's success. He's an engineer himself. He believes adamantly in this. The whole senior leadership team believes in it.
We've totally turned into this focus on One Engineering System, and we're from the team that is delivering this. It's based on Visual Studio Team Services, the same SaaS offering that we make available to customers and we use internally.
But how did we get there? Well, if you go back a few years, this was the Microsoft org chart, circa 2011. It was drawn, interestingly, by a Google engineer because we didn't publish our own org chart at that time.
The engineering process in the company was that every major division, Windows, Office, Bing, et cetera, did its own engineering system, and this had been true for decades. The idea was that by giving them responsibility for their own engineering system, they could go faster. That was the principle. They could be more effective. They could do what they needed.
The consequence of giving them their own responsibility for their engineering systems is that they didn't quite invest enough in them, and they ended up actually going, well, more problematically and slower, and things got more difficult. That led to a very interesting unintended consequence: no reuse would go unpunished.
The consequence of that was the philosophy, "Take no dependencies. We need to do this. We can't count on that other team. Don't know that they're going to be around to support that stuff. So if you see something you like, fork the code."
The consequence of that was that we ended up with about 300 XML parsers. There aren't 300 versions of XML. We ended up with a similar number of tree grid controls because everyone likes tree grids, and we didn't have a standard. I don't know how many versions of jQuery and all that stuff were buried in these different code bases that were opaque to everyone. That turns into cruft, which turns into a kind of technical debt.
So we set out to say, there must be a better way. There must be a way to get modern engineering practices in place that can support this organization and its engineering teams at scale.
We bought into a North Star, and that North Star is that source in the company would be available to anyone, that any developer could offer improvements to anyone else's project, that all of the company IP would end up being made of reusable components and services, that anybody could find and reuse components from anywhere else, that devs would be rewarded for creating things that were reusable, not just things for this next deliverable.
There would be no lag time between one dev checking in and everyone seeing the work, so that instead of having these really long, slow build processes where no one quite knew what would happen, you'd get to see instantly the changes and the impact of changes. And the build and test time would be proportional to how much code was changed, not proportional to how much code was left alone.
Finally, we encourage people in the company to, as part of their career development, move around. As long as everyone was locked on their own siloed systems, that was very difficult. So we wanted to get to a place where if you grew up in one part of the company and you wanted to go to another part of the company, the tools and process were familiar.
So let's take a look at what it looks like in practice. Ed, let's switch to the demo.
Ed Blankenship
Yeah.
Sam Guckenheimer
Talk to us about how we work.
Ed Blankenship
This is going to be a really fun process. Anyone who ever knows about showing live systems, this is our live engineering system that we use to build our products from day to day.
First, I'm going to just start—
Sam Guckenheimer
So I just want to riff on this for a minute.
Ed Blankenship
Please, yeah.
Sam Guckenheimer
Okay. We are on the internet. We are connected—
Ed Blankenship
It'll be fun.
Sam Guckenheimer
—to our production engineering system. I'm crossing fingers. A little bit later we'll actually watch a live demo.
Ed Blankenship
Okay, so we're not quite sure what you're going to see. This will be fun.
First, our team particularly is really large. We have about 70 different feature crews within the broader team, and so we need different views. This is our main view across all of the teams, but each of the individual teams, this happens to be one of our teams called Blueprint. They have their own different view. They can see—
Sam Guckenheimer
So hang on just a minute.
Ed Blankenship
Yeah.
Sam Guckenheimer
Let me riff on that. One of the issues that comes up all the time is, how do you deal with enterprise scale and team autonomy?
Ed Blankenship
Absolutely.
Sam Guckenheimer
So we just went from a roll-up dashboard of the pipeline down to a feature crew's dashboard, where this belongs to a group of, I think, 10 or 11.
Ed Blankenship
That's right. Each feature crew is about that big, and you talk about how individual feature teams' work accrues back up to broader initiatives that we're making. This is a view of their Kanban board. These happen to be individual stories and bugs, but they can map up all the way up to broader-level scenarios. But I'm going to hide that for a bit.
This is actually the live team's Kanban board. We don't know what's in there right now.
The nice thing about this is every team can decide how they want to do work, which agile practices they want to implement, et cetera. They can customize everything. Swim lanes, like if you're familiar with expedite lanes, this team happens to have one for customer support cases, live site incidents that—
Sam Guckenheimer
Yeah, if you notice that red one's blocked saying, "Wait for customer."
Ed Blankenship
Yep, and then this green one—
Sam Guckenheimer
It looks like Jet's having a problem.
Ed Blankenship
But you can organize the work in many different ways. All the teams do it a little bit differently. It's pretty nice to be able to do that.
The reason why I bring this one up is that at Microsoft, it's extremely important to make sure that you have traceability against all of the different artifacts in the system. So if you think about source code commits, work items, bugs, new features, test cases, deployments, builds, we've got linking across everything. That's particularly important when you start to want to see what is the delta between any different release that's going out or being impacted in some other part of the company.
We start right from here. Even from one of these bug work items, we'll talk a little bit about branching in a minute, but you can start even source code options right from different parts of the application.
Now, each team also, we have tons of different builds. Like, tons. I just happened to find one that was kind of interesting. It's our ALM search team's build. It's a continuous integration build. You'll start to see that traceability start to show up. These are all the bugs and stories that were part of it.
Sam Guckenheimer
Wait, I don't know if everyone can read that. You're going kind of fast.
Ed Blankenship
No, I don't think so, yeah.
Sam Guckenheimer
So, yeah.
Ed Blankenship
I kind of don't want them to see all the new features either, so—
Sam Guckenheimer
Okay.
Ed Blankenship
We haven't disclosed them.
Sam Guckenheimer
Cool. So those little bars that are fewer red and most are green—
Ed Blankenship
That's right.
Sam Guckenheimer
—their height is how long the build took to run.
Ed Blankenship
That's right.
Sam Guckenheimer
Those are current builds, and then down there, you see associated work items. Those are the stories and bugs and so forth that actually make this up. And then on the right are the test results.
Ed Blankenship
Yeah. We've got test results, code coverage. What's nice is, since we build on every particular platform in Microsoft, we have to have an engineering system that can support all types of tests. We have a common test results store as well. Code coverage, even down to where this build might have been deployed, into which environments and which pipelines they happen to be used in.
This is really the start of all of that traceability. If I go and take a look, I can even start to break down from even our original dashboard. I can start to look at the different builds and what tests are in there. It's particularly nice when this build, the CI build, happens to have one failing test from today.
Sam Guckenheimer
Yeah, it looks like a security test—
Ed Blankenship
It is a security test.
Sam Guckenheimer
—that was failing.
Ed Blankenship
So hey, we have one failed security test out of our entire test suite. That's always good.
But it's got great information, stack trace, error messages, the logs. A bunch of our UI tests will actually put screenshots in here, so that's a nice way to do it. And then individual bugs that actually get filed right from the test failure, so you can start to get traceability right from there, too.
Finally, if I just open up one of the work items, whether this is a feature or a bug, what I really love is I usually start from there, as a product manager. I'm usually trying to see where this is in the development life cycle, and this just starts to pull information from other parts of the system to say, "Hey, here are some pull requests. Here are some branches. Here's some code commits." It'll also start to tell you where it starts to deploy, if that bug has actually been deployed or not.
So as a PM, all I need to really know is, "Hey, where are the features or bugs at, at any given time?"
Sam Guckenheimer
That's right.
Ed Blankenship
And so I'll cover pull requests in a second, but Sam, why don't you—
Sam Guckenheimer
Okay, good.
Ed Blankenship
—tell us a little bit about Git.
Sam Guckenheimer
Good. Before we switch back to the slides, I just want to comment on what Ed just showed you.
We were downstairs talking about security in the Lean Coffee session and the need to have traceability from your stories to your code, to your tests, to your deployments, your bugs back, your incidents back. This is how we do it. This is by no means the only way to do it or the only tool to do it, but this is how we do that traceability problem.
Let's switch back to the slides.
Now, what we do for code is that under Visual Studio Team Services, we are using Git. We're in the process of moving the whole company to Git. The benefit of that is giving everyone one master branch. Or making everyone use one master. We do use lightweight topic branches. So a bug will have a branch, come back with a pull request. Branch will get closed in a day.
Unlike the old days when you said, "I'm going to do something new, and I'll go off here and create a branch using centralized version control, and then I'll come back in three months, and it'll merge." And then you have all this merge data, and everything's on the floor for a month.
These get collapsed immediately, within a day, within a few days. And you have tiny continuous merging. So the merging happens while the code is really fresh in your mind. It's not this long, what did I do? It's now, and everyone commits back to master through the CI build.
So let's take a look at that process and how we use pull requests.
Ed Blankenship
Yeah.
Sam Guckenheimer
Can we switch to the demo, please?
Ed Blankenship
Yeah, sure. Before we talk about that, I was going to mention one of my favorite stories is how bad Windows was for me in Git.
Sam Guckenheimer
Oh, my goodness. So Windows, can you imagine trying to clone a repo with 40 million files and bring it down to your own laptop?
The scale. There are some things about introducing Git that are really worth talking about.
One is start from the right. Get the release process working right first. You can deploy out from Git and have the pipeline work and get the benefits of the CI/CD pipeline, and then start moving back to the left if you've got an existing code base that's pretty monolithic and needs to be refactored. But you need to start refactoring toward a microservices model so that you have smaller repos, so that you have things that are independent and digestible before you go distributed—
Ed Blankenship
That's right.
Sam Guckenheimer
—with Git. Big learning.
Ed Blankenship
Yep. So instead of boring everyone by showing you what Git repos look like and branches and stuff, which are all fun and games, one particular thing that I have really enjoyed in our One Engineering System has been how we've implemented pull requests.
Sam Guckenheimer
Yeah.
Ed Blankenship
It is the main way that we start to get a first set of quality on a lot of the changes that are happening. Every change that goes through requires a pull request, and it's great to have another set of eyeballs on here.
I've done a little searching this morning about a cool pull request from the system. This one is interesting because I noticed Brian Harry, our vice president—
Sam Guckenheimer
Yeah, my boss, yeah.
Ed Blankenship
His big boss. Our vice president was on this pull request for some reason. So all fun, but it's got the normal things that you—
Sam Guckenheimer
Wait. You're going fast again.
Ed Blankenship
Absolutely.
Sam Guckenheimer
Let me make sure people in the back understand what we've got here.
Ed Blankenship
Please.
Sam Guckenheimer
So on the bottom, you see a review history.
Ed Blankenship
Yes.
Sam Guckenheimer
And who reviewed this pull request when and what the approvals were on this particular change. Then the reviewers are summarized over on the left-hand panel.
Ed Blankenship
That's right.
Sam Guckenheimer
So we have the idea that anyone has visibility into that history, and the review history happens on these tiny batches, on the pull requests, just on the change.
Ed Blankenship
Exactly.
Sam Guckenheimer
And you see the diffs like this of what those changes were as the reviewer.
Ed Blankenship
Then we can start to make comments. Threaded conversations, we can resolve them, all the typical things you would think about.
But I mentioned traceability is super important. We can start to make sure that things are linked together, but not only just a set of manual reviewing, we actually have a whole system that we've started to implement around making sure that certain branches stay at the highest quality.
For example, on our master branch, we have a set of branch policies that start to run as part of the pull requests. You can do things like making sure that a specific build is successful. That build can have a test pass as part of it. You can require that a certain number of people are—
Sam Guckenheimer
Yeah, but isn't one of the great things we do handling this at scale, I think.
Ed Blankenship
Absolutely.
Sam Guckenheimer
So go ahead.
Ed Blankenship
That was the one thing I was going to say is one of the hardest problems, especially on really large code bases, is finding the right teams or finding the right people who should be the reviewers.
Sam Guckenheimer
Yeah. So you're going to have one person review all the code?
Ed Blankenship
Yeah. Of course not. So we've got, and this is actually all of our mappings, where we say, "Hey, if any changes are in this part of the source tree, we want this team or this person to be a required reviewer," and it'll automatically add it.
So as soon as someone tries to change something in the billing code, it will go and grab the commerce team and require that reviewer before it can get merged up to master. So it's actually been a really good helping—
Sam Guckenheimer
Yes.
Ed Blankenship
—thing for us.
Sam Guckenheimer
So we love pull request policies. We love the ability to pull request policies automatically trigger reviews, and those reviews come from the right set of reviewers based on the corresponding directory group.
Ed Blankenship
That's right.
Sam Guckenheimer
That happens very nicely. Okay. We'll switch back to slides.
Ed Blankenship
Yeah, let's do that.
Sam Guckenheimer
So slides, please.
What does all of this enable? Nicole Forsgren reminds me that it's all about culture and metrics. So what are our culture and metrics?
You can see at the top of this a little snippet cut out from our engineering dashboard, and you can see that we're tracking all of these fine-grained metrics on the engineering process. There are really four categories.
Live site health. We're very big on live site culture. Live site culture means that you work to keep the customers up, with time to detect, time to mitigate, and then you work to do a root cause repair so that the problem doesn't happen again. We call those incident prevention items, and you need to close them in a sprint typically, or two sprints.
We watch for live site problems that are aging, in other words, aren't getting closed, map against customer support metrics, map against SLAs, and we track those by individual customer. That's about live site health.
Then we look at the velocity, which is the left-to-right movement. How long does it take to build, to test, to deploy, and so forth? And how long does it take us to improve, to implement learnings?
Engineering, we think of a set of governors. So we used to look at these centrally a lot. Now it's basically up to the teams at a lower level. Things like bug cap per engineer. If your bug cap is four, you have eight engineers, you get to 32 bugs, you can't do anything new. You need to pay down your bugs before you go further. Keeps the debt out.
Ed Blankenship
Bug jail.
Sam Guckenheimer
Yeah. Called bug jail, bug hell, whatever.
Aging bugs, how long are they around? What's the test pass rate and coverage? You saw that on the initial dashboard. Everyone sees that. Are there any failures? Are there any reds in there obvious?
And then, of course, we track usage, the outside in. How is this doing for customers? How good are we doing at acquiring new customers, engaging? How dedicated are they becoming? Is anyone churning out? And then in a fine-grained way, we can look at telemetry as well.
Now, we do that by means of a common telemetry pipeline. We have a pipeline. Again, this is something we use internally, and then we offer most of it to customers through Application Insights Analytics.
That pipeline, to give you an idea of scale across Azure in the first of these charts, is currently ingesting about 1.6 petabytes of data a day, 1,600 terabytes.
Ed Blankenship
A day.
Sam Guckenheimer
A day. And you can see the month-to-month growth here. It gives us all of the data on the running services. We can use it for this very high-volume ingestion, for fast queries over the datasets, and use it for text search.
So let's switch back to the demo, look at how we release.
Ed Blankenship
Yeah. I was just going to mention, one of the best things that, of course, we want to do, especially as a SaaS-based service, is to make sure that we get changes through the pipeline as quickly as possible.
Just to give you a sense of scale that we have, we have more than four million customers now on our SaaS service, in addition to all of the great engineers we have at Microsoft using the service. Getting it out there quickly has been an interesting challenge for us.
But, just going to pop up. I, unfortunately, don't have a live production one running right now, but I did find one that is running, so I'm going to show you.
Sam Guckenheimer
Okay.
Ed Blankenship
This is our release pipeline for production. We have five different rings that help us with exposure control, as you might imagine. Ring zero is our canary instance, which is actually the account that this is in. We give ourselves the first set of pain in case there's anything wrong.
Sam Guckenheimer
Yeah. So let's explain what a ring means. For practical purposes, the fine-grain definition is these are one or more Azure scale units.
Ed Blankenship
Right.
Sam Guckenheimer
Everything's deployed on Azure. You can think of these as data centers. By the time you get through ring five, you've gotten through 10 data centers worldwide.
Ed Blankenship
That's right. We actually do have a prod update going on right now. We've got all sorts of traceability and logging. We're going to see if the internet will work, and then we'll start to get some live output from one of the builds that are—
Sam Guckenheimer
Ring five, I think, is Australia and Brazil.
Ed Blankenship
That's right. And ring four here. Where is it? Oh, this is the fifth ring. That's right.
Sam Guckenheimer
Yeah.
Ed Blankenship
Start at zero. It's off-by-one error.
Sam Guckenheimer
Right. Okay.
Ed Blankenship
Always fun with those. Anyhow, normally you'd start to see all of the live updates. But once a deployment has actually completed, some releases go to the first ring, some releases go all the way. We can decide whenever we want for that.
But this is where the traceability ends and meets together as the glue. We've got our builds. Which builds have gone out? What branches did they come from? Who approved different releases? What work items were part of there?
So I think this is the part that's the most interesting, is you can start to get a full list of all the work items that are starting to be a part of this and compare them with previous releases at any time, as well as even down to the individual commits or the tests that are being run as part of this.
This is kind of the glue. We're constantly monitoring this and seeing where things are going. But we're able to ship very quickly out to many environments that support over four million customers.
Sam Guckenheimer
It's very cool. You can see the traceability all the way to production.
Ed Blankenship
There you go.
Sam Guckenheimer
And I think ring four is waiting for an approval. The deployment's automatic—
Ed Blankenship
I think so, yeah.
Sam Guckenheimer
—but you do need an approval to say, "Okay, go ahead."
Ed Blankenship
Yep.
Sam Guckenheimer
Okay. So let's switch back to slides.
All of that allows us to create a level of transparency. For example, when there are live site incidents that lead to outages that affect a number of customers, we blog about them, and we tell you exactly what happened.
For example, the one with the colorful charts, which were made possible by that telemetry, is going through what happened to the differences in query optimization between SQL 2014 and SQL 2016, and how we discovered the impact on memory usage and what that surprise was. That led to new process safeguards and changes in queries so that we could guard memory differently.
All of that's described openly to anyone who wants to get on the blog.
Ed Blankenship
That's right.
Sam Guckenheimer
Let's switch to the demo one more time because I want to—
Ed Blankenship
Yeah, I was going to say one thing that a lot of people seem to be surprised at sometimes, but is really important for a single engineering system at Microsoft, is the ability to both build and release any type of application.
You can imagine we have apps on iOS and Android and deploying out to Linux environments. We've got it all pretty much in the company.
Sam Guckenheimer
Right.
Ed Blankenship
We've had to build a build and deployment pipeline system that really does enable you to do everything from standard server applications all the way to, you can deploy to the production Apple Store if you want for your iOS apps, and then build pretty much everything you can imagine under the sun. That was a really important thing that we needed to make sure we did.
Sam Guckenheimer
Yeah. It's one engineering system. We have to do a lot of development for Linux now. We do a lot of development for iOS, for Android. Ed's driving a Mac, so—
Ed Blankenship
Ah, yes.
Sam Guckenheimer
—all of this needs to be supported here, and we have the corresponding tasks to do that.
Okay. Let's finish up on the slides.
So is this working? Here's a picture of one of the team rooms in our building where one of the feature crews is. They are responsible for one of the dashboards we showed you or the work shown there. When they want to do their daily standup, they walk to one side of the room, do their standup. They want to talk during the day, they swivel their chairs. And it's just a room for them. It's not a huge open space like this. It's just for them, and the little focus rooms on the side.
Ed Blankenship
I love it. You can even move the little desks around however you want.
Sam Guckenheimer
Yeah. They arrange the desks the way they want. The desks typically can be used for sitting or standing. No one here is doing standing. But it's totally up to them.
Now, we've tried to take that same agility into the org. Around every 18 months when we say, "Okay, you can change teams," what happens is that we take the team leads into a room like this.
Each pair of leads, the engineering lead and the product owner, stand up and say, "Hey, we're called Blueprint, and we work on this stuff in Agile, and yada yada yada, and here's what we're looking for and here's what we do," and so forth. Then the next team comes up and says, "We're about version control core, and here's what we do," and so forth.
Ed Blankenship
And they usually say, "Hey, my team's better."
Sam Guckenheimer
Right. Yeah. And then everyone boasts about how good they are.
Ed Blankenship
Of course.
Sam Guckenheimer
Then everyone in the room, we don't have them here, but you had them in the coffee session, takes a Post-it or three Post-its and writes down first, second, third choice on where to work.
Then the managers or team leads get those Post-its and then try to give people their preferences of where to work. We're hitting about 95% first choice right now.
Ed Blankenship
It's really an amazing thing in so many different ways.
Sam Guckenheimer
Yeah. So people get to decide where they want to work, to some extent. It's obviously within site, geographic site, and within organizational area, but you get to pick your crew.
And how is it working across the org? Well, this is telemetry from our One Engineering System. We're up to about 62,000. We're over 62,000 internal users right now. It's grown about 4X in two years. It obviously can't keep going that way. These are internal users.
What we haven't done yet, need to move to next, is measuring satisfaction with Net Promoter Score internally and so forth. But the signs are pretty good.
So we're moving from a place where no reuse would go unpunished to one where the assist gets rewarded.
Thank you very much.
Ed Blankenship
Thank you.