Moving to 1ES at Microsoft

Log in to watch

London 2017

Download slides

Moving to 1ES at Microsoft

Sam Guckenheimer

Product Owner, Visual Studio Team Services · Microsoft

Moving to 1ES at Microsoft

Chapters

Full transcript

The complete talk, organized by section.

Sam Guckenheimer

We at Microsoft were well known for an organization that looked something like this.

This was a parody from a set of org charts that were drawn about five years ago by a Google engineer. There are great ones on Google and Facebook and Apple and Amazon. But this was really our structure.

We had the idea that every division should have its own engineering system. This had been in place for about 20 years. The theory was, well, everyone knows what's right for them. If we let them make their decisions, they will be able to go faster. They'll be able to do what's right for them. They'll be able to focus on what's really important for Windows or Office or whatever.

And the reality was that over 20 years, they built up cruft. They went slower. There was a lot of duplication. Someone wanted an XML parser. Didn't matter that we already had 300; we wrote 301. Someone wanted to use jQuery. Didn't matter that we already had 60 versions in production; we used the 61st, and so on. It was a ton of duplication.

And the unintended consequence of that notion of letting everyone do their own thing was that no reuse would go unpunished. So, the philosophy was: take no dependencies.

We're doing it like this.

Well, about five years ago, less than that, three or four years ago, when Satya became CEO, Satya Nadella, there was a decision at senior leadership: no, we're going to do one thing. We are going to produce a world-class engineering system. This is going to work for us in Microsoft. It's going to allow us to do great engineering, and we are going to follow a principle of first party equals third party. That is, what works for us, we will productize and make available to our external customers as well.

So, I have the privilege of working on Visual Studio Team Services, which is the basis of this One Engineering System.

Now, I'm going to start by just giving you a look at how we work.

Sorry, this is a touch screen, which is a little sensitive.

So this is a dashboard from, if you read the URL, mseng.visualstudio.com, which is our instance of Visual Studio Team Services in the engineering system. And I'll just give you a feel for how it works.

Well, we're talking about DevOps. DevOps starts with a really good hardened pipeline. In that hardened pipeline, you need to be sure your code's good. So we do things with a pull request workflow using Git for Team Services. So that means all the code gets committed to one master, but before it gets committed to one master, it goes through pull requests.

And this little widget up here is showing me the pull request builds. And if I click on it, I'll see, huh, we've got right now that one. If I look at how many of these are running, you'll see there are a bunch that are currently in progress. You'll see that this is about the last hour, and my American colleagues haven't woken up yet, so it's a slow time.

If I look at one of these that has passed, you'll see that with that pull request, there are about 53,000 automated tests run. Those run in under 10 minutes, so that they give quick feedback to the submitter of the pull request. The whole thing took 22.9 minutes.

If I go and explore one of these, let's take... I don't know what any of these are. Let's take this one from Priyanka. You will see that for each of these submitted pull requests, I have the associated code changes here and the linked work items. Here, a bug, which I'll open.

So, in this particular case, this was a bug fix which had some associated changes. That bug fix went through the identification of this work, had a topic branch created for it. I'll show you what that means in a minute.

And that pull request went through a set of policies to ensure that the code is clean. So three reviewers approved. All the code has to be reviewed for ISO 27001 and SOC 2 and all the other compliances. The comments have been resolved. These optional things are tests that will happen after CI. I'll show you that. I have the history of the changes and approvals, and for each of these changes, can see what specific code is being changed, typically a few lines at a time.

So the idea is that everything happens in these tiny batches. People submit these changes, you're reviewing a few lines at a time, you're approving the pull request, the tests are then running. If the tests run, the builds run, you get the approvals, the security test runs, your pull request goes through.

Now, if that pull request runs green, and they don't all do that; you see a bunch of red up there, it will automatically go from here to here, the continuous integration flow. And most of these do run green because of the level of checks before they get to continuous integration.

And to give you an idea of what the CI builds are like, these complete in about 20 minutes. If I go through the history of this CI build, we'll see that there are probably several in the last hour. Not so many. Two in the last hour, three the hour before, what have you. Like I said, this is a slow time of day because the U.S. hasn't woken up yet. But this screen is basically the last, what, nine hours of CI.

And from CI, we go to another set of tests, which are in pre-production environments. So back to here. I'll show you.

So these two displays are showing me the state of what we call... I will back up. In the pull request flow, you saw a bunch of tests run. These are what we call L0s and L1s. They run against the binaries. They run against binaries and dependencies. Here we're running in production-realistic environments, what we call L2s. These use fake data, fake identities as test data.

And most of them run green. Occasionally, you have a failure. Occasionally, you have one of these yellow things. The yellows are flaky tests. So those are tests where even though you'll see 100%, we think a test is unreliable because it's been running alternately green and red. So that requires some inspection.

If I look at a failure, you'll have the same idea. You can trace through from the failure to the specific suspect test, and from there to the specific suspect bug. So this particular case, 782 tests ran, of which one failed, for which we have an error message, a stack trace, the corresponding attachments, the corresponding bug, and so forth.

That's the flow.

And when you get beyond those pre-production tests, of course, you get into production deployment. So if I look up here, I will see the state of production deployment. And this is organized in what we call rings. Rings are groups of data centers.

We have six of them for our service, these six columns. A canary, where we work. Ring one is the data center with external customers, but the smallest number. Ring two, data center with the largest number of external customers. Ring three, data centers with the highest latency to reach, in this case, Europe and Australia. And ring four and four A are in parallel, everything else.

And the last deployments were about 19 hours ago. I can look in, I can see what's going on in the particular data centers. If there's a failure, I can click in and see what specifically failed in that production deployment. And these are happening, as you can see, a few times a day.

That's how we work.

So that's live. That's the engineering system, and I was connecting live, and you were seeing data coming out of our dashboards from the service on Azure.

Now, the goal of moving to this kind of engineering system was to get out of that world where no reuse would go unpunished, to one where the source code for everything in the company would be visible to everyone, and anyone could contribute on a shared source model. Where anyone could offer improvements. Where our company IP would be reusable and built as reusable components with contracts and microservices. Where anybody could find what they wanted. Where you got rewarded for producing things that were reusable and could be shared. Where there was no lag from the time you made a change to the rest of the company being able to get to it. And where build and test time was proportional to the amount of change, so everything could be continuous.

And then by having one engineering system where you weren't stuck working in that particular place if you wanted to move around the company, because tools would be familiar.

I did the demo. Now, an interesting thing happened. Everything I showed you was working off of Git and the pull request flow on Git. Git didn't scale for us.

Git is well known for being used for Linux. We had a problem with Windows that it was about 400 times the size in its largest repo, compared to Linux. And we went through a bunch of things. We'll refactor it. So we'll spend two years refactoring Windows. For what benefit? So we can use Git? That doesn't sound like a very good idea.

Then we tried a large file system on Git. That didn't quite work. We ended up creating something called the Git Virtual File System, which we've open-sourced on GitHub now, and are putting through the community process.

The kinds of improvements we've gotten in performance are across the board, around 250 to 300x. So a Git clone of the Windows repo took, in the past, using Git, 12 hours, if it worked. Now, if you had any hiccup in network connectivity, your laptop went into sleep mode, or something burped in the wireless, or what have you, it just didn't work. It failed. You started over. That's now about two and a half minutes. Checkout went from three hours to 30 seconds, and so forth.

The way we did this is we did the same thing that you see with things like photo sharing sites, where you have the equivalent of thumbnails when you look at all the pictures that come up on the web. Those big picture files aren't actually downloaded to your machine until you click on them. But you still see all of the directory. Same idea here with the Git Virtual File System.

And we've now released that. We have 90% of the Windows org, many thousands of engineers now using this, and it took about 18 hours to cut over on a weekend.

So I showed you this in practice. One other thing I want to show you, which I didn't do in this demo, is I just wanted to shout out to Dominica for her great talk and say, what does a dashboard look like? How do you plan work?

Well, if I take a look at a team dashboard, I will see status for a team. You'll see the team members up here. Let me do a slightly better one. Everyone has their own dashboard, so let me get you one that I like a little more.

So you see team members up here. You see a cumulative flow chart up here. These are the last six sprints, tracking work in process. Dominica was talking about how that really is the way you keep things flowing, and they're tracking down here lead time and cycle time across their work.

If you look at how they actually are planning their work, it is on a Kanban board similar to what Dominica showed you in her sketches. And if you look up here: product backlog, things that haven't been started yet. A sprint backlog, what's been taken into this sprint. Active work, review, bug bash, done.

If you look up on top, you'll see on active work, 16 in red of 12. That's a WIP limit, going over a WIP limit. So there's a warning there: "Hey, you've got a WIP limit of 12. You're at 16. Something's wrong."

You'll also notice as I go down here, there are swim lanes. So up top is called DRI. That's Designated Responsible Individual. In other words, the work that is done by the person handling live site issues. So this is intentionally a swim lane for unplanned work, reactive work.

And then down here are planned work columns. One says rules in UX. One says TFS import, TFS being the on-prem version, process sharing, and so forth. And you'll see work moves across the board, and then when it's done, you have the same kind of traceability.

And if I want to look not at just the view of what's going on in the current sprint here, but I want to look across sprints, I can say, "Okay, let me say what is happening over many sprints and many teams?" And I see a little marker for...

Sorry. My screen is very reactive. I'm not quite sure what I just clicked. Let me go back. Plans.

I see a little marker, this green guy here, for today. I'm sorry, blue guy for today. This is roughly a planning horizon. This was a marker for our Build conference, and it goes forwards and backwards, and this is at the higher level of the backlog of features. And I can, of course, expand any of these teams to see what's really the detail in them.

So we have the granular Kanban board where the team's working. We have the aggregate view where you see the roll-up, and that's how we work.

If I'm on the Kanban board and I want to start work on something, I can directly on the card say, "Hey, I'm going to work on this bug." And here I can, directly on the board, just to make this... This won't get... See if this... Create a new branch in Git as a topic branch for this bug, which will then move forward. And then that branch will be squashed, closed as the work is completed so that we don't accumulate merge debt.

Okay.

So I'm conscious of time.

So the result of that is that we end up with a source tree that looks like this. Very lean, because we're squashing all these branches. But all the work is auditable. We have the full audit history. Code is fresh in your mind and we're using topic branches.

A word on metrics. This came up in yesterday's workshop session. We track four types of metrics. We track how we're doing on live site. That is everything related to qualities of service, availability, performance, time to mitigate, and so forth. Velocity, how quickly we get from idea or code change to production. Engineering governors of technical debt, so how many bugs are there open per engineer and so forth, which we use to control unplanned work from arising. And then everything from the outside in, the usage.

We're doing this with a common telemetry pipeline that we also make available to our customers as Application Insights. To give you an idea of the volume, it's about two and a half petabytes a day right now across Azure.

And if we have issues, we do a root cause analysis, and we'll blog about it. So we try to create transparency on everything.

Is this working?

Well, this is a team room, where a feature crew works. This is how feature crews get assigned. People put up stickies and say, "I want to work here." And they get first, second, or third choice. And we do about 90% first choice.

We do a One Engineering System day for the company where we say, "Here's what's going on." This is way oversubscribed.

And we measure NSAT on everything. So we're now at about 68,000 engineers using this. It's continuing to grow. That's been about 4x in two years.

And I think I'm at time. Am I at time?

Three?

I'm happy to take questions if I'm not.

Three minutes. Three minutes. So questions.

Come on. Really?

John.

Q&A

Q: Yeah. The virtual Git system you created.

A: Yeah.

Q: It seems like it's... I know you're Microsoft and you figured this out, but it's counterintuitive to the whole model of Git, of everybody having their repo on their laptop, on their environment, and knowing that that's what makes that whole system work. Now you've got this middleman with original cache, it looks like. It just seems like extremely complex and, of course, a lot of development to keep.

A: Well, actually, it's not. It's working very well. It lets us take advantage of all the Git workflows, like pull requests, without having the burden of saying, "Oh, you can't start from there."

So, many teams have been working prior to this, my own included, but now those workflows are possible for teams that are starting with large code bases as well. So, that's been very successful.

Q: Is there any guidance of what you'd say a large code base was? What feel, obviously your Windows code base.

A: Windows' largest repo, this is one of about 20 repos they have, is 270 gigabytes. That includes a lot of media files and things like that. But...

Q: And that was unworkable.

A: Right. Large probably means over 10 gigabytes, you wouldn't use straight Git without hitting lots of performance issues.

Go ahead.

Q: Can you tell a bit about how this new way of working impacted or affected your marketing and sales organization within Microsoft?

A: How did it affect marketing and sales? That's a great question.

So we are continuously releasing. We did a little exercise and figured on our team, for example, we have releasing at about 4x the rate of value that we were three years ago. And we put this out in these three-weekly sprints. Actually, we deploy continuously, but we plan three-weekly.

And we have this continual debate about, oh, should X be held for a marketing event? And in general, we don't, although we do claim credit at marketing events.

So, for example, we had our Build conference a month ago, and I talked about all these announcements that we were doing at Build. In fact, every one of the announcements had been around at least in preview for some time. So it's a different way of working.

I think Verity's giving me the hook. Is that correct? Am I done? One more question? One more question. Okay. Anyone?

All right, I got the hook. Thank you very much. I'll be around the rest of the day.