Evolving Windows: This Journey to DevOps

Log in to watch

Las Vegas 2018

Evolving Windows: This Journey to DevOps

Principal Program Manager, Windows Engineering Systems · Microsoft

An overview of the journey Windows has been on to transform its process, tools, and culture to a DevOps model.

Catherine Kamerling is a Principal Program Manager in the Windows Engineering Systems team, and manages the Windows Engineering work management team. Her team oversees the largest VSTS account in the world and uses DevOps to listen, experiment, and engage with Windows developers to improve their productivity.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Sam Guckenheimer)

I'm Sam Guckenheimer from Azure DevOps. I'm Catherine Kamerling from the Windows Engineering System team.

Did any of you see my talk on Monday with Dylan Smith about... Great. Okay. So about half the room. So that was one experience report about moving to a SaaS, and we showed you how we worked with that.

Gene's been asking for a long time, "Can we hear about Windows? I want to hear about Windows. If Windows could move to DevOps, then anyone can." So, I brought someone who knows the story. And you can hear about doing this at a huge scale. Catherine.

Catherine Kamerling

Great. So thank you everyone for coming and taking the time to listen to our story. I thought what we would start with is to go a little bit in the way back time machine and go back to Microsoft in 2007.

In January of that year, Vista was released after a much needed time release. Remember, XP was in 2001. So we, after fits and starts, had eventually released Vista to the marketplace with much bated breath. I was a brand new PM at the time, joining the company, and my job at the time was to go around to all of the areas around the world and to get to the privilege to tell the country managers that they were not going to get their bonuses that year for the deployment of Vista.

And the reason why is when you take a look at it, I don't know if you remember any of the predictions, but the predictions were the deployment of Vista was going to be around 8 to 10%, and was going to re-energize the PC unit shipment base. And a lot of the country managers had talked to their employees, who had talked to all of their sales team and felt that this number was on par.

One of the challenges that we'll talk about is that we didn't really have good telemetry at that time, and so my job was to go around and scan, through a survey process, people's machines across the world to understand what was really deployed. And the reality was the deployment was 1%, vastly different than the 8% number that we had hoped and that a lot of our goals had been measured against.

And as you might expect, we were just stunned. We had no idea why this was happening. And a lot of that was because, at the end of the day, there were three big challenges that we just were not paying attention to, looking at the unconscious bias of our market position of where we were at the time.

The first one was telemetry. We just did not have the data to understand what was going on in the market. We relied primarily on third-party providers who provided that information that they had and would provide it back to us.

We had horrible customer connection listening systems. We were very good at reaching out to the customers to ask their opinions on all sorts of information, but it revealed the unconscious biases that we had from our position, and we didn't have any pipes that allowed that flow of information to come back the other way.

And as you all know where this story goes, a lot of the information that people were trying to tell us was that we had huge compatibility issues with Vista. If you remember, with XP being in the market for so long, our customers had built so many customized solutions and tools on that operating system. Even if they wanted to move forward to Vista, they were unable to do that, and they had to roll back. And we missed those opportunities to take that information.

And the third one was just speed. Our waterfall process was three years. This one had been doubled to six years. And by the time we had done that planning back in 2002 and we released it in 2007, the market at that point was starting to move much faster, and we couldn't respond to the plans that we had built.

So we know we needed to do something differently. This is some of the big challenges that people know about the Windows database. We have about 11 million work items in our database across 12,000 engineers that work on this system. On a daily basis, on an average day, we do two and a half million queries or updates per day, and we have about 350 million revisions. So if you take a look at that data from a different perspective, you can see how that can be perceived as five and a half Space Needles or steam engines or the population of Chicago. We have that much data.

And I think that is one of the things that everybody knows about Windows. But the other thing that really is the issue I think that we wanted to talk about today is that scale is so much more than just that content. It often is like a learning disability, is how we talk about it, where it's often invisible, and you need your own terms to figure out how you're going to move through what that scale means.

So we talked about it just in terms of content and just in terms of engineers, but when you look at our scale in a general update, there are day-to-day processes and activities that also create a lot of complexity for our system that aren't readily apparent. We work on over eight and a half million devices that are out there in the world. We scale everything from HoloLens to Xbox, to servers, to embedded OSs that we work proprietary for some of the companies that are out here. And as you can see, in a general release, we have half a million pull requests. That's half a million people asking for a code review from someone else. We have 3.2 billion test cases that we're doing. These day-to-day activities are put in front of a frequency that is every frequency imaginable.

So on a day and a frequency, when you look at a general release, we have some of our challenges, such as Defender and Store, that need to release multiple times a day. Whereas if you get to Xbox, it's monthly. Our core OS is twice a year. And if you're building hardware, you're on a longer cadence, such as an 18- to 24-month cadence.

Those issues, when you talk about not just the people, but you add in the complexities, the ability that we need to provide all of these engineers to be nimble for the environments that they have, but also have us all work in one structure, was not the system that we had when we left the Vista days. And so we needed to do a few things differently.

So we're going to talk about things that we did in three areas. First one I'm going to talk about is things that we have done to really help us be more agile when it comes to planning and work management. Then I'm going to hand it over to Sam. He's going to talk a little bit about how we moved agility through code and through the process of building and release. And then we'll finish it up with some of the challenges and changes that we've made for our customer listening systems.

But if you take a look at this, this is very similar to the Agile framework. It's a little bit different because Windows always thinks they're a little different, so it's got different names up here. But the intent of it is the same. When we ended up with Vista, and we decided that we really needed to restructure and move everything onto one OS code base, we were all in different systems with different languages, and we needed to merge to one language.

So in working with Sam and the DevOps team, we looked at the Agile framework. We have something very similar for Windows. And as you can see at the top is Story, which is a little bit like Epic in that Agile framework. Those are the stories that executives talk to all of you about, the things that we're releasing. We weren't that great at understanding what were these big rocks that we wanted to do and the story from the customer perspective. So we've gotten much more crisp about understanding every piece of code that we work on has to ladder up some way up into one of the stories that we do.

And just to give you a sense, this is what we did last year, a couple of the numbers of how this takes a look for the Windows base. We do about 43 stories across all of this work that we talked about. You can see that that ladders into value props, about 206 customer promises, down to about 143,000 or so tasks that engineers are working on.

So now that we had understand we had a custom language, we recognized that while we were working with Azure DevOps with Sam's team, and we were instituting all of our data in there, we needed a couple of extra ways to have the ability to assess what was going on in a nimble fashion based on all those invisible complexities that we had talked about.

So what my team has done has built a couple of extensions that we put on top of Azure DevOps, and I thought I would go through what some of those are today. The first one is Story Tracker.

So again, if you think about those 43 stories that we generally have in a release, the executives need to understand in the planning cycle which stories are we going to go to, and they need a better way to assess how are we doing on those stories as we're moving through the release. And so we built this extension. At the top for each one of the stories, you can see the state of the deliverables for all of the stories that are going on. In this particular story, half of those deliverables are gray, which means they're proposed. They haven't yet started being worked on. That might be okay, that may not be okay, depending on where we are in the cycle. So that gives the information for the executives to have those decisions.

Should they want to dig in a little bit more, on the right they have a bar chart that tells you all of the groups that are involved. And then below, you see all of the customer promises that align to that story, and you can dig in to each customer promise, also get a sense of how are we doing in terms of work that's been completed, down to the deliverable and task. So this is a great way. The executives meet monthly. They review these stories to get a sense of how are they doing across some of these big items that they're tackling.

But then we needed another layer, another layer for the senior leads for our teams. A lot of our teams are managing groups of 500 developers, anywhere from 500 to 1,000 engineers at a time. And one of the things with this complexity is we have a lot of dependencies. In Windows speak, a dependency is something that I'm building that Sam and his group needs me to finish before he can finish his work. And that is something that I'm producing.

Conversely, we have a dependency that's the other way. Sam's creating an API that I need him to finish before I can move forward with the plans that I've committed to for my team. And so I'm consuming work that Sam's building. In the past, all of that information was done organically and generally face-to-face, or people working to try to understand how things were moving through the system.

And so what we did was we created an extension to pull that information together in one place. And so this is the Dependency Tracker, which you can see here in the view that I just pulled up is these are all the groups that are creating work that other people depend on. So if you take the group in the left, which is called Sigma, you can see that they create the most amount of work that other teams across Microsoft need them to complete before they can move forward with their jobs as they're rolling out their release cycles.

You can look down below and understand what are those specific tasks, who's consuming it, who needs that information, and the risk. We also have a timeline view and a risk view, so that if I'm Sigma and all of a sudden I decide I'm going to cut these five features, and you're the first person here, and you're the team PCE, and those five features actually matter to you, you now are aware sooner rather than later that something is materially impacting your schedule. We've created, too, an immediate communication system so that emails go out to everybody to allow people to understand that these changes are happening and so that conversations can happen sooner and people can make the changes that they need and move forward without it causing gridlock or preventing blocking issues from a release.

We've also rolled this out in our planning system so that all of our executives now have to think about these dependencies before they even start and what they commit to. And they all agree to these dependencies at a high level and are aware of how they're interacting with each other so that more thoughtful work is done throughout the release and an understanding of how this work crosses other teams.

Finally, we have for the leads who have information in terms of smaller teams, we have something that's a sprint view. In Windows, we call it an iteration, but think of it as a sprint. So this is a way for, again, a lead who's not a senior person, but who has multiple customer promises that they're working on to see how that work is laddering. And you can see here all of the work for a month or a sprint, what's going on in terms of what's been completed and what's not. We've added information to understand what's getting pushed out and what's getting pushed in so that we have more information to understand, are we losing ground in what that timeline is and how we can reground and have those conversations as needed.

Dashboards are a hugely valuable tool in the Azure DevOps world, and we use them as well. I just threw this up here on the right. You can see some customized widgets we have for bug tracking that we've created. So we have a heat map over here on the right. You can see for all of the groups that are related to this person's work, they get 48-hour bugs and all the way through the various definitions that they need to understand what's a hot bug, what's a blocking bug. And we've created a bug glide widget which we can share with people, but it helps us understand where we are in that process of getting to a healthy state before we put that out into getting ready to release.

And then we have two things that we have created for ICs. So if you are an engineer, and you're one of these 12,000 engineers who need to work in this base, it can get a little overwhelming, especially because a lot of our code is 30 years old and isn't as clean and as agile as we would like to have it be. So one of the things that we have created is something that's called an areas extension. And really it is a place for us to understand organizationally where information is situated. And so what you can see here in our OS project, these are the various groups that ladder up to the OS project. You can double click on those to get further and further information of various teams and where their information and their data sits, and then who owns it, so you can go have a conversation as needed and not feel like you have to work up through the task to understand where your information is and who are the people you need to talk to as you're working through building out your features.

The other thing that we've spent a lot of time on that might be unique to Windows is because our code base is so old and our culture was not set up in a way that is as agile as we would like, we have lots of binaries and lots of files that don't have owners associated with them. Some of it is really old code, but some of it is newer code that just didn't have an owner. So if you have gotten assigned a bug and you're trying to understand what's going on, you're looking in the binary, you don't know who to go to for more information.

So one of the things that we've spent a lot of time on over the past two years is how do we increase the percentage of files that have owners assigned to them. And so we have a process where we have this extension where you can go down to the file, see the area path, and see the owners that are associated. If an owner isn't there, you have information on the right that gives all of the history of the pull requests and the other history around that file so that you can get a sense of who it is that you need to talk to. And an algorithm is constantly working through this to help assign owners to files to improve that piece of information that people need to go and cover as they're going through their work.

So that is basically some of the tools that we've created in terms of planning and work management, but we know that there's much more to that. And part of that is how do we handle all of this code, this legacy code, and get that into a system that makes it more agile and aligns with this journey to DevOps. And for that, I'm going to hand it over to Sam, and he's going to talk a little bit about some of the work we've done in that space.

Sam Guckenheimer

Thanks, Catherine. So four or five years ago, when the senior leadership team under new CEO Satya said, "We're going to have one engineering system for the company, and everyone's going to use it. And it needs to work for the whole company," and that, of course, includes Windows.

We started this mail alias called Engineering System Architecture Discussion. And this was a torture machine. It had all of these threads about, "Well, we need to get to modern code practices and Git and pull requests and blah, blah, blah, blah. And the solution is we just need to refactor Windows." "And we've got this monolith. We need to turn it into microservices." Right?

And so explaining up the management hierarchy that you're going to refactor this thing, and it's a journey of we don't know how long, but certainly measured in years, and there's going to be no customer benefit or deliverable along the way, that didn't work so well.

We have in the core Windows repo, there are several side ones, but in this core repo that's at the center, the monolith, something over 7,000 developers who need to deliver code. That translates into about 11,000 topic branches that they work on. In a month, something like a third of a million commits and 30,000 pull requests and like 10,000 branch integrations because those topic branches get collapsed. If you look at that daily or in real-time, that's 10 commits per minute. That's 1,100 pull requests a day.

So if you think about people working in master, it's churning all the time. And if you assume that these are fantastic developers that say they only make a mistake one day a year, it means that your code's broken all the time. Which by the way is how Vista, and as Jeffrey talked about on Monday, Longhorn, didn't quite happen. It creates a merging problem that isn't nicely, "I'll take your pull request, you'll take mine," but it's like the freeway from hell.

So we know, and Jez and Nicole said yesterday afternoon, you're supposed to do integration and collapse branches every day. You're supposed to get your code to master all the time. How on earth do you do it at that scale?

Well, the good news is that we realized that the proprietary hierarchical version control that had been used in Windows and most of Microsoft for decades called Source Depot wouldn't cut it. But we did an eval, this is like four years ago now. Source Depot, we also looked at commercial alternatives like Perforce and looked at Git and said, "Hey, to get where we want to go, we need all these good things about Git you see there." Only Git was going to meet our needs of being able to work fast and get a pull request flow going and so forth, but only if we could get it to scale appropriately.

What do I mean? In that core repo of Windows, we had 360 gigabytes of data. Now, to put that in perspective, if you saw Dylan and me on Monday, we were showing you in Azure DevOps, in its most monolithic repo, maybe three gigabytes, so 100X down. If you compare this to Linux, which was built a different way from the beginning, it's more like 300X.

So to move to Git, we needed to do something about performance. It took 12 hours to clone that repo. And that's counting the successful ones. If your laptop went to sleep, you had to start over. If the wireless burped, you had to start over. So this is being on a great in-the-office network with a machine that's up and there's no hiccups in anything, and it doesn't have to restart. And just doing a Git status was eight minutes and half an hour for a commit. It was ridiculous. Unusable.

So to make Git work for Windows, we had to fix Git. We took three tries at this. You may have heard about the Git Large File System, that was one of those attempts and what have you. And it took three years and a lot of dedication, top-down to the belief that this was worth it. So we developed what's now called the Virtual File System for Git. It was GVFS in the beginning, and that's still the repo name, which gave us 300X performance improvements pretty much across the board. So that 12-hour clone was down to five minutes. A commit was not half an hour, it was six seconds.

The way we did that technically was essentially to use the pattern that you see on photo-sharing sites. So if you think about OneDrive or Google Photos or anything like that, you see thumbnails of everything. But you don't actually download the big JPEG until you click on the thumbnail. So you can think of this as providing thumbnails of all the files, but not downloading them until needed.

So we implemented GVFS, and we started moving parts of the Windows team to GVFS. How'd it go? Well, it took about six months from no use. The blue is Source Depot, the predecessor, and everything in Source Depot, and the orange color is Git. And if you notice, there was a point in March when we moved the bulk of the organization, and it happened over a weekend. And if you look at the heights of the curves, that's the number of pull requests, or before pull requests, what Source Depot called submits. Similar idea. So the amount of code activity actually went up. No interruption, which was quite remarkable. And now all of Windows is using Git, along with the rest of Microsoft under what we offer in the market as Azure Repos, part of Azure DevOps.

How did we deal with that problem of getting to the intraday pull request and that fast flow? Well, we have this problem of master's up here, and you've got all these people in Windows Core, 7,000 engineers working on their branches down here, and you need to figure out before the code moves to master how to validate it. So we developed some custom tooling.

When Dylan and I showed you what we do on the DevOps SaaS, you saw one build running for each pull request. So each pull request got its build, and when the build completed, that's when you saw the results. Windows builds take too long to do that. So we set up a system where we would have a continual build running as soon as it could, or continual builds running in parallel as soon as they could. And your changes in your pull request would be applied with an LKG, last known good, to be validated for master before that pull request could move forward to be committed to master. So it's a way of getting that high speed of changes back to master and having them validated before the commit to master. And we had to do some custom machinery for that.

Catherine Kamerling

So how'd it work? So one of the things that's working with the pre-code validation is we're in the middle of a whole bunch of pilots right now with Windows. It's working really well. Our hope is that next year we can come back and go deeper into our learnings and if it's something that we can roll out. But we wanted to share with you where we were in that journey.

The other place where we wanted to quickly isolate, remember back in the way back machine with Vista, and we had no customer pipes for listening, we created the Windows Insider Program. The Windows Insider Program is basically in almost all-- it's worldwide in almost all countries, 95% coverage in terms of the eight million devices we have and the 21 million apps that we have to take a look at how do we make sure that this build reaches out to these and doesn't cause issues. And our insiders are a key team with us.

We rolled that out in Windows 10, and you can see with the public previews how we have consistently increased our connections with this team of insiders. And so I don't know if anyone's an insider out here, but thank you very much. Since we started this in Windows 10, the Windows Insider teams have isolated and identified about half a million bugs that we have fixed as a result of that connection from our customers. And so we thank you for that.

The other thing that we wanted to share with is, as we talked about some of these extensions, what we're trying to do is make them public so that if anybody is interested in using some of these extensions, if there are unique issues that you have on your team and some of these tools might be helpful, you have the ability to do them. So we have two tools on GitHub right now.

One is Work Item One Click. That is for the individuals, the engineers on your teams. If they are working through and need to create certain rules for their WITs or their queries, they can do this and have that help them with their workflow. We found that to be incredibly helpful because, again, with these 12,000 engineers, there's no consistent default system that we could set up that helps everybody. So personalizing this for them was the way for us to go. So that's there.

The second one is Work Item Migrator. You might have noticed that we had 11 million items in our account. That really affects the performance. And so we worked with Sam's team to create a tool to migrate some of the archive older code that we weren't using or older files into another system that is accessible if we need it, but helps us maintain our performance. And teams are finding all sorts of interesting ways to use that tool. That also is on GitHub.

And then finally, the Dependency Tracker that I talked about is on the marketplace store in Microsoft. If your team has unique needs around dependencies, that tool is available for you all to use and see if that can help meet your needs and provide additional value back to us that we can learn from and continue to move forward with advancements in that space.

So in conclusion, I guess I would say this is the challenge and the passion that Sam and I have, which is really trying to get these 12,000 software engineers to work independently and together. And we're on that process right now, and we'd love to come back to you next year and let you know how that's going in terms of some of the build and release cycles that we have. And with that, thank you very much.