Progressive Deployment, Experimentation, Multitenancy, No Downtime, Cloud Security, Oh My!
This experience report is about rearchitecture from a monolith to cloud-native practices. We cover moving stepwise from single tenancy to multitenancy, scaling up to scaling out, fixed resources to optimized variable costs, periodic upgrades to zero-downtime updates, single backlog to continual experimentation, linear to progressive deployment with a controlled blast radius, long release cycles to continual testing, opacity to observability, and pre-release security reports to continuous security practices.
Dylan Smith was a Microsoft MVP (ALM) and DevOps consultant for many years before joining Microsoft to lead the DevOps Customer Advisory Team. Now he works with Microsoft’s largest customers to help them accelerate their DevOps journey.
Sam Guckenheimer is the Product Owner for the Microsoft Visual Team Services and Team Foundation Server. In this capacity, he acts as the chief customer advocate, responsible for strategy of the next releases of these products, focusing on DevOps, Agile and Application LifeCycle Management.
Sam edits the website, DevOps at Microsoft. He is a regular speaker and has keynoted at many conferences including DevOps Enterprise Summit and Agile. He is the author of four books, most recently Journey to Cloud Cadence, Visual Studio Team Foundation Server 2012: Adopting Agile Software Practices: From Backlog to Continuous Feedback. Prior to joining Microsoft in 2003, Sam was Director of Product Line Strategy at Rational Software Corporation, now the Rational Division of IBM. Sam lives in the Seattle area with his wife and three children in a sustainable house they built that has been described in articles in Metropolitan Home and Houzz.
Chapters
Full transcript
The complete talk, organized by section.
Sam Guckenheimer
We both work on something that's now called Azure DevOps. You may know it from its predecessor, Visual Studio Team Services, that I've talked about, or its on-prem sibling, Team Foundation Server. This is also the basis for Microsoft's One Engineering System.
To give you an idea of the scale at which we are doing DevOps, we rolled up from our SaaS some stats. We're doing about 78,000 deployments a day. Not on the slide is that we have 94,000 active engineers internally, who are our customers, in addition to millions of folks like you externally.
The State of DevOps Report, which I hope you've all read - Nicole and Jez will be here talking about it, I think tomorrow - talks about how, in order to become a high performer or elite performer, you need to go a lot faster. They emphasize all of these practices that high performers do with regard to the speed of delivery, the time to recover, the lower change failure rate, and so forth.
And the question that the report doesn't really speak to, that I get in customer conversations all the time, is: we've got an existing business. What do we do about our existing code? Should we throw it away and go cloud-native and start over, and how do we do that and keep the business going forward? So this is a story about that. Dylan.
Dylan Smith
Thank you, Sam. All right. What I want to talk to you about is what did that journey, that story, look like? We used to have, or we still have, TFS on-premise. Sometime around 2010, we started building what is now called Azure DevOps. I think the first preview was 2011, 2012. We've been on this journey for seven, eight years now, and we've told various aspects of our story at previous conferences like this.
Today I want to focus on a specific aspect of that journey. The first question that we faced when we decided to build this cloud-hosted software-as-a-service version of TFS was: we had this existing code, existing architecture. Do we re-architect it for the cloud, whatever that means, and then move it, or do we just move it as is and deal with the problems as we run into them? We chose the latter. We just moved it as is and went from there.
What I want to talk to you about is what did that look like? What were the problems that we faced specifically, and where have we come in the last five, six years on this journey?
When we started, we had TFS. We shipped it every two years or so, an on-premise server product. Some context: TFS architecturally is basically a SQL Server database that has all the data, and then application tiers and job agents, which are ASP.NET web applications hosted in IIS. It's not multi-tenant, but we did have a concept of a collection in TFS, a collection of team projects, and each collection got its own database. That'll be important. We could load balance the application tiers and job agents.
When we decided to move this to the cloud, we basically took what we had with almost no changes and threw it up into Azure. Specifically, it was called web roles and worker roles, which are an ancient Azure technology: basically a bunch of VMs and Azure SQL databases. We installed TFS up in Azure and made it available to our customers. About the only tweak we had to make was how we do identities. Everything else was almost identical to what we had.
We ran into some pretty significant problems almost immediately. The first problem was every time a customer signed up, they got a new collection inside our software. Each collection meant a new SQL Server database, and we very quickly had something like 11,000-plus databases in the cloud. I don't know if it was our software or the SQL Azure software, but that just fell over dead. It was never designed to handle that many databases. So the first problem was we need multi-tenancy. We need to have multiple customers in one database for several reasons, but having 11,000 databases just wasn't sustainable.
Our approach was a typical multi-tenant implementation. We added a customer ID column to every row in every table and changed every query to filter by customer ID. We had some clever trick that we used to do automated testing to make sure we didn't miss any queries, so we weren't leaking customer data. That's the approach we went with. Now we have, instead of 11,000 databases, one or at least a small number of databases.
The next problem was now that we have one big database, it turns out that SQL Azure back in the day can't handle a giant database of multiple terabytes. I think at the time the limit was 500 gigs. But even if it could handle it, our cost - we get an Azure bill just like everybody else, and our Azure bill was big, especially the database part of it. So the second change we made to optimize our COGS, our cost of goods sold, our cost of running the service, was move as much of that data as we can out of SQL Server into much cheaper blob storage. I believe nowadays we still have something like 60, 70 terabytes still in SQL Server, but that's just the metadata. All the customer data, the source code files, work item attachments, build outputs, all that other stuff is in blob storage.
The other major problem was TFS on-premise, and this is still true today: when you upgrade it, you need downtime. We go and make changes to the database schema; it requires downtime. When we moved this up into the cloud, for the first nine months, I want to say, every time we wanted to upgrade our cloud service, we scheduled a maintenance window and took it down for the world. Today, in 2018, that obviously wouldn't be acceptable. But that's what it was. The first four or five major updates to Azure DevOps required downtime.
Our approach to that was that we basically had to come up with a system to do database updates, specifically. There's a lot of stored procs and database stuff in Azure DevOps, and we needed to do those without downtime. We have a system of PowerShell scripts that allows us to do that. The key implementation details are: every time we implement a feature, it has code changes and database changes. Those code changes need to work with the old database version and the new database version. That's the key detail of how we did no-downtime deployments. We can roll out the code change first, which works with the old database schema, and then use our PowerShell script framework to do that database change online, in place.
That works great. The big downside, or cost, is every feature that we implement from now until forever is more expensive. Every single feature, we have to consider how do we make that work with the old database version and the new database version. That's cost. I don't think of it as a cost; I think of it as a tax. It's a tax on all feature development forever. I don't see a way to avoid it. We've accepted that we are going to pay that tax. But anytime I see a change that is a tax instead of a one-time cost, you need to think very long and hard about that.
Another thing that we make extensive use of is feature flags. Pretty much every feature that we develop is hidden behind a feature flag for some period of time until we're ready to turn it on for the world. I'm sure you've all heard about feature flags and how great they are, so I'm going to tell a story about where it bit us.
Back in 2013, at a Microsoft conference called Connect - these are the conferences where during the keynotes we usually announce big new features - we had some big new features for what was called at the time VSTS or VSO, whatever it was called at the time. We had deployed them to production in the weeks leading up to the event, hidden behind a feature flag. Our plan was, an hour or so before the keynote, let's flip that flag on for the world, and then we can show it off in the keynote, these great new features.
Well, that didn't go so well. We flipped that flag an hour, maybe two hours before the keynote. This is a chart from one of our blog posts on the root cause analysis report of the chaos that ensued. We flipped it on for the world, and the feature started seeing load like it had never seen before. We tested it, obviously, but there's no place like production. The service started going up and down. It was down during the keynote. It didn't feel good.
We learned a few things from that experience. Number one, probably don't flip on major feature flags an hour before a major keynote. So if we ever have any Microsoft conferences with keynotes where we're announcing new features, go check out our service the night before that keynote, and you'll probably see some stuff that we haven't announced yet.
That was one lesson. But the bigger lesson we had here is that at this point in time, I think it was November 2013, we had one instance of our service in the cloud, which means if we break it, if it goes down, we break everybody. The blast radius is global. Our bigger learning is we need to do something about that. We need to slice our service up so we can limit the blast radius when we inevitably break stuff.
The way we did that was we split it up into things that we call scale units: effectively independent instances of our software in the cloud. Nowadays we have many dozens of scale units, and this helped us in a few ways. Number one, it helps limit the blast radius. If we break something, hopefully it's limited to just that scale unit. It's not always true, but often that's true.
It bought us a couple other really important benefits. We have these dozens of scale units, and we group them into rings. We have six rings, rings zero through five. This is how we do progressive deployment. When we roll out an update and put out our release notes every three weeks, we go to ring zero first, wait a little bit, ring one, wait a little bit. It takes about a week to go through all the rings.
We designed these rings specifically. Ring zero is our internal accounts. If we break something, we'll likely break it in ring zero first. It'll take us down, and we'll fix it before it hits our customers. But then the first few rings are specifically designed. The next ring, ring one, has specifically targeted accounts that use features in our product that we don't use internally. For example, we have Test Plans for manual test management. We don't really use that internally on our team, so that feature doesn't really get tested in ring zero. Ring one is customers that use the breadth of the features that we may not necessarily test extensively in ring zero. I can't remember exactly what each ring is, but one of them is a non-US geography, and one of them is very large customers. Eventually ring four and ring five is everybody.
When we do our three-week deployments, we deploy to a ring, wait 24 hours, and deploy to the next ring. When we do daily hotfixes, we wait an hour between rings, and we're basically waiting to see if any alarm bells go off. The third thing that the scale units allowed us to do is put an instance of our software in different geographies around the world. We have customers all over the world, even 11 customers in Antarctica. So that statement is true. Now we can put scale units in different regions. When you sign up for an account, your account lives in one of those specific scale units.
Let me show you real quick what this looks like. This is Azure DevOps, and this is our release screen where we release Azure DevOps. This is our active release. I think it started yesterday, and I'm going to have to zoom out my browser here to make it all fit on the screen. These individual boxes are effectively our scale units. The columns represent roughly the rings. I can see here that ring zero is up top; ring one is done, ring two is done; it's in the process of deploying to ring three. Every release goes through our rings. We've modeled the scale units and rings inside of our release tooling.
While I'm in here, I'm going to show you one other thing. I said we use feature flags extensively. When you hear us talk about features that are in private preview or public preview, how we surface that to our customers, if you've used our tool, you may have seen this. We can go in there, and I can see preview features. I can turn various preview features on and off just for me or for my account. We have our own little mini workflow that some of these preview features go through.
Not every feature flag is exposed through this screen. This is the big major ones, potentially disruptive ones. But by exposing this directly to our customers, it allows us to do some pretty useful stuff. When we have a major feature, we will release it, and it'll be off for everybody. At some point we'll flip it to on for everybody and allow them to turn it off. At some point we'll just get rid of the flag. That first stage, where the flag is there but turned off for everybody by default, allows us to release a feature when it's not done. Oftentimes some of these big features will be half done, but we'll release it in public preview hidden behind one of these flags, and that allows us to iterate on the feature in the open, getting feedback, not having to wait until we're done with the feature to put it in the hands of our users. We release features behind those flags very early. At some point, we feel that we're done or done enough, and we'll flip it to on for everybody. We'll monitor our telemetry, see if people are manually flipping it back off, and try to figure out why. Once we feel comfortable that people aren't turning it off anymore, we'll just remove the flag.
There is one other really important aspect of our journey that we touch on regardless of which aspect of the story we're telling. Now that we're in this cloud service, cloud-native world, we're releasing every three weeks instead of every two years. We're moving much, much faster than we ever have before. At some point in this journey, quality became a really big focus for us. We're moving very fast, which means we're potentially breaking things very fast, and we needed to have a really good handle on that.
We made a few changes to wrap our arms around the quality of our product. The biggest change that we did, in my opinion, is we combined the developer role and the tester role into one role: combined engineering. That really changed the culture. It wasn't one person responsible for features and another team responsible for the quality. Every engineer is responsible for their feature and the quality of their feature, and now also the health of their feature once it's in production.
The other thing that we did is we had to change our approach to manual testing, specifically how we implemented our tests. It used to be that pretty much all of our tests were end-to-end integration tests, whatever you want to call it, where the app needs to be deployed and it runs an end-to-end test. We had tens of thousands of them. They were slow. They were brittle. They often failed, and we weren't sure if it was really a bug or just a problem with the tests. At its peak, I believe it took 22 hours to do a test run, so we did it about once a day. We ran those tests every day for eight years, and never once did we have a completely passing test run. I'm told we came close one Christmas time when people were on holiday.
We knew that needed to change. The change that we made is we basically just adopted good unit testing practices. We came up with our own hierarchy of tests, and because we like to invent names for things at Microsoft, we call them L0, L1, and L2. L0 and L1 are like traditional unit tests. L2s, and then there are actually L3s also, are the more end-to-end scenario tests. Our philosophy was, if you can test it with an L0, don't use an L2. The vast majority of our tests should be these L0s, these unit tests that are fast and not brittle.
The chart at the bottom of the slide represents the main scorecard that we used. To make that shift from a 22-hour test run of something like 50,000 end-to-end tests to what we have now took about two and a half years. That chart represents every column as a three-week sprint. The whole chart represents about two and a half years. What you're seeing is the orange bar is the old-style tests, and over on the right, the big bar is the L0 tests, and the smaller blue bars are L1s and L2s. It was slow. It took years, but we just had to start, and we scorecarded the crap out of it. We had other scorecards that show how many of those orange tests each team owns, and whether they are slowly working it down and shifting the mix. Over two and a half years, we got there.
Let me show you what that looks like. This is our pull requests. This is active, so everything on the screen is pull requests completed in the last 20 minutes or so. I hope there's nothing secret in those comments. If I pick one of these recent ones, we have various builds and tests that run on our pull requests.
In addition to shifting all our test mix to these fast, reliable L0 tests, we also needed to shift left. We want to run those tests, now that it's not 22 hours, now that it's more like 22 minutes, earlier in the process, before the code makes it into our master branch. In our case with Git, that means we want to run it on every pull request.
If I look at any of these pull requests and go to Tests, I'm going to see somewhere around 85,000 tests. Every time I look at it, it's a few hundred more. So 84,000 tests. These are all of our L0 and L1 tests. About 18 minutes, and every single code change has to pass all 85,000 tests, or it doesn't get in. That's our L0 and our L1 tests. The vast majority of our tests fall into that bucket.
But we do have what we call L2s and L3s. If I look at another dashboard here, this crazy colorful chart is showing once the code gets into master and we kick off some other builds, we're going to run some of our slower tests. Every column there is a build, a CI build. Basically, once a pull request is merged, it kicks off one of these CI builds. The blue stuff means it's still running, so over on the right will be the most recent builds. Each row is a different suite of our slower L2, L3 tests. Every build runs through about 10 or so suites of our slower tests. Most of them are green. I see there's a red sprinkler down there. That's our approach to testing and trying to keep quality high in the product.
One last comment I've been thinking about a lot lately is, if you look at some of our root cause analyses - we publish our root cause analysis when we have major live-site incidents. You may know, a month or so ago, we had one: the South Central US data center got hit by lightning, and bad stuff happened. What I notice is that the failure scenarios are getting increasingly complex. I think we've gone to microservices. I didn't really talk about that, but over the course of those six years, we carved up our monolith into about 31 separate services now. There are lots of benefits to that. Our teams go faster with higher velocity. The teams that are on the microservices versus the monolith, life is much better for them.
But I feel like there's a downside, which is the complexity has exploded. If we look at the causes of some of our live-site incidents, it's these really niche, intricate, very specific failure cases, and there are just so many of them now with the increased complexity of microservices. That's a challenge that we're still struggling with today. I think one of our approaches to help solve that is to really get into chaos engineering, or what we call fault injection testing, which we do a little bit of today, but probably not as much as we should. That's probably where our journey will lead us next. Sam.
Sam Guckenheimer
Yeah. Thanks, Dylan. Dylan has shown us how we try to go faster without breaking things: how we use feature flags to control exposure to whom; how we use ring deployment to control exposure to where, going from a data center with the smallest user count to the largest user count, to the highest latency; then how we modified testing so that we can test at the earliest level possible, shifting left as far as we can. And by the way, there's also monitoring that helps us shift right.
This has been a significant change in engineering process. If you remember the move to Agile, we got this idea of a definition of done. Well, the definition of done that we believe in in DevOps is that you have delivered code with tests and telemetry, and the telemetry that goes with your code will substantiate or diminish the hypothesis that motivated that deployment. In other words, you're not done until you can prove in production that you're getting the results that you wanted, whether those results are higher customer engagement, faster performance, lower abandonment, any of the things that you might want to achieve. You need to measure, and you need to be sure. If you're not getting those, you pivot.
That's a real change. As Dylan pointed out, we didn't have any telemetry when we were on-prem, and we had to go through this process of introducing extensive trace points and then using a big data pipe so that we can gather everything that happens. We gather eight terabytes a day of telemetry, I believe. Across Azure services, it's five petabytes daily, using what you would know as Azure Log Analytics and Azure Monitoring.
One of the other things that comes out of this experience is what the State of DevOps report highlights: the idea of a J-curve. We got better. We introduced feature flags, hit a bump. We introduced progressive exposure, hit a bump. Then we said, "Oh my God, we cannot go fast enough because our tests make us too slow." We had all of this technical debt in these long-running tests. Dylan talked about the 22-hour "nightly automation run." There was actually a full automation run which was more than twice as long. They were inconclusive. They always ran red, and someone needed to investigate. We needed to go through that valley of darkness and factor the technical debt in order to get back to the place where we could actually go fast at high quality and high reliability for customers.
The lessons learned that we'll leave you with are, in our case, we said we're not going to throw away the existing business. We are going to refactor. We are going to make the code base work both for the continuing on-prem business and for the SaaS from the cloud. We did that incrementally. It starts with a single sprint. We're now in sprint 143, and we're still doing it.
We needed to then figure out how we could get to safe deployment by controlling exposure, not inflicting changes on everyone at once, but a little bit at a time: folks who were in the preview based on feature flags, people who were in the canary ring, then people who were in the lighter rings. We do that for every change.
That also allows us, with feature flags, to do continuous experiments as part of our continuous improvement journey. Without the experiments and the measurement of the results, we just wouldn't know. It lets us do trunk-based development where everyone's going into one master branch, and it's safe because the testing's done in the pull requests before the commit to master.
We get there because we have enforced the idea that green is green and red is red, and testing needs to be a reliable signal, which means culturally, you're responsible for the tests with your code. You're responsible for the telemetry. It's not done until telemetry says it's done.
That's the story of eight years. The change takes time. We're maybe halfway there. We're going to keep going. We're going to keep improving. But we wouldn't be halfway if we hadn't started.
That's what we have time for. There's a meet the authors this afternoon during the networking time, I think 3:15. We'll be up there to chat further and answer questions, and we'll have our laptops if you want to drill into any demos. Look forward to seeing you there. Look forward to seeing you at our booth. Thank you very much for joining us.