The Skype Journey to 1ES and Cloud

If you've heard me talk here before, you have heard me talk about our move to One Engineering System at Microsoft. You've probably seen me speak at some things like that.

One of the things that is new is that I've put up a website called DevOps at Microsoft. The URL is there. That includes a lot of stories like the one we're about to tell you, and much more depth in the same vein.

So our story about Skype and Microsoft starts around 2011. That was the time of this org chart at Microsoft, where we had each division doing its own engineering thing. It was also the time where Skype was acquired.

Now, the theory of each division doing its own engineering system was that, hey, that can be faster. They have the support for them. They can do the right thing for their market. And the reality was they just went slower.

So when Satya Nadella became our CEO, we took a policy twist and said, "Hey, what we really need to do is to build one engineering system that is available to you on the outside as Microsoft Team Services." And that one engineering system will allow us internally to go as fast as possible and to have world-class pipelines, and testing, and agile planning, and continuous delivery for our teams.

So we're going to tell you the story of Skype joining Microsoft, and then Skype's coming on to using and contributing back to the One Engineering System at Microsoft.

Jennifer Perret

Thank you, Sam.

So before I start talking about our journey, I first want to make sure everyone understands the context of where we were coming from with Skype.

Skype was founded in 2003, and its architecture was set up as a peer-to-peer network. Skype, the product, the client app, ran primarily on desktop PCs. And because of that, how it was architected was that the heavy processing power was put onto those desktop PCs. Most of the code, therefore, also lived in a monolithic client core library. I'm sure many of you understand this.

The tech stack was open source, completely non-Microsoft, which was also a really fascinating problem. And you're sort of bringing in the whole engineering, "Okay, how are we going to do this?" Primarily Linux, Java, PostgreSQL, those are the top three.

But because we had several different R&D sites globally distributed around the world, the engineers were empowered to use whatever tool they wanted. So we had not only a monolithic client core, a monolithic service on the back end running proprietary PCs, but a myriad of different tools and technologies being used by our engineers. No problem with open source. It was the variety that we had to sort of initially struggle with.

By 2011, most importantly, the world was changing. Mobile devices were rapidly outpacing desktop PCs. And with these mobile devices, there were absolutely fundamental challenges, paradigms that we needed to address.

Nobody wants their battery to constantly be drained. We don't want to have an app that we're running on our mobile device, and an hour later, we have no power on our mobile device. Customers won't tolerate it. We, as product owners, don't want to have that experience for our customers.

In addition, this whole client--updating a desktop PC is painful enough. Imagine if every update to your client required your mobile device to get upgraded. We've seen it. We all hate it when we have to upgrade our apps.

Perfect example here is the Skype blog from 2011. I just wanted to include this. Everything we did in terms of hotfixing, back at that point in time, was through the client updates. That was the only way we could get a hotfix out. And then as we had more challenges, as we were hitting new scale and new volumes with these mobile devices, the number of hotfixes we needed to get out were also increasing.

So mobile clients ultimately need to be small and lightweight.

In addition, within the industry, cloud computing was becoming a reality. You had AWS, Azure, Google Cloud Platform, all running production apps successfully. And the whole shift into big data and what you could learn end-to-end between your client and a back-end server or service was a reality through big data and that need for telemetry.

So by 2012, the decision was made to re-architect Skype as microservices on Azure. And that decision was made by basically focusing on two strategies.

First and foremost, we needed to re-architect on the cloud so we could take all that processing that we had built onto the client and get it off of the client and out into the back-end service.

We also recognized as a strategy that by moving to the cloud, we would be able to scale out rapidly and then scale back down that load as we had very, very specific traffic trends when it comes to Skype. The biggest day in our year is Mother's Day.

So historically, we've been building our data centers to meet that one day in the year, through these huge skews, right? It's amazing, but that's the day. So we wanted to be able to make sure that we could scale out and get that processing power off the client.

We also recognized that we had very globally distributed engineering teams, and so we wanted to embrace the trend and learnings that we had seen through these microservices architectures, which would enable us to decouple the teams' work so that instead of trying to manage a monolithic approach on a global time zone, these teams could actually drive their own agility by owning their microservices.

So very loosely, truly decoupled microservices was our approach.

Now, to enable movement forward on this decision, how do you even tackle this? Okay, great. We're going to re-architect this thing.

We recognized quickly that we needed to actually drive a major shift in our engineering mindset, and we really focused on this being a shift to the cloud mindset.

This I called a lot of times the moment when I could watch my engineers' heads just poof. There was this moment. Everything they had learned had to be adjusted, because as we talked about how you're going to build a service on the cloud, the expectation we wanted our engineers to take in to their design and their development was to expect outages.

Right? These are engineers that are used to owning the data center and/or owning the ability to take a stick to their operations teams and say, "Why isn't this working?" And now we're saying, "No, no, no, just expect it to happen."

So build to high availability. And our premise and the way we focused here was we're going to deploy these services across multiple data centers, and at any point in time, you're going to expect to reroute traffic and take a complete data center, perhaps two, because sometimes we like to think about the real worst-case scenarios, completely out of your traffic load. And you're going to do it in under five minutes.

So that was to us what building to high availability meant.

We also made sure as we shifted to this cloud mindset that our engineers understood they were going to be facing all new sorts of problems as we tried to build up to the scale that we knew was coming to our product with Skype. And that was that things break when you go into high scale in a completely different way. And any kind of functional testing that you think you can get done in a lab will never, ever get you the reality of what's happening in production.

And so critical to us was testing in production. Also critical with that testing in production, to roll out or deploy progressively. Right? Add load to your service in a managed way and constantly track your service metrics, again, looking for any indication that that service is not able to handle the load that it's receiving. And if it hits a problem, be able to rapidly roll back to the healthy version.

And then, again, as I mentioned before, being able to actually reroute traffic, both because of an outage or because as you're rolling something out, it just didn't simply handle the scale, whatever the change was that you made.

Now, that was our shift to the cloud mindset.

But in our efforts to actually shift from a monolith to microservices, we had to do this in a very phased approach.

First and foremost, we were taking a huge, wow, this is new. We're talking real-time audio/video services running on the cloud at a minimum of four nines of reliability. The problems just don't get bigger than that. And then in the scale, there's really not much that hits as much scale as Skype does.

So we needed to take a very careful approach and make sure that we could prove that these microservices could actually handle that load.

So we first introduced a few of these microservices, and as I mentioned before, we deployed traffic to them very slowly. We watched, and we actually started putting these microservices in and slowly saw how much load they could take, as well as how quickly the team could manage incidents, even with smaller loads.

Then, as we actually saw that these microservices were working, and yes, our premise was right, we started to add more microservices into the stack. It was proven.

Now, we also needed to make sure in this staged approach that we could handle changes on our client. As I mentioned, a whole part of Skype's journey was that we were coming from a very large, monolithic core library that sat on the client, where we talk about it as our old client.

Well, those old clients were sitting on TVs, places that would never get updated. So we needed to make sure that we always were able to run side by side, where we could handle the traffic of somebody coming from an old client and then the new clients as well.

So this was very much a phased approach around how we made this shift.

Now, we didn't just take a shift onto the cloud. We also shifted our entire engineering to DevOps culture. And this was tremendously important to our move to the cloud because it really put that accountability as well as the empowerment into our engineers' hands for the quality of these services.

But key to this was, hey, training an entire org on DevOps. Because it's globally distributed, and I say this for other folks out here who do have these global workforces, if you're moving to DevOps, make sure that you involve your HR team early, because international HR laws have to be involved. This is above my pay grade. But there are aspects of what does it mean for these engineers to have on-call in different countries? So make sure you involve them.

And then, as I mentioned before, for us, this whole move to DevOps was about driving empowerment and accountability to our engineers. But it meant that we really needed to make sure that we were providing the self-service tooling that our developers could be empowered: how they managed their incidents, how they actually did on-call rotations. Everything needed to be enabled so that those developers could drive the actions they needed themselves.

And then most important, and this is one of those things that I will say it really is... Rarely will I say that something has to come from the leadership or from top-down, but it is critical for creating a healthy live site culture that from that leadership all the way down, that focus on a growth mindset needs to permeate.

And what I mean by that is that any incident could have our corporate vice president sitting on that call. And the absolute worst thing that we could have happen is that one of our executives, or even anybody in the middle, decides to start shifting an incident where we are focused on the customer, first mitigating the incident for our customer, and then focused on what do we learn from this for everybody, not just the team that was involved, but for everybody.

Because we wanted to make sure, hey, how are we learning from this incident and going forward? So the worst thing that could happen is within that culture, if there's a blame game or any aspect.

So I think what we got really lucky with is with Satya Nadella's leadership, the whole organization driving towards this growth mindset, is that it really permeated in how we managed incidents. We focused on, okay, first, can we mitigate for the customer? Second, what are those repair items? And then how do we drive those improvements back into our organization?

So we might find something through one incident and one team, and then we'd make sure that we looked left and right and understood if there were other teams that also would get tripped up by this problem and how to solve that.

Now, our journey didn't stop here. As I mentioned, we started early on testing these microservices. And as these microservices started to prove their scale, their reliability, and the quality, they also therefore started to actually be asked to be moved into workloads that were part of our Office 365 products, which meant we needed to meet the compliance and security bars for those products, and we needed to meet those bars well before they ever took any production load.

The key here, for us, was to understand that we were moving the needle from a dev plus ops culture into a DevOps culture meant that we needed to look at how do we handle separation of concerns? How do we make sure, throughout the entire engineering pipeline, we have that trusted path where everything is auditable, we have understanding of no injection points, all of those kinds of high-quality bars that we need for our auditors?

And in particular, what we did is because we knew we were introducing something new, is that we actually put into our schedules and planned a lot of time with our auditors doing mock audits.

And what I mean by mock audits is we took the controls that the Office 365 product was meeting, and we took every single control and reviewed that with our auditors, saying, "Okay, well, now we're running on the cloud, and we're running in a DevOps model. What do we need to meet? How do we do it? And how do we make sure that we're meeting this bar?"

So if you are making this change, my point here being, if you're moving and changing how you manage that separation of concerns as well as other aspects of your development process from a dev plus ops to DevOps, make sure you build in that time for all of your compliance and security audits. Because you'll want to really go into those early knowing exactly what tooling, if anything, you need to enhance your pipeline with.

An example of this is that we discovered early on that we needed to make sure that anything that went into production, because it was the engineering team that was pushing to production, that we managed that separation of concerns. Turns out the Azure organization also needed to meet this, as did several other teams that were building on Azure.

So we had the opportunity where we actually did co-development with the Azure organization to make sure that we actually had not only that approval path, but also the audit trail within our pipeline to make sure that we met all those obligations.

Now, I also mentioned earlier in the slides, it wasn't just cloud computing that was happening in the industry, it was also big data. And we have a high-scale service running around the world that we're now saying, "Hey, we're going to run it on the cloud, but how do we make sure we're hitting that quality bar?" And so we transitioned our org into being very data-driven.

The key to this, and kind of in hindsight, I will say a couple of key learnings, was one, boy, you can flood your system with data. And in fact, initially we did. We flooded our back end with data, and we didn't understand early on, well, what are the KPIs we want to measure? And what's the data we actually need? And there will likely always be that pendulum as you're sort of figuring out, well, what data actually do I need?

So learnings from this: design your KPIs very carefully. Understand what your business metrics are, your customer scenario metrics. This, for example, would be calling. How are we actually looking at the call quality and measuring that? And then what telemetry do we need? And then also the service metrics.

And I say it very specifically with those three in mind, because for the service metrics, because we are microservices, one of our quick learnings was that our teams, they're focused on the microservices they own and develop. And they would add all sorts of great telemetry so they could see their service health. But they never thought about early on, oh geez, we need to correlate all this telemetry.

And so a key learning, one of those, boy, 20/20 and wish I would've known this back then, was really around, hey, you know what? Let's identify quickly what's that correlation ID? What are the things we need to do so in this microservices world, we can actually get those signals and look at them carefully?

And of course, as I mentioned about our traffic rollout and rollback, we also focused on experimenting continuously with our customers. So really doing A/B testing so we could see what's working for our customers, what's not, and basing our decisions on that data and on that experiment. Because otherwise it's really easy to try and go into study group of one, or as sometimes we like to say in Microsoft, the highest-paid professional, AKA the HiPPO, making the decision. So we didn't want to do that.

So measure what's important. Focus on the usage, focus on your engineering team's velocity. How quickly can you address customer needs, changes to the market, and most importantly, live site health.

What we did not measure was what we would often call activity. What we focused on was the impact, the positive impact to the customer and/or to the business, not the activity of the developer.

So finally, in terms of Skype and how we had an opportunity coming through as this large open source engineering org. Remember, as I mentioned earlier, acquired in 2011, completely open source stack. One of the things that came with this engineering organization was this focus on code contributions and partnership.

And so we had the opportunity, not only did we move rapidly to Git under VSTS as part of the 1ES system, but we also co-developed with VSTS, the Maven and Ivy package management solutions that, all of you, it's available as GA. That was actually the Skype organization that built that in partnership with VSTS.

And we also did a bunch of co-development with the Azure organization for areas that we needed work within the engineering pipeline, things that we needed to make sure our agility went forward.

The point being here is this is a huge opportunity for feature teams, what would normally be teams that focus just on features, to make sure they understand that because they're DevOps, they also own a part of that engineering system.

So Skype today, we consist of approximately 1,000 engineers, five different R&D locations. At the extreme, we're about 10-plus hours apart. So anywhere from the West Coast all the way to Estonia. There's a lot of early morning meetings to make sure that happens at a reasonable time for everybody.

Our service volume, just to give you a sense of the scale that we're dealing with: push notifications daily, we have 1.8 million notifications coming through our system daily, 2.5 million call connections being made daily. And in terms of our data-driven, the quick numbers that I could grab, we are actually sending 142 billion events daily to our back-end system. And on the back end, we are actually joining that data, these are signals around the service health, so that we can actually see quickly if we are actually hitting some quality issues.

Five habits we've learned so far, just to summarize.

Cloud requires designing for resilience. If you can have your engineering team go through that mindset where they understand they don't control it all, and the idea that blaming another organization or another group of engineers for an outage, move past it. Just focus on designing for resilience. It really changes things.

Agility requires developers being empowered, not only with self-service tooling, but empowered to own the quality of their service.

Change requires a growth mindset. I can't say it enough. This willingness to learn, this focus on getting better, it's critical. Everywhere you can foster that, it's going to make any shifts that you're driving into your organization absolutely successful. Without it, not sure.

Contributions back to the engineering system. This really is about those developers feeling empowered. Instead of the, "Oh, well, the engineering system team owns that," or, "Oh, wouldn't it be great if this other team would build this tooling?" Empower your engineers. Make sure that there is this focus on sharing code back, getting into each other's code bases, and as we like to say, lifting all boats.

And then finally, on the compliance and security, if you are making these kinds of fundamental shifts and you do have compliance and security obligations to your customers, make sure you work with your auditors early. Go through those mock audits. Make sure you understand, well, what does it take? But also understand that the auditors get it. They understand that there are changes here that need to be made, and if you focus on the controls, you'll get it right.

Sam Guckenheimer

So Skype's about 1,000 engineers. There are 75,000 more in Microsoft who are now using this One Engineering System. And this culture of getting the best engineering system for our developers to make them as productive as possible is part of what we do, and contributing back is part of what we do.

Skype, when it started, was the largest Java Linux user in the company, and they contributed back what they needed for their packages.

That's the experience report. We may have time for a question or two. I do have one call to action before we do that, which is if you look at your game on your mobile in the conference app, if you look at Games, you'll get 10 techie points for doing the DevOps assessment, which is there. And there was a message about this over lunchtime. You'll get 10 techie points if you do that and take a selfie with your results page.

So do that on your laptop or your mobile, or come by the booth and do that. We'll be there during the reception.

So are we allowed questions for two minutes? Two minutes, three minutes. I see.

Q&A

Sam Guckenheimer: Questions?

All right. We can't see you, but I can hear you.

Audience Member: All right. It's going to be an anonymous question from the back.

Sam Guckenheimer: Perfect.

Audience Member: Not anonymous, microphone at the back.

Audience Member: I have a question about SLA. So you take Skype and you break it into a bunch of different microservices, and then you lost them. What's the SLA for North America? You have one number.

Sam Guckenheimer: So the question is, how do you do a single SLA, like 9995 or something like that, for North America when you've got a bunch of microservices?

Jennifer Perret: Great question. And in fact, one of the key things was, remember when I was saying, hey, initially our teams focused on getting the telemetry for their service only, and they didn't do that correlation? That was an exact example of, oh, wait, how do we actually see what our users are experiencing from end to end?

And so I think the key thing with the breaking up of microservices is make sure you understand what's that correlating ID or the way that you're going to join the data on the back end so that you can then pivot, hey, customer calls from North America to North America. What does that reliability SLA look like end to end? But it's critical for you to have that correlation. Great question.

Sam Guckenheimer: It also lets you get smart about alerts. Instead of flooding everyone with too many alerts, you can alert on the things that matter to customers.