How A Hotel Company Ran $30B of Revenue In Containers
App modernization with containers for the rest of us.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
Thank you, Adam and Lauren.
I have been a big fan of this next speaker for over two years. People kept coming to me and saying, "You've got to meet this guy because he is doing things with containers that will blow your mind."
It was true. I learned that, among many other things, he was containerizing all of the revenue-generating systems at a top hotel company that was collectively supporting over $30 billion of annual revenue. Dwayne Holmes did this work as a senior director of DevSecOps and enterprise platforms.
For years I wanted him to give a talk about what he was doing here at this conference, but due to a variety of reasons we were never able to make that happen. Thanks to him moving to a new company, I'm delighted that he's finally able to share his story, which, among many other things, earned him the title of Google Cloud Certified Fellow, having built and managed one of the world's largest Kubernetes installations.
I'm even happier that he is now joining our longtime friend John Rostarski at PNC Bank as their VP of converged applications and cloud. Please welcome Dwayne Holmes.
Dwayne Holmes
Thanks, Gene. I really appreciate the opportunity for you to have me here. I have definitely been following your work, like everyone, and I appreciate the awesome words. My name is Dwayne Holmes, and I'm talking about my DevOps journey. To protect the innocent in this presentation, I will be vague.
All the way up until last month, I worked for one of the largest hotel companies in the world. In 2019, they had revenues of over $20 billion. They have 170,000 employees. They're over 90 years old, with over 7,400 locations or hotels and 1.4 million rooms available. In 2016, they had a large merger approved. In 2018, that merger integration was completed, and in 2019, we rolled out a massive program for our customers.
A little about the things that we accomplished during that time: all the way up until last month, I ran a team that supported over 3,000 developers across multiple service providers. Our model was that we had few FTEs but lots of service providers when it came to development. Development was done when there was a project that was greenlit by corporate.
In 2016, microservices and containers were running in production. In 2017, over $1 billion was processed in containers. I didn't say microservices because we had microservices and also micromonoliths running in containers at the time. Ninety percent of all new applications coming out of development were in containers, and Kubernetes was running in production in 2017. In 2018, we were one of the top five largest production Kubernetes clusters by revenue, according to Red Hat and JFrog. By 2020, when I left, we did thousands of builds and deployments per day. We ended up having two Google Cloud Certified Fellows, and we had experience running Kubernetes in five cloud providers. The one most people won't guess is Alibaba Cloud.
In order to know where you're going, you need to know where you've been. In 2012, I worked for a financial company, and over 95% of infrastructure was outsourced. Out of 500 employees, only five were retained. Developers thought that if they outsourced all infrastructure to a provider, the provider would do architecture, engineering, and operations. As everyone knows, the same amount of work was still required because we had no engineers or architects, but we had a large outsourced provider contingent that was able to help us.
I learned three principles when I was at this financial company. The first one: the CIO always talked about dial tone. I'm also a closet economist, so I believe in Adam Smith, division of labor, and trade specialization. Finally, I believe in automating everything.
What is the dial tone principle? I always tell this story anytime I get a new team. No one cares about any of the technology that you use to implement a phone. It could be Cisco or Avaya or whatever. It just better work. When the business picks up the phone, they expect a dial tone. Anything less makes the business upset. To me, this means that you focus on what is important.
This led me to Ruby on Rails. Most people, especially my mom, know what Etsy, Hulu, Square, Instacart, Airbnb, Twitter, and Twitch do. However, they don't know the development platform that all these started on, which was Ruby on Rails. The reason why I love Ruby on Rails is because of Doctrine 2, where it talks about convention over configuration. That basically means that we should focus on things that accelerate business value, not things that do not accelerate business value. Because of that, a lot of decisions are already made before you even use the framework.
The other thing, because I'm a closet economist, is that I love Adam Smith's Wealth of Nations. I firmly believe in division of labor and trade specialization. There is a story called "The Lawyer versus Secretary." Suppose you have a secretary and a lawyer, and the lawyer can type faster, file faster, and use a computer faster than the secretary. Would the attorney choose to be a secretary or choose to be an attorney? Of course, they would choose to be the attorney. Every hour an attorney spends doing secretarial work is an hour they can't spend being a lawyer. Because of that, you bring on a secretary in order to maximize productivity.
That's how I feel about my DevOps or release engineering teams: developers are like the attorney. Their job is to put out amazing code, and our job is to support them. Every hour a developer is focusing on things that don't provide business value is an hour they are not providing that value.
The other thing I love is: when all else fails, automate. There are only two ways you can increase productivity. The first way is automation, and the second way is increasing resources. The issue is that, in my career, resources have always been scarce. Because of that, in order to increase productivity, I've always had to fall over to automation to do things.
How does this all fit in? In late 2015, I was a vice president at a financial company. I had a corner desk that overlooked the harbor and the city. Because they got rid of most of the infrastructure, I had amazing career stability. Everything was great.
However, one day I went to a meetup, and this meetup filled my head with crazy ideas about containers. I went to learn more about Ruby on Rails because I was doing some work for my mom and doing lots of development. When I heard about containers, it satisfied three things. Dial tone: containers abstract infrastructure. Specialization: operations could create containers that devs could use over and over again. Automation: I could build containers over and over again, and everything would just work.
I knew I needed to make a change. I found out that this hotel company was willing to go all in on containers. The issue was that it was probably bad for my career, at least so I thought, and everyone told me that. I went from a VP to a contractor. I went from amazing stability to no stability at all because I was a contractor and this was an experimental project that could be cut if it didn't work. Instead of amazing views of the city and harbor, I was sitting at a table, not a desk, in a room with no windows. I didn't know if I had made the right decision, and I would call and talk over and over again with people about whether this was the right thing to do. Most people said I was a fool.
The thing that allowed me to stay was an amazing team. They formed a great cross-functional team of people with amazing talents. We had three developers and three infrastructure people. I love giving nicknames to people. We had our fearless leader, who rallied the troops; the genius, who was the superstar developer; the professor, who knew everything about everything, was both developer and infrastructure person, and actually suggested containers; and Superman, who had unbelievable energy and was our doer. I didn't give myself a nickname. I was just glad to help.
The goal of this team was essentially evolution versus revolution. The goal was that we would take something and totally change the way the enterprise worked through this cross-functional team. I learned lots of things. One thing I learned early on is that environments, especially lower-level environments, should be production-like. The reason is that we were a high-performing DevOps team. Unfortunately, we didn't make any money. The only performance slot we could get was from midnight to 5:00 a.m. because legacy teams had all the best slots, and we got the worst slot.
The other great thing about having this amazing team was that, at the time, we couldn't Google anything about containers. We had to create our own orchestration engine in order to orchestrate containers on multiple VMs. We were able to prop each other up and bring ourselves along. As a result, we started creating frameworks based on containers, frameworks and libraries, and we thought about how to secure these and deploy them over and over again. Everything was kind of like a pyramid, where we built on things so that things would go faster.
The other thing is how I learned about the greatest microservices on the planet. I'm a Linux guy, and the way we thought about containers is that you have the command, which is a container; your command-line options, which are environment variables; and anytime you do a pipe, think of that as a sidecar. Linux makes the best microservices because you can take a command, change how it works based on command-line options, then use a pipe command and change it even more by adding extra commands. This is how we built a lot of our containers.
The result was that we came up with a framework for how we could deploy containers on multiple servers in multiple ways. A lot of people believe that you have to be in Kubernetes day one in order to use containers. That's not true. People also believe containers are immutable. Depending on how you design them, they're not, especially if you're using them on a VM. We really believed that containers are awesome, and if you focus on container hygiene -- in other words, how you build your containers -- you can run anything inside a container on a VM. Especially when you start out, focus on maybe putting containers on VMs, and then you can go to Kubernetes.
The reason we loved these frameworks, especially the ones we built, is because of dial tone. We abstracted where we ran containers. No one knew where we were running these containers; they just knew there was a URL where they could get a microservice. In terms of specialization, we realized that a small team can service a much larger team. Finally, automation: we could build hundreds of times without us getting involved.
After the RAM project was successful, they asked me whether I wanted to come on as a full-time employee. I said sure. I asked for six things: all containerized workloads, developer tools, pipelines, platforms, and base images. Three years from then, that is considered modern operations. I was asking to run modern applications or operations, but I felt that containers, platforms, base images, pipelines, and developer tools were really important for me to do my job for release engineering.
The reality was that, if you look at the classic DevOps issue, development throws code over the DevOps wall and it hits operations. I was like code being thrown over the wall into operations. One, the infrastructure SVP didn't believe in the team. Two, I had few allies in operations because most of my time was spent with development. Multiple reorgs left me under a different VP who thought the DevOps team or release engineering team was really a QA team. On top of that, the operations team had created another DevOps team where they were essentially implementing a service catalog with Chef. I was thinking, "Oh my goodness, I'm a team of one. Even though I have all these developer friends, this is deja vu all over again and I'm by myself."
The issue was that we had two competing platforms, and the question was: what will devs choose? One option was a service catalog where you could have a dropdown menu and pick compute, memory, or storage. The other was a platform where you go into Git and hit commit, and afterwards Jenkins does some things, creates an artifact, which is a container, and deploys it to compute.
The key thing is that the service catalog, in my mind, was TMI -- too much information -- where developers had to know all this compute stuff. The pipeline was the abstraction I wanted. Fortunately, developers chose our pipelines 100% of the time. Dial tone infrastructure provided too much information that developers didn't want, whereas for our dial tone they just hit commit to see their code running once Jenkins did its work. We managed Jenkins, containers, and compute so the developers could be amazing at what they do and do what they love. We focused on workflow, not necessarily building servers.
If you look at everything we built, this was our framework: how we built base images, pulled in libraries and frameworks, secured everything, how developers interacted with the pipeline and tickets, and how we secured things with code quality scans, static code analysis, and Aqua Security scanning containers and making sure running containers were secure.
Once everyone found out that 100% of things were going our team's way, we got even more work. Part of this work was merger integration work, customer product work, refactoring APIs, and going all in on international expansion with a partner. This is the phase I call "go faster."
The issue was that, even though we had done all these amazing things and wanted to use Kubernetes, the SVP wasn't at all sold on Kubernetes. Because we were a small team and developers were choosing us 100% of the time, we had an unbelievable workload. I thought to myself, "Oh my goodness, did anyone get the memo?"
Fortunately or unfortunately, depending on who you are, we found out one day that our multibillion-dollar website went down because of a release. Everyone was on the call. The SVP was hot. People were talking about how to roll back release changes. These calls tend to start with infrastructure, then more and more people are brought on as time goes on. The SVP and people on the call, mostly infrastructure people, assumed it would take hours to roll back the change.
We meekly shared our screen. We pushed one button, the easy button, and rolled back the change. We literally blew everyone away because everyone thought it would take hours to roll back the change, and instead it took minutes.
One of my favorite diagrams was one I created in 2007 after attending Google I/O, when I was obsessed with machine learning. I showed this to the SVP later on. You already know how operations and development people go: infrastructure people are always trying to figure out why developers are doing stuff, and developers are always like, "It's the network." I told my SVP that I had asked for developer tools and had most of them, though I needed some more. In the end, all these tools can be used to build models to do three things: grade commits, grade developers, and grade the team. Based on things they do, you can either reject or approve a release.
In order to do that, I needed Jira, Jenkins, Git, Artifactory, and various other tools. We could integrate them, and then they could begin to grade the different things developers do. That blew my SVP's mind, and he loved it. As a result, we got the green light to continue to consolidate. We formed a new infrastructure organization.
Another one of my favorite slides is about value. Customer service is what I call hands-on service: high touch, high cost, very low value. A lot of times when people think about DevOps organizations, they think of sitting developers and operations people together in the same room and, through osmosis, all this amazing stuff happens. I believe the amazing stuff happens when there is clear communication between operations and development. For example, if we create APIs that developers can use in order to use infrastructure, developers go faster and their happiness goes up.
One of the people I love listening to is Kelsey Hightower. He talks about NoOps. It is not removing the operations organization; it is forming product teams. These product teams control the end-to-end flow for how to provide a product to development. As a result, we think about how to productize things.
When I think about the value scale of providing DevOps, I think of customer service, which is low value. Then you go to the platform, which is location agnostic. Instead of being an artisan-focused team, you are now focused on process. Then CI/CD allows a team to be the enablers of all the people who are experts in their field. We can standardize tools and platforms, force good practices, and allow people to specialize. Finally, base images contain enterprise standards. They're opinionated and automated. The great thing is that they're a contract between development and operations.
Operations sometimes gets involved when a development team has 100 steps to go and they're on step 90, and no one likes being told at step 90 that they have to go back to step one. Our base images were a way that, if everything worked in conjunction with our pipeline, you could deploy to our platform as fast as possible.
The dial tone is that developers should understand how to use your platform. In our case, we used key-value pairs with Git, and this controlled the pipeline as well as Kubernetes, because Kubernetes is hard. We focused on Legos versus hands-on work. The specialization was that the team could focus on innovation. Developers could focus on providing business value while we thought about and worried about all the rest of the stuff. Automation meant taking all these pieces together over and over again to build something amazing.
As a result, we went from two teams that had separate focus to one team where we combined cloud, base image, infrastructure, automation, shared service, general programming platform, and CI/CD. The results were that we could support lots of developers, provide microservices and containers running in production, process a lot of money on our containers, and have Kubernetes in production where most people were just playing with it in a lab. We could do thousands of builds a day and be multi-cloud. Most people can't get one cloud provider right. We got five cloud providers right.
If I were to give advice, first I would say take calculated risks. I didn't know whether my foray into being a contractor would be good. I thought it was the worst mistake I ever made. But I did it because I firmly believed in the technology and thought it would be a paradigm shift.
Also, form teams of like-minded individuals. When you feel down and like you are fighting against everyone, you can look to the person next to you and feel better. Digital transformation, unfortunately, is politics. I was offered a VP role in 2017, and I didn't do it because I wanted to be hands-on-keyboard and I didn't like politics. The issue is that sometimes you have to take a promotion if you need to control your own destiny. I love technology, but I really thought I did a disservice to my team.
Finally, start slow. You don't need to run Kubernetes today, but most workloads can go inside a container. A lot of people jump to microservices and doing all this amazing stuff. It's okay to take baby steps. We took baby steps a long time ago, and that is the reason why we were able to be where we are now.
Finally: dial tone means abstract. Specialize: have a team obsessed with release engineering. Automate, automate, automate. Resources are normally scarce, and the only way to overcome that is to automate.
This is the help I'm looking for: everyone needs to convince Gene to allow me to do a container, Kubernetes, and CI/CD pipeline deep dive. When I was at the hotel company, we were on our Gen4 containers. They were cloud portable and scalable. Health checks were built in. We had tests for latency versus CPU. Certs were no longer in the application or managed by developers. We focused on circuit breaking, and we had APM built in and zero trust. Our images were very small. That's all about container hygiene. Our sidecars were used to enhance everything.
The other thing I'm passionate about is pipeline security: how do you secure an end-to-end CI/CD pipeline with security? Have plugins in the IDE that give feedback all the time to developers. Have a library and framework remote repository. Most people use Artifactory and Xray to do that. Have container scanning over and over again. Aqua Security is great for that. When you talk about day-two operations of containers, you have to think about how to secure the host, how to whitelist commands, and how to do forensics, because containers are ephemeral and dying all the time.
The other thing I would talk about is that it's very important to use environment variables. Most of the time people do env-prod, env-QA, or env-dev, and tree folder structures are horrible to maintain. Configurations should be a separate pipeline from your artifact. Environment variables should be used.
If you can convince Gene to allow me to do that type of deep dive, that would be amazing. Thank you so much for your time.