Speed as a Prime Directive

Log in to watch

San Francisco 2016

Speed as a Prime Directive

Vice President of Engineering · Hyatt Hotels Corporation

Hyatt is transforming into a technology company that delivers digital experiences in the Hospitality industry. We're applying Continuous Delivery in order to achieve our goals faster. In the process, we are simplifying and abstracting legacy environments and building a hospitality technology platform.

Chapters

Full transcript

The complete talk, organized by section.

Ray Krueger

I'm Ray Krueger. I'm the Vice President of Engineering at Hyatt Hotels.

That's a recent title for me. My original role at Hyatt, which is what most of this talk is based on, is as our chief architect. I joined in 2015 as our chief architect to come in and figure out how we were going to embrace technology and move the company forward with technology.

But aside from the fun stuff, do the boring things like define standards and do architecture reviews. But I've always referred to my job as being the chief evangelist for technology at Hyatt.

If you've browsed the website or poked around on the app, you might have seen this picture. That is my headshot. I use it because my wife hates it so very much. Those are the cold, dead eyes of an Illinois Army National Guard electronics technician.

I spent eight years with the United States Army and with the Illinois Army National Guard, and that influences a lot of my thinking and a lot of the ways I approach leadership and things like that, is kind of what I learned in the Army. Also, again, my wife just hates this picture, so it lives on.

So why optimize for speed? Our stakeholders demand it. If we're not shipping a quality product quickly, they're going to be frustrated. So that's our main role. If we're not shipping, what the hell are we doing?

Decreasing our time to recovery. A lot of these things are all things we've heard a lot of in the past two days. So if something goes wrong, if we can build and deploy and ship things fast, we can build and deploy and roll back things fast. So decreasing that time to recovery.

Security, again, sort of the same concept. If there's a hole, we can patch it and deploy very quickly.

Sort of an underlying thing that we sort of rely on in our teams is the sense of urgency. Because we're always going fast, because the teams expect to be able to go fast, there's a huge amount of noise that happens when anybody impedes that. If anybody gets in the way of our speed, there's an immediate reaction to that. Right now, it's a very noisy reaction.

And then being able to run more experiments. This is where we really get the most benefit, is being able to build prototypes quickly, being able to build, ship, fail, and learn significantly faster.

So defining speed. Speed is really just the rate at which something happens or is done. This is literally the definition that I Googled in my frantic state putting together these slides.

So what things are we trying to increase speed on? Is it design, implementation, QA, performance testing? It's all of those things. But the main driver for us is accelerating our experimentation, our learning, and discovery processes. Being able to fail fast is really the main goal. To put as many experiments in front of people as possible.

So where are we coming from historically? Actually, most of my speaking is usually to startups, to small engineering teams, things like that. So for me to complain about legacy technology in this audience seems a little weird, but it is there. We have AIX and Informix and 4GL and PowerBuilder and things like that in our lives.

But the main thing that we are trying to deal with is this technology as a cost center. My first bullet point there is that technology was a cost center like heat or laundry, anything else that we really had to deal with as a business, and it was begrudgingly funded and begrudgingly accepted.

Many separate silos. At one point, we had multiple CIOs over an organization that is much smaller than probably most of yours.

There are also no standards, just silos and snowflakes, everybody. Everybody was free to do whatever the hell they wanted, and we got exactly that. And it was all very vendor-driven.

So product development at Hyatt was historically like a sausage factory. Money went in, product came out, nobody wanted to know what was going on in between, and quality was very suspect. And operations is always playing catch-up.

Every time something went wrong, we had to get on the phone with some vendor. We had to go dig around in code that we didn't understand. We had to basically do digital archeology every time something went wrong to figure out what the vendor did and what we inherited.

And vendors were really deciding how things got built and where they were deployed and how they ran. So we had no control over our own fate, really, by handing everything off to vendors.

So where are we trying to get to? We're trying to get to using technology as a differentiator. And it's not like I'm giving away a secret here to Hilton or Marriott or anybody else. We're all in the same boat. We're all on the same path.

So trying to use technology as a differentiator so that we can have happy guests, so that we can try to predict your wants and needs and improve the experiences that you have anytime you visit one of our hotels.

One of the things that I really like, this is where I'm going to go off on a tangent that I didn't plan for and screw up my time, but one of the things that I really like about working at Hyatt is that we're expressly in the business of making people happy. It's a pretty great business model.

And we never talk about increasing revenue. We always talk about improving the guest experience. That immediately drives revenue, but those are the things that we talk about, and it's one of the things I really like about working at the company.

Continuous integration, delivery, those kind of go hand in hand for us now, of trying to get to a point where we are finishing features and shipping them as quickly as possible, again, so that we can...

I'm pointing at my screen, but I should point at that one.

Continuous experimentation and continuous discovery. Get as many tests and experiments out in front of people. The idea is that any time you're following through on some experiment or some experience on the site, you're actually going through some experiment that we've put in front of you. The goal is to have many going at any one time.

And continuous discovery, always listening to our guests, always trying to get as much information to drive our decision-making processes.

So how do we get there? My talk is really going to be about the technology platform that we've built to try to go faster. So you won't hear about Deming and other important people much smarter than me. We're going to talk about a lot of technology right here.

So open source first. A big goal of ours right now is to try to simplify our technology stack. So we're not targeting big, bloated application servers like JBoss and WebLogic and WebSphere, if that still exists. We're primarily working off of the open source frameworks that those application servers are built off of. So we're kind of distilling it down to the technology we need, and really not much more than that.

And this model here also helps us hire. We need to increase the organization. We need to increase our throughput. We're always trying to bring in engineers. Open source helps us stay on that path.

Postgres, MySQL, Redis, these are all very powerful technologies that are readily available, and you can pay for enterprise support if you need that life insurance policy. But for me personally, Postgres and Redis are my favorite hammers. I can pretty well conquer the world with those two pieces of technology.

But all of this helps us drive our time to deploy, time to recover, time to replace, because we don't have to negotiate licenses when we want to scale up. We don't have to negotiate licenses when we want to build a new thing. We can just get rolling. And again, that helps us control our own fate.

So now the less technological piece, defining our standards. It's a big role of mine as our chief architect was to define standards.

Most organizations, and you're probably familiar with this, but most organizations usually have some spreadsheet on SharePoint that just lists out, "We use Java," and you don't really know any of the motivations or how you're supposed to use these technologies. "We use JBoss." What does that mean? Why are we using JBoss? How are we supposed to deploy JBoss?

So we wanted to have clearly documented and visible standards. We have standards around things like logging and monitoring and all of our data storage and VM sizes, DNS patterns. We've defined standards on these things, and they're not lines in a spreadsheet. They're actual full-blown documents that describe my motivation and my purpose and the implementation that we're going to bring to bear on the problem.

And to do that, we actually follow the IETF RFC process. It is the engineering practice and process that built the internet. If it's good enough to define how the internet works, HTTP and TCP, it's good enough for us to figure out how we're going to use Redis.

So the way we do that is we actually treat standards as source code. We have a Git repository full of Markdown documents that anybody in the organization, not just the architecture team, not just me, anybody in the organization is encouraged to fork that repository, amend the documents, create new documents, define new standards.

And then we bring that document in front of network and operations and security and architecture and everybody. We try to get as many opinions from as many smart people as we possibly can behind that document.

And then once that document has really started to build out and everybody starts to feel like it's come to some sort of good place, it's the role of the chief architect as benevolent dictator to decide that this document has gone on long enough. This looks good. Let's ship it.

And it's never concrete. We can go back, like I said, and amend those documents. It's going relatively well. I've only written about half of them, so the others have been provided by other engineers and folks within the organization.

The standards themselves, though, aren't really all that useful if you don't provide the means for people to actually embrace them. So to tell people, "Do logging this way, do monitoring that way," it doesn't do any good if they have to actually spend time trying to figure out how to do that.

So what we provide is the frameworks to support the standards that we've established where it's appropriate.

For example, user interfaces and user experiences, we have a live style guide that we call Bellhop that provides all of the CSS and all of the JavaScript that you need to build an e-commerce-based application. Defining buttons and what labels should look like are all driven purely from just applying CSS classes that are just handed to you. You don't have to think about those things.

Web applications. We have a Node.js and Express-based web application framework that is, and I say framework loosely. Really, all we do is take the things that exist, patch them together, provide a generator to start your application from, and that's our framework.

So it comes with logging, it comes with monitoring, it comes with support for Docker. You don't have to think about any of those things. We provide the standards for you.

Same thing for our microservices. This is my first mention of Docker, but we're going into a container-based world. We're going into microservices both at the web and at the service layer.

So our service framework, again, I mentioned before, getting rid of JBoss, getting rid of WebLogic. So we've distilled our way down to Spring Boot-based applications. We have a template project that just spits out a Spring Boot application. You tell it the name. You tell it, "I want Postgres, I want Redis," and some other bells and whistles, and you hit Enter, and you have a project that, again, comes with all the standards pre-implemented for you.

So now your developers can immediately just dive into providing business value and implementing features, which means from these frameworks, prototypes come quickly. Experiments come quickly. We can get things built. We can get things tested. We can watch them flounder and fail in production as quickly as possible. So this helps us get immediate feedback from our product organization and from our customers.

Testing. So yeah, obviously we want to do testing. But you'd be surprised how little of that was actually going on two years ago, even before I got there.

So unit testing is obvious, but we've automated all of our... Not all, I shouldn't say all. But we've automated many of our service test performance testing for the website, all those sorts of components.

One of the more interesting things, and I did not realize that Sauce Labs was here, but we actually built a thing. We run all of the user interface tests through a framework that we built using Selenium and Sauce Labs to test all of our browsers every time we implement new features.

There's a lot of manual phases there still, where we're looking at UI/UX experiences as part of the test cycle, but we've automated a lot of it.

Again, automation, preaching to the choir here. We've all heard how important this is, but automation for us is not just a task in a given project. It's actually part of our platform.

In the past, everything was manual. There were 40-page Word docs that we would mail to a vendor. They would skip 10% of the instructions and screw up the other 20%, or another 20%. It was consistent. There would be 40 minutes of just dead silence on the phone. You could faintly hear typing in the background.

So getting VMs built was actually a very similar process. I'd open a ServiceNow ticket, wait days, weeks for the button-clickers to get around and go manually create my little snowflake VMs.

So now we've defined these VM standards. We know how big they are. We have a menu, just like going to Amazon or something and saying, "I need a small VM." We have ServiceNow, Rundeck, Ansible, that kick off a job that goes into vCenter and spins up VMs.

So what used to take days and weeks now takes 90 seconds for us to get a handful of VMs. And again, that's driven straight from ServiceNow through Rundeck and Ansible.

Jenkins handles all of our software distribution needs. So it runs the testing, it builds Docker images, pushes those to our centralized Docker repository, which is Artifactory right now.

So all of this is the platform that we work off of. It's not just a, "We should automate this task." It is a platform concept for us.

Big part of our future is containers as a service. So I mentioned Docker a handful of times. I'm not going to go super deep into this discussion, but I will say that we're using containers in a very pragmatic way right now.

We use it to solve for packaging, distribution, and deployment. So without disrupting our standing operational model of building VMs and putting load balancers in front of them, what we've done is we've standardized how we put software on those VMs.

So we have Docker that handles building what is essentially a static binary for us as an image. It is our application code and the JVM that runs it, that brings that whole package to production. So no more, "It works on my machine" type scenarios.

Distribution is handled by Jenkins building these images and pushing them to a central repository. And we use Docker basically as our unified interface for operations.

Now they don't have to think about how to stop and start this JVM, how to stop and start this Node application. They know how to stop and start Docker. They don't care what's inside the container. It is our assembly line, where we build those images, we package them up, send them up, and we do security scans on all of these things.

All of this is about empowering operations, to give them this standard user interface that they can always operate off of.

You can't go fast in a rusty pickup truck with bald tires and a rod knock and a steering wheel covered in broken glass. You can, but it probably won't end very well.

So we've built this standard paved path for our application to get to operations. Operations is empowered to push back on somebody who's not following our standards. "We gave you all the tools to build it right. Why did you screw it up?" So they're empowered to push back on those sorts of scenarios.

And they have the engineering staff to help them build up tooling when there's something missing. When they see that there isn't enough automation in a certain area, or enough monitoring in a certain area, they have the engineering support to build those things up.

So this is the usual analogy you'll see for containers: the big ship with the containers on it. But this doesn't really cover the big picture. Building containers, getting the containers on the ship, there's a whole lot of stuff that happened. The containers had to get loaded, the containers had to get moved around on trucks and cranes and all these other things, and then you get to this glorious endpoint of them all sailing across the ocean.

But I don't think this is actually a very good analogy for us, in our world of how we use containers.

So the way we use containers is more akin to a semi-truck. The containers following our standards practice, following our frameworks, they don't have any concept of logging in them. There's no log rotation. There's no log files. They log to standard out.

They have no concept in them of monitoring or any of that sort of thing. All of this is infrastructure that is provided to them.

So in a truck, the trailer on a truck, it does not have air for brakes. It does not have electricity for taillights. Those things are provided to it by the environment. They're provided to it by the tractor.

So in our world, that red hose is logging. That blue hose is monitoring. So these are services that are provided by the platform, by the infrastructure.

So the analogy that we work off of and has really started to take hold is that the engineering and development teams right now, they build the trailers. They build containers. And it's our operations team that operate the platform. They're the folks that drive the truck. So if you're not building your containers to standard, if you're not building your applications to standard, they won't put you on the truck.

So where are we now? Where have we actually gotten to in this vision? Well, we're actually in production. So if you go to hyatt.com on your desktop, not on your mobile phone, because that's a whole different legacy problem, you will see a Node.js application running in Docker. That is our homepage.

Then aside from just the homepage, every header and footer that you see rendered, again on the desktop, is provided by a Node.js application running in Docker. So this is a pattern.

All of our promo landing pages that we run, there's a new fall promo running, or will be running. Those things are, again, Node.js applications running in Docker.

We have 10-plus new APIs running, new microservices running, providing APIs. They are, again, the Spring Boot framework that I mentioned before. So these are Java Spring Boot applications running in Docker. They are accessed through our API gateway, which is based on Netflix Zuul, which is also running in Docker.

So all of that is out there, but again, we're doing it off of this very pragmatic approach of putting containers on VMs without really disrupting the operational model. We wanted everybody to get comfortable with what it means to run containers, what it means to have a container-based environment. And now we're getting to the point where we can actually start to move forward with this in a bigger way.

So what is working so far for us? Jenkins, Rundeck, Ansible, that's our pipeline. Jenkins kicks off a lot of software-based jobs. ServiceNow, I mentioned before, kicks off a Rundeck job that builds VMs, and then we have a bunch of operational tests that are launched from Rundeck. But really, Rundeck is what we refer to as our first domino. It's what knocks over all the other Ansible parts and gets them moving.

Other teams in the organization. So this has been the model is we basically found one customer that was willing to buy into my insanity. And we've started to build things. This is the e-commerce teams, and we've started to build products and build things and get them out there.

And other teams in the organization have now started to look and go like, "Hey, what are you guys doing? Why are you guys going so fast?" So other parts of the organization are now coming to us to figure out how we get onto the platform.

And folks are just excited to work on this stuff. This is cutting-edge stuff, Docker and Node.js and things like that. So teams are excited to actually get their hands dirty and learn these new technologies. What's the fun of being an engineer if you're not learning stuff?

So we're launching new things all the time. We're getting new experiments running all the time, and we're getting a lot faster at this stuff.

It's not without trouble, though. So the things that have slowed us down.

Until recently, this was done with no money. This is literally me convincing other people to spend their budget on my crazy ideas up to this point. So now, 2017, we're actually going to be investing pretty big into blowing this up for the whole organization.

The real world has major impact. If any of you are big travelers, if any of you are Hyatt loyalty members, thank you, then you know about we've launched a new World of Hyatt loyalty program. Launching that program has been hugely disruptive for us in technology because it was something that just showed up one day that we were all going to do. And then it was leaked. And then we had to react very quickly to it. So the real world impacts this stuff.

And then you heard Suzanna and Ben from American Airlines mention this yesterday morning, but FUD: fear, uncertainty, and doubt. That is my main resistance working with teams is they're like, "I don't know. It sounds like crazy talk. I don't know if I want to change the way that I've always been doing things to get on board with this."

And before we started into this, there are just so many manual tasks that we have to prioritize the things that we can automate. So there are still lots of manual tasks that we still deal with, and we're just prioritizing our way through that. But that just means that there's lots of opportunity for us to automate more things.

So what's next? I mentioned we're putting containers on VMs. That might sound like heathenism to some people, but we're going to keep doing that. It's working out well for us, but we're working on actually building now a cluster. We actually want to bring orchestration and clustering to bear on the containers now.

When we started down this road, those concepts were a little early. Persistence was kind of a black hole, and Kubernetes was young, and Swarm was a joke, and Mesos is kind of bolted-on container stuff. So we're kind of letting that all settle out before we got on the bandwagon. So now we're buying into that pretty hard.

More automation, more releases, just more experiments, more speed. So we've started to get more teams involved. Product is excited. We're going fast, and people are very excited about it right now.

So that's kind of all the material I have right now. Questions?

Thank you.

Q&A

Questions? Questions? I got one minute and 42 seconds. Yeah. Take advantage.

Q: Previous session in this room was on service management processes. Any comments on how you're approaching service management and operational processes?

A: I don't have a great answer for you there.

So we've been trying to... We have a team that is dedicated to ITSM and ITIL processes. We are in discussion with them.

The main focus for us, and I think this is really kind of a key thing that they've actually bought into this, was kind of surprising. I expected a lot of resistance from them. But they bought into the idea of, "Look, if we can deploy safely and quickly all the time, that's a standard change."

And they came to us with that. "If you guys can just do this all the time, well, then we don't even need to talk to you. We'll just mark it down as a standard change, and you can keep going."

And to that point, I actually missed one really important point when I mentioned the homepage being in production. The homepage is currently our model for continuous delivery. We actually deploy that right now timidly once a day, but we do it in the middle of the day. So we've proven that we can do it.

So we're actually launching the hyatt.com homepage in the middle of the day, and we do it at like 3:00 was the last time they did it. And again, this is a standard change as far as our ITSM team is concerned.

I don't know if that was a great answer, but it's an answer.

Q: Just a follow-up on that...

A: You got to wait for the mic or they won't record it.

Q: Following up on that prior question, for the standard changes that you do, do you open a ServiceNow ticket? Do you automate the process of opening and closing those tickets? How does that work?

A: Yeah, that's another area that's ripe for automation.

So yes, right now there is a ServiceNow ticket that gets opened. We kind of follow a sort of Google-like model that whoever's going to get paged is going to be the person who pushes the deploy button.

So right now there's an operations team that's actually responsible for operating this platform. So they're the ones who push that button, and right now that's basically the product owner. Once we've gotten through the QA process, that's one of the manual check-in steps is they open the ServiceNow ticket that says, "Ship it."

What else?

Going once, going twice.

Sold.