There is no Weekend Release, Only Zuul: Continuous Delivery for a 10-Year-Old Codebase

Log in to watch

San Francisco 2014

Download slides

There is no Weekend Release, Only Zuul: Continuous Delivery for a 10-Year-Old Codebase

Steve Neely

Development Manager · Rally Software

There is no Weekend Release, Only Zuul: Continuous Delivery for a 10-Year-Old Codebase

Chapters

Full transcript

The complete talk, organized by section.

Steve Neely

So when there's something strange in your production stack, who you going to call? Ops pager.

Why do you keep trying to ruin these guys' lives?

What you really want to do is call the developers who wrote the code and know how to fix it, because they know how to fix it because they broke it. I'm a developer. I know that is the truth.

And then you want to get it out immediately. You want to ship that fix, get it out into production, get everything working again, back so customers are happy. And this is what we wanted to do.

So this is me. I'm Steve Neely. I'm a dev manager at Rally Software. That means I am 50% developer and 50% manager and 50% something else, usually.

This is actually a two-part story. So the first part is about doing continuous delivery with a 10-year-old code base. And the second part is about doing continuous deployment. So that's auto-deploys, and I'm going to talk about that too, and how we did that.

That's my Twitter handle down there. If you can out-tweet Gene Kim or me, I will buy all your drinks tonight. And I know that's a safe bet because that's not possible.

Okay, so here's our little agenda. To make it easy for you guys, whatever is orange is what I'm talking about. We'll go down through the orange things from top to bottom. So if you fall asleep and wake up again, you can figure out where we were.

So the best bit just happened. That was the music. So if you're really done with me, you can leave now.

So what's the deal with Ghostbusters? Ghostbusters is a 1984 American supernatural comedy. It's a film in which three eccentric parapsychologists in New York City start a ghost-catching business. It's these guys here.

Dana, played by Sigourney Weaver, becomes... Her apartment is haunted by a demonic spirit called Zuul. Zuul is a demigod worshipped as the servant to Gozer the Gozerian, a Sumerian shape-shifting god of destruction. And as the Ghostbusters investigate, Dana becomes possessed by Zuul, who declares itself the Gatekeeper.

Then Rick Moranis appears, becomes a giant dog. Then there's something about giant marshmallow men stomping around New York. And then the Ghostbusters cross the streams, win the girl, and the credits roll at the end. That's what we were trying to do.

Okay, so we've recapped Ghostbusters. We'll come back to that in a couple of slides' time.

But starting off, what were we actually trying to do? So what we were trying to do is we had a 10-year-old code base, like I said. It was a bunch of Java backend code, JSPs, some XML files, Tiles, a whole bunch of that stuff.

We were shipping code every eight weeks. So every couple of months, we would do a release. So we would do two-week sprints, have a week of hardening at the end where we'd find a whole bunch of defects, and then scramble to fix those before the release train went out.

And so what some of us wanted to do was go faster than that. We wanted to, as engineers, ship code maybe every week, or maybe every day, or maybe more frequently than that.

Oops. Where am I?

We were about 200 sprints in, and we decided it was time for a change. And we'd heard about this thing called continuous delivery, and some of us heard about this thing called continuous deployment, and that's what we wanted to try and do.

So continuous delivery is, as a recap: I can ship the head of my master branch at any time. So there is potentially a push button somewhere that I can press, and it will take whatever's in master, and it will go out magically into production.

Continuous deployment is a little different. So continuous delivery is a precursor for deployment. Think of it as auto-deploying if you haven't heard of it before. Most of you probably have. So it's this: you go directly to production. Do not pass Go. Do not collect $200.

And that's actually what secretly some of us wanted to do, but we didn't tell the business at that time because it would've freaked them out. Took us a while.

Why would we want to do this? I think that selling continuous delivery is easier than continuous deployment. But we knew that if we could ship more frequently than every eight weeks, we would have smaller batches of code going out into production all the time.

So that means that when, since developers break stuff all the time, and I'm a developer, so I know that's true, when I break something and it's a small batch of code, it's something that I've been working on for a day or maybe a couple of days. If that gets out into production, it's a lot smaller for me to focus on. It's much less code.

Think about two, three, four teams working away for two months and then shipping this big ball of code, and all that stuff goes out. When something breaks, it's more difficult to find.

We used to get up at, oh gosh, it was 6:00 in the morning on a Saturday, we'd do releases. So we'd take downtime for our releases, which meant we actually got up at probably 5:00 in the morning to prepare, and stuff was going on before that even.

And everybody would log into the systems and watch our releases go out. And when I say everyone, I mean not just engineering. Our CTO would sign into the system, so would our CEO, and be on chat and watch everything going out. That is a stressful thing to do.

We knew that if we were shipping more frequently, we would become more practiced at it and it would become less stressful, and hopefully the CTO and CEO would stop watching.

More flexible, so you can try things out. Higher quality. If you get into this state where your master branch is in a potentially shippable state at all times, or is going out very frequently, you have to think about quality.

And the other thing is that, appeal to some of us, I think it would be fun. It was something fun to try. So it's fun building software, and it's fun building UI interfaces and web apps and all that kind of stuff, but it's even more fun trying this new stuff.

So we had to sell it to the business a little bit. And fortunately, our business leaders, or our kind of executive level, were all developers at some time, so they understood quite a lot of this fairly quickly.

But selling it to product was interesting too, or convincing them to let us do this and invest in it. It gave them release control, and that's what they really liked in the end.

So we said, "You don't have to wait every two months to get something out the door anymore."

And there were many instances of things that we would start working on, and they would take nine weeks. So you start at the beginning of sprint one, you go through the eight weeks, you do a release, and then the ninth week it's ready, and then they have to wait another seven weeks because we didn't do out-of-band shipping.

We would push code for defects and emergency stuff, but not for features. So the product team felt pain sometimes because we weren't shipping more frequently. So yeah, they found it stressful to miss the release, too.

So here's what we did in a very brief summarized form, and I've called this section "The Equestrian Adventure" because here's a picture of me pair programming with somebody at the Denver office. It's a horse, in case you hadn't noticed.

The point here is, and this is a theme that started before this conference even began, and it's something that I've heard people talk about here especially, but in other places too: this idea of unicorn organizations trying to do things like continuous deployment versus horses.

So are you a workhorse or a unicorn? And I don't believe that we, as a group of engineers, were necessarily unicorns. We're not the Etsys or the Facebooks or Googles, the ones that often talk about this stuff at conferences and sound super cool.

Other people can do it. We didn't magic this out of pixie dust and rainbows. This was just hard work. And it was incremental.

So you can't go from shipping every eight weeks to deploying every commit to master. You've got to do it one step at a time.

So we would measure our progress. We would push the boundaries. Some of this is engineering challenges, and some of this is cultural challenges, and people often ask me about that.

It's like, how do you convince the QA team to suddenly step out of the way and allow you to push code direct to production without them doing the checks that they feel like they have to do? How do you convince the operations teams to give you the keys to their kingdom? Because they're the guys that get the phone call at 2:00 in the morning, it's not you. So it's the ops pager that gets paged when I break something.

So we would push boundaries all the time. First of all, we'd start off by saying, "Well, what if we shipped code every two weeks?" And it turned out going from eight to two was actually really easy.

And then we would try every week. So this was good and bad. It was good because we were shipping code more frequently, but it was bad because we were all getting up at 5:00 in the morning on a Saturday every week, and that really sucked, especially if you'd been up all night the night before.

And then we would push boundaries a little bit more. So you would say things like, "What about if we release on a Friday afternoon rather than a Saturday morning?" So Friday afternoons are quieter traffic times, so that seems like the obvious thing to do, right?

And then we'd say, "What about if we release on a Wednesday lunchtime?" Which is kind of arbitrary for picking days.

Or what about if we pick the time in the morning when people first log into the system and it's under the maximum load it could possibly be? What if we start taking down VMs and rolling out new code at that time?

And it's this pushing boundaries all the time that makes people feel a little bit uncomfortable, but they can take that little bit of comfort. They just can't make the big leap. So you just have to work your way into their brains that way.

So we would be agile, too. So plan, do, check, act, and always measuring progress. So we were monitoring and watching everything.

Where did we start? We started by automating all the things.

And I've seen this slide before by many people who talked about this. Up at the Chef stand up there, they've got the little T-Rex saying, "How do I automate all the things?" Have you guys seen that? They drew. That's awesome.

So that's what we did. We built a value stream map. This is a super easy thing you can do.

We said, "Well, what's our process from a dev pulling a story from the backlog, starting to do some coding on it?" So we're pair programming, doing TDD. We're doing all the good stuff that you're supposed to be doing. And then it goes into some testing, then it maybe goes to a product owner, he has a look at it, then maybe it goes off to operations, who sit on it for a while. Maybe some database migration stuff has to be checked by DBAs.

Anyway, we would map this whole thing out, and we took a whole bunch of stickies and just put them on the wall and said, "This is our process. This is our workflow."

And then to make it a value stream map, we put times on it. So it takes approximately two days for stories to go from in dev into test, and then it takes a day to test, and it takes blah, blah, blah. And then find the slowest part in that process, and that was the thing we automated first. So we're looking for the biggest bang for our buck first.

And then we started automating all of our deployment pipelines as well, just taking it one step at a time.

And we would do things like play thought games, where we would say, "What prevents you from deploying now?"

So one thought game we played, which was, well, I was going to say it was fun, but I guess in retrospect not necessarily fun, but we would arbitrarily stop everybody in development and say, "Right, what if we were to deploy the code right now? The head of master. So we're going to take that and push it to production. What's going to happen?"

And quite often people would say something like, "Oh, I'm in the QA department. I just need to do some manual testing of some UI components. So some of the teams have just built some new fancy grid, and it's got all these whiz-bang thingies on it, and I don't know if it works or not, so I need to get in there and manually test it, so you can't deploy."

So that would be something we have to figure out. How do you fix that? How do you stop that from being a blocker?

Maybe you had manual deployment steps. So we had this, and I've heard this at many other talks here, people saying that you go step by step through some document and check boxes, saying, "This is how we do deployments."

So if you're doing manual deployment steps, that's necessarily a blocker to being able to auto-push to production too.

I talked about database migrations. Performance and load tests as well were interesting too. So our performance tests would take a couple hours to run, and if they hadn't run and you didn't know what performance was going to be like once you pushed the new code, then maybe you say, "Well, we're not going to be able to push and deploy right away."

What else? Something else that we discovered, which again is obvious in retrospect, is that if you have a test suite that takes nine hours to run, and deploying is predicated on all tests passing, the fastest that you can ever deploy is nine hours. So you need to have super-fast tests.

And this is something that we invested some significant time and energy into.

So yeah, we had a GUI test suite, or a browser test suite. It was Selenium-driven, so it would pop Firefox browsers up, log into the application, navigate around, make sure that the app was doing what it was supposed to be doing, and then report back based off of that. And it was only a couple of thousand tests. It wasn't a ton.

But it ran in serial. So the first thing we did was said, "Well, how do we parallelize this? How do we make it faster?"

So the obvious thing to do is to take it and break it up into chunks of, say, 10. So you've got 10 test suites of 200 tests, and you run all those in parallel, and you make things a little bit faster that way.

So the problem with that is that some of the chunks, or some of the buckets that we put the tests in, were much slower than other ones. So some would finish, and other ones would take another half hour, an hour to run, so you have to balance them across in some way.

So we were like, "Well, what about threading? What if we just have this queue of tests, and then we have this bunch of thread workers that can pop a Firefox and pull the next test, and then run that one, and then the next one, the next one, the next one?"

That was super interesting. And the reason that was interesting was that we started to find concurrency bugs in both our test suite and in our application. And the ones in the application were leading indicators of ones that we were going to see in production about six months later.

So think about as you have more customers using the application, the company's growing, you're selling more, your sales team's doing all the right stuff, then you've got higher load, you've got more concurrency on the app.

So we saw this. We saw this in the GUI test, and we saw these tests that would just randomly fail sometimes. And we had a feeling it was a concurrency issue somewhere in the production code, but we weren't sure where it was, and nobody fixed it, and then we saw it six months later. And then we were like, "Oh, that's interesting." So we fixed it, which is a total nightmare to fix sometimes.

So we fixed that bug, and then we saw other ones popping up as well, and it did the same pattern over and over again. And we saw that in our Java test, and we've seen that in other places, too.

So if you start seeing that in your test suites when you start highly parallelizing things, it could be the same indicator for you. It's a useful thing to know.

Which brings me to this, too. Okay. And I could talk about tests for hours. We don't have hours, so I'm not going to.

But this is another super important thing as well that we learned as we went along this journey. And it was that if you have non-deterministic tests, so maybe concurrency issues, or GUI tests are really bad for this. You have a test that says, "Hit the landing page, log in, and then make sure that the dashboard page has certain things on it."

And the dashboard page sometimes loads fast enough and sometimes doesn't. So sometimes the test will pass, and sometimes it will fail. So what you're starting to see there is non-deterministic, or we call them flaky tests because they flake out, right?

If you see a test like that, teams will stop listening to them.

So what I mean by that is you will run the test suite, the build light will go red, or Jenkins or whatever it is you're using will tell you that your test failed. And you'll look at the test failure and say, "I've seen that one fail before. That's the one that when you log in, the dashboard sometimes loads fast enough, sometimes doesn't. It's just a flaky test. That's not a big deal. Let's just keep going."

And when people start ignoring tests, you're in a really bad space. Because you will find at some point, one of those tests will fail, you'll ship the code to production, and it will be a real bug.

So if you are ignoring tests that regularly fail, you're in a super bad state.

And unfortunately, concurrency bugs are really hard to fix, so nobody wants to fix them. Timing bugs are a lot easier. There's some tricks you can do there. You can use visual cues for timing bugs.

So think about what you do as a human. You would expect certain things to appear on the page. So you'd use the same visual cues in your GUI tests.

Okay. So we're a SaaS application. We are shipping code more frequently. One of the top three things that we did, because the other stuff I said was important, too, making tests fast and reliable and trustworthy is super important, but, was we implemented feature toggles.

And some people call these feature flags. I don't know, there's like a million different names for these things.

Does everyone know what a feature toggle is? Okay. Does anyone not know what a feature toggle is? Okay. One person. I'll talk to you afterwards. No.

Okay. In a really brief high-level view, what we have is, think about having an admin page or a super admin page in our application that our support team or product team can log into and turn on specific features just for certain users or for certain subscriptions or for groups of people.

We have them that you can specific with virtual machines as well. If you ever listen to any of the Etsy guys talk, they did this as well years ago, too.

So the real power of this is that you decouple the concept of deploying code to releasing it to customers. So I can deploy code all day long, and I hide it behind feature toggles. It's just conditionals in code, essentially.

And then, because that code is not live, it's not code that will actually execute, it doesn't matter if it works, doesn't matter if it breaks the app. It will not be turned on. So that's how we would deploy code so frequently and not have negative impacts elsewhere.

It also allows testers to turn things on for themselves, so they could go and test it, and then when it's ready, you can actually switch it on.

So it's this concept of decoupling deploying code to the production stack to releasing to customers.

Our product team love this stuff. They think it's awesome.

So if you have me log into Rally, if any of you guys are Rally users or have seen Rally, what I see in Rally and what you see in Rally is totally different. Because we have features that are three or six months out that we've been working on, that we're testing internally and testing with some customers, but aren't turned on for everybody.

And the product team love it because they can then turn things on and time it with events. So there's some big conference happening or some kind of event, some marketing event, and they can turn things on live there. They think it's awesome.

So yeah. And I could talk about feature toggles for hours, too, because they're cool.

Okay, instrumentation and monitoring. Be obsessive about monitoring and be obsessive about instrumentation as well. So get in your code.

We've instrumented everything from the database layer through our JVMs. We even have stuff in the client layer now as well. So every time somebody clicks on something in Rally, I can tell you how long it took for that AJAX request to go out and get the data back, what the payload size was, how long they were sitting waiting for things to render on pages, and we've instrumented all that stuff. And everyone should be watching this.

So it's great to have instrumentation and monitoring and Splunk and Ganglia and all that kind of stuff, and Graphite and things. But if people aren't watching, then it's not giving you an awful lot of help. So you have to build that into your culture, too. A lot of this stuff is culture hacking as much as anything else.

We have a tool called Flowdock that we use, which is awesome. So this is something that we have. It's a product from Rally as well that we use.

And as a developer, I love this because I can sit and talk to people without actually talking to them, which is really cool. It's why SMS is so good. You can text people and you don't have to say anything to them.

So the thing about Flowdock is that we'd met some of the guys from GitHub, and they were telling us this thing about Hubot, which is a bot that lives inside their chat world.

And so we were like, Hubot was kind of cool because they were doing deployments with it and doing operational-style things. And so we took Hubot and worked with him a little bit and turned him into something that we call Jenkins. It was a hackathon project at first.

And so what we can do with Jenkins is we can say to Jenkins things like, "Tell me if the build is passing or failing," or, "Alert me if there's some kind of jump in the database load."

Or one that we did recently that was super useful was, "Tell me the difference between what's in production now and what's at the head of the master branch that everybody's working off of, because something's broken and I want to know what the diffs are between them."

We do GitHub control. We can trigger builds and deploys. We can do all sorts of operational stuff from within our chat room. And it's where we live anyway as developers. We're all talking in there anyway, so that stuff is super awesome.

And you can pug people and do meme generation or that kind of cool stuff, too. And if you want to see that, I could totally show you that later. It's fun.

Okay. Sorry. So where were we? Go back one. Yeah.

Okay, so back to the Ghostbusters thing. So that's continuous delivery. What about continuous deployment?

So as a check-in, we are at this point, after probably two years' worth of work, we're in a state where we're continuously deploying this big monolithic ball of code. The devs are happy. We're not getting up at 6:00 in the morning. We're doing deploys during the day, so everyone's around if something breaks.

We've got fast tests. Everyone trusts the tests. Everything is instrumented and monitoring. And the business is happy, too, because we're shipping code all the time.

So let's go back to Sigourney. So remember, she has been taken over by Zuul, this demigod, and he declares himself the Gatekeeper.

So we decided we were going to try and break off some parts of our main application, and that was going to be our Gatekeeper.

We couldn't do this with our original code base. It was too big. The architecture was just not quite right for doing continuous deployment. We couldn't auto-deploy it.

So we were looking at splitting it up into a service-oriented architecture, and this is what Netflix had done as well. So we're following their path.

So what we were going to do is break up the application into small services, continuous deploy each service, and for the very first service, we were like, "Well, let's find something that is going to be easy to write tests around." So something that's not GUI-based, something that we can write a bunch of automated tests, fully exercise it, and humans would never actually test it anyway.

So we were looking for an API service, something that only software would talk to. So if software is talking to software, you can write tests to talk to that software and test it and exercise it, right? Makes sense.

So that's what we did. We wanted testing that did not involve humans. So we decided to choose username and password checking, because that's easy. You take a username and a password and say, "Can this person get into the system or not?" Okay, so that's the authentication service.

So that might be a little audacious, but that's what we split off first, because there were some other business reasons for that as well. We had some other products like Flowdock that we wanted to do single sign-on and auth with that as well.

So that's what we did. We built it. We built out, it was like a mock of the existing, or a mimic of the existing deployment pipeline, but we simplified it significantly.

And so our auth service we called Zuul. That's the Ghostbusters reference. Zuul was the gatekeeper to all our products.

And this is our continuous deployment pipeline. So each one of these horizontals is a deploy. So on the far left there is someone's pushing code to GitHub. And then as soon as Jenkins wakes up and realizes there's new code there, they run the sets of tests.

So there's some unit and integration tests, some test deploys, then there's some compatibility tests. So it's an API, and we had an SLA to say how long the API would be valid for. So we version tests over time, and then age those tests out to make sure that we weren't breaking legacy clients, and that stuff is interesting, too.

Then we do an engineering deploy, and then on the furthest right is a deploy to production. That is all automatic. So once you go from push, you really are going straight out into production.

Zuul was in production for a year before we cut over.

And so what we were doing was this: we took every incoming request and sent one off to the original auth service that was part of our monolithic Java code wonderfulness, and then sent the other duplicate of the request over to Zuul and then compared the results to say, "Is this person getting the same answer or not?"

It turned out that took a while.

If you're doing this, you need to use something like futures or promises or something that's truly asynchronous, because if you're not truly asynchronous, you're blocking the original request, and you're doing something that's more akin to a gray deployment than something that's dark deployed or tee'd.

That was fun. That took about a year as well, maybe a year and a half. Because we rewrote it three times. Because we're engineers, and we like doing that kind of stuff.

So we wrote it in Scala, Java, and then Clojure, if anyone's interested. Yeah. Now I want to rewrite it in Node.

So where should we go next? The final bit.

So continuous deployment and continuous delivery, is that something that only unicorns can do? No, I don't believe that it is. Like I said, we're just horses. You even saw that picture of me earlier, so that proves it.

So there's two things here. There's a big secret as well. Asking your dev teams to do continuous delivery and continuous deployment means that they have to up their game. So they have to increase the speed of their tests. They have to make them more reliable. They have to automate all the deployment pipelines.

So it's a much stronger engineering discipline. You're asking them to work at a different level.

And it's a journey. So if you're trying to convince your business to allow you to do this kind of thing and spend a couple of years building these things out, this is part of the message you give them.

Even if you don't get all the way to continuous deployment, then every step is an improvement. Even if you make your tests just a little bit faster or a little bit more reliable, or you automate just the slowest part of your pipeline and then you stop, you're in a better place than where you were when you started.

So you're going to end up with a whole bunch of systems that are more finely tuned. They're more governed by automation. They've got thorough testing. You've got all this monitoring and instrumentation. Everybody's watching. And you've hacked your culture to be better, too.

So in the last two minutes, I think it's time to slay the unicorn myth. You, too, could be a Ghostbuster.

We can all do this stuff. It's not just these fancy companies, the magic unicorn people.

Okay. And then my final slide that Gene asked us all to put in.

So there's still a whole bunch of stuff we haven't figured out. We have come a long way in the last, I don't know, four years or so we've been working on this.

And I think the thing that we still have problems with is orchestration. So we have some things that run with regular services in SOA. We have some things with microservices. Orchestrating all those things and making it simple for the dev teams to do is still difficult.

When we take any of the teams that haven't done any of these DevOps-style things and say to them, "We want you to build something new. We want it to be a service," it scares them. They think it's going to slow them down. They're like, "I know how to do that. I can go hack the back end in Java and make some changes to the Oracle database, all that stuff." That's the easy thing to do.

It's getting them in a position where they feel confident and they know how to manage all these different things together. Because there's a learning curve. It adds complexity. And that upfront complexity does scare people still.

So I think that's still something we need to figure out. We need to make this stuff simple.

And that's me. Thank you.