My Great Awakening: Top “Aha” Moments As Former Dev Manager

Log in to watch

San Francisco 2015

My Great Awakening: Top “Aha” Moments As Former Dev Manager

Sr. Director, Platform Engineering · Pivotal

After spending her entire career as a software developer, with nary a moment doing operations, Cornelia Davis found herself working on an application platform that serves operations as much as development. In order to better understand that world, she spent one month on the team that runs that platform in production.

The experience brought lessons in organizational design, the value of pair-ops (in addition to pair programming) and test-driven development, the importance of addressing continuous integration as a first class concern, and how separating infrastructure ops from application ops serves the business and their customers better.

In this session Cornelia will share the “prod incidents” that brought these teachings; the audience will gain an appreciation not only for what, but why these practices and tools are so important.

Chapters

Full transcript

The complete talk, organized by section.

Cornelia Davis

Good morning, everyone. Thank you so much for coming.

What a great conference, huh? The keynotes, just unbelievable. Mike Bland yesterday, all of them this morning, just so impressive. So glad to be here, and so glad to have you join me this morning.

My talk this morning is on me being a developer and learning about ops. So I'm going to tell you a bit of a story for the next half hour that has lots of lessons in it.

My name is Cornelia Davis. I am the CTO of the transformation practice at Pivotal. For the last couple of years, I've been leading a group within the Cloud Foundry division of Pivotal. The platform engineering group is a field-facing engineering organization from the Cloud Foundry side. What the platform engineering team does is it spends all of its time out field-facing, doing engineering, but field-facing, working with our customers and our partners. We do partner engineering as well, really helping them make the most of the platform.

So I have spent a lot of time with enterprises, and I learned a lot, and I'll tell you some of those stories as we went along. But this story today is about what I learned in going out and working with customers on this platform.

I won't be talking about the platform too much. I'm going to talk about things in the abstract. So these things are going to be applicable to you whether you are using the Cloud Foundry platform or some other platform. The one thing that I will tell you, though, is that platform does matter. Having the right set of tools is absolutely essential for DevOps. But I'm going to talk, I think, a little bit more about just the lessons in the abstract today.

So let me jump into that story. When I first started working on platform as a service, I was still at EMC. I joined Pivotal through the spin-off coming from the EMC side. I was working in the chief technology office, doing emerging tech, and my boss said, "Hey, why don't you start taking a look at PaaS?" This was a little more than three years ago. "I really think this is going to play very prominently in the coming years."

And so I said, "Great," and I started reading about PaaS, and I was reading headlines like this: "PaaS makes developers happy," or, "PaaS is all about developers, helps them develop and test." And it was this developer mindset.

And I thought, awesome. I've been in the industry for 25 years. I've always been on the development side, had never done anything about operations. But then I moved into this platform engineering role that I just mentioned, and it took me less than a month before I had this oh-my-gosh moment. And I came home, and I told my husband, "I'm working on a data center product. I'm working on an ops product."

And it turns out that these platforms, and a platform like Cloud Foundry, is at least as much about the operator and the operations teams and the operations processes as it is about developers. The good news is that, as a whole, industry has come along now as well and recognized that these platforms are not just for dev and test and not just about developer productivity, because all of you who are operating apps in production, you know that that's the long tail, that you have to keep things running in production for a long time. So it isn't all about developer productivity.

So about a year ago, I reported for duty. I reached out to our director of cloud ops, the person who keeps our Cloud Foundry instance, our public instance, available running in the cloud, and I sent him an email where I literally said, "Reporting for duty, sir. I would like to spend a month doing ops. I would like to be a part of your team."

And this is my favorite admiral, of course, Grace Hopper, so I wanted to put a plug for her in there.

We run Cloud Foundry in production. It's open for business 24 by 7 by 365. We have no planned downtime, no maintenance windows or anything like that. And so that's what I did: I spent a month on that ops team, and I'm going to share those lessons.

The great thing is that the very first morning I started out, I got there and I sat down and started getting onboarded into the team using checklists, if you were in the room for the last hour. And we were two line items into the checklist, and we had an incident.

Here's what happened in that incident. We have a customer, the Sundance Film Festival, running on our platform. They're running their applications on the Cloud Foundry platform. They have their applications deployed, they have multiple instances of the app deployed, and they had a certain amount of traffic that was going to that application. No problem.

The platform does more than just host applications, though. It provides all sorts of services around that, services that are benefiting you on the operations team. For example, we do logging. So we've got four app instances. We aggregate all the logs together so that those logs can be persisted in some log storage and search system. So we have the compute that's running the apps, and we have the logging subsystem, and everything was going fine.

Now, the day that I started was two days before the film festival opened, and that was the day that they were going to release a bunch more tickets available through their web app. And so they had anticipated a spike in traffic. They had scaled up the number of app instances ahead of time. They knew this was coming, so this is like a Black Friday event. They knew this was coming. They had scaled the app instances, and you can see that they were serving that increased volume of traffic just fine.

And of course, increased volume of traffic means an increase in the logs. And what we hadn't done is that we hadn't scaled the logging subsystem. So there was still a little bit. Those log processing could only do so much. So they were still sending data out and persisting logs, but a whole bunch of logs were going to `/dev/null`. They were just dropping on the floor. So this was the incident.

Let me tell you how we dealt with that. By the way, within about 45 minutes, we had scaled the logging subsystem and everything was up and running. The good news is that nothing was ever affected. The apps were running fine, so it turns out that we didn't need the logs that we were dropping on the floor, and within 45 minutes, we were capturing all the logs as well. So we were able to capture all those logs.

Let me tell you about what the operation cycle of that looked like. Just a simplified picture of what we had here. We have the apps, we have the logging subsystem, and we've got the persistence for the logs as well. That dotted line around that whole system is what we consider the platform.

And the group that's responsible for that platform is the Cloud Ops team. That's the team that I had reported for duty on. That's the team that I was working on.

Now, the logging subsystem is built by a team. The Cloud Ops team is in San Francisco. The logging subsystem team is in Boulder. They happen to be in Boulder. So they're responsible for a portion of the platform, and they're under heavy development. We're actually in the middle of changing our logging subsystem.

And so what had happened was they have their own monitoring. This is part of the whole DevOps story: the development team, the team that was building the logging subsystem, has their own monitoring, and they were getting alerts. They got the first alerts that said, "Hey, we've got too much volume. We're dropping logs on the floor." So they got those alerts, and they got paged using PagerDuty, of course, and their prod pair came in to start addressing this.

I'll say more about prod pairs in just a moment. We at Pivotal do pair programming all day, every day. There's all sorts of benefits on that. But we also do pair ops. So our operations teams work in pairs, always work in pairs, and that is actually hugely valuable.

So they got the alert, and they said, "We need to involve the team that's responsible for the entire system, because if we want to change the topology, we need to involve the Cloud Ops teams." By the way, the Cloud Ops teams, their dashboards were fine because they weren't looking at the fine-grain level of detail that's inside of the details of the logging subsystem. Not yet, because like I said, the logging subsystem was under heavy development. So they weren't getting alerted yet.

But the app team paged us, and we came in again as a pair. So we have a development pair on the right-hand side, and we have an ops pair on the left-hand side. And that's when we went in, and we scaled the logging subsystem.

Let me say a couple of things. Let me talk about some of the lessons learned there.

The one thing that I will tell you is that the operations team, you already noticed some parallels between the dev team and the ops team. We do pair programming. We do pair ops. The other thing that we do is that both teams have backlogs. There is a backlog for the dev team. There is a backlog for the ops team, because the ops team is not only in the business of fighting fires. They have their own product. They treat their deployment as a product, and they have a backlog of stories of things that they want to do, where they want to increase the level of automation that they have. They want to increase the level of visibility that they have on their dashboards and so on. So we treat ops just like any other product, and we have a backlog for that.

Both the dev team and the ops team have a prod pair. On the ops team at the time, we had two pairs. Occasionally, we had three pairs. We had two pairs on the ops team. When there was a fire, not every ops person started running around with their hair on fire. There was the prod pair, and they were responsible for, if there's a prod issue, this is the pair that comes in. The other pair or the other two pairs still working on the product backlog. So that kind of discipline is really important.

And then we have a very sensible way of managing deployments. So let's talk about that.

Oh, and let me go back. You'll notice that this is actually one of the pairing boards that we have. A couple of other things you'll notice is that the pipeline team here is the prod pair. We also have an interrupt pair, so we're very collaborative. We have a lot of synchronous collaboration on the floor in our development and ops teams, and so there are always teams going to other teams asking questions. What we do is we've identified, rather than interrupting everyone, just like a prod incident doesn't interrupt everyone, there's an interrupt pair. They're the ones that answer questions for the day.

So there's a couple of really good practices showing on this board.

Let's talk a little bit more about deployments.

The first thing that I'll tell you is that we do deployments during normal working hours. We do not schedule deployments at 2:00 in the morning. We do not schedule deployments on Saturday. And in fact, we categorize our deployments into a number of different categories.

There's something called a stemcell deploy. That means we're actually revving the operating system out from under all of the applications that are running. That one takes a fairly long time because we actually have to recreate VMs along the way.

We also have something that we call a manifest-only deploy, which is what we did with the logging subsystem there. It really just means that we're either adding or removing some nodes. So the whole rest of the cluster is fine, we're just fiddling with a couple of nodes.

Well, a stemcell deployment, if we don't start it by 9:30 in the morning, we don't do it, because it takes the entire day. A deployment where we're just going to add a couple of nodes and we estimate it's going to take 20 minutes to complete, we can do that at 3:00 in the afternoon. So we do deployments during the day.

I was doing a talk at our conference earlier this year, and when I said that in the room, there was actually an audible response. People were like, "What? You do deployments during the day? I mean, that would change my world."

So the big question is, how the heck do we do that? And there's a whole slew of things, but there were three top things that I wanted to highlight here.

The first one is immutable infrastructure. If you were at the keynotes this morning, you heard John Willis talk about it very passionately. And I share his passion for this. Immutable infrastructure is absolutely essential. If you have snowflakes and you are making a change manually somewhere and you don't know what the downstream ramifications are going to be, you are going to be in a world of hurt.

So when you do deployments, when you make changes, you always have to start from a known starting point, and that is a blank slate. And you have to redeploy everything. So if we're redeploying a node or if we have to change the operating system, it's all automated, 100% automated, and we are never starting from a, "Well, we went in and tweaked on something." We don't even allow SSH access into some of our things. So no snowflakes. That's number one of how we do this.

The other thing that we do is this principle of single deployable artifact. When you're deploying something into prod, you are deploying exactly the same artifact that you deployed in your user acceptance test, in your test, and in your development environment. You are not rebuilding code as a part of the deployment.

Now, one of the advantages of that is this whole being able to do things during normal working hours. What you see here on the right-hand side of the screen is a very simplified version of one of our pipelines. We have a number of different subsystems within the platform, and each one of those subsystems has their own acceptance environment, Dijon and Tabasco. It depends. We use sauce names. The spicier the sauce, the further away from prod it is, the more spicy it's going to be. And as you move all the way up to prod, A1, and then we actually have a ketchup somewhere, which I'm not showing in this particular slide.

But the interesting thing here is that as things move through this pipeline, you can see that over here at A1, A1 is pre-prod. That's the staging environment. But A1 and prod share something that we call the shared package cache. So when things are moving through the development pipeline and we get to A1, we build all of the things that are going to go onto the prod VMs, and we put them in the shared package cache, so that when I do the deploy... And by the way, that takes a while, because doing all those compiles takes a while.

But when we get to prod, then prod is going to do exactly the same thing. It's going to say, "Do I need to compile this?" And it's going to answer that question based on whether it's already been compiled. And it does that all with SHAs. So it takes a look at the GitHub SHAs, and it says, "Ah, if the SHA matches..." Well, it actually calculates the SHAs and says, "Hey, if the SHA matches something that's in the shared package cache, it's already been compiled, and I can just deploy that." So that's a specific technique that allows us to go much faster in prod.

Single deployable artifact.

Another thing that is hugely valuable for serving this is that we have declarative deployments. We don't do Puppet and Chef. What you do in our platform is we say, "Here, this is what we want the topology to look like. System, make it so." And our platform is designed to allow that. It allows you to say, "Hey, I want a topology with two logging servers," or, "I want a topology with four logging servers."

Remember when I said within 45 minutes we had this thing back up and running? Because the only thing we needed to do was we needed to go into a manifest and say, "I don't want two, I want four." And actually what we did, the only reason it took 45 minutes was that we started with two, and we said, "Let's try three." And we were still dropping a few logs, so then we said, "Let's try four," and then we were fine. So order of magnitude, a little bit different, but that shows the purpose.

So having a system that's designed that way, that's designed to allow you to just declare your topology and then have the system make it so, really reduces that burden. Our director of cloud ops says he wants you to set up an environment. You have to set up your operations environment so that if you get paged at 2:00 in the morning after you've just come in from being at a bar, that you can still do your job. You don't want to make it super hard to do your job. And so the platform and everything that you have around there needs to support those operational capabilities.

All right. So let me talk a little bit more about that single deployable artifact. So far the application I've been talking about has been the platform itself. Our platform in particular is there to allow application developers to write and run their code on that platform.

So I'm going to start shifting from the application being the platform to the applications that are running on the platform and talk about a few lessons there, because I think that they're very key and they came out through the experiences that I had in that month as well.

So let me talk a little bit more about that single deployable artifact. I already said you have a single deployable artifact. You build it once and it moves through your entire development life cycle: dev, test, user acceptance test, and deployed into prod. Single deployable artifact.

Now, there's an important part of that, and that is that your environments don't in fact look identical. In dev, my customer database might not be a database at all. It might just be a mock. And when I get to test, I actually have a version of my customer database that is the entire customer database, but it's been cleansed of personally identifiable information. And then when I get to prod, I have another environment. Other things are going to change. Some network settings are going to change. There's going to be some differences.

So how do you do this single deployable artifact all the way through the life cycle? What you need there is you need an explicit layer of abstraction that's in the business of abstracting away environment configurations so that your single deployable artifact can just reference those environment configurations as it moves through the pipeline.

Now this lesson came out for me during my one month on ops in that our platform, I'm used to this. I've been working on this platform. This is like motherhood and apple pie. I sometimes say that this whole ops story is kind of great because sometimes it's a benefit to be clueless. And I didn't come into my month at ops with any preconceived notions of what might happen. So I came in with this teach me, completely open-minded thing.

Well, I also, having had this background of, yeah, of course, you're going to have this layer of abstraction that's going to allow you to have this continuous deployment, this model for continuous deployment. Well, it turns out that our monitoring subsystem doesn't have that. You essentially are building your monitoring, your dashboards on the fly.

Now we have implemented our own continuous delivery around that monitoring software so that we had to do translations. We built a dashboard in test, and then you have to translate it to get it running in prod. And so one day I was working on that task with my pair. We were updating that monitoring software, and we did that transformation and something broke.

Now the long story short is that we had a bug in our transformation, the thing that transformed the dashboard from one stage to another stage. So the difference between staging and prod, we had a bug in that software. And the reason that we had to build that software is because the monitoring software doesn't have that layer of abstraction. They are not treating continuous integration as a first-class entity in their system.

So that's really what that single deployable artifact and really the lesson is around: does your platform and do your tools allow you to support continuous integration? Is that a first-class concern? It has to be. So it moves through the rest of the pipeline.

All right. So let's come back to dev and ops responsibilities for just a moment. We talked about that in the very first story where we had the dev team that was responsible for the logging subsystem. We had the ops team that was responsible for the entire deployment. So let me tell you another story.

If you remember this picture, it turns out that those applications themselves are actually running inside of containers. They're not running inside of Docker containers, but they're running... Well, actually, there's no such thing as really a Docker container. There's Docker images, but they're running inside of Linux containers, just like Docker images do. So we have a multitude of applications that are running on a VM, and we have a whole bunch of VMs that are running applications.

The big boxes are a whole bunch of VMs that are running applications. And the little boxes are the applications that are running on those VMs inside of Linux containers.

So this story came out of a day where we were going to do a prod deploy and I was so excited. I even tweeted, "Hey, I'm going to do a prod deploy today. I'm so excited." And somebody replied to my tweet and said, "Hey, with Cloud Foundry, isn't that supposed to be like a ha-yawn?" And I said, "Actually..." So three hours later I was like, "Actually, I'm really quite bored because there's nothing happening here."

But I did say to my prod pair that morning when we started, I said, "Oh, I hope something goes wrong." And he was like... And I said, "Really? Because I want to learn." And I got my wish. And you'll see, I swear it's because I asked, because you'll see how odd it would be that this type of failure would happen on that day.

But let me tell you about that failure. We were going through and we were updating each one of these VMs, and we do so with a canary-style rolling upgrade, zero downtime. We don't bring them all down at the same time. And we have, I don't know, 200 or so of these VMs. These VMs have the name of a runner. So we were marching along and we were watching, and we were updating. We updated a VM, we updated a runner, we updated another runner. And you can see, these are actually from the logs. It took about a minute for each one of those, a minute to update each one of these VMs.

And we updated a bunch of runners. And remember that thing I said about immutable infrastructure? If one of them works, they should all work. If you remember your proofs, I think it's induction, where you prove it for the base case, you prove it for N, and if it's going to work for N, then you prove that it works for N plus one. That means it works for everything.

That should've worked. And what ended up happening is that on number 96, it failed. The update failed. But wait a minute, we're starting from scratch. What the heck happened? What failed? Why did it fail?

And here's what happened. Each one of those virtual machines is running a multitude of processes. And I'm only identifying two of the processes here. One's called the BOSH agent, one's called the directory server. It really doesn't matter what those are.

But what ended up happening was we took a look at one of the healthy VMs, and we took a look at the one that failed. And we took a look at the ports that those processes were using. Now, the BOSH agent on the left-hand side gets a port dynamically assigned, and the directory server gets a port statically assigned. It was defined in the manifest. It was statically defined.

On the healthy DEA, we had 35560, and the statically defined one was 34567. 34567. That's what somebody picked. 34567. And on the unhealthy one, the BOSH agent, which gets a dynamically assigned port, had port 34567. So when the directory server came up, it couldn't use that port.

So we scratched our heads and said, "Well, gosh, what's going on here?" And we didn't know, and we went and asked around, and we have Dimitri, who knows freaking everything. He's one of those wizards. And he said, "Oh, it's ephemeral ports." And we said, "It's what?" And he said, "It's ephemeral ports." And we said, "Okay."

And so we went, and we Googled it, and we found out that there's an ephemeral port range. How many people know about the ephemeral port range? Okay, maybe 10% of you. There's a range of ports from 32768 to 61,000. Depends on what system you're on, but on Ubuntu, it's those. Don't ever use that port to statically assign anything, because when you have systems that are getting dynamically assigned ports, they are able to pull ports from anywhere in this. Anywhere in that range, they can grab that port number.

So check out the size of that range. There's 30,000 ports in that range. So what are the chances of us having this conflict? That happened because I asked the gods for a problem, and we got it.

So what's the lesson on this? Basically, the developer that set up the directory server should never have assigned anything to 34567. And you know what? It's not her fault. Only 10% of you knew about the ephemeral port range.

So the point here is that DevOps is about eliminating barriers between dev and ops. That's fundamentally what it's about, and there's lots of different ways of doing that. But one of the ways of doing that is do not expect your developers to know some low-level infrastructure details.

And so what you need to do is you need to think about setting up your systems so that there is a natural separation between operations, between people who have to know things about the infrastructure, and people who need to know things about the application.

So this is something that we work with all of our customers on, this concept that you have an application team. Application teams do dev and ops. But do not make the application teams understand low-level platform details. And set up your environments to enable that so that your application developers, well, they create that deployable artifact. Platform operations teams are responsible for keeping the platform up and running so that the app teams can consume it, and the application operations teams are responsible for running the applications.

If you take a look at the details on here, they're doing similar things: deployment, monitoring, scaling, upgrading. They're doing similar things, but they're doing it in a different discourse, a different vocabulary.

All right, so just to wrap up here, then the final takeaways are: treat ops like any other product. I belabored that point earlier. We do deployments during the day, super critical, because when something does go wrong, you've got Dimitri on the floor who can say, "Ephemeral ports." He says it with a Russian accent. I can't do that.

Single deployable artifact, super important. Separate application operations from platform operations. And then the last one that I haven't belabored yet enough is that you have an app team that is made up of developers and operations.

And that experience, I wrote up three blog posts about those experiences. So each one of those stories is written up in more detail if you want to see those things.

And then finally, my very last slide is, what do I need help with? And it's this: lots more men in this industry than there are women. And so changing this, this is something, if any of you have ever seen me speak, I speak on diversity quite a bit. And I was actually chatting with my husband when I was working on this slide deck, and I said, "What am I going to put as the 'I need help with'?" And he said, "Why don't you put up diversity?" And I said, "That is a great idea."

So totally not a techie subject, although it is techie, and it is a very heavy subject, but it's my plea to you to let's all get together and figure out how we can make dev and ops and DevOps a much more diverse environment.

So thank you. I think I went over by a minute. Thank you for your patience. Have a great day.