DevOps: Human Scale Systems

Log in to watch

San Francisco 2016

DevOps: Human Scale Systems

Technical Operations Leadership · GitHub, Heroku, 37signals

DevOps news is dominated by discussions about tools, and with good reason. It's not unusual for the amount of infrastructure-related code in a system to approach or even exceed the amount of code dedicated to the actual problem the system is solving, even in small systems. As our systems scale in size and complexity, we invest an ever increasing amount of resources into building solutions to help manage our our complex technical systems. And rightly so.

What's often overlooked, however, is the human component of our systems. All too often our approaches to tools, processes, and systems management attempt to remove humans rather than empower them.

I'll make the case that humans are not a source of entropy to be safeguarded against in our systems, but rather a fundamental source of resilience and even efficiency. We'll discuss ways that we can use this point of view to our advantage when constructing our systems to move faster without sacrificing safety. We'll look at things like tools and our interactions with them, team collaboration, and even organizational structure and policies.

We've had plenty of talks about building for web scale, cloud scale, and even planetary scale. Let's spend some time talking about designing for human scale.

Chapters

Full transcript

The complete talk, organized by section.

Mark Imbriaco

I'm Mark Imbriaco, and I'm happy to be here. You'll note that the title of this slide is a little different than what's in the program, and we'll get to that in a minute.

So who am I? I've been doing this whole internet services thing for a little over 20 years now. I know I don't look that old. It's true.

I started way back in the mid-'90s with an internet company, and over that time I've done a bunch of things. I've worked as a developer and a sysadmin and an ops leader now. And for about the past decade or so, I've had the really good fortune to work at some of the amazing companies that Gene talked about.

I was in a leadership role at 37signals, Heroku, LivingSocial, GitHub, and DigitalOcean, doing technical operations. And those were not just amazing companies, but amazing groups of people, where I've learned a ton. And the things that I've learned at those companies are what led me to found Operable, a company I started about a year and a half ago.

And we're focused on taking the lessons I've learned over the last 20 years and bringing it into a place, using ChatOps, that all of you can use. Right? That people in the enterprise can adopt these same techniques that those of us in the startup world have used for a while.

And the last thing you should know about me is I've got lots of opinions. Sometimes they're even right.

So I was excited when Gene asked me to speak because I was in town last year when DevOps Enterprise '15 happened, and Gene invited me to come down for the day. So I came down, and to be honest, I wasn't expecting a whole lot. I was like, "You know what? I'm doing this startup stuff. I'm doing this technology that's sort of right at the leading edge. What am I going to learn there?"

And it's really interesting because I came away from it way more excited about DevOps than I was when I got here. I was frankly a little down on the term. I felt like DevOps had kind of become a synonym for continuous delivery, and that this cultural idea that we started with, this idea of improving collaboration across our organizations and having people work together better and empathize with one another, I felt like we had lost a lot of those things.

And then I came to DOES '15, and I was blown away. The passion and energy that was in the conference last year was at a level that was significantly higher than any other event I had been to, and it was infectious.

And the thing that I really liked about it was I was really struck, right? I went to some talks and I'd watch the talk, and I wasn't struck by the technology. The technology is the easy part, and if anything, that's the case in all the companies I've been at as well. Sure, there's lots of technology problems and we've got technical debt and everything's falling apart, but that's just how it goes.

The human side and the organizational dynamics are much more difficult. And I watched a bunch of people get up on stage and talk about all of the difficulty that they had been through and the challenges they had faced, not with solving technical problems, but with fighting an uphill battle to do what they thought was right to move their organization forward. And it was really compelling.

So I was super excited. And then Gene asked me to speak this year, and I was honored, but this is where that title slide takes a turn. So I was like, "Sure, Gene, I'd love to speak. I've got this topic, this human scale thing that I've been talking about for a while, moving the focus of DevOps back on the people, and I think it'd be great."

And he's like, "Sure, that sounds good." And then three weeks ago, he's like, "Hey, can we review your talk?"

"Sure. Sure, Gene, that'd be great."

So we get on a video call, and we start to go through it, and Gene asks a seemingly innocuous question. He's like, "Hey, can I make a suggestion?" Now, be really careful if Gene ever asks you this how you answer.

And he gave me the whole spiel that he just gave you about experience reports, which I found out about three weeks ago. So suddenly, I thought I had a talk done, and now I'm giving a different talk. But I'm excited about it because I like telling stories. So Gene basically gave me the license to tell stories.

So I'm going to tell some stories about some of the things I've learned at various places along the way, and this is obviously not an exhaustive list. I've learned lots of horrible things not to do. I'm going to try to share some of the things that I think you should do. And I'm going to go kind of chronologically, and I think I've got a narrative, but we'll see how it goes.

So 37signals. I joined 37signals about 10 years ago, I guess, at this point. I was the seventh employee at the company. And 37signals is the company that is responsible for Ruby on Rails, which I'm sure basically everybody here is familiar with.

They're also the company behind a product called Basecamp and a number of other productivity software packages that are offered as a software as a service. So I was the seventh employee. I was the first ops hire. And we had problems, as you might expect. We had an infrastructure that was built by the developers who wrote the code, which is not necessarily a bad thing, but they didn't have any experience either. So it had worked, and it was amazing, and it got us to where we were, but there were a lot of challenges and a lot of things for us to learn collectively, not just in terms of scaling technically, but also organizationally.

And one of the things that happens when you're the seventh employee at a very small company and you're the only one focused on ops is you have lots of things to do. And the more things you do and the more successful you are, the more you want to do, and the more you feel like you should be doing.

And one of the things that we decided at 37signals... So 37signals, another thing I should talk about is 37signals also wrote a number of books. One of them is called Getting Real. Everybody should go read it. Check it out. gettingreal.37signals.com, I think, still works. The company is called Basecamp now.

But it's a very short book with a bunch of business advice from this little startup in Chicago that is applicable across lots of organizations. And one of the things that they talk about is this idea of making tiny decisions.

We tend to get caught up in this a lot, especially in operations. We're faced with a problem, and we immediately go four steps down the road that, what happens if this service... Okay, I need a database for this service. What happens if this service grows really big? What happens if it outgrows the size of the server? What if the East Coast gets hit by a meteor? What if...

And we get caught in this never-ending spiral, and suddenly we've got this huge decision to make. We need to build sharding and we need to build this highly available intergalactic database.

When all we really needed, and I'll get back to a concrete example. Basecamp was growing incredibly quickly. We were growing something like double digits monthly when I joined the company. And we were constantly hitting problems with our database in terms of performance and size.

And around this time, there were a lot of other popular companies like Twitter, for example, that were using Ruby on Rails, and they were growing to significant sizes, and sharding databases was all the rage. And we went round and round on this for a while. And then we stopped and looked at it and said, "What exactly is the problem we're trying to solve?"

The database is too slow. Well, what can we do? Do they make bigger computers? They make bigger computers. Let's buy one of those and see how long that lasts.

And we did that, and it turns out Moore's Law helped us never have to make the hard decision. So we were able to spend money to solve this problem in a way that allowed the business to keep moving forward without us making this huge, disruptive change to our software development, to our software, to our infrastructure. We were able to sort of push this decision down the road to a point where we would have more information, and that's really the point.

The idea of making tiny decisions all the time. If you can decompose a big decision into a bunch of smaller decisions and figure out which ones are most important and solve those, you get more time to make better decisions with more facts. And if you're wrong, they were small decisions anyway, so you can go back and change them. Who cares?

And you're making all these decisions and you're the seventh employee, and you've got lots of work to do. And you always want to do more, so you end up with this hero culture thing. This is a big problem in ops. I'm sure this is not a new term for any of you, but we see this in ops a lot.

Ops people are frequently called on to sort of fight fires and solve problems when everything is burning down, right? And there's a rush involved. You get paged at 3:00 in the morning, you solve the problem, the site comes back, everybody's like, "Hey, Mark, you did a great job. That's awesome." And it's sort of a self-fulfilling, self-reinforcing process, right? You want to feel this rush again. You want to keep solving problems.

And even if you don't want to keep solving problems, you also don't want to get woken up at 3:00 a.m. So you always find more work to do. There's always things to do. And if you're like me, you don't know when to ask for help. And this is really common as well, and I'm not an outlier by any stretch.

So what you really need is you need to push back when you see this in your organizations. You need to enforce vacations. You need to tell people to take a break.

If you're one of the people like me who tends to work too many hours, and I do this with my own startup, I tell the people that work for me on the first day, "Don't be like me. This is how I am. This is not how you should be. Don't do this. This is not healthy. I know it's not healthy. I can't help myself. But please don't do it." And I make them not do it, right?

My co-founder helps me out, too. He tells me, "Mark, you're not allowed to work today. Go home."

And it's really important to kind of enforce this balance because people aren't going to do it on their own.

And then finally you get over this hurdle and you decide, okay, great, I can take a vacation. I got some extra help now. Things are in a reasonable state. I can take a vacation. We know how to deal with all these issues that might happen. Everything's good.

And then something else happens. And let's transition to Heroku. This next story is actually going to be about 37signals, but the remediation happens at Heroku.

So I finally go on vacation. Let me go back, actually. I finally go on vacation, and I go not just on vacation, but on vacation where cell phones don't work. And this was a big step for me. It was a little scary. I went camping with my family. We were up on this mountain where there was literally no cell phone coverage. And we were supposed to be there, I think, three or four days.

We ended up coming back a day early because I sprained my ankle. And we're driving down the mountain, and literally the second we got into cell phone coverage, my phone almost exploded from all the pages. And I'm like, what in the hell is happening? I'm not even supposed to be back for another day. What's going on right now?

It turns out that I had a plan for dealing with problems that would come up, or at least we thought we did. But a funny thing happens when you have a plan that you haven't necessarily practiced.

It's a segue to Heroku.

So I learned a lesson. The lesson was, if you have a plan that you haven't practiced, you don't have a plan.

So I joined Heroku after I left 37signals. I joined Heroku. Heroku's a platform as a service. I joined them in 2010 before we got acquired by Salesforce. And when I joined the company, we had 60,000 applications that we hosted on the platform. That sounds like a lot until I tell you that a year and a half later we had a million and a half apps on the platform.

So we did a little bit of work in a year and a half, and we learned an awful lot about kind of everything to do and not to do. But it was an amazing time, and Heroku was an amazing company.

So one of the things we did at Heroku kind of ties back to this practice idea. We had plans for how we deal with problems at 37signals, and they weren't as well tested as I would like. And when you have a plan that you haven't tested, it's much like a backup that you haven't tried restoring. It's kind of the old joke: if your backups aren't tested, you don't have backups. If your plans aren't practiced, you don't have plans.

So we took this lesson that I learned the hard way, and we built it into our process at Heroku. We had this notion of playbooks, and you folks call them runbooks. And sorry, I said "guys"; that's not actually what I mean. I keep doing this and I keep stopping myself, but bear with me. I should say "folks" or something else, but I apologize.

We had these playbooks, and we had a mandate that all alerts that were generated by our monitoring systems had to be actionable. They had to have a playbook attached that gave you the steps to use for validating that the problem actually happened, gave you the criteria for making decisions about what to do when you got this alert, gave you links to docs, and importantly, also gave you feedback mechanisms for who to contact if it doesn't work. Told you who to reach out to if you had problems while you were executing it.

And even further, every playbook had to have simulation steps so that you could reproduce the circumstances for the playbook in a test environment and walk through the process and practice how you would respond to that specific incident.

And we took it even a step further and we kept track of the last time those were performed on each of the playbooks, and we made new hires run through them before they could go on call in their first few weeks.

So by doing all these things, we were able to build a tremendous amount of confidence that not only we had plans for how to solve problems that we sort of knew might happen, but the plans worked, and they worked recently because I think we had a rule that they had to be tested within 30 days, right? So every 30 days, they had to be refreshed. And we did it kind of on a rolling cycle. We didn't just stop the world every 30 days and test everything. But the idea is none of them should be more than a month out of date.

And that's not perfect, but it's a hell of a lot better than what we started with. And it gave people the confidence that they could expect the playbook to work, and it gave new people the confidence that they had already been through this process once. So when they were on call to solve a problem, we didn't just throw them to the wolves.

So it was huge.

But things aren't perfect, right? You're resolving these problems, systems go down, and how do you learn from those things, right? We've got these playbooks. Now what? What do I do? These playbooks have lots of things for me to keep track of. These playbooks have lots of decisions for me to make.

And one of the things we noticed early on was that we had to make decisions during incident response that weren't important. At least weren't important for the core problem that we were trying to solve, right? We had to make decisions like when should we update the status site? When is it a publicly facing event? Engineers had to make these calls on the fly. When should I contact the support team or the communications team? When should I tweet, right?

These things are not things that an engineer who gets paged at 2:00 a.m. should be thinking about. They should be thinking about, how do I get the service back up?

So we decided to be prescriptive about some of these things that weren't core to the mission that that person was trying to solve. We said, "If these services are down, create a status issue that says you're investigating and call somebody from support so that they can keep updating the site for you. That shouldn't be your problem. Your problem is figuring out whether it's a real problem or not. Let's err on the side of over-communicating. Let's let our customers know that we're investigating a problem. We're not sure if it's customer impacting yet, but we'll let you know soon."

And then focus on figuring that out and solving the problem. And by removing some of this cognitive load from the people that are trying to solve the problem, we got improved response times or improved resolution times for problems, right? Suddenly, these people, these engineers who are getting paged in the middle of the night, don't have to think about these things that they don't care about anyway. Incredibly valuable. It seems obvious now, but it wasn't obvious to us at the time.

And then once you have these problems, how do you create an environment where you can learn from them? I'm a huge believer in this idea of learning reviews. You might hear them called postmortems or retrospectives. I hate the word postmortem, so I don't use it in this context. I call them learning reviews. And the idea is to examine your successes and failures, look for places where you can improve.

I'm not going to harp on this too much because I'm sure you've all heard it. But these are hugely important and incredibly valuable, and if you make them a habit rather than an exceptional activity. If you say every time we have a SEV 2 or higher, we're going to have a postmortem or a retro or a learning review or whatever your terminology is. Or if you say, every time we deploy a major new feature, we're going to have a retro about that launch, and we're going to talk about it even if it was successful. What are the things we can do better next time? What are the things we need to make sure we keep doing?

And there's a ton that we can be learning from those things. And when we have big problems and we have these retros and these learning reviews, we often generate a lot of information that we can share publicly.

And I'll give an example. So when I was at Heroku, we had a significant outage. We called it the Skynet outage because it happened to happen on the day that, from Terminator, Skynet evolved. You may all remember this. It was 2011. I remember it very well. 2011, when Amazon's EBS didn't. EBS went out for about a day.

So their Elastic Block Storage service stopped working for about a day. And we're running Heroku, we're running 250,000 database instances, and I'm not exaggerating, 250,000 databases that relied on EBS. And suddenly EBS stopped working. And a large part of Amazon's control plane for US-East stopped working.

So you can imagine it was an all-hands-on-deck thing. My favorite part of this story when I tell it to people is that I was on vacation when it happened, and I literally had no idea that it happened for the first 36 hours because my team didn't bother to call me because they had it under control. That felt pretty good when I found out. Well, after I stopped freaking out, it felt pretty good. I found out about it in the car on the way home, and my laptop was in my lap the whole rest of the way.

But we learned a ton during this incident. We learned a bunch of things. And we took the things we learned, and we fed them back into our process. But the other thing we did that was, I won't say unique at the time, we didn't pioneer anything. But we decided that we wanted to be transparent to the maximum extent possible with our customers about what was happening, what we were going to do to fix it, what we had done, how we'd responded, and so on.

So we took the results of our internal postmortem, we called them postmortems at the time, I hadn't learned to hate that term yet, and we shared it very transparently. Shockingly transparently.

And I've done this a lot of times at other companies since Heroku, with varying degrees of success. Actually, I shouldn't say that. Universally positive. It's shocking how much customers just want to understand what happened, what's going on, where are you at in resolution.

Think about it yourself, right? Your power goes out, or your cable goes out, or you're running ops for a company and your internet service provider goes down, and they're not communicating about the status. That's way more infuriating than the actual outage.

And we had had this experience a lot ourselves dealing with service providers. So we didn't want to be that guy. So we made it a point to be overly communicative. We would tell customers, we had a policy every 15 minutes we had to update a status site, unless we had information that we knew we're going to be working on this for the next two hours. Okay, we're going to be working on this for two hours. Check back in two hours and we'll let you know how it's going.

But over-communicate and make sure that customers knew that we empathized with what they were going through, and that we wanted to be open about where things were. And this turned lemons into lemonade in a way that I did not expect the first time I did it. The amount of goodwill that being transparent and open generated vastly exceeded the amount of badwill that a significant outage, and in fact, a 67-hour partial outage of Heroku caused.

And as I've done more of these, I came up with this formula. And I share this a lot. I've given talks about incident communication a number of times. And I love this formula because it's so simple.

There's only three things to do to write a good public reason for outage or postmortem statement. First, apologize, but you've got to mean it. If it's, "We apologize for any inconvenience you may have experienced," just stop. Give up. You've already lost. That phrase is... just don't do that. It's completely insincere and people know it. But also, don't gush. Just be sincere.

The second thing you have to do is demonstrate a thorough understanding of the problem, right? Customers want to feel confident that you know what happened, that you understand the scope of what happened, that you have a clue about what you're doing, frankly.

And the third thing you have to do is just explain what you're going to do to reduce the likelihood of similar events in the future. But critically here, don't overpromise. Don't tell customers you're going to prevent this thing from ever happening again because it'll happen in six months, and you're going to look like an idiot.

But be transparent. Say, "These are the steps we're taking. We think this dramatically improves our ability to respond to this kind of event in the future. It reduces the likelihood," et cetera. And the amount of goodwill you generate is insane.

So LivingSocial and goodwill. I joined LivingSocial after Heroku. I was the VP of technical operations there. We were doing something like a billion dollars a year in transaction volume. And e-commerce sounded like fun. "It seemed like a good idea at the time" is probably the right phrase. It wasn't a great idea, but hindsight.

But one of the things I learned at LivingSocial, and it's directly related to that whole sharing and empathy with customers, empathy is a core engineering value as well.

And if we're here talking about DevOps, this should be obvious to us. This idea that empathy between teams, between us and our customers is important, but it's also important between us and our colleagues and across our organizations.

And the biggest problem that development and operations teams have is lack of empathy. And it's not because they really dislike each other. It's because they don't understand. They don't have the context. They don't talk. They don't have a frame of reference. They haven't made the effort.

When I joined LivingSocial, this was a significant problem for us. And again, it wasn't because anybody was a bad person, but you had an ops team that was two people, one of them a co-founder, and he had the same problem that I have with this sort of hero complex. And you had a development team that was something like 40 people.

So you got two operations people, you got 40 people. You can imagine that those 40 people can churn out software a whole lot faster than those two people can catch up with all the technical debt that they've built up in the last two years when a company went from zero to a billion dollars in transaction volume.

So I joined the company to take over operations and build a real organization around it. And one of the first things I did was sit down with the development team and the ops team and say, "All right, what's the problem? What do you need? What's standing in your way?"

And it seems obvious in retrospect, it was actually obvious to me at the time, but you sit down and the developers say, "Well, we've got all these product requirements, and we've got all these things we have to deliver, and the ops team keeps getting in our way. And we can't get code out, and why do you guys hate us? Why won't you let us run Mongo?" And all these other things.

And the ops team is like, "Well, why do you guys hate us? Why are you guys so incompetent? Why does everything break all the time? What are you doing? Why can't you guys learn how to write code?"

And when you really get down to it, each of them sort of believed... They actually didn't even believe it when you unpacked it. When you unpacked it, they didn't believe that the other person was incompetent or that the other person was really obstructionist. But they didn't know how to rationalize their environment or their needs with the response they were getting from the other side.

So what the developers really meant was, "Hey, we want to try some new technology once in a while," or, "Hey, we've got this problem to solve, and this technology looks like a good fit," or, "Hey, we need to be able to deploy new staging sites faster."

So when you start unpacking these problems and you say, "Okay, you've got this problem. You tell me that you want to run Mongo. What's the actual problem that you want to solve?"

"Okay, I want to store things that look like documents, and I need to store 100,000 of them."

You're like, "Well, okay, we've got MySQL. Just use that."

And you start unpacking these conversations, and you start unpacking and understanding what the actual problem that people are trying to solve is, and you can start moving from a place where you say, "No, you can't have Mongo," to a place where you say, "What is the problem you're trying to solve, and how can I help you solve it? How can I help you solve it in a way that helps both of us?"

And maybe the answer is, "You still can't have Mongo, but here's a better idea." Or maybe the answer is, "You can't have Mongo right now." But both of those are better than just a no.

When you just say no, you cut off conversation, you remove any chance for empathy, you remove any chance that the two sides can understand where the other is coming from. So this is a really big one for me. The idea of kind of unpacking and getting to the core of the problem you're trying to solve comes up in startup life all the time, right? People jump from a need to a technical solution, especially developers and operations people, at light speed, and you really need to push back and try to understand what they actually need. And try to have some empathy for it.

So empathy, GitHub. I joined GitHub after I left LivingSocial because, frankly, joining a daily deals business, it was just not a good idea. Just not a good idea. And I wasn't passionate about it. I'm passionate about enabling developers and the kinds of work that GitHub does.

So I joined GitHub. Everybody knows who GitHub is. I don't have to explain them. Hooray.

I didn't have a title. GitHub doesn't have titles or didn't at the time. But I was in a leadership role on the ops team. I was responsible for thinking about infrastructure and process.

So one of the things I learned at GitHub is this notion of being collaborative by default, right? You shouldn't have to go out of your way to share things that you're doing. The work you're doing should be visible, transparent, and collaborative to the rest of your team all the time.

So think about ways in your own work that you can make your workflow visible and collaborative, right? So I mentioned Operable, the company I started. We're heavily driven by this idea and directly attributable to GitHub. GitHub had this idea of ChatOps. And I'd been using chat and bots for a very long time, but not to the extent that GitHub did it. GitHub took this notion of bringing your workflow to your chat room to insane levels.

And I mean insane. I gave a talk at Velocity New York, I don't know, two or three years ago, where I talked about how on my way to give a talk about ChatOps at Velocity New York, I responded to a significant DDoS event from the airport departure lounge using nothing but chat commands that made changes to our external BGP announcements, and got on a plane.

And there was an equal mix of sort of technical what and abject horror from the audience. And it wasn't so much horror that, hey, you did this in chat, or that, hey, you did this in an automated way. It was more security related.

But this notion of bringing the work into chat, everybody got that idea. And even if chat's not the right mechanism, think about ways that you can take the work that you're doing and bring it into the place where you collaborate.

Maybe you don't use ChatOps and maybe you want to say, "Hey, every time there's an incident where I have to update the status site, the support team should know about that." Even if you're not using ChatOps, there are some things you can do. You can probably set up a Pingdom alert that pages somebody in PagerDuty when the status site changes from green to red. There's a bunch of things you can do, but the point is, make the...

Gene's telling me don't rush. I've got extra time.

The point is bring your work into the place where you collaborate and make it collaborative by default because this whole DevOps thing, it's all about being a team sport. It's all about collaboration.

So let's take that back because when we collaborate, there's a ton that we can learn. There's a ton that we get out in terms of serendipitous interactions. There's a ton of value we get in training, right?

If I'm using ChatOps to do all of my deployments and a new hire starts, suddenly I don't have to train them on how to do deployments anymore. They saw it on their first day. Probably in their first 20 minutes, they saw somebody deploy some code. And now they know how to do it without asking.

And that's really powerful. And when I'm working an incident with my team and I pull up a graph in chat, whether it's the right thing that points me to the problem or not, everyone learns something. And I didn't have to say, "Go to New Relic and click here and click there and click there." And New Relic's a fantastic tool, but I didn't have to do that. I just showed them. Incredibly powerful.

And this is a benefit that wasn't obvious to me at the time at GitHub, but it has become more and more obvious. At LivingSocial, we had lots of compliance issues, as you might expect when you have that kind of transaction volume. And I became really friendly with the idea of compensating controls.

The idea here is that you can get away with a lot of process things that auditors might look at you funny about if you have visibility into the process and you have an audit trail. And by making this work collaborative, you get really far down the path to solving a lot of those concerns. So it's super interesting to think about ChatOps as a security vehicle and a compensating control for the kinds of process improvements that you all want to do anyway.

Oh, and ops tools don't have to be ugly. This I learned at GitHub. I joined GitHub. The day I joined GitHub, there was a graph. In fact, I'm just going to click to the next thing.

So the point here is ops tools don't have to be ugly. You probably all have designers. Find a friendly designer to help you, and they can help you go from having an ugly Nagios graph to something that looks like this.

This is called FS Util. And this, I have no idea if they still use it or not. Probably not. It's not relevant anymore with their new architecture, but this used to show you a snapshot of the disk capacity across the file server clusters at GitHub, right? So where all the Git repositories are stored.

And the day I joined the company, this graph was about two-thirds red. And just looking at it and seeing it two-thirds red gives you a lot of information really fast, right? You say, "Oh, huh, first of all, I should probably do something about that." But second, I know where the problems are, and I know where to go to start solving them, and I have immediately actionable things to work on. And it's not ugly, so I don't mind looking at it, and I don't have to decipher a whole lot because it's really easy to glance at and understand.

So my first week at GitHub, I think I spent the first two or three days making that not be red by reorganizing things and bringing new file servers online, and it was great.

And one of the things that I learned from that, and one of the things I learned at my time at GitHub, is to build this culture of shipping. Right? So building a culture in your teams of shipping is incredibly powerful.

And this notion of shipping, I want to be really careful. Shipping does not just mean delivering new software features. Shipping just means you did a unit of work, and it's something that you can share with the rest of your organization. Something that you can celebrate together. Something that everybody can be excited about.

So my first ship at GitHub was making the graph not be red anymore, or making FS Util not be red anymore. And it was amazing. I shared that GitHub had a... or I have no idea if it still exists. They had an intranet site called Team where you could share kind of status updates. And I dropped the graph in, and I got a bunch of emoji reactions, and it made me feel good about myself. I had solved a real problem and people noticed it and wanted to high-five me about it.

And GitHub had some other interesting ways to do it. Serendipitously in chat, you would see somebody deploy something, and like, "Hey, that's a cool ship." And you celebrate it.

And they had this, it's going to sound corny, but it was really fun, this idea of, we called them toasts. So when there was a significant new feature, they had this sort of cultural phenomenon where you would take a selfie of yourself toasting and post it to Team. Or to the GitHub issue where that thing shipped. And you'd end up with dozens of selfies of people toasting the person that just shipped that.

And it makes people feel good, and it builds this self-fulfilling cycle of delivering new work and working together and celebrating everybody's successes.

DigitalOcean. I'm running low on time, but we're close here.

Do the simplest thing that could work. This is also about shipping. So shipping is also about processes. We had some processes at DigitalOcean that were problematic. We had this issue that I'm sure many of you have, where the ops team couldn't get anything done because we had a ton of interrupt-driven work.

Everybody has this problem, right? Everybody who has an operations team has this issue. And we talked about it, and it was really simple. You're like, "Okay, we've got this problem. It's eating up roughly an engineer's worth of time to deal with all this interrupt-driven work, and nobody can get anything done because we're all getting interrupted all the time."

And then you go, "Huh, an engineer's worth of time. What if we just said one person has an on-call rotation to do that stuff every week, and nobody else has to get interrupted?" And suddenly a process is born, and it took all of 15 minutes, and you didn't have to jump through any hoops. And best of all, it's so simple that if it doesn't work, you just throw it away and do something different.

And feedback loops. We talk about feedback loops in DevOps a lot. And we think they come for free somehow because we're deploying all the time. And I learned, much to my chagrin, that even though you're talking to people and they see your problems, they don't necessarily see your problems.

We had a problem at DigitalOcean that my ops team was getting woken up literally every night to solve. They would wake up at 3:00 a.m. They would respond to the event that the support team told them about. They would Band-Aid it. They would get up in the morning and get a developer to help them solve it for real.

And this went on for four or five days, and I was in a meeting with the dev managers, and I said, "Hey, guys, what's up with this thing that's waking my team up every night? When are you going to fix that?"

They're like, "What are you talking about?"

I'm like, "What do you mean, what do I mean? You're helping us every morning fix this problem. How do you not know this is a problem?"

And they legitimately didn't know that it was a problem that was waking people up. It seems crazy, and I still don't really understand it, but we didn't explicitly say, "Hey, this is waking my team up every night."

And as soon as they learned that it was waking my team up every night, we had put in a tremendous amount of effort to build empathy. They fixed it, it turns out, in about an hour, and nobody got woken up again.

Just because we went the extra mile to be explicit about a problem, and we had put in the work to build empathy across teams.

So being deliberate about the way that you share that kind of information pays a lot of dividends, and it's going to be organizationally dependent.

And feedback loops, speaking of. If you're interested in ChatOps and have kind of security or audit requirements, I would love to talk to you and pick your brain. It would be a huge help. And if you have any ideas or comments about my talk in general, I'd love to chat.

Thanks very much.