What We Learned from the 2016 State of DevOps Report

Log in to watch

San Francisco 2016

Download slides

What We Learned from the 2016 State of DevOps Report

Dr. Nicole Forsgren

CEO and Chief Scientist · DevOps Research & Assessment LLC

Jez Humble

CTO · DevOps Research & Assessment LLC

Four years and 25,000 respondents later, and we have learned a lot about what makes IT and organizational performance awesome. This year we include insights into security, containers, trunk-based development, and lean product management. Tune in for practical take-aways to make your teams' technology transformations even better.

Chapters

Full transcript

The complete talk, organized by section.

Dr. Nicole Forsgren

So here's what we're going to go over today.

First of all, how to make your data suck less. We're going to focus on survey responses because that's what the State of DevOps Report is based on, but these methods all work for all the types of data that you also have in your systems. It works for objective data, it works for log data, it works for system data. You can use this for several different types of things.

We'll talk about what we found that we did and didn't expect: things about continuous delivery, things about management, things about culture, this DevOps thing. That's what we're going to talk about today.

We'll start by talking about how not all data is created equal. I'll go ahead and turn around for this, because who here thinks surveys are shit?

Some hands. That was a solid response. See what happened last night? Yeah. Too soon. For those watching the video, this is the day after the election. It's a little rough. Okay. I'm just going to go and cry for a little while.

Jez Humble

I know.

Dr. Nicole Forsgren

I'm sorry, Jez. Okay. We're all sorry.

Who here loves the data from their log files?

Okay, pretty good response. Who here has seen shit data in your log files? Yeah, like every hand. Love you guys.

Here's the thing: we can see good data several places. We can see bad data several places. So we can go through steps to make our data better. We can do this for our log data. We can do this for our survey data.

In survey data, the best way to do this is through what we call a latent construct. I have a picture of a thermometer here because some things are hard to measure with just asking for a number. I can ask for your temperature, though. I can ask for the temperature of this room. I can get a number. If I ask Jez, I might get a different number because Celsius.

Jez Humble

Celsius.

Dr. Nicole Forsgren

But we can get a number. What if I ask you about your satisfaction? What's your satisfaction number today? Or your organizational culture number? That's harder to do, and so when we do surveys, we want to do things like latent constructs.

When you think about this, think about a super awesome Venn diagram with five or six overlapping things that have this core thing in the middle. That's a construct. That's how we get around and help our data be better with survey questions.

Because if I only ask one question, what happens if I say it wrong or you understand it differently? I don't know if it's bad. If I ask four or five questions and this one sucks and all the statistics tell me it sucks, I can throw that out and I still have four left, and I know that it was a misunderstanding or an oops or something.

The other really great thing about this is that people lie, right? That's fine. People lie. They're going to bias your data. It's hard to trick a latent construct, because that means you have to lie on every single item...

Jez Humble

Consistently.

Dr. Nicole Forsgren

Consistently, and every single person on your team has to lie on every single subcomponent consistently.

So it helps our data be better. It helps make our survey data good, or it gives us reasonable assurance that it's telling us what we think it's telling us. So now I can get your organizational culture temperature. I can roll it up into one number that kind of represents what that is.

It includes a couple different steps. The first step is a manual step: writing the questions. We call them items. Writing the items needs to be based on definitions, theory. It needs to be very carefully and precisely worded.

We do a card-sorting task. I will pull 20 or 30 people around, hand them a whole bunch of questions on three-by-five cards, ask them to sort them into piles by theme to make sure that the things that I think in my head make sense to a whole bunch of other people.

The next step is statistics. I do this with numbers. I establish validity, both discriminant and convergent. So once I get survey responses from a whole bunch of people: who here has taken the State of DevOps study?

Thank you. Love you. Love you so hard. Thanks.

When I get all of those responses back, I can do a statistical analysis on it to make sure that when you're answering questions about what I think that construct is, you're only answering about that construct, and you're not answering about anything else. Those are the validity items.

There are also reliabilities I can do. It's a statistical way to make sure that we're all reading it in about the same way. So again, if I have five questions and one of them's bad, it doesn't match up statistically, we say it doesn't load. It's like the statistical card-sorting task. That one thing just ends up in a weird pile. It doesn't end up in the same pile. I can throw it out, and I still have four left.

What happens if I only ask one question? I think it's awesome, but most everybody here misunderstands it. I have no idea. I don't know. That's why we use latent constructs.

We can do this with log files and log data as well. Sometimes we have tons and tons and tons of data. Who here is drowning in data? Drowning in data. We can combine our data the same way. We can say, "What would be a good indication of performance? What are some measures that we can combine for performance?" And we can do the same thing.

A good example here is culture. I brought up culture earlier. We started by asking ourselves, does it matter? People always talk about how culture matters. That's nice, but research, right? So I had to go dig into the literature and see if it matters, but also, what kind of culture?

Remember I said the definition really matters. Are we talking about national identity and norms? That's been shown to be important in financial literature. Are we talking about an adaptive culture, a culture that values learning? We used that in 2014. We've really focused the last few years on a culture that values information flow and trust, and this came from a researcher named Ron Westrum.

Do you want to talk about Westrum a little bit?

Jez Humble

Yes. He's a sociologist who was studying safety outcomes in healthcare and aviation, and what he found is he could predict those safety outcomes by talking about information flow. He developed a model that allows you to measure the extent of information flow along six different axes.

We took that model, which is right here, and we asked you questions which allowed us to put you along this spectrum.

Dr. Nicole Forsgren

This is the Westrum typology. This is what comes from his 2004 paper. We can read through these. As we read through the pathological culture, a bureaucratic culture, and a generative culture: who here has a friend that works in a pathological culture?

Okay. There's a lot of those friends. Lots of friends. I'm hearing about lots of friends. This is about a third of our population so far.

Who here has a friend that works in a bureaucratic culture? Narrow responsibilities, modest cooperation. This is about 55% of our population.

Who here lives in a generative, performance-oriented culture? High cooperation, messengers are trained, risks are shared, novelty is implemented and encouraged.

Jez Humble

And messengers of bad news. Do we actually train and encourage people to bring us bad news so we can act on it? Because it gives us opportunities to learn.

Dr. Nicole Forsgren

Okay, some hands here, too. Where are we? I think about 20% is generative here across our study.

What we needed to do was find a way to turn this into a latent construct. This isn't survey questions, this isn't survey items, but it's based in theory, it's based in literature. I know that this should be predictive of performance outcomes.

Here's what we did. We took the right side and we used strong, clear language. "On my team, information is actively sought. Failures are learning opportunities, and messengers are not punished. Responsibilities are shared. Cross-functional collaboration is encouraged and rewarded."

Now when someone reads through this, they can answer on a scale from one is strongly disagree to seven is strongly agree. The statement has to be something that you can strongly agree or disagree with. You don't want, "I kind of feel pretty okay when I work with my people, sort of, some days, on Wednesdays." It needs to be strong, clear statements so that I can get those shades of gray.

Jez Humble

These are called Likert-style constructs.

Dr. Nicole Forsgren

Yes. Now, the great thing here is that this has been found to be valid and reliable. It's also been found to be predictive of IT and organizational performance for three years in a row.

I know several teams in the industry now that are using these as quick pulse checks every three months on their teams, because we do know that when culture starts breaking down, technology starts breaking as well. When teams start fighting, technology starts breaking.

My nerd moment really quick was when Dr. Westrum actually emailed me several months ago and said, "I hear you've been using some of my work. Can I see it?" I'm like, "Yeah, that'd be awesome." And I emailed it off and I was like, "Oh shit, now I'm nervous. What if I did it wrong?"

No, he said that he was super excited about our results and we had applied it quite intelligently. So yay. We did it right.

Here's an example. We're all going to play along. We thought in 2014, so this originated in our 2014 study, that notification of failure would be highly correlated with success outcomes. Where do we find out about failure?

This was original in 2014. Can you spot it? Now here's the question I'm asking about. We threw this into that statistical card-sorting task. It didn't land in one pile. The statistics tell me that it lands in more than one pile. So read through it quickly. Think about what that might be.

It lands in two piles.

Jez Humble

And it doesn't tell us why it lands in two piles. It's like, "Sorry."

Dr. Nicole Forsgren

But when you take a look, do you see the difference? There's some notification that comes from internal embedded systems, and there's some notification that comes from Twitter. I don't know about you, I don't want to know that my system has gone down from customers.

So now we have two different constructs. That's the other nice thing about having several items. If one had just fallen out, I would've just tossed it. Now I had two different constructs: notification from far, I called it, and notification from near, from inside.

Guess which one is predictive of IT performance? Far? Near. Near. Yay. We're all scientists.

And then we do a whole bunch of other data tests. So I test for common method bias, common method variance, early versus late responders. I do a whole bunch of other stuff.

All of the stuff I do before I even start looking for relationships among the data. So are we a little bit convinced that we can trust the data now? Maybe? Yeah. Yes? Yay. Yay. Okay.

Really one quick note. We don't test for prediction unless we meet one of three conditions. So keep this critical hat on when you read stuff.

You either have to have longitudinal design, which ours doesn't. We collect data several years in a row, but I don't link your responses to each other. You don't get a link to the survey with a cookie. So I don't have longitudinal data.

Randomized experimental design: I can't do that.

Jez Humble

And this is a problem in computer science in general. It's impossible to do randomized controlled experiments because there's so many variables. This has been a big problem in computers in general, is we just don't have that kind of thing.

Dr. Nicole Forsgren

And even in business. How am I going to say, "Dear this business, only spend this much money. Sorry if you end up losing in the market. That sucks, but I need it for my research design. Thanks, that'd be great." No takers. Shocker.

The third one is theory-based design. I mentioned we went back to the theory. Ron Westrum's work is a good example. He found in several other contexts that that type of culture was predictive of performance outcomes. So if I believe that's there, I can go ahead and test it. And if it is there, then I can say it's predictive.

Otherwise, you get crazy stuff like, have you guys ever heard of spuriouscorrelations.com? It's fantastic. Do you know that the number of pool deaths, children that die in pools, goes up the year that Nicolas Cage movies come out? Do we think one predicts the other? No. So this is why...

Jez Humble

I know. Impossibly.

Nicole is very strict about this. If I start using the word predict or cause in one of these things, Nicole will come over and be like, "Jez..."

Dr. Nicole Forsgren

Correlates.

Jez Humble

"Correlates, Jez."

Dr. Nicole Forsgren

Unless it meets one of these conditions, we will only report correlations in our report. And God bless them, because at one point, I was this random girl who was like, "You can't say that." And they were like, "Who is she?" But they did it. Yay.

Now we're to the data and the key findings. If there's only one thing you take away from the key findings, it's this: throughput and stability is not a zero-sum game.

Jez Humble

For years, we've been told, okay, going faster means breaking things. And you've all been like, "Oh, we should do continuous delivery." And then the IT operations people are like, "No!" Because you're going to break all our systems, quality's going to go down, blah. We're trained to think that by going faster, you reduce stability, you reduce quality.

This was shown to be wrong in manufacturing. The lean manufacturing movement won not because Toyota made shitty cars cheaply, but because Toyota made cheap cars that were better and got to market faster. It was a combination of things. They fundamentally changed the game.

What we're seeing in the DevOps movement right now is we're fundamentally changing the game. It's not a trade-off. The project management trade-off of cost and schedule and scope, it's not a trade-off. We're fundamentally changing the game and finding new ways to do things, and this comes through in the data.

We divided the population that responded to the survey into three groups, and the fact that we were able to do that in a statistically valid way shows us that these measures that we came up with are valid, so that's a huge deal in the first place.

What we found is these are our stability and throughput metrics. Throughput metrics: lead time for changes, time to go from check-in to version control to release into production, and then how frequently you release. Throughput.

Stability is how long it takes to restore service in the event of an outage or a degradation of service. And then change fail rate is the proportion of time you put something into production and you then have to either roll back or, because no one ever rolls back, emergency fix until it works.

What we found is the high-performing group did better on both of these things. It's not a trade-off. High performers did better on throughput and better on stability. The reason for that is because the practices which enable them to do that enable both of these things.

Having everything in version control enables you to increase throughput because it means you can get testing environments on demand. It means you can release on demand. And it also increases stability, because when something goes wrong, you can restore service in a predictable, deterministic way in a known time.

So the same practices in the continuous delivery and DevOps canon that increase throughput also enable increased stability, and the data shows us that this is true.

Dr. Nicole Forsgren

I'll also mention that the low performers suck at everything, and the medium performers are in the middle on everything. So we just don't see trade-offs.

Jez Humble

Right. That strategy isn't working.

You can see this if you go to the report. You can see high performers are deploying multiple times per day. Lead time is less than an hour. Time to restore service is less than an hour. Change fail rate is nought to 15%, versus the low performers who are releasing on the scale of weeks or months, and their lead times are measured in months as well.

Dr. Nicole Forsgren

And it takes them less than one day. What's the asterisk here? Oh, the asterisk is that I report the medians because the distribution isn't always normal, so I report the median. But when you do the test for statistical differences, you use average and median, or average and...

Jez Humble

Mean.

Dr. Nicole Forsgren

Mean. Thanks. So the low performers are statistically different even though the median appears the same in this report.

Jez Humble

So the distribution probably has a long tail or something.

We see this in the data. Low performers do worse at all of these things. And this is across industries. We see large companies in the high-performing group. We see small companies in the low-performing group. We see people in government and healthcare in the high-performing group. We see startups in the low-performing group. It's not the case that just because you're big and in a regulated industry, you're going to be a low performer.

Dr. Nicole Forsgren

What he's saying is no excuses.

Jez Humble

Yeah.

Dr. Nicole Forsgren

Just get better.

Jez Humble

Stop whining.

Dr. Nicole Forsgren

Just do better.

Jez Humble

This is really hard, that's the other thing, doing all this stuff. It takes years and a large amount of investment, but it's hard, but anyone can do it.

Dr. Nicole Forsgren

But it's worth it.

What we found is that IT matters. Firms in the high-performing group were twice as likely to exceed their profitability, market share, and productivity goals than the people in the low-performing group.

So we've been told for years that IT doesn't matter. Our data shows that it does matter. Again, we have this very clear definition of IT performance. We have a standard in-the-literature definition of organizational performance: profitability, market share, productivity. And we can show that the one predicts the other.

Jez Humble

Yep.

Dr. Nicole Forsgren

Sweet. Yes. IT performance is predictive of organizational performance.

Jez Humble

So how do you do this?

This is the model from this year. We basically expanded this model over a number of years. This is called a structural equation model. What this allows us to do is basically write this on a piece of card, use a tool to load the data, and actually, because we wrote the card down before we ran the data against it, it's theory-based, and we can actually...

Dr. Nicole Forsgren

And because they're in the literature and I spend months doing lit reviews that no one else wants to do, we know that these arrows should be here. We expect these arrows.

Jez Humble

And the data confirms that this is in fact the case.

We define continuous delivery in our model as being comprised of all of these things: effective test data management; comprehensive, fast, and reliable test and deployment automation; trunk-based development, which means everyone is working on branches that get checked into trunk or master on at least a daily basis. No one's working on long-lived feature branches that take more than a day to go into trunk.

Continuous integration, having everything required to reproduce your production system in version control, and having good practice around security. All these things together lead to low levels of deployment pain, high levels of IT performance, lower change fail rates, less rework.

We were trying to find this year a measure of quality, which we'll talk about in a minute. Our proxy variable for quality is the level of rework, and we found continuous delivery predicted low levels of rework. We'll talk a bit about identity in just a minute.

Dr. Nicole Forsgren

So, very quickly, some surprises. Take a look here and ask yourself which of these measure effective test practices. Again, if we're doing this statistical card-sorting task, which ones fall out? Which ones don't load? Which ones aren't the same thing?

Effective test practices: developers create and maintain acceptance tests. QA does the acceptance tests.

Jez Humble

Outsource people.

Dr. Nicole Forsgren

Outsource people. When tests pass, I'm confident the software is releasable. Test failures are likely to indicate a real defect.

Okay, do we have our theories? Do we have our guesses? Okay, here we go. Here's what we find. Those in red don't load. It doesn't work.

Jez Humble

And this one was particularly interesting. A lot of organizations want to take their QAs and teach them test automation. It doesn't work. My hypothesis around why it doesn't work, which is why I put this question in in the first place, is because if developers aren't involved in writing the automated tests, they won't write testable code. There's no feedback loop there.

You have to have the developers involved in the test automation and the testing because that's a feedback loop that encourages them to write testable code. The point of test-driven development is not the unit test you get out the other end. That's a nice byproduct. The point of test-driven development is it exerts a force on developers to write testable code.

So the data shows that I'm right. I love that.

Dr. Nicole Forsgren

Surveys are such a rich source of confirmation bias.

Jez Humble

Let's talk about unplanned work.

We've tried a number of different proxy variables for quality. Quality you can't measure objectively. It's inherently subjective. Quality, as Jerry Weinberg tells us, is value for someone. So you can't measure quality objectively. But we have proxy variables, and this was the one that actually worked out, that actually loaded.

Dr. Nicole Forsgren

We're looking at how much unplanned work, unplanned rework you end up having to do. High performers versus low performers.

High performers spend 20% more time on new work than low performers. So think about value creation in your organization. They also spend 22% less time on unplanned work and rework. So think about how much you have to redo work and how much time that gets spent there.

Jez Humble

Surprises with culture.

Google recently did a study where they looked at how to build a high-performing team at Google, and they were like, "Five Node developers and a product manager. Oh, well, that doesn't work." Eventually they came up with a model which had nothing to do, surprise, surprise, with tools or who does what or any of these other things.

So we wanted to validate that study. We also have a measure of how strongly you identify with your organization, which is from the literature.

Dr. Nicole Forsgren

From the literature, and as Jez mentioned, but I'm going to highlight a little bit, the cooler technical stuff you do, the more it contributes to your identity, which is, "I want to work for this company. This is an amazing place to work. I want to tell all my friends to work here." It's like an awesome NPS score, like a science NPS score.

Does this ring true? If you're doing super awesome work, do you want to stay working for your company? Who here is desperately trying to hire?

Okay, so many hands. So by making your technical practices better, that helps you hire, attract, and retain great quality people.

Jez Humble

And then we wanted to compare the results of that with the Westrum culture model that we'd already validated in previous years.

Dr. Nicole Forsgren

Here's a quick overview. The identity items on the left, the ones I just talked about: I'm glad I work for this company. This is a great place to be. The work I do is exciting. I want to keep working here. My values align.

The Google findings are on the right. As Jez mentioned, it doesn't have anything to do with technology or tools. It's all about psychological safety, dependability, the structure, and the clarity of the work. So we turned the Google findings into survey questions and survey items like we did before.

Here's the interesting surprise that we found: the Google items split in half. The top two load with the Westrum items. But think about it: Westrum's all about trust, information flow, psychological safety, dependability. The bottom items are all about the work that you're doing and the meaning that it has to you yourself and how you contribute to your organization. Those loaded with the identity items.

What we ended up doing was just retaining the Westrum items and the identity items in our research. We didn't keep the Google items in there. But that's also really interesting because that means that Google's findings have also validated other research in the area. It all comes down to...

Jez Humble

Reproducibility.

Dr. Nicole Forsgren

Yeah. Crazy science.

So we want psychological safety, a good environment to do our work, and we want work that is meaningful, that makes us want to stay there.

Jez Humble

So one minute on management.

This is one of the surprising things. Work in process, managing work in process: one of the key ideas of lean thinking. We want to manage WIP so we can get shorter lead times. Right? Obviously. In theory, right. I've heard this is a thing.

Turns out correlation between work in process, managing work in process, and IT performance is negligible.

Dr. Nicole Forsgren

Oh.

Jez Humble

What? Is Lean wrong?

What we actually found, we have a structural equation model we pulled out for Lean management, and again, we're defining Lean management as these things. These are models, so they are simplifications of reality. But they load, and the data confirms them.

WIP limit's only effective when used in combination with the use of visual displays to monitor quality, productivity, and work in process, and the use of application performance and infrastructure monitoring tools to make business decisions. You can't just impose the WIP limits and achieve the effect. It's a combination of multiple things working together that gives you the impact on IT performance.

Dr. Nicole Forsgren

And then just as an FYI, it makes your culture better. It makes your IT performance better. It decreases burnout when...

So I think we're going to skip to the end now because we're out of time.

Jez Humble

Yep.

Dr. Nicole Forsgren

Okay. Conclusions. Even if you think it's obvious, test with data. If the results don't surprise you at all, you're doing it wrong. If every single thing you're doing surprises you, you're also doing it wrong. There should be a happy middle ground there.

Thing two, stability and throughput is not a zero-sum game. We are changing the game. We are not in a world of trade-offs.

IT matters. IT performance matters. It contributes to organizational performance, but you have to do it right. You have to have the technical piece. You have to have the process piece. You have to have the culture piece. If you have all three, it really benefits the organization.

We have a company. We do cool stuff. We can come and measure you. Check out our website.

Thank you very much.

Jez Humble

Yeah. Thank you.

Q&A

Q: Since I have the mic, we're going on an afternoon break now, but since we have kind of a long break, if we want to take a couple of questions, I'm sure they'd be glad to answer. Anyone?

Jez Humble: One or two super short ones, though, because we've got to run.

Q: Eric. So, super short question. IT performance relating to awesome company results. I wondered if you did that in a longitudinal way, because I wondered if you followed any companies that sort of dropped out or whether you just took a single year and looked at that.

Dr. Nicole Forsgren: I don't have longitudinal data yet.

Q: Okay.

Jez Humble: We validated it in multiple years, though.

Dr. Nicole Forsgren: Yes. It's been validated three years in a row.

Q: Got it.

Dr. Nicole Forsgren: Thanks, everyone.