The State of DevOps - Capabilities for Building High-performing Technology Teams
Technology drives value and innovation in every organization. At Google Cloud, we have learned a lot about what it takes to build and scale high-performing technology teams. Our own lived experience combined with a multi-year research program led by the DevOps Research and Assessment (DORA) team can be used to help you and your team transform into a high-performing technology team.This talk will dive into some of the findings of the 2022 DORA research program.We will couple these findings with stories from the field about how teams are putting these ideas into practice. There will be success stories and cautionary tales: let's all learn from one another.Spoiler alert! The best teams focus on getting better at getting better. You can do this, too!
Chapters
Full transcript
The complete talk, organized by section.
Amanda Lewis and Nathen Harvey
Amanda Lewis: Good afternoon, everyone. Quick show of hands: how many of you have read the State of DevOps Report? All right. The 2022 Accelerate State of DevOps Report came out a few weeks ago, and today we're going to share some of the insights and highlights from it. I thought I would start, and I kind of had a different idea for today.
Nathen Harvey: Oh, what is this? Amanda, this is not what we were going to do. I thought some people had already read the report. How many people picked up the report today, since you've been here, and have a hard copy? One of the highlights in the report is that context matters, so I thought maybe, since we're together, we could have story time with Nathen and Amanda. No report reading.
Amanda Lewis: But I was going to dramatically read the entire thing.
Nathen Harvey: I know. I'm saving you.
Amanda Lewis: Fine. We can do story time on one condition.
Nathen Harvey: What's that?
Amanda Lewis: The story that we're about to tell is completely fictional. You all have to know that it's not about where we work, and it's not about where you work, though you might see some reflections there.
Nathen Harvey: The Google lawyers did tell us we had to include some disclaimers.
Amanda Lewis: Yes. This story, all names, characters, and incidents portrayed in this production are fictitious. No identification with actual persons, living or deceased, places, buildings, and products is intended or should be inferred.
Nathen Harvey: Also important: no animals were harmed in the telling of this story.
Amanda Lewis: There's still time. I think we have 23 minutes. I have one more disclaimer, a warning for you and the audience. The topic that I chose for today might cause some terrible flashbacks.
Nathen Harvey: Oh no. Do I need to sit down?
Amanda Lewis: There are no chairs, but we are having story time. If anybody in the back wants to come up and sit up front, today's topic is, everybody...
Nathen Harvey: I don't get a drum roll?
Amanda Lewis: Log4Shell.
Nathen Harvey: Who's pumped? No? Again?
Amanda Lewis: I think it's important to start with a timeline. Close your eyes. Go back to December 10th, 2021. Where were you when you first heard about the Log4Shell CVE?
Nathen Harvey: December 10th, 2021. It was a Friday. It was December. At my organization we love DevOps, we love Twitter, and we have a strict hashtag no-deploy-Fridays policy. It was Friday, and I wasn't going to deploy anything. I was a little bit behind on holiday shopping, so it was going to be a light, easy day. I might have coffee in the morning and then jet out.
Amanda Lewis: I went back and looked at your calendar, and you had a pretty simple Friday planned.
Nathen Harvey: I have a good memory.
Amanda Lewis: Unfortunately, it sounds like that's not how it turned out. Walk me through what really happened.
Nathen Harvey: I read about this CVE on Twitter and immediately went through the five stages of denial. First I had denial. Then I got angry: it was probably a real issue. Then I did some bargaining, then I had depression, and finally I accepted my fate. I knew this CVE was going to change virtually everything about my weekend.
Amanda Lewis: So it was not a straight line.
Nathen Harvey: No. It was not a straight line. The CVE felt a lot more like this. I picked up the phone, called my wife and family, and said, sorry, everything about this weekend is going to be put on hold. We have a real problem at work. Then I hung up, went to my toolbox, and pulled out one of my favorite tools for this sort of scenario.
Amanda Lewis: What tool is that?
Nathen Harvey: The OODA loop. Observe, orient, decide, and act. Unfortunately, I had already observed that there was a CVE. Now I had to orient myself: where are we impacted across our systems? Then figure out what to do, and then go do that thing.
Amanda Lewis: How large was your production system? What did this look like?
Nathen Harvey: We had about 400 different systems in production. I pretty much knew right away that since most of them are Java, Log4J and Log4Shell were probably going to impact them.
Amanda Lewis: Four hundred systems. How long did it take to assess where you were impacted?
Nathen Harvey: I don't know if you've heard of all the latest and greatest software supply-chain security tooling in the industry, but it only took me about two minutes, because I went to our trusty SBOMs, our software bills of materials, looked through all of them for those 400 systems, and knew exactly where and how we were impacted. If only it worked that way.
The truth is, we knew about this kind of work, but oddly enough we had not been able to prioritize it. Features, features, features, am I right? Since I did not know everything about these 400 applications, I had to start making calls and pulling in subject-matter experts. I knew it was going to ruin the weekend for all of us, because there would be a lot of manual inspection. By Monday morning, we had identified two primary applications we needed to fix right away.
Amanda Lewis: What were those two applications?
Nathen Harvey: We're an online retailer. We have an application that has been around for about 12 years. It is the heart of the organization: our order management system. It sits in the middle of everything. Our business cannot run without it, so we knew we had to fix that. We also knew customers come to us through our e-commerce front end. We had a microservices application that makes up that front end, and we knew we had to fix that too. The e-commerce application and the order management system were our first priorities.
Amanda Lewis: Walk me through how you approached and solved those two.
Nathen Harvey: Which one should we start with?
Amanda Lewis: E-commerce. I'm going to guess the order management system was gnarly, so let's start with the easy one. That's where the order starts anyway.
Nathen Harvey: It is the front door. Our e-commerce site had been around for about two years. Two years ago we knew we needed to rebuild it, so we did the sensible thing. We hired consultants and said, build us that microservices architecture everyone is talking about. They built us a great microservices architecture for our e-commerce front end, and then they left. We paid for functionality, not for knowledge or documentation. We got the functionality we paid for. It was awesome.
The nice thing was that we did not have to do much with that system. Retail vendors and suppliers could manage the content there, so we pretty much left it alone. We might touch it once or twice a year.
Amanda Lewis: At the time the tradeoff made sense. Bringing in a vendor or partner can be successful if we upskill the team and don't just walk away. Since you updated it a couple of times a year, did you have access to the code or did you have to reach out to the vendor?
Nathen Harvey: That would be even worse. They gave us the code, and that part was great. We were happy to have the code. But it was about 27 different microservices, which equated to 27 different repositories in our version-control system. We had to dig through all of those repositories and figure out where and what needed to be updated. We identified that just about every one of those microservices needed to be updated.
Amanda Lewis: You had only been making updates a couple of times a year. Automated build process?
Nathen Harvey: What's that? We had these 27 things, found where the library or dependency needed to be updated, updated one dependency, and then it was a manual build process. We picked up the thing we built, put it into the test environment, opened the website, and started banging away and testing. It was all manual.
Amanda Lewis: I'm a little concerned this story is going downhill. There's going to be a lot of failure up ahead.
Nathen Harvey: There might be. Dangerous ahead.
Amanda Lewis: When you pushed it, what happened?
Nathen Harvey: We started with one microservice, because one thing at a time; we should build iteratively. We updated one microservice, did the real process, put it onto our staging environment, opened up the application, and got 500. That was not the number of orders we saw; that was the response from the server. We updated one microservice and it broke the whole thing. As it turned out, we could not update this microservices application one microservice at a time. We had to manually build and test all 27 components. When we finally updated all 27 of them, things started working beautifully. That was Thursday-ish.
Amanda Lewis: That does not sound like a very fun weekend.
Nathen Harvey: Remember, over the weekend we spent the entire weekend assessing how bad we were. We did not fix anything over the weekend. No fixes started until Monday.
Amanda Lewis: So it took about a week.
Nathen Harvey: By Thursday everything was ready to go. But remember hashtag no-deploy-Fridays. That thing had to sit there. We only deploy once or twice a year, and our change approval board never trusts the changes we take to them. They only meet on Tuesdays and Thursdays, but in an emergency we can ask them to meet sooner. They met on Monday. We went to them with what we had, and they said: you deploy this thing once or twice a year. Actually, you deploy it two or four times a year because every time you deploy it, it breaks, you roll it back, and then you deploy it again. So they told us to do more testing. Good thing too, because we found a couple other things that broke. On Tuesday we fixed everything, took it back to the change approval board, and they finally let us go. Then it was fixed.
Amanda Lewis: That was the e-commerce front-end microservices application. That was the easy one, right?
Nathen Harvey: That's how I feel right now.
Amanda Lewis: Did your team look like this?
Nathen Harvey: Pretty sure. They were super burnt out and maybe not coming back. The team on Wednesday was super crispy. It had been a long weekend. They were doing long hours. They were stressed because they knew there was a vulnerability in production and we were at risk of all kinds of terrible things happening. They also continued to struggle with: after I write the change, is it going to work? I don't even know if it is going to work or if I can get my changes approved. It was terrible.
Amanda Lewis: How are they doing now? It has been a year.
Nathen Harvey: They are doing a little better. The scenario exposed some of the ways we were working and where we had opportunities to improve. The e-commerce team could look around to other teams in the organization and see that not every team had such a bad time. Other teams did okay.
Amanda Lewis: Now we're going to talk about the order management system. I am going to guess you were pointing the order management system to some of those other teams. Walk us through what happened.
Nathen Harvey: The order management system has been around about 12 years. It does not follow a microservices architecture. I would call it a macroservices architecture. There are about three big components: the O, the M, and the S. We went to the first one, the O, found where the dependency was, updated the library, and committed the code. Continuous integration kicked off, automated tests kicked off, and we went to the change control board with green tests and a vetted pipeline. They said, ship it. So we shipped it.
Amanda Lewis: You said there were three components, but you only updated one.
Nathen Harvey: Sure, but we can deploy them independently. On Tuesday, we started with the M. We picked up the next component and ran tests. Part of the team started working on the third component while tests were running, but our tests failed. The test turned red.
Amanda Lewis: I remember you talking about this team. Wasn't this the team that always prioritizes broken builds?
Nathen Harvey: Whenever a build breaks, they stop what they're doing. We pulled the team off the S, all dug into M to figure out what was broken, and that bug was elusive. It took the entire day to find it. By the end of the day, we found the bug, made the updates, and CI and automated tests were ready to go back to the change approval board. Ship, then move on to system three.
Amanda Lewis: Even in this situation, with everyone talking about Log4Shell, they still prioritized the broken build?
Nathen Harvey: Yes. The build is broken. We have to fix it. We can't just ship something bad.
Amanda Lewis: Can we clap for that team? They are a good team. Raise your hand if either of those teams reflects anyone in your organization or a past organization you worked with. Obviously, they're not on the team you are on now.
Nathen Harvey: You've all felt that before.
Amanda Lewis: You've given me a lot of good information. I think I can help.
Nathen Harvey: You can help how? I want our web e-commerce team to feel as empowered and as good as our order management team. How can you help?
Amanda Lewis: DORA.
Nathen Harvey: Dora the Explorer? No? Oh, right, the Digital Operational Resilience Act that the EU recently passed. That's what's going to get us there. DORA.
Amanda Lewis: Sorry, Nathen, not today. Today we're going to talk about this DORA: DevOps Research and Assessment.
Nathen Harvey: Excellent. Who is familiar with DORA? A lot of hands in the room. It is an ongoing research program that has been around for about eight years. It has been funded by different organizations over the years, and for a few years it was funded by the organization called DORA, founded by Dr. Nicole Forsgren, Jez Humble, and Gene Kim. In 2018, DORA the company was acquired by Google Cloud, and the DORA team at Google continues the research into the capabilities and practices that we know predict the outcomes central to DevOps. The research has remained platform- and tool-agnostic. For myself, I was a huge DORA fan. I remember the day I found out we were acquiring them. It has been exciting to work with the research team, not only because of the learnings each year, but because I have learned a lot about the research practice: the ethics and the passion they bring to it.
Amanda Lewis: Great. Now I get to read the report.
Nathen Harvey: We do not have time for me to read through the report. Over the years, this research has dug into technical, process, and cultural capabilities that help teams perform. One thing we found is that technology alone is not enough. We have to look at process and how people in the system come together to drive performance. Through analysis and research we are able to make predictions. As teams get better at particular capabilities, that is predictive of them getting better at software delivery and operations performance. Most importantly, as you get better at software delivery and operations performance, you help your organization get better and drive organizational performance.
Amanda Lewis: So if you start with trunk-based development, you can earn too many two pennies more per share for your organization. Trunk-based development is what you should do next.
Nathen Harvey: So it's like a maturity model with a built-in roadmap?
Amanda Lewis: Nicole Forsgren would not be happy with you using those words, or with me saying that is exactly what it is. Context matters. There is no well-paved roadmap. There is no one-size-fits-all path. We have to think about how our team is doing relative to those capabilities and where we should make improvements.
Nathen Harvey: One thing this year is that year after year we have seen that delivery performance drives organizational performance. This year, delivery performance drives organizational performance only when operational performance is also high. Speaking of operational performance, we use reliability as a measure for operational performance. Reliability is multifaceted. There is not one way to measure it, because at the end of the day, reliability is about our teams' ability to keep the promises we make to customers and deliver on those promises. We look at practices like regular reliability reviews, whether teams have reliability goals, and whether they reprioritize work when they are not meeting those goals. One fascinating thing this year is that reliability helps drive organizational and operational performance only as teams mature. Teams early in their reliability journey, doing only one or two practices, often lost some reliability initially. As they matured their practices, approach, and capabilities, they saw the J curve of transformation, the inflection point where reliability started to go up. The key lesson is that when you start, you are likely to trip and fall. Get back up and keep moving forward. You have to stick with it.
Amanda Lewis: In that same way, if you start those reliability practices, you are layering them on top of each other. Technical capabilities build on one another. In almost every talk over the last two days we have seen this: version control, continuous integration on top of that, continuous delivery, and loosely coupled architecture. Those organizations are more successful. The research shows 3.8 times higher organizational performance.
Nathen Harvey: Another key focus this year was the software supply chain, specifically the security practices and capabilities around securing it. We found good insights. Adoption has already begun. Maybe not surprisingly to this room, the leading predictor of teams that were best at those security capabilities was teams with a high or generative culture: high information flows, high trust, learning from incidents, and so forth. Those teams had a head start.
Amanda Lewis: Amanda says I have to move on because we are running out of time.
Nathen Harvey: When coupled with continuous integration in particular, shifting left on security and embracing continuous integration, teams with above-average continuous integration capabilities and above-average security practices and capabilities had the highest organizational performance relative to peers. These things have to go together. In our fictional story, we saw that play out. Even the application with a modern microservices architecture lacked continuous integration, and as we addressed a security vulnerability it hurt us. The other application, which had been around a long time and used a more traditional architecture, had prioritized how they do software delivery, and that paid off. The change approval board existed in both cases. In one it held them back and caused frustration and burnout. In the other it helped them get that out in three days.
Amanda Lewis: So when it comes to the State of DevOps Report and the entire body of research around DORA, the number one takeaway is that your team context matters. You have to start by looking at the capabilities: technical, process, and cultural capabilities. You can use this model, our structural equation model, which lays out the predictions and predictive analysis we see. Capabilities build on one another and lead to good outcomes. As a team, do an open, objective assessment of yourself and your capabilities. Where are you strong? Where do you have opportunities to improve? Then make the investments where you have opportunities to improve. Embrace the theory of constraints and make improvements where you are being held back.
Nathen Harvey: I recognize we did not introduce ourselves at the beginning, so I guess we will do it here at the end. First, shout out to my daughter Olivia for the cartoon drawing of us.
Amanda Lewis: Some family and friends think what Nathen and I do is watch Dora the Explorer episodes and talk about them all day long, so this was her depiction of us. I'm Amanda Lewis, a developer advocate focused on DORA, bringing the community together and helping people use the research.
Nathen Harvey: Hi. My name is Nathen Harvey. I'm also a developer advocate. You can follow me on Twitter there; just misspell my first name like my father did.
Amanda Lewis: A couple of weeks ago, we kicked off a community of practice. It is new and we are iterating, but we ask that you go to DORA.community and join us. We are having community conversations. We have had two so far. We start with some context and then do a lean coffee discussion. We would love anyone here who wants to lead one to reach out to us, because we do not want to be the ones kicking off all the conversation. Without you, DORA would not exist. You fill out the surveys, give us the information, share your stories, and we want to connect everyone so we can keep improving together. The community is about bringing together practitioners, leaders, and researchers in this movement so we can all help each other get better at getting better. Thank you so much for joining us today. I have copies of the report here, and I'm happy to go out in the hall and read it to you if you would like a real story time. Otherwise, you can download it from that QR code as well.
Nathen Harvey: Thank you so much.