The State of DevOps - Capabilities for Building High-performing Technology Teams

Log in to watch

Virtual US 2022

Download slides

The State of DevOps - Capabilities for Building High-performing Technology Teams

Nathen Harvey

Cloud Developer Advocate · Google

Amanda Lewis

Developer Advocate · Google

Technology drives value and innovation in every organization. At Google Cloud, we have learned a lot about what it takes to build and scale high-performing technology teams. Our own lived experience combined with a multi-year research program led by the DevOps Research and Assessment (DORA) team can be used to help you and your team transform into a high-performing technology team.

This talk will dive into some of the findings of the 2022 DORA research program.

We will couple these findings with stories from the field about how teams are putting these ideas into practice. There will be success stories and cautionary tales: let's all learn from one another.

Spoiler alert! The best teams focus on getting better at getting better. You can do this, too!

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Thank you, George. So anyone who was involved in the DevOps community say 10 years ago will recognize Nathen Harvey's name as his contributions were so visible, especially in the Velocity Community, which is partly his remit as the vice president of Community Development at Chef. In more recent years, he's been part of the DevOps research and assessment project at Google Cloud, which became the new home of the state of DevOps report. Which Google Cloud acquired in 2018. I've mentioned before that the state of DevOps research working with Dr. Nicole Forsgren and Jez Humble is one of the things that I'm most professionally proud of and I have so much admired and appreciated the work that Google Cloud has continued to do helping validate previous finding and researching New Frontiers. incidentally one of my favorite new findings is how the use of trunk based development increases the likelihood of being a high performer by 2.5 x So Nathen is currently a developer Advocate at Google Cloud and he presented on the latest findings from the state of DevOps research at the DevOps Enterprise Summit in Las Vegas a couple weeks ago and he co-presented with Amanda Lewis another Google developer Advocate who joined Google as part of their Stackdriver acquisition. So their presentation at DevOps Enterprise Summit Las Vegas was one of my favorites because they tell The Tale of Two Cities against a backdrop of Log4J with some incredible twists so here to tell the story is Amanda and Nathen.

Amanda Lewis and Nathen Harvey

[~1:35] here It's the 22 accelerate state of DevOps report. I am so excited to dig right into the findings. Let's let's cover them. Let's let's start reading right away. It starts here with it. Executive summary. Hey, Nathen. Yeah. Yeah the report. It's exciting. It's awesome. And as much as I'd love to hear you read that exactly executive summary. I was thinking that maybe you know since one of the key findings this year, is that context matters. So instead of you just reading out all of the highlights, which would be fantastic. Could we instead? Let's bring this report to life with story time. Okay, so I don't get to read it. But we're gonna bring it to life with the story. Okay, as long as it's clear Amanda to you and to me and to everyone out there. This is a fictional story. Yes, and and along those lines, you know, we really probably should add this disclaimer for the G lawyers. Could you just quickly read it for me? Yes, the story all names characters and incidents portrayed in this production are fictitious no identification with actual persons living or deceased places buildings and products is intended or should be inferred.

[~2:47] An additionally no animals were harmed in the telling of this story. Now Nathen since it's December, I thought we would choose a topic that would fit in with one of my favorite Christmas movies. The Nightmare Before Christmas So for today's story time. I chose Log4Shell. so don't you think it is like one of the greatest gifts we received in December of 2021. Well, you could call it a gift. What else could we call it? Let's think back. Like where were you December 10th? 2021. Wait, hold on before you start telling me that I'm gonna grab my whiteboard so I can take some notes while you tell me all about how you experienced this gift. Okay. Well December 2021. I came into the office, you know planning an expecting a very light day because it's December after all. Oh and it was a Friday and you know what our organization #NoDeployFridays, so, you know, it was a pretty light day, and I'm sure I had some.

[~3:58] Holiday shopping I had to get caught up on. Okay, so your Friday was? Looking like this, but then it changed. Yes, in fact everything changed. Would you mind holding my coffee for me? I would love to let me just set that down so I can take a few more notes. So, all right. It's December 10th. Walk me through what you did you you read the CVE then what happens? Okay, so I love this that you've drawn this as a DOT like a point in time, but the truth is it was much more like a roller coaster. I went through the five stages of grief, you know, denial anger bargaining depression. But finally acceptance another CVE was going to change virtually everything about my weekend plans. Oh, so what I'm seeing is that this really wasn't a straight line for you. It wasn't a moment in time. Like you said, this is a roller coaster. So you got up to acceptance. And then what did you do?

[~5:03] Oh, well, then I picked up my phone called my family and let them know that another CVE was going to change virtually everything about my plans for the weekend. We needed a plan and I went to one of my favorite Tools in a situation like this the OODA loop. You know, he would a loop. Can you remind me I sometimes I get the O's in the D's screwed up. Can you remind me yeah, of course Amanda the OODA loop observe Orient decide act then repeat. So when we observe we observed that there was a vulnerability we then had to orient how many production systems were likely impacted by this and then we decided what to do which was obvious. We were going to fix the vulnerability and then we were going to get to work but at this point in the story we've observed and we're now starting to orient. So as you were orienting and figuring out, you know, which systems were impacted. How many production systems did you discover were impacted by this? Yeah. Yeah. So we started there was that one. There was two there. Actually, it was about 400 400 production systems. It felt like most of them were going to be impacted as well.

[~6:13] Wow, now, that's a gift. I mean, it's like 400 gifts. How long did it take you to assess that many systems. Oh 400 systems. No problem. We got them all assessed and oriented about two minutes. We just did some querying through our SBOMs to find out which would be impacted, you know, SBOMs the software bill of materials. We obviously keep those up to date and accurate for all of our systems. So Nathen, I've got to tell you I'm like shocked right now. This is really amazing. So you could do that in two minutes. And then what did you do next? Oh, man, I do wish it had been that simple. We we recognize that SBOMs are important. But the truth is we just haven't been able to prioritize getting SBOMs in place for all of our applications. So what really happened was we ended up doing manual inspection for these 400 systems, but of course no one person to do that alone. No one actually understands all of these systems. We had to call in subject matter experts from across the organization.

[~7:18] And had to ask a lot of them to do work over the weekend, but by Monday morning, we had done an assessment and identified two applications that were the most critical and we knew we needed to fix those first. Wow, so I am really feeling for your team right now, especially so close to the holidays. So which two applications were the most critical for your business? Yeah. Well, we have our order management system, you know, it's the heart of our business. It's been around forever. But if this is offline customers can't buy anything. We can't ship anything. We can't manage our inventory. The other system was our primary e-commerce system. It's the front end of our business. It's where our customers come to purchase items if it's down or compromised, they can't buy anything. That's no good for any of us. Now so um, I think about these two applications.

[~8:12] Let's start with discussing the e-commerce website. I expect it was easier to tackle than the order management system and you know since an order starts there seems like it's a great next place for us to go in the story. Yes, but you see we didn't actually have an easy resolution for the website decisions were made a few years ago. That kind of came back to haunt us. You see when we built the system we knew we wanted a modern architecture a modern architecture using microservices. But our team didn't have any experience with microservices, but it's okay because we hired consultants and we used a vendor to help us build and ship the site. Ultimately, though we paid for functionality not for knowledge or documentation. And I can imagine at the time this trade-off made sense, right? I think bringing in that vendor or partner can be an awesome solution, especially when we do it in collaboration with our team the organization we're working within you know, and then in the end, you know that team is upscaled and they can continue on after the engagement. So it sounds like obviously in this case that wasn't what happened.

[~9:25] So do you have access to the code? Are you gonna need to work with that vendor to make these updates? Well, the website does run on our own infrastructure and it is in source control, which is great. I mentioned it's microservices, right? So it's spread across 27 different source code repositories. Now the good thing is about the way that they built this our marketing team has a user interface that they can use to manage inventory to add specials and sales and all of this so we don't actually make a whole lot of changes to this application. In fact, we only make about one or two changes a year and every time we do, That we have a lot of strategic planning because it usually takes us about two months to validate everything and so forth. So it's not an automated process by any stretch of the imagination. We have to build everything manually test it manually and so forth.

[~10:18] Okay, so if you've only been making updates to the application a couple times a year and without that automated build process and testing. I can only imagine that the likelihood of failure in this situation is going to be really high. Oh not just the likelihood of failure Amanda. We had actually experienced it. I mentioned that we only updated it once or twice a year. The truth is it was more like two to four times a year because every time we tried to update it something broke well now we have this Log4J vulnerability that we need to patch. Well, it's microservices will do one microservice at a time. That's the beauty of microservices. So we updated the first microservice we had to do a manual build. We took it and deployed it into our test environment and we opened up the e-commerce front end on our staging environment. And guess what?

[~11:12] 500 yes, it was broken. It just didn't work. So now you knew that everything's broken. I'm a little afraid to ask. How long did it take you to fix it? Well as it turned out we couldn't deploy the microservices independently. So we had to go through and update all 27 of them and then get those all deployed into our test environment and then we would be ready to go to production. Wow, that does not sound like a very fun weekend Nathen. Oh, no, no, remember Amanda. It took us all weekend to orient to figure out which systems were likely to be compromised. We didn't even start this work until Monday. It definitely took us the full week though to go through all 27 different Services get everything into a test environment. Okay. So took you about a week until everything was finished. Course a week. You know what happens a week after Friday? Another hashtag NoDeployFriday, so we couldn't Deploy on Friday and this application obviously has to go through our change approval board, which only meets on Tuesdays and Thursdays. Luckily. We were able to convince the change approval board that this was urgent. We had to call an emergency meeting so they met on Monday.

[~12:32] But they were not very happy with what they saw. In fact, they've been burned so many times by this. They told us we had to do we were not allowed to deploy yet. Turns out I was a pretty good thing because as you might remember one vulnerability deserves another so there were actually lots of Log4J releases that happened in pretty rapid succession the cab told us we had to hold off on deploying until things stabilized. That way we could batch up. And do everything in One release. turns out that month there were four updates and the last one didn't land until the 28th of December. So it's like a post Christmas gift that doesn't sound like really great at all. So what I'm hearing is in the end, it really took the team over 20 days where they had to to kind of sit with the stress of getting this all updated. Yeah. Yeah. It was not a very fun December.

[~13:36] I have to tell you Nathen right now as you keep telling the story my stress level just kind of kept going out and up and you know, I've got to ask like how was the team feeling were they getting burnt out during this? Oh, yes, everyone on the e-commerce team was definitely feeling crispy. After that big deployment at the end of the month long hours obviously contributed the stress that you mentioned contributed and the opaqueness of the approval process. What was it going to take from making the change to getting that change approved much less deployed and working. Yeah that does it really sounds rough. So I guess I'm kind of curious. You know, how has the team fared since you know, I'm thinking back just a few weeks ago when the OpenSSL CVE came out, you know, was it easier for them this time around it was easier because the team decided forget SSL we don't need no. No, they didn't do that. But the team has definitely learned a lot and this year we were actually spared a little bit because our OpenSSL versions on the front end or still on the 1.1 branch. So we dodged that bullet if you will but having advanced warning of the pending system or the pending change helped but it also reminded us of some of the progress that we still need to make this after all is a journey Yeah, although I think after hearing everything they went through two years ago. I'm glad that this time they didn't have to re-experience that.

[~15:11] So now that was a great recap of everything the e-commerce team went through what about the order management system, you know, you've said that system is the heart of the business and I imagine since it's been around forever that it was even slower to update than a microservices based front end. I'm a little bit nervous to even ask you to walk me through this. Well, I completely exactly I can see exactly how you would think that as you mentioned. It's older. It's larger it follows more of a macroservices architectural pattern then a microservices one, but unlike the e-commerce system. The OMS is something that our internal teams have been actively developing over the years. In fact just in the last two years alone. We were able to go from quarterly releases to deploying updates to the system on about a weekly basis. So in many respects this team was better prepared when the vulnerability was announced.

[~16:09] Wow, I mean that is surprising so How did it go when they got started? Well on Monday morning the team identified the three components that were impacted within the OMS system. They upgraded the Log4J library and one of the three components and their continuous integration process automatically kicked in a jar file was built some automated tests were run on that. It was automatically deployed to a test environment where some additional tests were run and then the team took the certified Pipeline and the green passing test to the change approval board, and it was basically rubber stamped. ship it they said and the team did so Okay, so wait. Didn't you say so you shipped it? But didn't you say that there were three components and you only updated one. Yes, yes, but the components were all built in a way that they can be independently deployed and tested. So everyone is also very comfortable with this because that's how we've been doing things in practice for well over a year.

[~17:14] All right. So one down two to go. I mean that must have been pretty easy since they've been doing it for some time. Everything was fixed by Wednesday. Almost the test failed after the second component was updated and it took a while to track down and fix that bug. Ah, I think you once told me about this team isn't this the one that always had the habit of prioritizing their broken builds. It's true, you know, when they fixed that for the second component and started off the test some of the team started working on the third component trying to figure out how to update that but as soon as the test broke the team swarmed they came back and they tracked down that bug good thing they swarm too because even with the Swarm it took them a full day to track down that bug get it fixed and resolved but by the next morning ready to deploy going back to the cab and Thursday morning came around the final the third component was released and deployed as well.

[~18:12] You know Nathen thank you so much for this. This has been an incredible story time and I really think I know how I could help. These really so Amanda. How do we help the website team have a similar experience to the order management system team in the future? Well, of course. DORA the Explorer No, not that door Nathen. Oh, the digital operational resilience act that's legislation that was recently passed in the EU, I think. Not that one either Nathen the designated outdoor refreshment area man, Ohio sure knows how to have fun. So no Nathen as fun as that would be today. We're gonna talk about something even better. So for today's discussion, we're going to talk about DORA DevOps research and assessment DORA is an ongoing research program that's been around for about eight years and you know, the research program has been primarily funded by a number of different organizations over those years for a few years. It was the research program itself was funded by an organization by the same name DORA and that organization DORA was founded by Dr.

[~19:26] Nicole Forsgren Jez Humble and Gene Kim. In 2018. DORA the company was acquired by Google Cloud and the DORA team at Google Cloud has continued the research into the capabilities and practices that predict the outcomes that we consider Central to DevOps. The research has remained platform and Tool agnostic and it's really for me. It has been this incredible experience to work with the research team not only because of the findings and the learnings each year, but I have a better understanding of the research practice, you know, the oath the ethics and really the passion they bring to their work. Yeah, and these researchers have dug into capabilities for the many years of the research program and what we're able to do through the data is draw predictive connections or predictive analysis. We can say that these capabilities predict that are software delivery and operations performance which itself predicts better organizational performance. I think it's also important to think about the capabilities the types of capabilities that we look into of course, there are technical capabilities, but we take a full system view. We're also looking at process and culture as part of our capabilities investigation.

[~20:47] So it's like a maturity model with a built-in roadmap. Oh Amanda. No, it is not a maturity model. It is not a roadmap. There is no well paved One path fits all context matters. We really have to think about where are we constrained and make improvements there? Right because you know like in the previous years reports. We we had learned that delivery performance drives organizational performance. But you know, this year's findings gave us that additional context that yes delivery performance drives or performance. But only when organizational performance is also High That's right. And we also dug into reliability this year and reliability itself is a multifaceted measure of how well a team is able to meet their commitments the commitments that they've made to their customers or to the users of this app their applications and this year we dug into some of those different capabilities and practices like is the team doing regular reliability reviews. When you miss your reliability goals, are you using that as a signal to help reprioritize the capacity that you have on your teams?

[~22:00] And one of the most fascinating things that we found this year was that as teams are adopting these different capabilities and practices initially. They may actually see their reliability drop but over time as they mature their practice as they bring on more capabilities across more teams, they'll reach an inflection point where those reliability capabilities start delivering truly better reliable applications. Because like you said while it's not a roadmap the technical capabilities, they really build on one another, you know teams get better and they improve as they get better at additional capabilities. So when we think about continuous delivery and Version Control, they really amplify each other's ability to promote high levels of software delivery performance, you know, if you're combining continuous delivery Loosely coupled architecture Version Control and continuous integration these that Fosters software delivery performance that is greater than the sum of its parts and we found this year that teams who make higher than average use of all of those above capabilities. They have 3.8 Times Higher organizational performance.

[~23:17] Yes, and then we dug into security as well and we found some really fascinating insights into security practices this year. So first of all, what we found was adoption of some of the security practices we looked at not adoption has already begun. There's ample room for more adoption. But at least we started out we also found that healthier cultures have a head start. The best cultures are already moving forward. so Nathen when you say healthier cultures, you're really talking about generative cultures characterized by high trust and that free flow of information. These performance oriented cultures are more likely to really establish those security practices. Then the lower trust Organization no cultures. That's right Amanda and we found that security also provides some other unexpected benefits sure when we focus on security our security gets better and we run more secure applications, but some of those additional benefits include things like reduced burnout on our team and then we also found that there's a key integration point when it comes to adopting these security practices many of the technical aspects of supply chain security hinge on the use of continuous integration which provides that platform for many of those practices.

[~24:45] So this is another instance where these capabilities they're really like building upon each other. Right? So when we compared continuous integration and security we found that those teams that you know had were above average on both of the security and their continuous integration. They had the best overall organizational performance. So having good continuous integration and good security. It really is a key driver of your organizational performance. Absolutely and we saw this in our story time, right the order management system team, they had invested a lot in their own continuous integration capabilities and having those continuous integration capabilities gave them much more confidence as they address the security vulnerability when you compare that to the the e-commerce team, they didn't have those continuous integration practices in place and things were much more difficult.

[~25:44] Yeah, I mean, I guess it kind of fascinates that me when we look at it in this way because on one side, you know a change approval board really slowed down the team created more stress, you know, it was really challenging but then we see on the other side. It can actually be a useful tool and can help teams move faster and feel confident in the decisions that they're making. That's right. And as as we said earlier, this is not a roadmap. We're not saying just go do continuous integration right now instead. We have to start by understanding. Where are we today? What capabilities what might we improve oh and thinking back on our story Amanda. We talk through two of our 400 applications. Unfortunately, there were a lot of meetings negotiations Blood Sweat and Tears that went into getting the rest of the fleet updated in short. There was a very very long tail.

[~26:41] So Nathen, what are your thoughts? So for these teams and that sorry. Sorry. This is way too small. I can't read what's there you go to the next slide, please. Is this better? Oh, that's much better. So this is a good view of those capabilities that are investigated in the research and how we can connect them to one another so you can see these capabilities are enforcing reinforcing each other and on the next slide what we'll see is that those particular capabilities some of them help decrease the important outcomes so we can see a decrease in team burnout a decrease in error proneness on our team, but these capabilities also help increase other important outcomes, like delivery performance and operational performance. Well, I can tell you what Nathen. That maybe I am just coming back from burnout because I just realized that we didn't introduce ourselves at the beginning of story time. Oh, right, so I'm Amanda Lewis. I'm a developer Advocate at Google Cloud.

[~27:50] And hi everyone. I'm Nathen Harvey. I'm also a developer advocate here at Google Cloud and together we get to work on DORA all the time, and we would love to bring more DORA into your organizations. So What can we get from? What kind of help can we get from people that are here today Amanda? The thing that would help us the most is if everyone could come and join us in the DORA community of practice, you know, we're getting together. We're having Community discussions and it's an opportunity for all of us that are going through this and experiencing it to share listen and collaborate and if you go to DORA dot Community, you can join the Google group and then you'll get invites. You can start some asynchronous Communications there in the group and we hope that you'll join us on December 12th for our next Community discussion. That's right. And Make sure that you grab your own copy of the accelerate state of DevOps report. You can download one here at this URL my recommendation download one and give it to your manager because they want to read the executive summary which by the way says for the last eight.

[~29:00] Oh, no, I think we're out of time. We are out of time. But thank you everyone for joining us for story time. Thank you all so much. We'll see you soon.