DevOps and CAB: Mortal Frenemies

Log in to watch

Las Vegas 2023

Download slides

DevOps and CAB: Mortal Frenemies

Dave Stanke

Developer Advocate · DORA

Hany Elemary

CTO and Founder · Navalia

Often dubbed as "eternal enemies", DevOps and the Change Advisory Board (CAB) pull organizations in different directions: rapid delivery vs strict governance. This presentation delves into the intrinsic differences between these approaches, exploring their origins, objectives, and cultural implications. Central to our discussion will be a case study showcasing a digital transformation journey at Wendy’s. We will highlight how, in partnership with Google Cloud, we identified the CAB as the main constraint impeding software delivery, and strategically transitioned to a greatly streamlined change approval process, supporting faster, higher quality releases, as reflected in DORA metrics. The journey provides insights into the challenges faced, the politics and strategies employed to get to tangible benefits of embracing agility without compromising on stability, security and compliance. We have found common ground where DevOps and the CAB are perhaps not yet friends, but closer to frenemies.

Chapters

Full transcript

The complete talk, organized by section.

Dave Stanke

Hello, everyone. Thanks for coming. My name is Dave Stanke. I'm a developer advocate with DORA and Google Cloud, and with me is Hany Elemary.

Hany Elemary

I lead the engineering team at a global technology consulting company called Navalia. And I also wear a couple different hats at one of our clients.

Dave Stanke

One is, I love Wendy's. Big time. I love Wendy's, right? Give me a Baconator. Give me some of those fries, crispy, when you dip in Frosty. Amazing.

So the other thing I love about Wendy's is that they're a great customer and partner to Google Cloud. We've been doing great work with them, and you're going to hear about some of that today.

I love Google Cloud. Google Cloud is the best cloud for developers. When I want to make a change to DORA.dev, I hop on my Cloud Workstation. That's it: preconfigured developer workstation in the cloud. Got my favorite IDE on it. I can open up a pull request, and then Cloud Build is going to spin up a preview environment on Firebase so I can view it.

Then I'm going to send that pull request over to Amanda and ask for a review. And I'm going to wait, send a friendly ping: "Hey." Wait till tomorrow: "Hey, top of your inbox." And eventually, a review. I click go. We're live.

And then I'm going to walk away, and my Cloud Workstation is going to automatically deprovision because I'm not using it anymore, so I don't have to pay for it anymore.

Love DORA. DORA is a research program, the largest and longest-running research program of its kind. Since 2014, we've been asking thousands of developers how they work and what kind of outcomes they get from that. And we've developed a model which shows the technical, cultural, and process capabilities that predict software delivery performance, and the major four key metrics that can tell us how well we're performing at software delivery, as well as the predictive effect that software delivery performance has on our organizational outcomes and the well-being of us as engineers.

So I'm beyond excited to bring you a story that combines all of my favorite things for you. Thank you, sir.

Hany Elemary

Well, welcome everybody. I'm incredibly excited today just to share a digital transformation journey that I have been part of at Wendy's, one that, in so many ways, not only do I find interesting, but in so many ways I find astounding and remarkably strange. And you sort of get to see why here in just a second.

In order for me to tell you this story, I have to go back about a couple of years, so that we can talk a little bit about the Google partnership that was announced with Wendy's sort of at the tail end of 2021. And a couple months right after, my company, Navalia, we started engaging and started collaborating directly with Google. And it was Wendy's first in the innovation space, bringing conversational AI ordering to life.

So right now we're actually at one restaurant in Columbus, Ohio, doing a little bit of testing. And then, by the end of the year and closer to Q1 of next year, we'll start rolling out to other stores as well.

But the part that I really want to talk about today is actually Wendy's Digital, which Navalia, along with Google and Wendy's, started in October of 2022. And we'll walk through the journey via timeline here in just one second.

While Wendy's requires no formal introduction, Wendy's is a publicly traded company, which means that there are some strict requirements around compliance and regulations that are implied. They're also a global organization, so there are about 14,000 employees split between full-time and part-time. They have an expansive restaurant footprint, so about 7,000 restaurants in North America mostly, between U.S. and Canada. And then they have a growing presence in the U.K. and other areas of the world as well.

And then they have an ever-growing digital presence that started since the pandemic. My role at Wendy's, again, I kind of wear a couple of different hats. The one that's relevant to this talk is principal architect over the digital platform for Wendy's.

So that really means, if we take a look at the overview of the architecture, you have your Wendy's-owned channels, such as the iOS app, the Android app, web ordering, and restaurant kiosk ordering, along with all of the different integrations with delivery service providers, the Uber Eats, DoorDash, the list goes on and on.

And then right in the middle here, you'll see the platform. These are your business capabilities, your experiences addressable through APIs through the digital channels. And then on the right-hand side, we have the external dependencies that sort of like the point-of-sale system at each individual restaurant, payment processing service providers, fraud detection systems, loyalty providers for earn and burn, and things of that nature, in addition to taxing authorities.

And then, last but not least, is the platform engineering stuff that's developed on GCP.

If we take a look at the timeline, when we first started engaging in the digital system, we basically saw a very big problem with software delivery performance at Wendy's at the time. The development teams, the existing vendor partner that was there at the time, were deploying about once every six weeks, with some anomalies to bring it in a little bit, like once every four weeks. But for the most part, consistently once every six weeks.

We tried to understand why that was. We saw a lot of different challenges, but essentially, at the end of the day, we could bucket these challenges into two themes.

One of the themes was summarized as a heavy-handed process around software delivery and release management in general that involved our topic, but it wasn't just CAB-related. There was a lot of coordination in the process, less so collaboration, and less autonomy for individual teams. We saw a good opportunity there to change that immediately and reap the benefits of changing.

Then the other theme was that Wendy's at the time had an architecture that met the moment of the pandemic. They did very well to get over the hump of the pandemic and have good digital presence. At the time, it was point-to-point integration with the channels, with the different delivery service providers. And then after the pandemic, it really sort of failed to capitalize on adding new business capabilities, adding new experiences.

One challenge that I saw at the time was actually Wendy's was trying to experiment with different payment methods. You just wanted to get a proof of concept and just prove that. It was very difficult, and there was a lot of friction in the process.

So we said, okay, now that we're a brand-new team, we're jumping in. First order of business is that we're going to baseline our progress and measure our progress according to the DORA metrics.

So here's what we found. This was based on our assessment, and again, we saw a huge opportunity for improvement for Wendy's. You can see deployment frequency and lead time for changes were about that six-week mark.

And then I also want to draw your attention to the bottom two, which is time to restore service and change failure rate. When you take a look at this picture, you might think, well, this is good, right? We're a high performer. But for a QSR, quick-service restaurant, having an outage that lasts even one hour is incredibly damaging. Millions of dollars are lost. Folks are actually getting to the restaurant sometimes, and they're like, "Hey, I placed an order," but then the order isn't really there. So there's also some operational complexity involved in that.

So there's a lot of improvement to make across the board. Time to restore service was sitting on an average of four hours at the time, but what we really wanted to focus on was the top two initially.

So we said, okay, if we try to change some things in the process, just minor changes, allow people to collaborate a little bit more, even though the teams were sliced horizontally, front-end teams and back-end teams, we said, we'll just try as much as possible to go towards aligned teams. Even if the teams are still separate, we'll try to combine things. We'll minimize handoff with QA to participate in the process early on.

And sure enough, after three months, we actually moved the needle forward, and we moved to the medium, that middle top. At this point in time, we were consistently deploying to production every couple of weeks.

So that was great, but we weren't done yet. We knew we had a lot of improvement to make. Next up, we weren't necessarily targeting the high-performer column. We just wanted to see how much more can we push. Can we get to two weeks minus one day, for example? Two weeks minus two days?

And we got stuck. We just got stuck at the two-week mark regardless of what we do. And we had at the time identified the main constraint in our process. CAB was essentially the main constraint.

I know DORA has done a lot of research in that area. What does DORA say?

Dave Stanke

Yeah, absolutely. DORA, at the heart of our research, is showing how moving change through the system quickly and continuously is good for outcomes. So it's only natural that might have a constraint, right? A change advisory board that's there to scrutinize every change.

What we've seen in working with teams is that teams can bring the change approval into their processes. Use things like peer reviews to create that segregation of duties, use continuous delivery or continuous integration systems, and use monitoring to catch problems early and address them.

And by doing so, we find that we can move things through the system quicker without increasing the amount of defects that are found. But there's also still a role to be played by the people, in that they're there to coach teams on process, connect teams to each other, and as well for escalation points for higher-risk changes.

Hany Elemary

Yeah, no, thanks Dave. Let me paint a picture of CAB as a process within Wendy's.

So much like traditional CAB in other enterprises, CAB is a centralized group of folks who may or may not be aware of all of the different moving parts in terms of software, hardware within the organization. So that's smell number one.

Then CAB also met twice a week, Tuesdays and Thursdays. If development teams wanted to get their changes into production, they had to submit paperwork, paperwork in an online tool, so to speak, 24 hours ahead of that meeting. And if they didn't, then they got slotted into the next meeting.

And sort of happy path, if everything is good, then you go to CAB, you present your changes, you get your approval, then you get a deployment window in which it's safe for you to actually deploy your changes.

So all of that stuff causes a lot of friction. I'm sure, I see a lot of head nods here in the audience. This sounds familiar.

So we tried to propose certain changes with folks who were running CAB, and immediately were met with friction and politics. Again, nobody's intended here. These are just people like you and I. They're trying to de-risk changes. They're trying to do their job. They're not necessarily the authors of this process. They're just hooked into that process.

But one thing that came up as we were talking with folks is SOX compliance. What about SOX compliance?

And again, you might relate to this. I've often found that compliance in general gets put out there as a smoke bomb to quiet people down. I've seen it with PCI compliance, with accessibility compliance. To a certain extent, there is something to that. But I've learned over the years to have compliance conversations as basically a conversation starter as opposed to a conversation ender.

And I think at that time, I wasn't necessarily familiar with all the ins and outs of compliance. So I said, okay, let me just go educate myself, do some reading, try to understand what SOX compliance is really all about.

So compliance really doesn't say you must have an entity called CAB. It doesn't say you should use tool X versus tool Y. I didn't really see any of that.

The high-level summary is that compliance says if you're a publicly traded company that collects money from customers, you've got to have accurate financial reporting. You've got to do your absolute best to prevent fraudulent activity. You've got to have an audit trail into your release management: who did what, what time, all of that good stuff. And then you must have a process to de-risk issues should they arise in production, a mechanism if you will.

As I was reading all of that, that's essentially what a CI pipeline does, right? All of these things we get from a CI pipeline.

So I said, okay, now I've got to collect some data. I've got to put a proposal together, and really I'm just going to present it to the executive leadership team.

At that point, what I had in mind was, let's actually measure cycle time. A typical cycle time from development to being ready to deploy. And then I also wanted to measure cycle time from that point when actually something was getting deployed into production.

What I found was surprising. I found that if a feature took one week to develop, it roughly took one week to also deploy or to release into production. Far from ideal, obviously.

And then my next step was, well, what about smaller changes? If we actually have a change that took one day or two days to be ready to deploy, how long did we have to actually deploy that change into production? And it took no less than three days to deploy that change into production.

Again, we were getting stuck in process land, oftentimes more than the time it took to do the work itself.

So I took this data and presented the proposal and this data to the executive leadership team, to the CTO. They immediately got it. They had the same concerns, and we started having a conversation about how do we change things incrementally.

But then there was another wrinkle in the process. It turns out that the auditor who had certified the process at Wendy's certified the process according to specific language, and it also included the specific tool that Wendy's was using.

So at the time, the CTO told me, okay, this is the problem that we're having. We're going to have to reengage the auditor to make the language and the verbiage a little bit more flexible.

In my mind, I was just ready to wave the flag. I was like, this is just going to take another year. Okay, I guess we have to do that.

And to my surprise, this took two and a half months. So he went and fought the good fight, which is exactly what we needed, that executive leadership support. I do give the Wendy's leadership a lot of credit for that. And they got it across the line in two and a half months.

And after that point, it's like the floodgates had opened, right? It allowed us, a lot of different development teams, to really start making significant progress.

But at this point, we couldn't really say, "Hey, our job is to dismantle CAB. Our job is to get rid of you." Again, these are people that are trying to do the absolute best. Like you said, Dave, there is a particular role in terms of education around de-risking changes.

So our proposal involved four different steps. It was incremental in nature.

Our proposal was, how about we actually create a smaller group? We'll call it the digital CAB. It's made up of folks who are actually familiar with the tech stack, with the architecture: the engineering director, myself as the architect, other principal engineers, engineering managers. And if a team needed to deploy their changes, we'll create a Slack channel. They can just post to that Slack channel, and you just need two approvals.

We can drop everything that we're doing, unless it's an outage, obviously, and our focus is going to be to move forward. It's not like a Tuesday or Thursday. It's not like 24 hours. You just do it on demand. And that got us some good progress, but we knew we weren't done.

So the next iteration of that was to say, okay, well how about we actually take this and say, let's focus on architecturally significant changes. So if you have a standard change, just go ahead and deploy. We don't need to know about it. And we started classifying what a standard change might look like. We gave some examples, like simple changes, if you will. The bulk of the work really fit that. So we got even quicker with that.

Then the next iteration of the process was to default to yes. So we asked teams, even if you have an architecturally significant change, go ahead and post it to the Slack channel. Give people maybe 30 minutes to an hour. You be the judge, but give people some time to disagree or to barge in and say, "Hey, we're running maybe a promotion here," or "We're doing something there, so maybe hold off." But if you haven't heard anything, just assume yes, and you can go forward.

And then the last step of the process was the immediate activation. You deploy, and you let us know that you've deployed. We trust your rollback process at that point.

And that was absolutely huge. The result of that, with September, this past September was a record deployment month for us at Wendy's. We had more deployments than we had days in the month. And again, we're not done yet. I love some of the talks from yesterday that said the goal is to be transformative. It's not really transformation. So there's a lot of work that we still need to do.

A lot of this work, we started doing and we started evangelizing DORA not only as a software delivery performance mechanism, but also as sort of a proxy for adaptability and a strategy for risk mitigation.

So when we started talking to the business about DORA, we said, even though these metrics are technical in nature, they allow you to essentially move faster, to provide customer value faster, to change direction faster. And also, if there is an outage, we'll recover faster.

The way we did that is a comparison of all of the incidents, at least focused on P1 and P2, for the last three years. As you can see, 2021, we had 14. 2022, we had 13 incidents. Year to date, we had two incidents, P1. And it's hard to talk about average resolution time when you're only talking about two incidents, but right now we've got under two hours. So again, not where it needs to be, but again, there is good progress here.

One last thought that I'll leave you is the best advice I've ever received, when I was a young tech lead and batting every problem that comes in my way. My mentor basically told me this: "Hany, if you find yourself fighting lots of little battles, it really means that you're not winning the big ones."

I found that so profound because we were dealing with these little battles. We were dealing with battles such as velocity isn't where it needs to be. Such as, well, how about we add more people to the team to speed things up? Again, everybody here knows the constraints. Adding more throughput to a constraint does not make things move faster. It makes things slow.

The other thing that I found incredibly provocative about this statement is that he said, it means you're not winning the big ones, not fighting. I think fighting would've actually been a much more accurate description, because I didn't know which ones to fight. Winning here implied that this is a must-win battle.

And the moment you win these big battles, just be prepared. They're going to take a long time. You might have some battles at the same time. All of these problems underneath are just going to deflate as a result of these issues.

So that's really all I have, folks. I really appreciate you. Appreciate presenting with you, sir, for the first time. Feel free to connect with me on LinkedIn. Thank you again.

Dave Stanke

Wow. Thank you. What a story. Being able to have a record deployment month at the same time as really minimizing harm, I'm so excited. It's got my heart just racing.

This story has my heart racing, and I really want to go get with my team and try to find places where we can optimize and continuously improve.

So if you'd like to learn more about Google Cloud, visit cloud.google, or stop by our booth. If you'd like to learn more about DORA, you can visit DORA. Change conversation.