Not dead, just resting! How To Win at Maintenance Mode

Log in to watch

Las Vegas 2023

Not dead, just resting! How To Win at Maintenance Mode

How do you do maintenance mode in a DevSecOps world, and why is that even a question?Maintenance mode. Keeping the lights on. BAU support. Evergreening. Whatever you call it, it happens when there’s no more funding for feature development in live digital services and data pipelines. There’s a need to resize teams to cut costs, to reassign people to start on new propositions… but those live services aren’t dead, they’re just resting! Who does the library upgrades, security patches and defect fixes? Who preserves availability targets when there’s no more money to look after the live services that are making money?

I’ve spent time with many different scaleups and enterprise organisations. Too many have used their operations team as a maintenance mode solution, and wondered afterwards why technical quality, reliability, and employee retention all took a hit. And if you think transitioning a live service into an operations team is hard, you should see the reverse when it’s back to the developers for more features.

I’ll cover the pros and cons from three different maintenance mode solutions - delivery teams, the operations team, and multi-product teams. I’ll share the results from a maintenance mode survey of 40 different organisations worldwide. And I’ll explain why multi-product teams as an extension of the You Build It You Run It operating model can give you a truly magnificent maintenance mode.

Chapters

Full transcript

The complete talk, organized by section.

Steve Smith

Good morning, everyone.

Ron? Ron, where are the slides?

Oh, there we go. Thanks, Ron.

Hi, I'm Steve Smith. I'm going to talk about the maintenance mode scaling problem, the kind of unglamorous, inevitable technology problem that Gene Kim and I just love talking about.

I phoned you a year ago and I said, "Gene, I'd like to talk in Vegas." And he was like, "What about?" And I said, "A topic so boring, Gene, that you'll instantly regret ever having me." And he was like, "Oh, that sounds amazing. Tell me more."

Somebody last night in the restaurant said to me, "How do I get to know Gene?" And I said, "The problems that you're bringing him are not boring enough."

So we'll cover what the problem is. We'll cover three different solutions, and a survey of 40 organizations that Gene made me do as the price of coming here.

Who am I? Let's go. Yes, Gene, come on. No, no clicker, Ron. Really? I hung out with Ron out back, but clearly... Oh, see, Gene's going to go back and turn off Ron. There we go. Okay. Oh, there we go. No, Ron. No. Okay, thanks, Ron.

All right, so this is why you always get to know the AV people.

Who am I? I'm an engineer who likes to advise other engineering leaders. It runs in the family. My dad was an engineer. My dad's dad was an engineer.

I'm the Head of Scale service at Equal Experts. We're a global technology consultancy of around 3,000 people. I've been with the company for 10 years now, and it's my job to collaborate with our customers to identify and solve scaling problems. That's something that slows you down or harms you when you're trying to scale your teams up or down.

Before that, I worked at LMAX, a financial startup where we built an amazing, state-of-the-art, low-latency exchange. I worked with a guy called Dave Farley. I don't know whatever happened to him. Never heard of the guy ever again. Just kidding. He's on YouTube now. My daughter's really impressed by that.

And yes, it's true, I'm British. Somehow I snuck onto the program. You can tell I'm British because my humor is quirky, yet ultimately somehow lovable. I've never spoken to a lawyer in my life, which over here is kind of a big deal, I know. And in my household of four people and one puppy, I have one car. That is the correct ratio in Britain.

But don't worry. Equal Experts has a North American office full of lovely North Americans like Sid, who's down here somewhere in front. There he is. Hey, Sid. So Sid's our SVP, and he is from Nebraska. He lives in Florida. And in his household of four people, excuse me, three people in his household, he has four cars and one plane. That is the correct American ratio.

So what the hell is this maintenance mode problem?

You see, Sid told me last night to take out a joke about communism, and I swapped it out for a joke about him. And now that's gone down really well. My wife said not to do politics in the U.S. She was right.

So what the hell is this thing? Anyway, we're behind time, speed up already.

So last year George spoke about the leftovers. He talked about how to manage business-critical services when they lack funding or clear business ownership. Thanks, George, for sharing your story, and at the request of Gene, I'm going to try and build on that a little bit. And we're going to try and hang out tonight, me and George. He's been ducking me for a year. Now we're ready.

I see this as the maintenance mode scaling problem. You might know it as business as usual, or harvest mode, or keeping the lights on.

Let's say that you've got a hodgepodge of monoliths and microservices, or a zoo of data pipelines. Those are the correct collective nouns for them. The majority of them will be non-differentiating services. You'll build them because you have to, not because you want to.

And your teams work hard, right? They continuously deliver planned features to your users, and then once they've got live traffic and demand slows down, deploys slow down, and they reach zero demand. That means zero funding remains for feature development. The day-to-day work becomes maintenance tasks: upgrading libraries, security fixes, all that good stuff.

And the problem you have here is who manages these zero-demand services at scale? Because they're like the Norwegian Blue parrot in Monty Python. They're not dead, they're just resting, right?

And if you have annual funding, this problem is coming your way because you'll have pressure to reduce CapEx spend. You can listen out for this problem. You can hear people say things like, "We need to increase capacity for more services. We need to reduce costs." Or maybe you're measuring unplanned work rate for your teams, and it's slowly starting to creep up.

To solve this problem, you need to create an ownership model for zero-demand services that allows for teams to be reassigned, resized, or retired. That's going to cover those two new outcomes that you want: more capacity and lower costs.

But you also have some existing outcomes you want to protect. You want to protect your current levels of live service reliability, job satisfaction, and future feature delivery.

All right. What solutions do we have? How effective are they? What problems do they cause themselves? Because scaling problems are hard like that, you know, and we can't all be Spotify, can we? Just throwing Swedish money and Swedish Fish at our challenges. We've got to actually figure them out for ourselves.

So they're not going to hire me. It's fine.

The first solution is to have your delivery teams maintain their own zero-demand services in the background.

Here's a composite organization synthesized from many Equal Experts customers over the years, and I can't remember what any of them are, which is good. Let's pretend it's an American retailer. It's not really American, but I do like America.

The first time I ever came to America, I was made to sit in a glass room at LAX for three hours with all the other passengers while a tannoy yelled at us that we should not speak to any walk-up lawyers. I've never been so scared in all my life.

We've got 12 delivery teams here. There's a product details page and an online checkout team. They're building differentiators, they're innovating, they've got long-term funding. Forget about them already.

I'm much more interested today, at least, in the 10 other teams: the furniture team, the returns orchestrator team, all of the non-glamorous stuff I warned you about. They're being built for parity, right? They're going to reach zero demand once they're into live traffic.

With this maintenance mode solution, what happens is those services move into the background of the delivery teams so that they can pick up more work. That's what the blue lines signify here. Just shove it into the background.

And here's the end state of this solution. Orange means a team is maintaining a live service in the background while they're building a new service in the foreground. You can resize teams, but you can't really retire them. And the operations team does all the live support for you. Facts of ops, not that they chose that life.

What's good with this solution is there's more capacity, and there's a similar level of service reliability to before because the maintainers haven't changed. They're the creators.

But what's bad is future feature delivery is much slower than before, okay? Because a team will have two different business owners for their two different services, so prioritization becomes painful, or as I like to call it, a bun fight.

And what will happen is you'll probably allocate a dedicated amount of time each week to maintaining the background service. But then teams will come under pressure to squeeze that time down, or to shove in large, unfunded features, and it's pretty stressful.

What else is bad? Well, there's little sense of mission or purpose for your teams, and it's not easy to retire teams. You can't actually reduce costs with this solution. That's why this maintenance mode solution kind of happens by default until someone turns up and says, "We need to save some money around here."

So to test our beliefs, we surveyed 40 organizations across the Equal Experts network: telcos, financial services, retailers, the other ones. The columns here are the new outcomes that we're trying to achieve and the existing outcomes that we're trying to protect. And the row is going to be the maintenance mode solutions that we're looking at.

Green means we think that the outcome can be achieved. Red means it can't. Deal with it. And the percentages are our actual customer experiences.

For this delivery team solution, 58% of responders agreed that there was an increase in capacity, and 75% agreed that they could protect service reliability to a similar standard as before.

But 100% of the responders said there were no cost reductions, and 52% agreed feature delivery was slower than before. One person said to us, "We were still seeing new feature requests, high-value stuff that could make a lot of money, but IT managers would say, 'Uh-uh, we can't change that. It's in maintenance mode.'"

And 33% saw a drop in staff happiness. One person said to me, "Whenever we dial up maintenance tasks, I can see it have a real impact on team morale."

All right. Our second solution is probably the most popular one out there. Your operations team are already doing all of the live support, so why not make them do all the maintenance tasks as well, right? This could be a move from the first solution, or it could be your starting point.

So what does that look like? Here's our American retailer again. It's not really American, but I do like America.

You know, the first time I went to New York, I got into an awful lot of trouble in a bagel shop. I asked for a British-sized bagel. The waitress said, "What is that?" I said, "Your food sizes are insane in this country. Give me a half-size bagel, and I'll pay full price." She got angry. She threw me out. And now I can't get a bagel anywhere on the Lower East Side. The sad thing about that is that story is entirely true.

We've got the same 12 teams as before, the same 10 non-differentiators that we're interested in. Now when they reach zero demand, they are transitioned into the operations team, who will do all the maintenance tasks as well as live support.

And the end state is the operations team maintaining everything and supporting everything. And now you have the flexibility to resize teams, retire teams, allocate services to different teams. It's entirely up to you what you do with this. Listen to your heart.

But whatever you do, your operations team is going to be awfully busy.

So what's good here is you do achieve greater capacity and lower costs, which in this economy is super important. But there's a lot of problems here.

Number one is reliability takes a big hit, okay? Because your entirely blameless operations team are affected by two things.

One, their cognitive load is going to be really high. There's no limit to the number of zero-demand services that can be imposed upon them. They have to work long hours. Their work in progress is really high. Mistakes are going to creep in.

And number two, it's unlikely they'll be given the time and space necessary to acquire the technical skills and domain knowledge to rapidly complete maintenance tasks to a similar standard as when services were with the delivery teams.

Future feature delivery is another big problem here, right, that I'm sure people have seen before. One operations team owning many zero-demand services means one of two things: zero business owners or many business owners. So prioritization, then, is really hard, an endless bun fight over an ever-growing support backlog.

And there's a lack of job satisfaction here because developers feel that they're in a never-ending feature factory, and your operations team feel that they're in a never-ending dumping ground. And they're both right.

There's a countermeasure to the problems caused by this maintenance mode solution. You can do a reverse transition, right? You can suck a zero-demand service back out of the operations team into a short-lived, newly funded delivery team, who will complete a small backlog of stability enhancements or long-planned features, and then it can go back into the operations team.

Here's our American retailer again. You can see in the middle there's a payment service that's being sucked back out of the operations team into a payments V2 team, because of course the payments V1 team is long gone.

I've seen this solution a bunch of times in the wild, and this countermeasure, and it's flawed.

Your new delivery team will lack the domain knowledge necessary to rapidly complete their work. And a reverse transition out of an operations team is like an order of magnitude more painful for everyone than a transition into your operations team.

I know a company where there's a 300-question spreadsheet before an operations team will take on maintenance tasks as well as live support. And to do a reverse transition back out of the operations team, there's a separate 100-question spreadsheet where every question is a cunningly worded variant of, "Could you please do a better job than last time?"

And yes, of course, when the developers are finished doing a better job than last time, when they've finished implementing those stability enhancements or those long-planned features, it's back into the operations team, and let's do that 300-question spreadsheet all over again. But this time your operations team are really waiting for you.

It's a time-consuming, wasteful merry-go-round.

Back to our survey results for the operations team solution. 66% of responders said, yes, we had increased capacity, and 55% said, yes, we saw reduced costs. That's good. That means you've achieved those new outcomes.

But 55% agreed that reliability took a really big hit. One responder said, "We had a lovely time once," by which I know they are British, and it was not a lovely time. They spent 20 hours fixing a broken deploy because their blameless operations team had been unable to complete maintenance tasks for a year, so there was no rollback pathway at all.

And 55% felt that feature delivery was more difficult than before. Somebody very angrily wrote to me saying, "The ability to handle maintenance tasks is a very painful concept. Here we have outsourced operations contracts with no time dedicated for all upgrade work. Everything has to be a purchase order and a change request. The amount of waste in this process is criminal."

Don't worry, I did actually phone the police about that one. They told me to stop phoning them.

88% of responders saw a drop in staff happiness. One developer said to me, "The operations team here is permanently unhappy looking after services they didn't build and trying to upgrade them all the time, and I feel bad for them."

So what we see with this solution is it achieves those new outcomes in the short term. That probably explains why it's so popular, right? But in the medium to long term, it kills your existing outcomes that you're trying to protect. Maybe that explains why I've seen this solution cause so many problems.

All right. Our final maintenance mode solution is to form multi-product teams. It's a solution we've designed with some of our customers. This could be a move away from either of those prior solutions that we've seen, or maybe it's your starting point.

Either way, the You Build It, You Run It operating model is a prerequisite for multi-product teams. So we'll need to talk about that first.

So at the Europe 2021 event, I spoke with Simon Skelton from John Lewis & Partners, a UK retailer similar to Nordstrom over here. He shared how we introduced the You Build It, You Run It operating model to 30 teams and 40 digital services as part of their transformation journey.

You Build It, You Run It is an operating model in which teams build and run their own digital services. It means product managers are incentivized to prioritize reliability alongside functionality because they're accountable for both. And it means that developers are incentivized for the long term to continually build reliability into everything that they do. Think of it as better insurance for your most valuable business outcomes.

This new operating model gave John Lewis & Partners an annual deploy count that was 26 times higher than before, a time to repair that was twice as fast, and revenue protection was four times higher.

I co-authored a print book on You Build It, You Run It a year ago. I've got some copies in the back. Come and find me afterwards if you'd like one. It's no problem at all.

So the obvious follow-on question we have here is what would happen if we applied You Build It, You Run It principles of outcome-oriented teams, zero handoffs, and clear incentives to zero-demand services?

Here's our American retailer one final time. It's not really American, but I do like America.

I went to a Walmart in Orlando last year, and I bought a giant bag of Cheetos. It was the largest bag of crisps I have ever seen. That was lunch and dinner all sorted. I did have some very strange dreams that night, though.

So we have the same teams here, but now they're organized into product families. You might know them as verticals. And those teams are also allocated into different domains.

We have our teams building and running their own services, and when they reach zero demand, they're transferred into a multi-product team dedicated exclusively to that family.

A multi-product team is a team of engineers who are accountable for reliability outcomes for all zero-demand services in that family. They don't have a product manager. That would wrongly imply the presence of a product backlog. But they do report into a product lead who owns that entire family.

So what happens is that you transition those services into your multi-product team, and then you can resize or retire teams or reassign people, as you saw fit with the operations team solution. Again, numbers will vary here. It depends on your organizational context.

What's important here is the idea of You Build It, You Run It being applied beyond a single product team.

So what's good with this solution is your new outcomes are achieved. You've got more capacity and lower costs, and you also can protect your existing outcomes.

Multi-product teams have the skills and domain knowledge necessary to preserve a level of reliability similar to before. And their cognitive load is much more manageable because the number of zero-demand services that can be imposed upon them is limited to the number in their product family.

Future feature delivery can happen at pace when necessary. When a multi-product team identifies an improvement opportunity, they can go to their product lead, who's an apex decision-maker for that entire family. They can make a quick prioritization call across all of the services that they own without any drama.

And the engineers in a multi-product team can have job satisfaction. They are responsible for end-to-end outcomes. What they're doing is making a difference. They're empowered to act.

What's bad here is you do need to spend time setting up the right guardrails to counter any dumping-ground cultural factors that are lurking in your organization. So you need an organization-wide definition of zero demand. You need a strong identity for multi-product teams and some really great product leadership.

And you do need to find funding for one multi-product team per family, which means you need a good answer for, "Wouldn't one operations team be cheaper than all of this madness, Steve?"

And the way to do that is to flip that cost-based conversation into a value-based conversation. You need to understand the kind of outcomes that you're trying to achieve together, and then you can look at the business metrics that are necessary to show if you're actually successfully protecting outcomes and achieving your new outcomes.

All right. What have we learned when we go to break? There are, like, three things to take away from here.

Number one, if you have annual funding, zero-demand services are inevitable. It's going to happen to you, so plan accordingly for how you're going to approach maintenance mode.

Number two, don't chuck all of your maintenance mode work blindly over the wall to your operations team. You are accidentally sacrificing important outcomes in the medium term.

And number three, if you move to You Build It, You Run It and empowered product teams, moving to multi-product teams for maintenance mode is going to be a natural fit for you.

So thanks to Gene for inviting me to speak here. Thanks to Anne for all the logistical help.

If you'd like to talk about any kind of scaling problems that are on your mind, I'm around all week. I'm at the conference. I'm the one that's easy to find in a crowd. Or you can get in touch with me on LinkedIn afterwards.

And thank you to Equal Experts North America for looking after me this week. Remember, Sid is here if you want to talk to a North American, someone who thinks like me, but sounds like you.

Thank you.