Operations - The Last Mile

Log in to watch

Las Vegas 2018

Operations - The Last Mile

Co-founder and Chief Product Officer · Rundeck

Why is it that some DevOps transformations stall while others continue to flourish?

This talk will make the case that Operations is the most predictable differentiator. So much of the energy of the DevOps movement has gone into activities that start in Dev and move towards Ops/agile practices, automated deployment pipelines, automated testing, and of course, the unofficial mantra of "deploy, deploy, deploy."

However, when it comes to Operations, too many DevOps transformations have stuck with the status quo and left problematic Operations practices in place. By not fully engaging with and transforming Operations, companies are preventing themselves from realizing the full potential of their DevOps investments. This gap is the last mile problem of DevOps.

This talk will first examine the trouble with the various siloed, ticket-driven, low trust, and centralized practices that have been accepted as status quo in Operations for far too long. Then we will look at the specific techniques being used by high-performing Operations organizations who are fundamentally transforming how they operate.

Damon Edwards is a Co-Founder and Chief Product Officer of Rundeck, Inc., the makers of Rundeck, the popular open source operations management platform. Damon was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy.

Damon has spent over 15 years working with both the technology and business ends of IT operations and is noted for being a leader in porting cutting-edge DevOps and SRE techniques to large enterprise organizations. Damon is also a frequent conference speaker and writer who focuses on DevOps and IT operations improvement topics. Damon is active in the international DevOps community, including being a co-host of the DevOps Cafe podcast, an early core organizer of the DevOps Days conference series.

Chapters

Full transcript

The complete talk, organized by section.

Damon Edwards

How many folks think the things we are talking about at these conferences have the opportunity, the potential, to improve the lives of the people who work in technology organizations and improve the bottom line of those technology organizations? Give me a yes. Yeah. Me too, and that is why I am here to tell you about what is going to get in the way.

The point of this talk is in the title: The Last Mile. If you are not familiar with that term, it comes from the telco industry. You build a network of value, all these things you have done, but there is still that last mile to connect to the user and to the customer so we can realize the full potential and value of the work we are doing. My thesis is that last mile is operations.

Developers have had an unfair advantage over operations. For the last 17 or so years, Agile has been seeping into their brains. Maybe they have not been doing Agile, but it is in the textbooks, books, conference speeches, language, tools, and ideas of lean, flow, fast feedback, and working in small batches. What is the last intellectual movement that swept through operations? ITIL, 1989. So when we talk about DevOps concepts, developers have this unfair advantage.

Here we are in 2018, and ops is in a tough position. It is under pressure. On one side is go-go-go digital DevOps transformation: go faster, open things up. On the other side, often from the same business people, it is lock it down, do not be the next hack, do not be the next breach, keep us out of the news. As the digital pipeline becomes more and more the factory floor, this matters more and more. Ops is squeezed between these pressures, and it is hard enough to manage them, let alone find time to improve how things work and join the transformation to complete the last mile.

Story time. This is a true story; the names have been changed to protect the not-so-innocent. The company was big on change. Digital transformation was sweeping through. They were talking Agile, DevOps, a new SRE organization, cloud, Docker, Kubernetes, and microservices. It was great times: go, go, go. Everyone wanted to work there and be part of it. But nobody was talking about what happened after deployment. It was all about getting to deployment, and when you burst through that mirage of deployment, it was like 2005 all over again: silos, tickets, conflict.

On an average Tuesday, at 9:30 in the morning, the NOC starts seeing lights. There have been intermittent errors this week; maybe this is the same one, maybe not. Half an hour later a business manager calls about a customer issue and outage. The NOC says to escalate. Bob, the incident commander, opens a ticket. That ticket blasts out to the business manager and all the app-specific SREs because nobody knows what is actually going on. Through a lean lens, that is an interruption and a lot of context switching already.

People jump onto the bridge call and into the try-this-then-try-that loop. They do not have access to all the right systems, so they call in someone from the legacy system administrator team who has access to the production environments with customer data. Business managers find their way onto the bridge call asking whether it is fixed. There is waiting, a dog pile of everyone trying things, and disconnected access getting in the way. Eventually they decide it is a problem with the Foo service, but the Foo SRE says a new version is being deployed and they have not been told about it yet.

Bob escalates. In lean terms, that is partially done work: they have reached a conclusion but cannot go further, so they pass it to someone else. More waiting. Karen, the lead developer on the Foo service, is in the last day of the sprint, ignoring emails, when someone knocks and asks if she saw the ticket. She context switches and immediately needs more information. She does not have production logs, because she is not given access. She opens a ticket for help, then reaches out through HipChat to SRE or system administrator folks. Lee sends logs, but of course they may not be the right logs. They go around and around. The ticket becomes a context wagon: even when people are not active on it, it has their name on it and occupies a piece of their brain.

Karen finally gets the logs and realizes whoever restarted the services used incorrect environment variables. The whole service pool needs to be restarted with the right variables or more cascading problems will happen. Now it is 2:00 in the afternoon. Bob asks the middleware team for an urgent restart of the app pool with the right environment variables. There is more partially done work, waiting, interruption, and context switching. Melissa, the middleware manager, says they cannot restart services in the middle of the day without business approval. Because of scar tissue from something bad in the past, Bob must interrupt the SVP of the line of business to approve a restart of customer-impacting services.

Susan, the SVP, tries hard, but she has been far from the non-email end of the keyboard for years. The VPs discuss customer impact and microservices, interrupting everyone with more context switching, then approve the restart. At 5:00, Melissa asks who knows these production services best. Ellen does, but she is on a plane to Europe for a launch. Scott is next; he has only been there a couple of months. Everyone waits while Scott dumpster-dives through SharePoint and wikis to figure out the manual restart process. The context wagon grows, and the salaries in it grow too.

Scott starts the services manually. Bar waits for Acme; Acme startup fails. Scott escalates. Linda, the Bar SRE, notes that the new DevOps program added environment pre-flight checks that fail when dependencies are not ready; Bar cannot connect to Acme. The ticket is updated and both the network SRE team and Bar lead developer are pulled in. The developer can comment out the test, but that would need the CD pipeline, QA, and change management, so they try the network folks. The network folks are not answering because the business managers called the network team about a different network outage and told them to focus there.

Because Scott had beers with Carlos from the network team a couple of weeks before, he still has Carlos's cell number and gets help. Harry from the network team says traffic is blocked by the firewall and tells them to take it up with the firewall team. Freddy, the firewall engineer, says it cannot be the firewall because they only change it on Thursdays. Then he finds that a rule changed last Thursday blocking Bar from Acme because someone said it no longer needed that path. When asked to change it back, he says he can do it Thursday. The chief of staff points out customers are livid. Nicole from NetSec says this is a production change and CAB people need to weigh in. Someone threatens to call Susan the SVP, and the firewall rule change is approved.

Freddy, Scott, and Bob change the firewall and restart services. They think it looks good, but they are not trusted to test their own work, so they need the customer engagement manager with the right tools to check the APIs. It is 9:45 at night. Varsha is the person they need; it is her birthday, and she is out. She comes home, runs the tests, and says services started okay and everything is green. The incident is over. The next day Susan calls a meeting asking whose fault it is, why the organization is so bad at change, and what additional approvals and processes they will add so this never happens again.

Does this sound familiar? After investing in cloud, Agile, DevOps, and everything else, why does everything take so long and cost so much? The answer is that the organization largely ignored operations. Most companies chase symptoms and follow old conventional wisdom: better tools, more people, more discipline and attention to detail, and more change reviews and approvals. But more tools just leave the same problems with more tools. More people is usually a non-starter. More discipline is like telling developers to write fewer bugs. More reviews and approvals are scar tissue on top of scar tissue.

We have to challenge the conventional wisdom about how operations works and change the systemic conditions operations has been marinated in for years. Four forces keep getting in the way again and again: low trust, excessive toil, silos, and queues.

Low trust is about who has context and where decisions are made. In the story, the person touching the problem often was not the decision-maker; the highest-paid person in the chain was. John Allspaw showed how whether an action is dangerous always depends on context. The people with context are often over here, while the people making decisions are starved of that context. Low trust also ties to psychological safety: a shared belief that the team is safe for interpersonal risk-taking, or, as Sidney Dekker says, how easily you can tell your boss bad news and how easily your boss can tell the organization bad news. Google's research found psychological safety was the number-one predictor of team performance. Trust and safety go hand in hand.

Excessive toil is the kind of work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and grows linearly as the service grows. The opposite is engineering work: creative, iterative, strategic work that adds enduring value and enables scaling. Organizations need balance. If toil goes to the max, there is no time to improve the business and no time to reduce the toil, so teams enter a downward spiral. Reducing toil creates capacity for operations to improve itself.

Silos are not just teams; they are a way of working. In a small group or startup, people may share a backlog, context, tooling, and priorities. But in the enterprise, nothing lives in isolation. We always need something from someone else, and those people have their own backlog, priorities, and information. These mismatches make teams turn inward to optimize themselves.

Queues cover for those disconnects and mismatches. We drop in ticket queues liberally because it is easy. But queues are expensive. Queueing theory and Don Reinertsen's work show that queues increase cycle time, risk, variability, and overhead, while lowering quality and motivation. Queues also blow apart the value streams and mental models that DevOps tries to build. Tickets create one-off snowflakes: technically correct at the time, but brittle and unreproducible. The next person or automation run hits something unexpected. The only thing worse than automation that does not work is automation that is just a little bit broken.

What can we do differently? For low trust, shift left the ability to take action. Give the people closest to the context the process trust, tools, and guardrails needed to make decisions and act. For excessive toil, track toil levels for each team. This is not time tracking; it is a regular sense of how the team is doing. Set a toil limit and fund reduction efforts. If a team's toil is above a limit, such as the industry benchmark of 50 percent, swarm to it and invest to free human capital for valuable business work. This comes from the SRE movement, including toil limits, error budgets, and service level objectives.

For silos, the obvious answer is to get rid of silos, but not by making everyone do everything. It is about horizontal shared responsibility. Damon contrasts a loose Netflix model, where everything is cross-functional teams and there is no central operations organization, with a more Google-like model that still has development teams and an operations organization. Google uses clear handoff requirements to get into operations and error budgets with consequences that put responsibility back on development. The models differ, but both build shared responsibility.

Cross-cutting concerns bring the queue problem back, because every specialty cannot sit on every team. The industry trend Damon calls self-service operations is a design pattern for this. Take operations actions such as environment provisioning, restarts, health checks, cache clearing, scaling operations, and security checks, and turn them into pull-based self-services. People who need them can use them on demand. Developers and platform engineering teams can build self-service capabilities, hand them to operations and security for code review and vetting, and then access can be granted to the right people.

The idea is to stay out of teams' way. Wherever you cannot remove a handoff, do not put a ticket queue there; put a self-service interface. This works with cross-functional team models and standard dev-and-ops models. People can focus on building the platform and self-service capabilities. That removes ticket queues and excessive toil and turns people back into operators. It is also a place to build in security and compliance. It would be anti-Deming to try to get quality by external inspection; build it into the system. Put security, compliance rules, and evidence collection into the self-service system.

Not all tickets are bad. Tickets are good for documenting true problems or exceptions, and for necessary approval routing that would otherwise become a mess in email. But organizations must stop using ticket queues as general-purpose work management systems and work-permission systems that run people's lives.

Examples from this conference show the pattern. Jody Mulkey, then CTO of Ticketmaster, described how Ticketmaster's public web-facing outage MTTR was 47 minutes. They had many escalators: people calling other people. They built self-service capabilities in dev and QA, tested and delivered them, and used them to empower the NOC and level-one ops teams. They even linked monitoring tools to self-service operations so responders could run key actions first. After the Support at the Edge program, they went from 47 minutes to 3.8 minutes on average.

Sean Norris described a high-compliance strategy at Standard Chartered Bank, a 165-year-old bank with more than 80,000 employees in over 60 countries. Everything was optimized for compliance. They used self-service to standardize operations across thousands of operations people and bake in compliance. Instead of every production change requiring manager review, self-service with built-in compliance produced more than 13,000 privileged-environment operations tasks in 12 months that did not require review, creating huge time savings.

Damon asks for help documenting the self-service operations design pattern. Rundeck has written a book available at rundeck.com/selfservice. He wants reviews, design patterns, ideas people have seen, and company stories, with the goal of creating an industry-agnostic view of how to empower operations, relieve pressure, and let operations get to the work that improves the business.

The recap: do not forget about operations. Challenge conventional wisdom. Deployment is not the goal. There is a lot to life after deployment. Understand the forces undermining operations work, internalize them, socialize them, and teach the organization to spot them. Shift left control and decision-making; push the ability to take action closest to where the context is. Learn from SRE, especially toil limits and error budgets. Focus on removing silos and queues, and leverage the self-service operations design pattern wherever possible to remove handoff points. Damon closes by pointing people to his pinned slides, his email or open DMs, and the Self-Service Operations book.