Runbook Automation: Old News or a Key to Unlock Performance?
Damon Edwards is a Co-Founder of Rundeck, Inc., the makers of Rundeck, the popular open source Self-Service Operations platform.
Damon was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy. Damon has spent the past years working with both the technology and business ends of IT operations and is noted for being a leader in porting cutting-edge DevOps and SRE techniques to large enterprise organizations.
Damon is also a frequent conference speaker and writer who focuses on DevOps, SRE, and IT operations improvement topics. Damon is active in the international DevOps community, including being a co-host of the DevOps Cafe podcast, an early core organizer of the DevOps Days conference series, and a content chair for Gene Kim’s DevOps Enterprise Summit.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
Many people ask me, Gene, how exactly did you get into the DevOps movement? The answer I usually give is that it was a natural continuation of my 21-year journey studying high-performing technology organizations, which drew me to the center of the DevOps movement. But if you kept asking how exactly I stumbled into the DevOps movement, eventually you would hear me say that in 2010, when I was at Tripwire, I got an email out of the blue from someone I had never heard of, inviting me to be on a panel at an event I had never heard of.
So I went to it, and I was blown away by what I saw. There I met this amazing group of mavericks at the epicenter of the DevOps movement, including John Allspaw, who you heard from yesterday, John Willis, Patrick Debois, Andrew Shafer, and so many more.
That event was DevOpsDays 2010, the first DevOpsDays in the US, created by Patrick Debois. The person who emailed me was Damon Edwards, who was one of the conference organizers along with John Willis.
Over the years, Damon has been one of my favorite people to collaborate with. He has been a part of the program committee from the very beginning, and he helps shape all the next-generation ops and infrastructure programming talks. Damon will talk about the gap that still remains for operations, despite SRE, despite platforms, deployment automation, and the technology du jour.
Damon is a co-founder of Rundeck, and congratulations to Damon and team for their recent acquisition by PagerDuty. Here is Damon.
Damon Edwards
Thanks, Gene. I appreciate the kind words. It is great to be here today, even in this virtual Las Vegas in these unprecedented times. It is going to be great next year to be able to see all of you again in person. As Gene said, my name is Damon Edwards. You might know me from my work at Rundeck, where I was one of the co-founders. But as of a couple of weeks ago, big news, we are now part of PagerDuty. Rundeck has been acquired by PagerDuty. It is great to be joining forces. But enough about me.
Let us get started. So who knows what this is? BA, excuse me: up, up, down, down, left, right, left, right, BA, start. If you played video games in the late 80s or early 90s, your thumbs might have been twitching when I said that. Of course, this is the famous Konami cheat code. A whole lot of games unlocked extra lives, extra power-ups, slowed things down, sped things up, or gave some other advantage based on the system you were playing. That is what cheat codes are: unlocks. They unlock capabilities that make it easier to overcome obstacles in the system you are playing, moving toward whatever the goal or quest is for that video game.
You might be asking what this has to do with the DevOps conference. I started thinking recently, as our world has changed around us, about what brings us here. Of course there is the fellowship and camaraderie and often the commiseration. But I think there is more to it. At the base of it, I believe it is about these cheat codes, these unlocks: the design patterns, principles, and techniques we can learn from each other and then apply to the systems we work in to help ourselves, our colleagues, and our companies overcome obstacles, make barriers lower or farther away, and improve our overall performance.
I started to think about these cheat codes and what the next unlock is: the next areas we can focus on to unlock the most value. For me, one thesis of this talk is that the next great unlocks are going to come from operations. If you think about operations activity, not just what happens inside the four walls of the operations organization but operations activity wherever it may lay, there are two main parts that are distinctly operations. One is incident management: spotting and resolving problems. The other is service requests: how we take business requests or requests from colleagues and handle them as quickly as possible.
I want to focus on incident management, because that is where the rubber meets the road in terms of operations capability: the ability to spot and resolve problems as quickly as possible. What do we all want out of incident management? This has not changed for decades. We want shorter incidents. We want to solve problems for customers or users as fast as possible. We also want to do it with as little disruption to the rest of the organization as possible. Shorter incidents and fewer escalations are what we are after.
What has always gotten in the way? We have learned that what gets in the way is complexity. I do not mean complicated. A car engine is complicated. Complexity is the randomness and unpredictability of something like traffic in a city. As our systems have become more complex, dealing with that complexity has prevented us from having the shorter incidents and fewer escalations we have always wanted.
J. Paul Reed and John Allspaw have done a great job bringing the broader world of safety sciences, complex systems, and why problems happen and often do not happen into our domain. I have been lucky to work with them over the past few years in the community sense, and they have beaten into my head the idea that our world is complex and not deterministic. The difference between seeing what we do as deterministic and predictable, or seeing it as living in the randomness and unpredictability of complex systems, is often at the root of the DevOps conflicts we have gathered to solve.
From the development side of the house, you are trained to think from a much more deterministic point of view. You write code; it either builds or it does not, it either runs or it does not. It is a binary activity: inputs and outputs. If something goes wrong, you can put your finger on what it was. As we move into distributed systems, we often carry that same deterministic point of view and think it is a broader collection of deterministic pieces where we can predict inputs and outputs, version things, know what version we are on, and move in an orderly and predictable way.
If you come at this from the operations side, things do not look the same. Damon shows a Cornell visualization tool for microservices architectures: a modern, mid-size public SaaS where tiny gray labels represent service instances and blue lines represent communication between them. Aside from looking suspiciously like a 1990s data center wiring closet, it shows the world is not so orderly. On the left we imagine a technical system; on the right we realize it is a sociotechnical system, with activity and uncertainty we cannot control: network traffic changes, API performance changes, library updates, configuration changes, cloud providers, hardware variation, and constant tailoring for performance or business fit. It is never one size fits all.
Richard Cook calls the left side the system as imagined: the predictable, deterministic point of view our brains want to hold. The system as found is complex, more random, and more unpredictable. We can reason about it and make assumptions, but we can never perfectly predict what is happening.
Cook also points out the unique role of humans in these systems. Humans fix the system, and humans also cause problems in the system. When you watch people working in systems as found, they are monitoring, looking for signals; responding, mobilizing to make sense of what they are seeing; adapting, tailoring the system to make behavior match what is needed; and learning, using feedback loops and understanding what just happened.
Automation alone cannot do responding, adapting, and learning. Those are the domain where humans are best. Automation can help, but coordinating, redirecting attention, using creativity, being surprised, and asking what someone was seeing that caused them to act are things machines are terrible at and humans are great at. Research across high-consequence domains such as medicine, nuclear power, and transportation safety reaches the same conclusion: automation works best when it supports the human operator, not when it replaces the human operator. Attempts to replace the human operator have not gone well; systems built to support the human brain and operator get better results. The same should hold for our field.
Cook calls us to develop trust in our operators. Too much design has historically gone into preventing people from doing things, taking away choices, and making systems more like black boxes, and we have not gotten the results we wanted. If we learn from other domains, we need to reveal the actual controls that are available. We need to find the right levers and knobs and make them available to humans so the human brain can do what it is best at.
Damon describes this as finding the right abstraction layer. If the abstraction goes too high, to a black-box level, bad things happen. He points to Lisanne Bainbridge's 1982 paper The Ironies of Automation, which discusses how adding more automation can create unintended consequences, especially as it trends toward black-box automation. Along with Cook's How Complex Systems Fail, these papers are decades old but feel relevant to today's distributed digital business systems.
The other side is going too low: SSH, sudo, a bag of scripts, and say a prayer. There is randomness, variability, and problems there too. It is a delicate dance to find the right abstraction layer.
In most companies, Damon says, they punt on building that abstraction, and their experts become the abstraction. Someone like Alice, a key individual contributor, knows all the scripts, tools, and commands, and knows how to target them, what order to run them in, what options to provide and when, and how to interpret the output. We hold up these experts, and they become our abstraction. Then they become the bottleneck and the silo; everyone has to push through that expert to get anything done. Adding more experts means training more master craftspeople and adding coordination issues inside the expert silo, while not solving the outside pressure problem.
The answer is to apply self-service. Damon does not mean merely letting someone run a script. He means taking the knowledge from experts' heads about invoking the right thing at the right time with the right options, guardrails, error handling, notifications, and the ability to do what the expert would do to invoke underlying tools and scripts, and abstracting that into a self-service layer. The point is not just to make experts faster, but to safely give that self-service to people outside the expert silo, whether in traditional operations or in the broader you-build-it-you-run-it world. We are not changing the underlying tools, obfuscating them, or hiding them. We are capturing the knowledge of how to invoke and use them in a safe and repeatable way.
For incident management, Alice may know many ins and outs, but she cannot know everything. When an alert comes in, her options are to look in the wiki and wonder whether the information is right, from the right person, or still valid; use shared scripts or tools and wonder whether she has the right version or the right flags; or escalate and pull in as many people as possible from other parts of the organization who might know the pieces. Along the way, incidents take longer and disruptive escalations spread through the organization.
With automated self-service, Alice can react with the same efforts her expert colleagues would use. She can ask how the network team would diagnose this, how the database team would diagnose this, what the platform team would do, and run those herself. She may also have remediation steps: restart, clear the cache, reset, roll back to a known good state, or fail over. All of that can be put in Alice's hands so she can act with much of the same expertise her expert colleagues could provide.
There are two patterns Damon sees people applying. One is the Iron Man, or iron suit, pattern: augment the human with as much ability as possible to diagnose and resolve problems. The other is the robot pattern: preprocess alerts, run diagnostics before Alice logs in, or automatically call self-service from the alert system for known recurring issues to keep those problems off Alice. Even better, Alice can hand self-service off to someone else, focus on her other work, and distribute the operational burden through the organization instead of being constantly interrupted.
This works for service requests too. In Alice's day job, people need things, want her to do something, or need her to answer a question. Those requests go into a ticket queue, which means waiting for requesters and interruptions pushed toward Alice, keeping her from other high-value work. With self-service, Alice can let other people help themselves: set something up, change something, run a report, perform performance checks, or handle other repetitive requests safely. Alice can focus on her other work.
Self-service also enables new organizational models. You build it, you run it has a lot of promise, but in big, highly regulated and secure organizations, simply giving people access to production environments is perilous. With self-service, operations, security, and compliance can vet code and procedures, then let nearly anyone run them without direct access to the system. Everybody is happier, including compliance, and self-service can relieve headaches while enabling new organizational models.
What is the magnitude of impact? Damon notes that finance may say that operations is paid to have these headaches, so what is in it for the business? Your mileage may vary, but with self-service, people talk about 30, 40, 50, or 60 percent shorter incidents. This is not an MTTR calculation; it is more of an anecdotal learning exercise comparing past events, escalation time, difficulty getting the right people to the right place to make the right decision and do the right thing, against a self-service model that pushes control closer to people at the edge, takes out escalation and waiting, and lets a broader audience diagnose and solve problems faster.
Organizations that go this way also talk about cutting escalations in half. For repetitive problems, pushing control closer to the edge can avoid escalation chains that constantly interrupt others. For service requests, Damon asks what instant gratification is: perhaps 99 percent faster turnaround compared with filling out a ticket, interrupting someone, waiting, doing rounds of communication, and finally getting it done. With self-service, someone can do it themselves. Some people say up to 15 to 20 percent of total organizational time happens in this operational area and can be saved by applying self-service and cutting out waste. Each organization must do its own calculation, but it can be a massive unlock.
How is this self-service created? One key design pattern is runbook automation: taking procedures and knowledge, creating automated workflows, and building the right abstraction layers that still show visibility down to the underlying tools and scripts while containing enough knowledge and guardrails to guide people in the right direction.
Runbook automation is the do side of the problem: how to take action. The other side is the view side: how to augment humans with the right information and knowledge to take action. After some fits and starts, Damon says the AIOps and observability space is the next unlock there. It provides the view. We are not only giving people the ability to take action, but the information needed to take the right action, or at least learn from their systems and reason about the next step.
That is Damon's thesis and talk: the next great unlocks are self-service operations, runbook automation, AIOps, and observability. He invites people to talk more, whether they found him convincing or not, at damon@pagerduty.com, on Twitter, or at the Rundeck virtual expo booth. Slides are available at rundeck.com/does20, with links to papers and talks that can fill in the details. He thanks the audience and hopes to see everyone in person next year.