Shifting Left on Production Excellence with Observability

Log in to watch

US 2021

Download slides

Shifting Left on Production Excellence with Observability

Shelby Spees

Site Reliability Engineer · Equinix Metal

Liz Fong-Jones

Principal Developer Advocate · Honeycomb.io

As DevOps leaders, we want to empower developers to own their code in production. Production ownership is the best way for developers to deepen their skills and deliver business value, but many leaders hesitate because it involves on-call and firefighting. How do we set up our teams for success and not burnout?

Join Shelby Spees and Liz Fong-Jones as they share how Honeycomb established and evolved a culture of production excellence—maintaining reliability and scalability while shipping features at the rate of a company ten times the size. They will detail how the Honeycomb team uses Service Level Objectives to measure how their services are doing and which users are having a bad time. And they will walk through how the team combines observability with progressive delivery to create tight feedback loops and guardrails for experimentation.

Attendees will walk away with strategies for establishing these practices across their organizations with strong buy-in at the practitioner level. Cultural change isn’t just top-down or bottom-up, it needs support at every level.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

As you have undoubtedly noticed, one of the prominent themes at this conference is next-generation infrastructure and operations. We already heard from the team from Vanguard and Comcast, as well as the team from Google SRE.

Anyone who has even a remote interest in SRE, observability, or the technologies that enable it is likely familiar with the work of Liz Fong-Jones. I first got to meet her when she was staff site reliability engineer at Google, and she shared some of those experiences at DevOps Enterprise in 2018. She is currently a principal developer advocate at Honeycomb, and I'm so delighted that she will be presenting with Shelby Spees, who is now a site reliability engineer at Equinix Metal.

They will be diving deep into what observability is, what the goals and underlying principles are, and some fantastic ways that anyone who cares about delivering reliable services will care about, and how to make modern production environments something that we actually want to live in.

Here's Shelby and Liz.

Shelby Spees

Thank you, Gene. Welcome to Shifting Left on Production Excellence with Observability. We're here to talk about how observability enables production ownership sustainably at scale, so that your engineering org can support your business's need to move quickly in the market without forcing your software teams into this shitty, abusive relationship to their systems.

If your business wants to move faster, you need to be able to respond quickly to changes in market as well as ever-evolving security and data privacy requirements. That velocity requires tight feedback loops, because it doesn't matter how fast you're going if you're pointed in the wrong direction. Like an airplane, you need to keep adjusting your heading, and in response to changes in your environment, you need to be able to adapt.

This feedback loop we're here to talk about today is production. We need to shift production left. Breaking down that wall empowers developers to learn from production, and not only build better software, it also positions them to identify how that software can better support your business's needs.

I'm Shelby Spees. I recently joined Equinix Metal as a site reliability engineer, and before that, I worked with Liz Fong-Jones as a developer advocate at Honeycomb.

Liz Fong-Jones

And I used to work with Shelby when we were both developer advocates. And we've utilized our many years of expertise as site reliability engineers to figure out what are the key lessons that you need to learn when you're thinking about this journey towards shifting your production environment left.

Shelby Spees

As an industry, we've already come a long way. You all deserve credit for that. You've probably adopted DevOps strategies like continuous integration and infrastructure as code in the parts of your org where it's feasible and where it makes sense. That helps a lot with the work that used to be manual and error-prone, for catching things like build issues long before they have a chance to hit production.

We've already gained so much, but there's still more we can do. So what's the next step? Shifting left.

One question that frequently comes up is whether developers should be on call. It's true that if we want our developers to benefit from the feedback loop of production, putting teams on call for the software they write is very effective, but it's critical that we're not setting our people up for burnout.

The thing is, production has become so complex. This is especially true in large enterprise organizations where you might have some parts of your org moving to the cloud, adopting progressive delivery, and implementing all kinds of modern deployment practices, while other teams are holding it down on legacy systems and managing integrations with newer components. There's a huge range of technologies and skill sets.

All of this makes production really intimidating. On top of that, traditional monitoring tools are inscrutable. They're speaking the language of individual hosts, but they can't tell you which part of the application things are going wrong. This is true even for seasoned ops engineers who often have to make decisions based on correlations between a blip in the graph and whatever they can find in the logs around that timestamp.

That expertise is valuable, but it only takes us so far because prod is always encountering new issues. We've done such a good job solving for so many known failure modes that what remains are these novel emergent failure modes where there's rarely a singular root cause. It's not just the latest change, but it's often that latest change interacting with some change from two months ago or two years ago, and maybe it only appears on certain kinds of traffic. Or another example, there's some external dependency that's changed and now the ground has shifted beneath us. We have to scramble to update our stuff in response.

Stripe's 2018 report found that developers are spending 42% of their time on bad code and tech debt. Our teams are always fighting fires. We're constantly in this reactive state, and so we can't make forward progress. We can't invest in improving systemic issues. When something needs to be upgraded or migrated to a new technology, it takes forever, which really hurts trust between software teams and business stakeholders.

Meanwhile, our heroes are exhausted. There's a few go-to people we rely on to keep things running, but they're too busy holding things together. They don't have time for knowledge transfer. So we maintain this low bus factor. It makes our sociotechnical systems fragile, and people get so burnt out that they end up leaving their jobs, even jobs they loved. It's a huge waste all around.

So while we've made amazing progress in the last decade or so of the DevOps movement, we shouldn't stop here. What we need is production excellence.

All teams need production excellence. Those of us here at DevOps Enterprise Summit, we've benefited from all these improvements, so that as we continue to level up our organizations, we have a responsibility to pay it forward so that every software team can have a healthy relationship to their systems in production.

Production excellence isn't just about what technologies you're using. Buying the alphabet soup doesn't guarantee better outcomes because you can't buy DevOps. Rather, once you have better feedback loops, then you can start making data-driven decisions about, for example, whether adopting Kubernetes can actually help your teams improve future velocity and maintain reliability. Is it worth the complexity?

Production excellence isn't about how many nines you have. It's about investing in people, culture, and process, because the people building and operating your software systems, they're the lifeblood of your organization. It's a sociotechnical system where the tools are there to enable and empower your people to apply their expertise.

This is why production excellence is business excellence. You want to invest engineering effort where it'll have the greatest impact for the business, not just this next quarter, but for the long haul. This is your North Star, your guiding principle. So what's the first step? Start with observability.

Let's level set a little bit. Observability, it's the ability to inspect and understand a system's internal state using the telemetry data it's already generating, even if that internal state is something you didn't anticipate. For example, this isn't about going in and flipping your log level to debug in production. Debug logs now don't help you for the incident that started 30 minutes ago and then inexplicably resolved itself.

Also, observability isn't about going in and adding new timers around the blocks of code you think might be introducing latency and then deploying that change. You don't want a deploy between asking a question and getting an answer. It's not that debug logs or adding timers aren't important or valuable, it's they just don't give you observability into what's happening right now.

Observability is for these hard-to-debug problems, these novel emergent failure modes that you can't possibly predict in advance. You can't know what dimensions, what attributes are going to be important someday. But that sort of data is prohibitively expensive to store as tags on time-series metrics and prohibitively difficult to parse and query with traditional log aggregation.

Observability data gives us the answers for distributed systems. We have traces. Identifying bottlenecks is a lot easier when every single event tracks its duration. Observability means making it cheap to capture lots of dimensions at write time, and then we can slice and dice it and filter it down at query time. And while we're capturing a much richer picture, all of it is still read-only.

Here's an example event. It includes the sort of data that we often see in flat logs, but parsing those logs is finicky and expensive. Labeling our attributes makes querying much easier. On top of that, we can link events together into a trace. Each event has an ID, and it can point to another event ID to say, "Hey, that other event called me. That's my parent." It's a directed graph.

Now, at runtime, you can capture data from across all parts of the stack, from the build ID to the user agent to which payment processor was used in that particular checkout transaction.

All of this data is worth capturing because technical decisions are business decisions. Let me turn it over to Liz to talk about making the business case.

Liz Fong-Jones

Thank you, Shelby. So you can capture all of this data, but does it really matter unless it's having an impact on the business? Let's talk about how you translate capturing observability data into actionable business insights.

So how do we figure out what the impact of observability data is? Well, we need to think about where our business needs us to invest our effort. How do we drive our business forward?

So often when we're making decisions as engineers, we're often trying to do things like increase scalability, or pay down technical debt, or introduce new features. So we need to have a mechanism for deciding when we should speed up and when we should slow down and focus on the fundamentals.

Service level objectives, which are a concept from the discipline of site reliability engineering, can really help us get on the same page about what level of reliability we're targeting and whether we're achieving the results that we want. They're a way for us to describe in a common language between business, engineering, and customer stakeholders what success means for our customers and help us measure it with these telemetry data coming out of our systems for the entire life cycle of a customer journey.

So there are a couple of books that I would recommend, specifically the Site Reliability Engineering series by O'Reilly, as well as a Service Level Objective book by Alex Hidalgo.

But let's level set briefly about what a service level indicator is. A service level indicator is a mechanism of encapsulating all of our critical user journeys, things that have customer impact, things like homepage loads, API calls, or user queries. And the good news is that if you've invested in observability as a foundation, you already have that rich data about customer workflows being captured inside of your application and emitted as telemetry data.

So our service level indicator transforms the flow of events coming into our system into a categorization of those events, from good events to bad events. We're able to set thresholds to say things like, "The homepage is expected to load within 500 milliseconds." And then for each homepage load that's executed against our service, we can determine whether it met that threshold, whether it was successful and fast enough.

And then we can broaden the view from a service level indicator to a service level objective and zoom out and say, "I want 99.9% of events over the past 60 days to succeed," for instance. And then we can compare that to our target and see how are we doing versus our actual data and our target. And that's the invariant that we're trying to maintain of our systems, because it doesn't matter how many features we ship if our customers cannot trust in the reliability of our service.

So we might want to have views such as the historical SLO compliance. How are we doing on a rolling 30-day basis? A week ago, how was the performance of day 37 to day 7? This enables us to understand what's the long-term trajectory of our service and give us guidelines as to where we should invest our time.

Now, some of you may ask, why not 100%? The answer is that we need to keep our users just happy enough. It is true that you can invest in infinite nines, but if you do so, you're trading away your ability to do any kind of feature development, and you're trying to deliver so much reliability that external factors will be the primary driver of your customers' experience level of reliability, not the investment that you've made into your service.

For instance, many of us access services via our cell phones, and your cell phone is only about 99.9% reliable. So why build a service that is 99.9999% reliable if customers will never experience it, and all that effort happened for naught? Most services are not necessarily life or death.

So if your service is not life or death, it is okay for users occasionally to have to press the reload button. That's the price of progress and of having your service be sustainable and affordable.

And we can use the difference between 100% and your service level objective as a guideline to the amount of allowed unavailability that we can have. This is the idea of an error budget. So for instance, that 99.9% service level objective, that means that we are allowed to have one in 1,000 requests fail. So if we're serving a million requests per month, 1,000 of them can fail, either being too slow or for being an error, and that's okay.

We just need to make sure we're managing the rate of burn to make sure that we're not spending it all on one big outage, or worse yet, blowing it entirely. So this is the idea of the error budget burndown.

And this helps us really think about when is it okay to take risks. If we've barely touched our error budget for the month, then we can release the chaos monkeys. On the other hand, if we overspend our error budget, that's a sign to us that we need to slow down.

There's no point in hanging onto extra error budget. It's like keeping extra cash under your mattress that you could be investing in your innovation and in your business. It's a missed opportunity.

But let's suppose that you've had a really rough couple of weeks. In that case, instead of launching brand-new services, we need to invest in reliability and pay down our technical debt, and then we'll see better SLO performance over the following month after that.

So in order to proactively manage our service level objectives and our error budget, we need to think about the idea of burndown alerts. We need to think about predicting based off of the most recent few hours of data to figure out, am I going to run out of error budget in the next four hours? And if so, I need to wake someone up, because otherwise, without intervention, I'm going to result in unhappy customers.

On the flip side, taking this mentality means that we no longer have to treat every single little problem as a life-ending emergency. If I'm going to run out of error budget in 72 hours, this is a problem that can wait until the next daylight hour, certainly, if not the next working day.

By switching to alerting on actual symptoms of user pain from our error budget burndown, instead of alerting every time a CPU flaps above 90%, this really helps us have much more tolerable lives as developers. And that enables us to feel comfortable taking on the pager rather than living in fear of it.

But let's talk for a second about how we actually take this idea of service level objectives and how we take this idea of observability, and how we manage to use that investment in that error budget to get faster feedback cycles so that we can get higher velocity and higher reliability at the same time.

So this is where we get to the idea of observability-driven development, and something that we weave into our development flow as a whole rather than just bolting it on at the end. And this is why we talk about the idea of shift-left observability.

So at Honeycomb, where I work, we practice continuous delivery, and we've gotten exceptionally good at it. We deliver code to production a dozen times or more per day. And there are many steps in this, but I want to focus on just three specific elements that I think are most key to our success.

First of all, we instrument our code as we write that code. For each change, we ask, how is this going to behave in production? How will I know that it's working? Just like I would not commit code without unit tests, I also do not commit code without instrumentation that helps me understand its production, performance, and behavior.

And then once I have that code written, and once it's appropriately instrumented, I ensure really fast feedback loops. We use CircleCI to build on every commit on every branch. And then we also think about being able to deploy to production on demand because if we keep the build in a green state, if we keep it so that every build that is built from the main branch is in a releasable state, that keeps us nimble and that keeps us on the ball rather than batching up commits 100 at a time to release into production.

But the most important thing is actually closing that feedback loop. If I wrote the telemetry as I went along, then it's important for me to look at that telemetry when my code reaches production. Because it's a lot easier to find a critical problem right after it's reached production, maybe an hour after I wrote the code. That state is fresh in my head, and that helps me debug it. Whereas if it waited hours or days later until I actually released it, and if I weren't looking at it as it went out, I might have no idea what was going on or what I was thinking when I wrote the code.

So by observing the behavior of our changes in production, we're able to verify and validate that they work the way that we expect according to what we engineered into the code and according to what our users are actually doing with it.

So we use telemetry data about the production data that our customers are sending to us in order to verify not just correctness, but also things like usage patterns. Are people making use of the new feature that we added? Because if they didn't, then all of our effort was for naught. We have to really think about are we designing features that delight users, and are there things that we can do to decrease the risk in case there is a problem?

So this means that we often do things like A/B testing, things that are a little bit more experimental, so that instead of delivering large units of work more slowly, we deliver fast units of work knowing that we can always roll back if there's a problem with our experiment.

So Honeycomb managed to do this with 50 developers, but how do you make this practice of production excellence scale for an enterprise company with hundreds or thousands of developers? The answer is to invest in the right places in rolling out the culture. Your systems and teams are unique, so you're going to want to pick the most fertile place to invest rather than trying to do it by fiat all over the board.

So that's why you need to think about finding the right teams to start with. And the first thing you're going to want to do with one of those teams is to use automatic instrumentation to generate the data that those teams are going to need.

The second ingredient that we feel is really critical is decreasing that feedback time to make sure that when you are developing code, that you're getting it tested as quickly as possible, that you're rolling it to production as quickly as possible. And therefore, we think it's important to instrument your builds so that you know what is stopping you from rolling out software every 10 minutes, every 15 minutes, rather than waiting hours or days for a build to be complete and graduated to production.

Finally, we think that it's really important to have executive-level support, to have both a champion and a sponsor. You want to have someone with a strong sense of ownership, someone who really lives and breathes the DevOps mindset of kind of shifting that knowledge left, of sharing everything that they learn and leveling up the people around them. You want to help those people develop their sense of observability skills and really focus on upleveling the rest of the organization around them. And you're going to need an executive sponsor who is willing to prioritize the investments in making those development teams able to move faster with the power of observability.

So some don'ts. Don't just treat it as one and done. You want to think about instead iterating over time. So for instance, maybe you don't pick the right service level objective to start with. That's okay. You can always go back and change that target to something that's more realistic. Similarly, it's okay to start off capturing a subset of the data that you need as long as you feel like you're empowered to move forward and add additional instrumentation and telemetry.

Next, you're going to want to make sure that you're able to scale things up to your organization. That what works on one pilot team, you need to be able to roll that out across dozens of teams. How do you do that? Well, you need to prioritize the developer experience. For instance, if you have central libraries that every team uses, adding OpenTelemetry to those libraries will make it automatically instrumented across your entire team.

You need to make sure that people are onboarded and given help as you start giving teams responsibility for their own pagers. You want to make sure that you have config as code set up so that people are able to get the right set of dashboards and get the right set of graphs, and also have the ability to dig in and evolve beyond the canned graphs.

So in order to substantiate this idea that organizations that are much larger than 50 people can do this, I wanted to talk specifically about Vanguard. And you'll be hearing more from Christina Yakomin from Vanguard later in this conference.

Observability at Vanguard really succeeded because of these four key elements. They had an observability champion in the form of Rich Anacore and in the form of Christina. They prioritized knowledge transfer and sharing among their teams. They adopted OpenTelemetry early on, and they adopted and went all in on service level objectives to help free up that time of their team.

They turned off a lot of old scale alerts that were too noisy and replaced them with SLOs. And by replacing their old noisy alerts with SLOs, that enabled their teams to have confidence in the changes that they were shipping and to spend less time firefighting.

So it's one thing that Vanguard was able to do this successfully, and that's in no small part because of Christina and Rich. But how do we deal with the fact that there are tens of millions of software development positions that are going to be open over the next decade at large enterprises? How do we solve this? Well, we can clone Christina, right?

We can't clone Christina, and we can't clone the original contingent of SREs and ops engineers who've been doing all this work. So let's talk about the next generation of developers. More and more software developers are doing remote CS programs like what I did, or they're graduating from boot camps, or they're self-taught, so they don't have the chance to learn operational skills the way the last generation of SREs got to learn. We have to grow them, and so we have this opportunity to make it a much better experience for this new generation. With production excellence, they don't have to burn out. They don't have to live in on-call hell.

So that's what we have prepared for you today, is how you can adopt the production excellence and observability-driven development mindsets in order to shift production left in your organizations. And now I'd like to invite Shelby to share with you how you can help and how you can find us.

Shelby Spees

Thanks, Liz. We're really, really lucky to have a super active observability community. You can find us in the CNCF Slack in the TAG Observability channel, and OpenTelemetry contributors are very active in the CNCF OpenTelemetry channel.

There's also a new community growing around OpenSLO. Check out openslo.com, where you can find a link to join the Slack. Finally, we'd love to hear from you on Twitter about your stories of developing production excellence. Thank you so much. Thank you for joining us. Have a great conference, and we'll see you on the Slack.