Avoiding Goodhart’s law - Use SLO’s as Tools not Cudgels
The concepts of SLI, SLO and Error Budget are there to balance risk (rates of change) and reward (business contentment). Using such metrics as red lines to punish teams, or force acceptance of risk by the business is missing the point. My experiences from SLA’s in service contracts for hospitals inform this conversation identifying that SLI, SLO and Error Budgets are better as a basis for conversations about the stress an application can withstand, and the three dimensions the measures should cover. This session takes Goodhart’s law from economic policy as a frame for reconsidering SLI’s and SLO’s, and offers a few hints for approaching the negotiation meetings. Leave this session inspired to approach your SLO negotiations in the best possible way.
Chapters
Full transcript
The complete talk, organized by section.
Marco Coulter
G'day. Welcome to Avoiding Goodhart's Law: Using SLOs as Tools, Not Cudgels.
The concepts of SLI, SLO, and error budget are there to balance risk and reward: risk around the acceptable rate of change, and reward being the business success and customer contentment. Using such metrics to punish teams for exceeding budgets or forcing acceptance of change within the business is a path to failure. This session is going to give you a few hints for success, and I'd like to thank my hosts at DevOps Enterprise Summit for giving me the chance to share some knowledge in this way.
First, let me introduce myself. G'day, I'm Marco. I am an ex-CTO who has worked for one of the top 50 international banks. I've supported data centers for hospitals and service providers, and worked for some of the industry's largest vendors. I've lived in three countries and managed teams across 13 countries. I also spent five years as an industry analyst running the data science team at 451 Research. Seeing technology from every side as an operator, developer, analyst, vendor, buyer, and CTO gives me a unique view on technology. You can read some of my writing or interviews in the publications on the left here, or at my own personal website, tech-whisperer.com. So enough about me.
It's good to have targets, right? Think of Robin Hood, the story where he places a child against a tree and he loads up his arrow and he aims, and then we're told of an apple as a target on the child's head. With that, the story becomes a story of a skilled archer. But without that target apple, it's just the story of a dangerous guy who's shooting arrows at children. So it's good to have targets as long as you use them correctly.
Today's session is going to come in three chapters. I will talk about how I experienced Goodhart's Law before I even knew it existed. Then we will think about the SLIs in a better way across dimensions. Finally, I will throw a few hints about negotiating your SLAs and give you some further reading.
So let's get going. I'll get to Goodhart's Law in a moment. I want to share an experience with you first. Depending on your personality, you will either pause, relax, and enjoy my story, or you might already be searching online for Goodhart on Wikipedia. And that's gaming the system.
Some folks define gaming the system as a smart play. Others equate the phrase to cheating. I guess I'm more in the second group. For me, gaming the system means taking the rules that are created to protect a system and instead manipulating the system towards a desired outcome or goal.
Here's an example. In a prior life, back in Australia, I worked for a service provider that supported all of the hospitals in a state. In hospitals, nurses take lab samples and they get sent to the labs. They are processed and the results get transmitted back to the patient record where the nurse back in the ward can immediately look them up. Pretty simple, right? By the way, the wards are often on the 16th floor somewhere while the labs are in the basement of another building on campus. It can be quite some distance between them.
Technically, it looked a little like this. Messages from the lab's Unix system would be sent to message queues, and the queues then fed the lab updates into the mainframe system that held all the patient records. Everything allegedly spoke a common HL7 standard, so there's never going to be any problems, right? Some of you already see where this is going. Different vendors had slightly different interpretations of the HL7 standard, and malformed messages would get stuck in the queue.
We would get phone calls from hospitals that they had to go to manual procedures. The backup procedure, if the message queues get stuck or if the data doesn't appear in the patient record, was for the nurses to physically run from the ward down to the labs to get results. This was not optimal as the patient's health was at risk, both from the delay and from the nurse being absent from the ward.
Being techies, we thought we would take care of the situation by setting an SLA that said if the message queues get higher than 100, we, the service provider, had to refund money back. That should address things, right? I even coded a monitor batch script so that when the queue length approached 100, alerts would start to go off. Monitor icons would turn from green to yellow to red.
As technicians, we focused on that measure as the target, as the goal. We even built capacity plans around making sure that that queue processing system got all the power it needed. You might think, great result, Marco, that's top-notch. The lab results are getting to the ward in time, right? Well, the only problem was we would still get those pesky phone calls and the nurses in the ward saying the system sucked, and nurses don't bite their tongue when they're telling you you're not helping them. They were always having to run down and manually collect results, and we looked, but the message queues were empty.
The problem was transactions were starting timing out before hitting the message queue. We were managing the capacity plan, in fact, the whole application, to the metric and not the outcome. Years later, I learned that this procedure had a name, and yes, we're finally going to get to Goodhart.
Goodhart was an economist in the UK who in 1975 stated, and let me read this, "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." I know that's kind of wordy. He was English. He was a politician. Wordy is what you get. Here's what he meant. Basically, the law says that when a measure becomes a target, it ceases to be a good measure because people will game the system. Queue length was a good measure of queue function. We were managing queue length as a target for success instead of successful laboratory transactions in patient records, which is what the nurses needed.
So how do we avoid Goodhart's Law? We need to let SLIs be measures and SLOs be goals. What should we measure? The key is to stand in the other people's shoes to see everything from a few different angles. Hence, three dimensions. Remember, DevOps is about balancing the risk of unavailability against rapid innovation and efficient operation. We embrace that risk by giving it a value through well-defined and governed service-level indicators, objectives, and agreements.
As you're identifying SLIs, we need measures that see the whole picture in the three key dimensions of code, infrastructure, and customer experience. Let me step it through a simple example based on that hospital environment. The example is not going to give you specific SLIs that you can apply in your environment. It's not meant to. It's intended to share the thought process.
First, let's quickly review and agree the SLI, SLO, SLA model. SLIs are numbers, and they work better when they're percentiles, so avoid averages. In mature environments, SLIs will be nested. They will combine SLIs that sit up against the code and the technology up to SLIs that sit next to the customer. They're a defined quantitative measure, a metric.
You're then going to set limits on the SLI, say an upper or maybe an upper and lower, and that's going to give you the SLO, the objective. SLOs should capture the performance and availability levels that, if barely met, would keep your typical customer happy. They're generally a target. Failures will be less than X or a range. Responses will be between X and Y. Generally, I like to translate the SLOs into periodic budgets that I can track weekly. Whether weekly works for you will depend on your release cycles and so on.
Then you hit the SLAs. They define the actions that are acceptable once the budget's used up. Defining these ahead of time is critical. One additional thing: it's not just about full outages these days. You need to focus on slowdowns as well rather than traditional uptime availability. Try to focus on the customer domain and their experience. Use successful customer requests instead of technology.
To capture the overall environment, the parent SLI supporting the CX ones should cover each of the following three dimensions. First, let's start with code. From our code, we want functional code that does not fail, and also feature additions and write-downs of technical debt. You're going to be dealing in this dimension with multiple languages, so you want to be sure that the metrics will work across them generically, where you can avoid metrics that only apply to a specific language. It adds too much extra handling.
Also, it's not going to be limited to applications you built in-house. Sometimes you're going to have to deal with things like SAP and the ABAP language or a SaaS application like Salesforce. Maybe what you're managing is environmental, like a Kubernetes environment or something, and then you're not just watching for code transaction errors, you're looking at configurations as well. Some of the nested SLIs around code might end up being YAML code accuracy or OpenAPI definitions.
What you want to avoid here is silos of data. You don't want the business team working off one source of metrics while the development team works off a different source. You need a single source of truth there as part of deciding what you will base the SLOs on. The examples will show you that.
For the SLI here, for the sample indicator, let's embed a few clarifications. Our first step would be to focus on well-formed updates, and that specifies the transaction. We want to update the patient record, and we want to acknowledge completion, and that specifies the reaction. In this case, we want to measure it somehow. We agreed on APM as the source. I'm familiar with AppDynamics, but you could use Datadog or New Relic, and because we're only concerned about code here, you might prefer some of the observability offerings like Honeycomb or Instana.
Some books recommend good-over-bad ratios for SLIs. If you can control all the code, that can work. But in this example, I am avoiding the fail ratio as too many elements are out of our control. The HL7 update transactions were coming out of purchased lab software. We had no way of fixing that code. We had to wait for patches. The queuing systems were third-party software. We couldn't tweak them to respond better to malformed entries. We had to wait for patches.
So we couldn't be certain we'd reach a point where the HL7 outputs were always well-formed coming into our code. The basic code SLI needs to be focused on well-formed updates getting to the code that we were writing and controlling.
For the code SLO objective, we apply a goal to that indicator. We're already assuming well-formed records, so we can set this fairly high. Again, we're being clear about the transaction, the reaction, and the source. It's easy to set this goal too high. People generally say we want 100%, but that means there's no experimentation, there's no innovation. It needs to be instead set just over the level where you will keep customers happy. Too high and you're wasting opportunity costs.
For the SLA here, we apply an outcome. Note that our SLA goal allows some wiggle room against the SLO. We added a time range here that it must be met over a sliding range, in this case of 28 days. The SLA should specify what happens when the SLA is missed. Does one department owe the other department a refund, or does the software release cycle get automatically frozen for the next 28 days to return to stability? See, the SLA part is the part that's negotiated.
In a perfect world, this is defined by the business or customer. But in reality, it's going to be a negotiation between the parties, and it needs to be well defined. A lot of SLAs or error budgets are negatively stated. They punish teams for exceeding a budget by automatically stopping their next release cycle. I would also urge you to consider some positive options around the risk. For example, additional resources or funding may be released to tune the code or replace third-party systems. There might be an automatic escalation to a leader who can make decisions to reduce future risk. But the rules are clear for everyone ahead of time.
Then we have infrastructure. Here we are considering infrastructure across a few classes. First, how long does it take in the infrastructure layer? This needs to include the network. If you ask a network team, no matter what the question is, the answer will always be, the network's fine. But most tools don't watch the actual path taken by the application across the network, and that can be a significant trap.
The second class is capacity: not just performance in the moment, but capacity for the immediate future. This means thinking about systems of record like the mainframe and database, along with disks, memory, and compute. Then the third category is security, and it can be a bit painful. The more secure a system is, the less available, because when you get to 100% secure, the system is offline. We balance the risk across availability and security.
Here is another SLI for the infrastructure example. Again, we're focusing the transaction, reaction, and source of the measure. This time, the reaction is total transaction time, which reaches across databases, queues, mainframe, all of that to give the full end-to-end transaction. It leaves out the malformed messages and the failed transactions. If those things were common in the environment, you may need to address them with separate measures, or maybe you'd want to step back and include them.
People will game the system. That includes your customers. If you're going to measure towards metrics, if you're trying to use them for actions, people will tend to stop watching the outliers in the measurements and work towards it. So for SLOs on infrastructure, you may want to express the SLO in the shape of a performance curve. Here, we expect the bulk to occur normally within 30 seconds. This was well within the capabilities of our infrastructure, but some may take longer when there's a high system load or something like that. We included a long tail in here of five minutes at the top of the curve.
As we moved to the SLA negotiated with the customer, you see a big jump. We're only committing to the five-minute time with them. This actually came from their side after we had conversations with the ward nurses. We realized after the first debacle that we weren't understanding their needs. We went out to the wards, and we watched them and talked to them, and their view was different. They realized it takes time for the samples to get from the ward and be delivered to the labs. For them, the timeframe was about beating the time it took a nurse to run from the ward to the lab when the system was down. The results had better come back sooner than that, so a doctor wouldn't dispatch them. That takes about 10 minutes in this environment. When we offered them five minutes, they were really happy with that. It doesn't always have to be as fast as possible, just as fast as necessary.
When I worked for banks on stock trading systems, that was a whole different world. The processing time was their competitive differentiator for their traders, so fast as possible and never mind the cost was the approach. Very different. The dimensions here of code and infrastructure, though, that's not the full picture. I saved the best for last.
Our third dimension is the business, or if you're a nonprofit or government, then it's the customer experience. This is about the revenue or service production capabilities of the application. As we add in the business dimension, it can be difficult to measure the full experience, so you might have to get out to the customer interface, and that might require some browser integration or some mobile platform agents. For availability to track predictability of response times, you may want to even look at synthetic testing tools for synthetic transactions. I'm going to keep it a little simpler in our hospital example.
Here in this example with the SLI, we're looking at the doctor and nurse experience. We had built and owned the patient record application, so we knew we could add our own specific measure in there. From hanging out with the nurses in the wards, we worked out that the nurses had an instinctive expectation of when the labs would be coming back. That's when they'd start looking at the record.
If the update was not there, they'd do something else and come back in a few minutes. We worked out that repeated record lookups was our sign that we weren't meeting their instinctive expectations, and soon they would be calling us to complain. So we coded a repeat counter into the patient record application, a refresh counter.
As you look at the numbers here, why did we set beyond 10 seconds in? We had five minutes, of course, but why beyond 10 seconds? At first, we just had five minutes, but we kept missing the target because there were just one or two nurses in one or two of our hospitals who wouldn't wait at all. They'd just sit there hitting the refresh button again and again and again and again and again. So we put a beyond-10-seconds line in there to avoid those crazy impatient ones.
If we look at the SLO here, we're seeking a low number. You might have expected a tiny percentage, but the SLO now, because we're looking at the whole thing, includes all those malformed transactions coming out of the crappy lab system. We needed to be realistic with them. In fact, this SLO nested everything within the system. When we hit the SLA, again, we gave ourselves reaction room against the SLOs. The eight-hour timeframe here actually came from the nurses. They think in terms of their shifts, so if they saw it once within a shift, acceptable; if they saw it twice within a shift, they're phoning us.
You've got to remember that you're working towards: it's not a metric contract. Use it as a tool, not to beat each other up, but to understand and capture assumptions. You need to consider all these three dimensions for success. The SLAs are not there to beat each other up. They're there to capture the mutual understanding. You reach that mutual understanding through negotiation. SLIs, SLOs, SLAs, error budgets, they are the tools to support negotiations.
Negotiating is a key skill for any DevOps professional. There are some great books out there, though the books sometimes can be contradictory. Getting to Yes, Getting to No, and pretty much almost all of them are aimed at sales folks. For me, I'm a win-win negotiator. I want everybody to feel like they won something, and that's not always possible. Here are a quick few thoughts based on my experiences, to save you reading all of the books that I've read.
A few quick steps. Know thyself is not a new thing. It's carved in the Temple of Apollo in Greece in the fifth century BC, so definitely not a this-century thing. Starting out is great. How much can you control? What risk can you absorb and keep your job? Is the risk spread evenly through the year, or do you have peak periods like a Black Friday or a Super Bowl or a New Year's Eve? Are you in a period of significant transformation as an enterprise? Will things be the same in 12 months? Or if you set SLAs today, will they be meaningless in 12 months, depending on the nature of the transformation?
Use all of this to gather your needs. You probably have a feeling for expectations that the business will have. What do you need to deliver that? Could you accept tougher SLAs if you could grow your team or purchase supporting tools? When consulting, I try and brainstorm this a bit to identify where my outer boundaries are. What would be unacceptable or be considered too easy?
Preparing to engage is about gathering the information and building it into a strategic model. There's a people factor here as you're gathering information as well. Gather opinions about the people you will be negotiating with. What are their goals, their aptitude to risk or innovation? Even subtle things like what time of day are they more open to ideas or in a better mood? Are they happier if there's donuts in the room, or are they happier if there's coffee in the room?
If this will be you as the facilitator, then read up on this and learn it ahead of time. Facilitation's a very specific skill. Consider bringing a contractor in or an outsider. Actually, it can be great to grab a leader from a different part of the organization who you know is a natural facilitator, who can park their own ego and needs and draw input from everybody in the meeting. It can give that facilitator a career profile inside your organization, and they learn about another part of the company, so it's win-win. I told you I like win-win.
Then it's time to get negotiation on, to schedule things and set up the meeting. In a negotiating meeting, you want the warm-up. This is a discussion of the application and the dimensions and the scope of the meeting. Be brief. You don't want them to zone out. They're all experts in some aspect here or you wouldn't have them in the room. You don't want to turn over every stone yet.
Get everyone to talk. Ask them to spend one or two minutes describing their aspect of it. The nominated facilitator should politely close anyone down if they start to exceed a brief introduction. Your next step is to test drive something. Give them a shot at the indicators under consideration. You're testing the water here, so try to make it something that could live on as a final agreement, and make it something real.
Then you hit assess. Now they have something to talk about. Assess the business value. Is this the best place to start? Will pursuing this give ROI? Do you need to balance innovation and risk for this application? What actions will be effective for missed SLAs? Will you just freeze changes until you return to stability? If so, how long should that freeze be? An important part of this phase is extracting and capturing assumptions. Clarifying assumptions is why footnotes in SLAs can be longer than the actual SLA sometimes.
The next step is to propose. You've gathered information, everyone's turned over a few stones. A quick hint here: predictability is often more important than speed. The higher variance in response times, the more user experience is affected and you lose their trust. Avoid spreads greater than, say, six standard deviations. If you're more than six SDs, then you've got low capability to process, so it's not good.
You propose it, you get reactions. Now you recur, you assess the new proposal. Expect to iterate through this stage many times, because this is where the real negotiation occurs. For each meeting, you might have to schedule follow-up meetings. Make sure to take a few minutes to revisit the warm-up, restate the scope and goal, recap the conversations to date, and try to acknowledge something from everybody in the room. You want them participating and feeling respected. But don't talk long enough to zone them out.
Finally, you get to agree. This is the final presentation of a finished SLA, SLO, SLI for sign-off. It doesn't have to be a physical signature, but it's worth saying that you will need everyone to confirm by email. They need to commit on the record. If they're reluctant, you missed something during the assess phase. Return to that and try again. There's some assumption there or hindrance that they have.
Okay, so I went through that pretty quickly, but the slides kind of cover it. Here's what I want you to do. Learn from my experience. Don't manage on the metrics, focus on the outcomes, the full transaction, the complete process, the overall experience. Don't use service levels to beat each other up. Use them to become preemptive. Use them so that you can offer more services ahead of time.
When you build out your service levels, remember to assess them against the three dimensions: code, infrastructure, customer experience, CX. Are you seeing the full picture? Is some critical aspect being overlooked or assumed? One of the most critical things around customer experience, of course, is predictability. With higher variance in response times, the more the user will lose trust in you and the application. It's an indication of low-capability process, so don't fall into that trap. Keep your eyes on the transactions that are outliers. Outliers annoy the crap out of users, so watch those.
Finally, realize that if you want to be great at DevOps, then you need negotiation skills, and negotiation is good for life, and it's good for DevOps. As promised, here are some links to further research. This is the time to switch back to this screen and take a quick snapshot, although I'm sure the PDF will be available online. You can catch up most of my thoughts on my website, tech-whisperer.com. If you found this session interesting, then please connect with me on LinkedIn or Twitter. I might have some more interesting thoughts tomorrow, you never know. Feedback on this presentation would be fantastic. What caught your attention? What was important that I missed so that this becomes stronger and stronger as we get more input?
With that, I want to thank Gene and Ann and the DevOps Enterprise Summit team for their support and help during this event, and particularly Ann's patience. I want to thank you for your time today, and I will see you in the chat rooms. Again, feel free to bring questions to me there. I look forward to meeting you and catching up.