Safe, Responsive, Trustworthy, Coherent: A Target Operating Model for Grainger Product Engineering
Shared models help clarify what we agree that we’re trying to do. “Agile” has unfortunately become too misunderstood and jargon-y to serve this purpose. At Grainger, we’ve been exploring alternate framing to point more clearly to what we’re trying to do: “Continuous Delivery”, “Modern Software Engineering”, etc. The latest version I’m proposing is a Target Operating Model emphasizing 4 properties: Safe, Responsive, Trustworthy, Coherent. I’ll describe the problems this target operating model is intended to solve and the reactions it has received so far.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
I met Jason Yip in 2019, back when he was at Spotify. You probably know him from the work that he did there, and maybe even before that when he was at ThoughtWorks. Since then, he has joined Grainger — a leading broadline distributor of maintenance, repair, and operating products (MRO categories) in North America. It was founded nearly a hundred years ago and now has over $15 billion in revenue.
He'll talk about his new mission at Grainger to help the CEO's new strategy of technology-enabled market share growth. He will talk about many things, including the target operating models, which emphasize four key properties: safe, responsive, trustworthy, and coherent. I love how so much of these lessons are grounded in the fact that as technology moves from just being a cost center to actually organizing into product teams as they go into growth mode. Here's Jason.
Jason Yip
I was thinking they were gonna play the song I generated for Grainger, but cool.
Hi. I'm Jason Yip. As mentioned, Grainger does — actually, I just want to check, how many people know who Grainger is? Wow, that's a lot. Almost no one ever knows. Okay — kind of an interesting thing here.
Mentioned broadline distributor, MRO. There's this tagline — people have heard this: we keep the world working. When I think MRO, it's stuff that keeps people working, and it's like every stuff. In Las Vegas it would be probably tables, chairs, cups, et cetera. In a factory it would be nails, drills, all that kind of thing.
Interesting thing here — mentioned $15.2 billion. It's gone 51 years of consecutive dividend increases. I kind of mention that just because the point is it was doing fine for a very long time. And 75% of the orders are now going through digital channels. This gives you just a sense of how many different things are distributed by Grainger as well as all the different industries that we're involved in.
01Other useful background
It was founded in 1927, which means it's 97 years old, and it's survived the Great Depression — they essentially operated for two years and then got hit. So they've had practice at this. Leading broadline MRO distributor in the US with approximately 7% market share. It's approximately 7% because I don't know what I'm allowed to say there, but it's small. Especially compared to — I used to work at Spotify and our market share was gigantic in comparison. But here you're leading with only 7%. Over the last few years, their intention is to shift to grow market share. This is why all the modernization, why someone like me showed up as well.
02March 2023 — Enter me!
That's me — working from home as everyone else is. My job is to improve how product and engineering interact, and really, I usually call it general problem solving — because I'm just looking for different challenges around product delivery to see how I can help.
03One of the first problems I noticed
I call this — just kind of something I've seen a lot. When you have a transformation, or any large project that has a lot of money and political clout, it's like a magnet for problems. Anyone else seen this? If you haven't, you need to look closer because this is happening. And I don't want to say this is malicious — people don't do this because they're bad people. It's just people got problems to solve, and they usually don't have enough money or people. So anytime something gets that attention, everything just kind of flies out of nowhere and attaches itself to it.
So you have a natural tendency for transformations to look like this — there's a very long line of problems, and then it's equally split across all of them.
04The Law of Raspberry Jam
How many of you have read The Secrets of Consulting? Oh, this is not a consulting crowd. Cool — I guess it shouldn't be.
The Law of Raspberry Jam: the wider you spread it, the thinner it gets. Weinberg is really good at these kinds of metaphors, and that's what's happening here. If we're talking about strategy in general — if it's unfocused, it's bad. If you have an unfocused transformation, I would call that a bad transformation strategy. Because you spread it wide, it gets thin, shallow, you're not really effective.
So that's not how you want to do it. You want your transformation, your large effort, to look like this — I'm going to stack people on high-leverage problems, not all of them. And therefore I get an outcome. The question is, how do you get this?
05A target operating model for Product Engineering: Safe, Responsive, Trustworthy, Coherent
This is the idea. A target operating model for product engineering: safe, responsive, trustworthy, coherent. I have kind of simple words, compact, mainly because it's intended to try to create focus on high-leverage things rather than solve every problem.
I saw other people do hypotheses, so I should have one too. Hypothesis: A transformation narrative based on simpler, jargon-free outcomes will lead to a more focused and effective transformation.
Oddly enough, I was also trying to pay attention to how everyone was talking about their own efforts to see how jargon-free they were, because I'm kind of judging that too. So — there's a phrase that you hear at Grainger a lot (also because they're a lean shop): 'what problem are you trying to solve?' When you talk about, 'Hey, we're doing something,' you want to start with, 'what's the problem we're trying to solve?' Saying it in a simple way.
I have four main things that I observe — these seem to be the top ones: - Stable systems despite faster delivery → Safe - Faster delivery to respond to opportunity, to reduce time-to-value → Responsive - Reliable delivery to specific dates and/or outcomes → Trustworthy - Effective coordination across boundaries to achieve shared goals → Coherent
06Safe — Stable systems despite faster delivery
I couldn't figure out what numbers I'm allowed to say, so I'm just going to use thumbs up and thumbs down to represent it. I had this particular issue where if you talk to engineering and we talk about change failure rate, everyone's good — if you look at the number, it's good. If you talk to product, though, they would say it was not good.
So what's going on? Why do I have the nominally same concept, but there's a different interpretation? It turned out to be: when people on the engineering side were talking about change failure rate, they're talking about the failure rate of technical deployments. I even heard this in some of the talks here. People talk about deployments, and I'm not entirely sure if they're talking about a customer-facing release from the perspective of your customer, or from an internal perspective. When people in product and business were talking about change and change failure, they're talking about it from the perspective of the customer. What is the perception of the rate of failure? And those did not match up — which is why you had this mismatch.
So, this tagline: we keep the world working. When we talk about safe here, we're talking about keeping our customers working. This is a direct translation of this tagline that we have. When you say change, failure, and all the other things, you think of it from the perspective of the customer, not internal concept.
07Differences in how customers see failure
This came up as well in this discussion around differences in how customers see failure. I put 'basic failure' here because I'm trying to be respectful. But in reality, when a customer talks about this, they're more likely to say like, 'that was a dumb failure.' So we have this dumb failure / basic failure versus what I call the 'what are you gonna do?' failure.
The difference is, if you do a basic failure, your stakeholders and customers will question your competence — which is sort of what was happening. I threw a few examples here — I'm sure you haven't had these happen [forgetting to renew a certificate or license, breaking basic functionality like can't log in or add to cart, deploying to the wrong environment]. Maybe you haven't — you should really automate a lot of these.
And then you have the more complicated ones, like, for example, if there was a worldwide outage due to some kind of bad security update. Most of the time they don't blame you — it's like 'hey, what are you gonna do?' So you kind of want to focus your attention on getting rid of the basic things first, and you deal with these more complicated ones later, knowing that the effect on reputation is not the same.
And if you are able to prevent the complicated ones, people are delightfully surprised — but you don't have to worry about them initially.
So the idea of safe is also: avoiding basic failures, delighting people with avoiding complicated failures.
08Failure demand
This diagram comes from our CTO, when he was initially talking about this idea of — we wanted to deliver more quickly, but while we're delivering more quickly, we have to be careful about this feedback loop that can occur if you don't do it correctly, where you'll generate failure demand due to poor quality.
What's included in failure demand? If something goes wrong, like orders don't process, you have all this activity that starts to occur because of that: person-hours to contain the problem; person-hours (including in distribution centers — I might have to recover from late orders by doing overtime in a distribution center, so I have a whole bunch of people doing that); people to analyze what happened and identify long-term remediation; and then actually doing the work for the remediation.
So every failure pulls its entire chain of effort. And all of that effort would have been allocated to work that produces value, but instead is now created to deal with the failure. So this is what was meant when we say, 'we want to be faster, but we don't want to just generate more work because we're moving too quickly.' That just undermines all the speed, because you just have to reallocate effort to failure.
So safe is also reducing failure demand.
09Automation and MTTR
There's something I think Google wrote about — human intervention adds about 10 to 20 minutes to MTTR. We kind of estimate around the same. I don't put the revenue numbers because I don't think I'm allowed to. But same sort of thing — safe also probably means automation. It's like, you don't even have to worry about labor costs; it's purely reducing the time, because it's worth a lot of money.
10Safe — putting it all together
I'm just going to jump meta here. The idea is I'm talking about things like this because this would create context to show how activities or efforts we're doing in terms of practices make sense within this larger transformation. So we have these things called Sensible Defaults and Site Reliability Engineering, which I'm not going to talk about in detail. But the idea is that they would do the things I talked about in order to fulfill the goal. And the idea of talking about this way is that, so then I can understand — if I'm on a team, why are we doing these things, and we know how to focus.
[Slide: Safe → Goal: Stable systems despite faster delivery | Mechanisms: Keep customers working, Avoid basic failures, Reduce failure demand, Automate | Main efforts: Sensible Defaults, Site Reliability Engineering.]
11Responsive — Faster delivery to respond to opportunity and reduce time-to-value
I only have half the time, so I'm going to be a little bit faster here.
Deployment frequency of hundreds per year. Raise your hand if you think that's good or pretty good. Pretty good. It's not really — like, if 10 deploys a day, it should be thousands. But, you know, so it's all right. Okay. What if I said your customer-facing releases were in the ones? Okay, so you deployed a lot, but there's no closing of the loop. Back to this idea of understanding why you're doing stuff — it's for the customer. We're not closing that loop until you make contact with a customer. You're just making contact with the technical environment. Granted, you still learn things from that, but it's not quite the idea that we're looking for.
So responsive means closing the loop with customers.
One of our directors, Emily Rosengren (Senior Director Product Engineering), said this phrase: 'working software gives you options.' From what I understand, she didn't really think too much of it — just said it — but it really connected with the product VP they were working with.
It's a really basic idea, which people might've heard of before. If you look at the waterfall thing and a new opportunity shows up in the middle, you don't have any choice. You either ignore the opportunity or you throw away everything you have, because you have nothing. You don't have any options really. The better version is if you break things up into smaller releases. So if something shows up, you have the choice. You don't have to take the choice, but you have the option to say, 'Hey, I'm going to switch to the new opportunity. I already have value from what I've released.' Then you can make that choice. It doesn't say you have to abandon and then go to the new opportunity — it just gives you the option.
So, if we could change how we deliver to make sure that each thing was providing value immediately, then you always had that choice to change your mind. So responsive meant providing options.
[Slide: Responsive → Goals: Respond to opportunity and reduce time-to-value | Mechanism: Close the feedback loop with customers, working software to provide options | Main effort: Small, customer-facing releases.]
12Trustworthy — Reliable delivery to specific dates and/or outcomes
Are people familiar with the emotional bank account from Stephen Covey's Seven Habits of Highly Effective People? This is such an old book, so it's always weird — but it's new if you've never read it before.
General idea of depositing trust: you do things and you deposit trust, and then when you build up the bank account, you can screw with the other person because you have a lot of trust to draw on. Actually, that's not what you're supposed to do — not really do that way. But anyway, within the context of product development, I would say that reliable delivery is the primary mechanism of depositing trust.
So regular small releases is how we deposit trust. Every time you deliver something, it's working — they go, 'oh, they seem to know what they're doing.' They're more willing to forgive you when you screw up, because you will.
Associated with that, we do this thing — not all the areas do this yet — but when we talk about forecasting type stuff, like the fancy spreadsheet Monte Carlo simulation, the idea there is not so much that 'I'm doing this so I know when things are gonna get done.' It's part of that trust-building process — that when you say, 'Hey, this thing,' it's relatively reliable, and it builds trust.
So trustworthy is about regular, predictable delivery.
[Slide: Trustworthy → Goal: Regularly 'deposit trust' to improve relationships | Mechanism: Reliable delivery to specific dates and/or outcomes | Main efforts: Regular, small releases; Delivery forecasting.]
13Coherent — Effective coordination across boundaries to achieve shared goals
The final one is coherence. We're structured in a way we call them product domains. There are some other things outside there, but in terms of most of the product engineering things, they have different product domains. I saw a similar thing at Spotify. When you do things within a domain, it's easier. But there's always stuff that needs to cross, and that's when you get some of the issues. This is why we talk about coherence.
Really two questions. There are more than two questions, but I'm going to say there's two questions you're going to ask to enable, to ensure that teams are able to reliably act independently. One: do they know the overall strategy? And two: does that overall strategy make sense? Sometimes you don't get both. I guess you have to have a strategy for it to make sense, but sometimes you get a strategy that has no — it's all nonsense. So it doesn't help people understand what they need to do. But you need both.
14Alignment enables autonomy
This picture — someone has shown this one before. This was a big deal back at Spotify. It actually originates from The Art of Action. Someone else, one of the Equal Experts guys, said he was involved in this too. But I knew it from The Art of Action originally — then Henrik [Kniberg] did a nicer diagram. So the idea is that the alignment enables effective autonomy, so that you're going to this top-right thing ('aligned autonomy').
One of the issues that shows up — I've seen various versions of this — but this idea of: you have an overall strategy, and if you communicate in a way that the version splits, you have this mismatch, so no one has the same version, because it's spreading out based on the communication pattern.
The answer is quite basic. It is just: talk to people together, regularly. That's it. There may be fancy jargony terms to talk about this, but it is just — all the people that need to know, you get them together and they just talk regularly.
For us, that's typically these groups: Product Engineering; Portfolio Managers (the group that is still doing more traditional project-type stuff); Business Partners; and Representatives from other product domains. Just get them together. They talk regularly, and we just see that dramatically reduces the amount of drift you get. It works out — nothing fancier than that.
15Shared economic understanding
Another thing — this is fancier, sort of. Had this question come up: should we invest in cross-region active-active replication of a service? Yes or no? Is this kind of a complicated answer? Well, what if we are losing $100 a minute? Yeah, I don't know, that's not that much. Who cares. About $1,000 a minute? Ah, not even then — I'm not so sure. AWS is expensive sometimes. What about $10,000 a minute? And you go, 'oh, okay, no problem.' It's like, no-brainer kind of thing.
The idea here is that it was getting clearer on what the impact of being down meant. I think without this, you get this thing where people are almost like, 'I want to be more reliable because of vibes.' Not because of money. Which also makes it very difficult to talk to other people, because they might have different vibes than you.
So coherent — beyond just getting people to talk together, there's this thing around shared economic understanding to guide decision-making. This is sort of in process, but it seems to help a lot.
[Slide: Coherent → Goal: Effective coordination across boundaries | Mechanism: Align independent actions to shared goals | Main efforts: Regular stakeholder sync; Domain economics.]
16Takeaways
My takeaways are relatively simple. The idea — which I think you see a lot — is we're trying to create focus. To the extent that you should feel uncomfortable about what you're not doing. And how do you create that? You want to step away from the jargon ('DevOps,' 'Agile,' 'Continuous Delivery,' 'Modern Software Engineering') and get to very clear, simple, jargon-free language of the outcomes you're looking for. And then everything lines up to that.
This is the summary of what I came up with at Grainger — we'll see how effective that ends up being.
17Questions for you
I'm very curious about — not just what you're doing, but how you're framing what you're doing, and how that's creating focus. What are the things that you are deciding to focus on? I'm curious how you're addressing the problems that I mentioned, because I'm assuming you have similar ones. There's my socials and email — the old-school socials. Thank you very much. (https://www.linkedin.com/in/jasonyip/ · jason.yip@grainger.com)