Operability and You Build It You Run It at John Lewis & Partners
How do you transform speed from 10 to 5,000 deployments per year, while increasing overall website reliability for 30 teams working on a £3.5 billion retail website? How do you embed an operability mindset into product teams working at one of Britain’s oldest, largest, and most popular retailers?
John Lewis & Partners has provided its customers with retail merchandise since 1864. The company is co-owned by its 78,000 employees, and it operates 42 stores across the UK as well as johnlewis.com. Over the past few years, John Lewis & Partners has been on a digital transformation journey, replacing its ecommerce monolith with tens of microservices and teams, and a bespoke, award-winning digital platform.
We have a big emphasis on operability. We’ve built a Paved Road for telemetry, including availability targets and service level objective alerts. We’ve implemented You Build It You Run It at scale. We’ve adopted Chaos Days, post-incident reviews, and per-team incident management.
We’d like to share with you the successes we’ve had, and the lessons we’ve learned while adopting operability at scale. We’re hoping to encourage and inspire other folks working in large enterprise organisations! Your takeaways will be: how a digital platform can remove telemetry friction, how you can track leading indicators of operability, and how you can measure the cost effectiveness of You Build It You Run It.
Chapters
Full transcript
The complete talk, organized by section.
Simon Skelton
Hi, I'm Simon Skelton, the Platform and Operations Manager at John Lewis & Partners. This means I have overall accountability for the smooth running of johnlewis.com.
And whilst I consider myself a newbie with a mere 20 years at the Partnership, throughout my career I've been an on-call programmer, a developer, led ops teams, and implemented ITIL across the Partnership. But I'm definitely an advocate for DevOps.
Steve Smith
And I'm Steve Smith. I'm from Equal Experts. I've worked there for seven years, and I recently spent two and a half years working at John Lewis & Partners with Simon and a whole bunch of other great people.
And we're here to talk about operability and You Build It, You Run It at John Lewis & Partners, and how we've gone from 10 releases a year to 5,000 deployments a year whilst also improving website stability.
Simon Skelton
Well, a little bit of background on John Lewis. This is definitely not John Lewis. This is the classic "Are You Being Served?" sitcom from the '70s and '80s. But our history actually goes back much further than that.
It dates back to 1864, when John Lewis Senior opened the first store in Oxford Street. But it was actually his son, John Spedan Lewis, who believed in fairness and humanity, and he experimented with a new business model as he thought it was unfair that the three owners earned more than all the 300 employees in total. And in 1920, he shared the first bonus of seven weeks' pay with them all.
This now means the 78,000 employees, or partners as we call ourselves, are all co-owners of the business. And as you'll see in the middle, John Lewis Partnership is the overall brand. We're talking about John Lewis & Partners, the department store, and we've also got Waitrose & Partners, the grocery chain, as well.
But these strong foundations have still allowed us to adapt and innovate, with Edgar the Dragon being our first combined John Lewis and Waitrose Christmas advert, which is often called the official start of Christmas, and that was trending number one on UK Twitter within two minutes of launch. And our stores continue to be updated as well to meet ever-changing customer needs, with much more focus on experiences. And indeed, they've had to change.
Looking back, 2019 was a very challenging year for retail, with the likes of Mothercare closing down. Unbelievably, Brexit no longer filled the headlines when coronavirus hit in early 2020, and other well-known brands such as Debenhams and House of Fraser had to close their shops too. As you'll see on the right, almost 10,000 stores have closed in 2020.
It's been tough for John Lewis too, but let's look at some positives. Here's a great example of how we've quickly adapted with our virtual services launch just two weeks after the first lockdown forced our shops to close. Virtual nursery, personal styling, home design, and beauty classes all proving really popular, and something that will likely last post-coronavirus.
Coronavirus has also accelerated what we believe will be a permanent shift more online. Johnlewis.com was already very successful at 40% of total sales, but this is likely now to remain at 60% to 70%, we believe. And Black Friday is normally our biggest day online, but of course the shops were closed once again last year, and our estimates proved very accurate, and we saw an additional 50% increase in sales on the previous year.
But I'm very pleased to say that the last three years' worth of investment paid off. The platform scaled perfectly throughout Black Friday and the whole Christmas period, and we traded without any issues.
But now let's step back in time to 2017 and look at some of the challenges we were facing then. We felt our speed to market for new features was too slow, and the technology was seen as constraining the business and not enabling it. It was also difficult to manually scale up the on-premise servers for the likes of Black Friday, as well as difficult to add more teams to work simultaneously delivering new features.
Also, this was a key decision point. Do we invest the majority of our budget and resources in the next 18 months to upgrade our commercial off-the-shelf e-commerce platform? But that would only enable us to stay in support without adding any new features, so no surprises for guessing what we did do.
Back then, we had six teams working on multiple e-commerce monoliths. They were a mix of third-party e-commerce packages and bespoke front ends as well. We had a central operations team called Application Operations Support, which was mostly comprised of third-party managed service with some partners as well. We only managed one overnight deploy a month, with summer clearance and Christmas trading period change freezes meaning we only did 10 deploys a year. And these big releases caused plenty of major incidents, and we had quality issues as well. We were losing millions of pounds a year in opportunity costs. We couldn't release new features fast enough.
So let's have a look now at what we did to tackle those challenges. This is a timeline and a brief narrative of a huge amount of work by a lot of people. We can't cover everything here today.
But in 2017, we made a commitment to replace our e-commerce monolith with digital services, while still delivering new features to customers. Those digital services would run on what we call the John Lewis Digital Platform. It provides a paved road, and its bespoke platform capabilities are built upon the top of Google Cloud Platform. This allows us to scale up product teams without compromising on throughput, quality, or reliability.
In 2018, our cloud search team were successful in taking 1% of the live traffic away from the old search engine. This validated not only the technology but the ways of working as well. By 2019, we had nine times as many teams, and we had those product teams on call for their own services, and we had new customer propositions emerging. By 2020, we continued to grow and accelerate, moving significant traffic away from our monolith to the new services, and as you saw from the Black Friday traffic, that's been very successful.
So back in 2017, we believed that product teams and You Build It, You Run It were prerequisites for daily deployments and high reliability. But in the 2000s, we actually used to have combined delivery and ops teams, but they were eventually split as delivery deadlines were frequently missed and operational issues became overwhelming.
But what were those issues, and what could we learn from them? Back then, we had project-based delivery with infrequent business owner input. We now have agile project teams with frequent prioritization from a product owner. We had manual testing which didn't catch enough defects. We now have automated testing with continuous integration. Releases were infrequent, large, and manual. We now have continuous delivery with small, frequent deployments. And the on-premise test and live environments were too different and slow to provision. We now have the John Lewis Digital Platform with cloud-based self-service infrastructure.
But when it came to operability, keeping availability high and operational issues low, the question I kept asking myself, and Steve, was: how do you embed operability into digital teams at scale in an organization that's 150 years old?
Well, we brought operability down into these four areas: growing awareness by making product teams responsible for supporting live digital services; identifying concerns by standardizing and then visualizing leading and trailing indicators; testing proficiency by running chaos days and live load tests; and embedding principles by creating new learning pathways and opportunities for partners.
And now I'll hand over to Steve to give you some more details on these.
Steve Smith
Thanks, Simon. So an operating model is insurance for your business outcomes, and with You Build It, You Run It, it's a policy that can achieve high standards of deployment throughput and service reliability together in a way that's cost-effective.
This table shows how You Build It, You Run It works at John Lewis & Partners. There's a table of availability levels matched to revenue and out-of-hours support. So a product manager has an idea for a new digital service. They then go to the digital platform onboarding guide, which has a copy of this table, and they have to estimate the maximum amount of revenue that can flow through their digital service in a period of time. Then they match it to one of these levels and their own tolerance for risk, and that gives them an availability target, and it gives them out-of-hours support as well.
For example, if I'm a product owner and I have an idea for a cloud search service, and maybe in 45 minutes or so it will have GBP570,000 flowing through it, then that will match to the 99.9% target, and I have to have a team rota on call. Alternatively, maybe I have an idea for a merchandising service, and I think it might take GBP50,000 within one and seven hours. In that case, I'd have a 99.0% target, and that would mean no on-call. We'll come to what that means in a moment.
The important part here is that it's a product manager that makes the decisions, not a platform lead such as myself, nor an operations manager or delivery manager such as Simon. The product manager is the budget holder. They make the prioritization decisions. It's up to them.
This is a good framework for revenue versus availability versus on-call, but it's not a recipe for all organizations. The maximum revenue that you tie to an availability level, the availability level that you choose, that's going to really vary based on your own business. We took our initial revenue numbers from an incident management policy across the Partnership and iterated on it, and you should take a similar approach.
This diagram shows the workflow of incident notifications for monolith on-premise and digital services hosted on JLDP. With a monolith, there's an alert that comes out of New Relic, it goes to the OpsBridge team, they scrabble around in a bunch of spreadsheets, and they hunt down the right member of the Application Operations team to call. They phone them, they invite them into a Google Chat room, and they put in their major incident manager for incident response. OpsBridge also manually create a ServiceNow record.
With a digital service, it's very different. An alert could come from Prometheus, or it could come from New Relic. Both fire into PagerDuty, which has teams, services, escalation policies, and rotas all automatically provisioned as part of JLDP's paved road offering. All you have to do in your service, in a bit of config, is type in your service name, your team name, and your availability target, and JLDP provisions PagerDuty and ServiceNow for you.
So PagerDuty gets an alert, it matches it to a service, it matches it to a team, matches to an on-call engineer, and phones them straight away. It also immediately creates a record in ServiceNow, and there's bi-directional sync, so any changes in ServiceNow itself are reflected back to PagerDuty as well. An incident channel is created in Slack, and then the product engineer starts to do incident response. Other people can view the response because the channel is public and searchable, and the engineer also has a shiny button in PagerDuty called Declare a Major Incident that lets them pull in a major incident manager to use the exact same major incident process.
On reflection, adding PagerDuty into the alerting toolchain was a really key part of the operability journey at John Lewis & Partners. It meant that the time to acknowledge an incident could come down from five to twenty minutes to 60 seconds consistently, because the end-to-end workflow was fully automated. It also meant that painful friction points in PagerDuty setup, and especially in ServiceNow setup, could all be eliminated.
It also meant that a commitment to working with all aspects of IT operations could be demonstrated because there was no attempt to create a digital incident management process. I remember insisting myself that we use the process as is, and we work with the incident managers to help them get the most out of their role. It also means that the use of public searchable channels means that incident response and incident reviews can all happen in one place.
This is a diagram that shows out-of-hours support for digital services in early 2020 based on their on-call level. On the Y-axis, we have availability levels from low to high, and on the X-axis, we have product demand from low to high.
We mentioned earlier that different services have different levels of on-call. At the lowest availability levels, there is no on-call out of hours for a service, and that includes no fallback on an operations team. That's an intentional, carefully designed approach that's appropriate for the lowest level of revenue risk. We do this because having absolutely no operations fallback generates stronger operability incentives for the delivery teams. Because now they're thinking, "If there's an incident out of hours, no one else is going to fix it. I've got to fix it when I get in in the morning." So that would encourage people to think more about operational features up front.
If a service has a middling availability target, then what will happen is the product team engineer will be on call for their digital service, or an engineer in a sibling team in the same product domain. A product domain is a logical grouping of services in the same business domain. There's a focus on customer outcomes or minimal cognitive load for engineers. And the way that it works is that, for example here, the add-to-basket service, the electrical service, the fashion service, they all operate in the same commercial journeys product domain rota. So tonight, one person from those three teams will be on call for those three services.
This is the secret to growing You Build It, You Run It at scale in a way that doesn't go up in a linear fashion as you increase number of teams and services. You don't want to have 20 people on call for 20 services, nor do you want to have one person on call for the world. This is a way of striking an effective balance.
And if a service has the highest availability target and the highest amount of customer demand, then the team operates their own on-call rota, and they have maximum operability incentives there. That isn't forever. If product demand slows down, the product manager announces that demand is being filled, at least for now, then that digital service gracefully transitions into the appropriate product domain rota.
A minority of digital services should be in a team rota. If too many services are in that rota, then it's considered to be an overestimation of revenue impact risk or an underestimation of mitigating downstream dependencies. And in each case, a team rota or a product domain rota needs to have a minimum of three to four product engineers. Both John Lewis partners and Equal Experts engineers all go on call together. No one's made to do it. It's just like with an operations team: it's all about personal choice and trying to get an on-call rota that works for the team.
Let's move on to identifying concerns with leading indicators. I vividly remember Simon saying to me at some point that trailing indicators of operability weren't enough. We needed to understand the presence of adaptive capacity, not just see it after it had been used.
This is a screenshot of the JLDP service catalog. It's a service that runs itself on JLDP, and it's kind of like a developer portal, I guess, is the trendy name now. But what this shows is a bunch of different services by their service level, their availability rate at present, and then there's an assessments column and a telemetry column. And these are showing leading indicators of operability that are relevant to the John Lewis & Partners context.
With telemetry, there are automated checks that look for bespoke telemetry. Now, JLDP gives every digital service logging, monitoring, and alerting out of the box. But it's been observed that teams who build their own bespoke telemetry on top of that are more likely to handle live traffic incidents as they occur in a timely fashion. So JLDP scans for bespoke telemetry and flags up if there's nothing there at all. Green would mean that there are no outstanding tests to complete. Red, as in this screenshot, implies that teams have some work to do there.
With assessments, this refers to a set of exploratory questions where teams self-assess themselves for their own services every quarter. It's called a service operability assessment, and they are how questions. There's no yes/no questions. It's all about how, and diving down into how teams actually operate their services.
For example, one of the questions that has to be completed says: how do you handle latency problems of a downstream dependency? And what might happen is you might look at that and think, "We need to put in a circuit breaker." You'd record that in your response. You'd write down a Jira ID for that task. That's all machine-readable. It's scanned by JLDP, and it's visualized in the catalog as something that needs to be handled.
As we can see in this screenshot, green for the first service means that there's been a recent assessment and there are no outstanding tasks to complete. Gray means that there's been no assessment for a while, and red means that there's been an assessment and there are outstanding tasks to complete. So all of this is about identifying operational problems, latent faults if you like, before we actually have a major incident.
This is all about trailing indicators as well. We use service availability and deployment throughput, where it's automated checks to show adaptive capacity as it has been used. This screenshot shows a delivery indicator. It's a visualization of deployment throughput. It shows the amount of days in between production deployments and the amount of time it takes to do a deployment.
With this service, we can see that through 2019 into 2020, this service went from fortnightly deploys to weekly deploys, which is really good, and there's some wobbliness with how long it takes to do a deploy. So there's a couple of conversations that Simon or someone else can choose to have with that team about how they've improved on deployments, how they've made them smaller, more frequent, put themselves in a better position to diagnose problems and roll back quickly. And yet there's still a bit of wobbliness about how long it takes to get something out the door.
All of this data is quantitative, it's shallow. They're all placeholders for conversations, and there's no one going around with a clipboard saying, "You must do better." That's definitely not the John Lewis & Partners way.
One way we test operability proficiency is by running chaos days. We want to identify digital services that may fail in production under certain conditions before a major incident actually occurs. This is a photo of a chaos day review in our head office, and standing up there presenting is Rob Hornby, our product owner for the platform. This particular chaos day was targeted at the John Lewis Digital Platform itself in a test environment with some of the platform team members acting as agents of chaos. Product teams were asked to monitor their own services in that test environment and contact the platform team in their dedicated front door Slack channel if any issues were seen.
We run chaos days on a quarterly basis in a test environment, and we intentionally select the most experienced team members to be those agents of chaos to ensure they can't act as human runbooks during the incident response. We uncovered plenty of latent faults in the past, such as a product team who didn't notice their database had vanished. The learnings from a chaos day and follow-up tasks are captured, and we've observed that teams who fixed latent faults soon after the chaos days are less likely to endure painful incidents later on.
We also regularly validate our ability to handle Black Friday levels of traffic. We have a similar approach to that of our chaos days. We visualize key components of the website and use our knowledge and experience to determine what load scenarios to try. Although product teams do their own load testing per digital service, we still find that extreme simulations of customer browsing surface issues from interactions between the different johnlewis.com website components.
A live load test happens overnight to reduce customer impact, but real profiles of customer behavior are compressed and skewed to fit the Black Friday traffic profile. And they're injected into the live website. Product teams use the analysis from those live load tests to improve their own digital services and protect our Black Friday capacity.
Simon Skelton
We also take professional development of our partners very seriously. After all, they're co-owners in our business. From the very outset of our digital journey, we've ensured that partners have opportunities to learn new skills and move into new roles.
Partner engineers can embark on a number of different learning pathways. We've designed one specifically for operability that covers topics such as agile operations, security testing, performance, learning from incidents, and more.
And we've mentioned before that the AppOps support team was mostly staffed by third-party managed service with some partners. Well, those partners have invaluable skills and experience. And as we wind down that support team, and we reduce the managed service, those partners are gradually moving into product teams and into the platform team itself to share their operational wisdom and learn new skills as well.
So let's come more up to date now and look at the outcomes we've achieved.
Steve Smith
Thanks, Simon. In terms of deployment throughput, the graph on the left shows deploys from 2018 to 2021, and you can see that it's rocketed up from 10 to 5,000 a year. You'll see a drop around Black Friday 2019. That's because digital services were still in a change-freeze process then. There's no such dip for 2020 because, by that point, stakeholder confidence had increased and digital services had been lifted out of that process, which was great.
The graph on the right is JLDP service catalog again, and that's showing the time to first customer. The time to provision a new digital service has come down from six months to one day. The average timescale to the first live customer is now 90 days and is coming down all the time, and teams are reporting additional millions per year in incremental revenue as a result.
This is about service reliability. The graph on the left shows incident rate, and you'll see there's been no significant increase in major incidents for the past two years during the introduction of digital services. The graph on the right shows time to restore for those exact same incidents, and you'll see that for monoliths and digital services, there is a trend downwards, which is really encouraging. You'll also see that digital services have a much faster time to restore than the monoliths.
And this is my favorite slide. This is the magic table. This is all about service reliability. This is all about showing how the hybrid operating model works at John Lewis & Partners. This was an analysis between April 2019 and April 2020 of the different components of the live website at the time.
For You Build It, You Run It, at the time, there were six digital services operated by four rotas, so one service not on call perhaps, some services in a product domain. Again, remember, six services doesn't mean six people on call. The deployment frequency was daily. That's seven times faster than the third-party managed service operating the three monoliths under one rota.
Digital services had only six major incidents compared to 13 for the monoliths. The handoff rate, the amount of incidents that required a second better-placed responder and incurred a time penalty, was one and a half times lower with You Build It, You Run It. The time to restore was three times faster. You might recall that the target for 99.9% was 43 minutes. Well, You Build It, You Run It is awfully close to on average, which is pretty good.
And revenue protection effectiveness was three times higher. This is a measure that looks at the percentage of estimated revenue loss per incident that's actually protected because the actual revenue loss is less than the estimate because of a fast time to restore. So because You Build It, You Run It has a faster time to restore, the third-party managed service could cope with the monoliths. As a result, more revenue could be protected, more money could be saved, which is a really good thing.
Simon Skelton
So what does this kind of speed and agility allow us to deliver for our customers? I've picked up one example here, which was pre-COVID. This was our first beta trial on johnlewis.com, where we wanted to improve the experience for choosing the right sofa. For an online retailer, it's not easy to gather feedback from our customers directly, but we have the advantage of being able to tap into the vast experience of our shop selling partners.
After putting the first iteration live on the website, some of the team visited one of the stores. The shop floor partners are most used to IT being multi-year projects to roll out the likes of a new POS till system. So they were absolutely amazed when they could see their feedback being implemented on the live website within the same day, which was excellent.
Let's move forward to what our current challenge is and where you may be able to help us learn from your experiences. We're still working out how we achieve the best-value support model, such as influencing teams to adopt the domain model. How do we safely reduce and remove the reliance on the 24-by-7 eyes-on support model? That's still work in progress. And of course, the ongoing challenge of evolving service management to become more agile.
So what are our takeaways? How do you embed operability into digital teams at scale for a 150-year-old enterprise? We think: test, learn, and continually evolve your model. Think about operability as early as possible to ensure sustainability. Maintain visibility of operability with both leading and trailing indicators. Encourage little-and-often deployments wherever possible to increase agility and reduce the blast radius of deployment issues or defects. And adopt You Build It, You Run It for all product teams to maximize operability incentives and create a cost-effective insurance for the business outcomes.
So it just remains for Steve and I to say thank you for listening. We've put a few references up here, one to our partner recruitment website, so please check that out. There's also some talks here from some of our colleagues and articles on medium.com and some of the EE playbooks. Thank you for listening.
Steve Smith
Thanks very much.