Paving a Secure Road at John Lewis & Partners

Log in to watch

Amsterdam 2023

Download slides

Paving a Secure Road at John Lewis & Partners

Simon Skelton

Platform & Operations Manager · John Lewis & Partners

Chris Rutter

Principal Security Consultant · Equal Experts

How do you continuously secure 100 digital services made up of 400 microservices with 7000 deployments per year? How do you ensure cross-cutting security controls that still enable 40 product teams to innovate and release without friction?

John Lewis & Partners has provided its customers with retail merchandise since 1864. The company is co-owned by its 78,000 employees, and it operates 35 stores across the UK as well as johnlewis.com. Over the past 5 years, we’ve been on a digital transformation journey, replacing our ecommerce monolith with over 100 digital services built and maintained by 40 agile product teams.

At DOES Europe 2021, we shared with the DOES community how we built our award-winning digital platform and adopted the You Build It You Run It operating model to help us deliver valuable and reliable products.

https://videos.itrevolution.com/watch/549298512

https://medium.com/john-lewis-software-engineering/our-award-winning-john-lewis-digital-platform-2d093e03d542

Now we’d like to explore how we’ve transformed security in our product delivery, by solving common technical, organisational and process challenges with scalable and robust security controls, practical workflows, and all without an army of security specialists.

We’ll cover the successes we’ve had, and the lessons we’ve learned. Your takeaways will be: When to be opinionated and provide standardised security to delivery teams for free, how to use visible metrics and policy to distribute security ownership, and how to produce the necessary assurance to streamline security gates and enable rapid delivery.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

All right, so all this morning we've heard so much about how world-class operations can help dev by letting devs better focus on dev things.

So up next is the amazing story about how a platform team led by Simon Skelton, manager of platform and operations at John Lewis & Partners, a retailer that was founded in 1864 with annual revenue of nearly 5 billion pounds and approximately 75,000 employees. Over the last five years, the work of Simon's team helped them scale their e-commerce capabilities and enable faster time to market.

Simon previously presented in 2021 talking about their DevOps journey from monolith to microservices with agile product teams built on a paved road platform. But Simon is here today, along with his colleague Chris Rutter, principal security consultant at Equal Experts, to talk about their continued journey building on those strong foundations to make security valuable and integrated into the DevOps environment.

Here's Simon and Chris.

Simon Skelton

Thank you. Good morning, everybody.

My name is Simon Skelton and I'm the platform and operations manager at John Lewis & Partners. That means my team provide the digital e-commerce platform for our website, and it also means I've got operational accountability for the smooth running of the johnlewis.com website, which does about 3 million pounds of revenue each day.

And I consider myself a new boy at the partnership, at a mere 21 years of service. I think that makes me the young, wild one, but without the beard.

And Chris.

Chris Rutter

Hi everyone, I'm Chris Rutter. I'm a principal consultant with Equal Experts and I specialize in really helping teams realize the value in security activities. I've been working with Simon and his team for around about two years, since 2021. I think we've done a lot of good work.

Simon Skelton

So we're here today talking about paving the secure road. And as you'll see from this image taken from our recent Christmas advert, this is metaphorically about putting on your crash helmet so that when you're banging your head against a brick wall trying to talk to a security team who doesn't understand DevOps, you don't do yourself too much damage.

But first, let me give you a little bit of background to John Lewis because I realize it's a very UK-centric brand. The overall brand is the John Lewis Partnership, but we have 34 large department stores and also 329 Waitrose shops, both offering mid to premium goods to our customers.

It was back in 1864 that John Lewis Senior opened the first shop in Oxford Street in London. But actually here you'll see John Spedan Lewis, his son, who believed in fairness and humanity and experimented with a new way of doing business because he didn't think it was fair that the three owners earned more than all of the 300 employees put together.

Which is why now the 75,000 employees, or partners as we call ourselves, all have ownership in that business. We all have a share in that business, which means we share in the profits, but there's other benefits as well. Particularly for me, I'm looking forward to my six months' long leave when I reach my 25 years' service.

And these strong foundations have allowed us to adapt to an ever-changing retail market. On the top right there, you'll see one of our previous adverts with Edgar the Dragon, and our adverts are often called the official start of Christmas, but that was trending on Twitter and was number one within two minutes of its launch.

You'll see there our stores have had to evolve and adapt and are now much more experience-based and obviously targeting more convenience. But the e-commerce journey has also had to change as well along with that.

So back in 2021, we talked about operability and you build it, you run it. We talked there about a journey moving from a monolithic e-commerce platform with 12 releases a year, with one big development team handing over to a single operations team and all the constraints that went with that.

Now I'm pleased to say we have around 40 agile product teams where they build and run their own services, and they do about 7,000 deployments a year now. I'd like to agree with HSBC and David that it also has fewer incidents, which is fantastic.

So just a little bit of context. I won't go into detail for this, but our paved road, which is based on the concept of Netflix, provides a number of standard services for all our tenants on the platform. For example, that allows us to spin up a new service within just one hour rather than the six months it used to take to provision hardware. It allows things such as simple config-driven Google Cloud Platform resource provisioning and standardized observability dashboards and alerts.

But we're here today to talk about the next part of that journey: how we've transformed our security by solving common technical, organizational, and process challenges.

I'm very relieved to say that I could include this slide because we were very pleased to win the Best Implementation of DevSecOps just last month at the Computing DevOps Excellence Awards. Obviously I'd have had one less slide if we hadn't have won.

But how did we start our security journey? Well, it was really important for me to think about all the different key stakeholders. For me as a platform owner: how do I ensure and prove all of my systems are secure? For a product owner: how do I know when to spend effort on security and prioritize that? For a product team: they just want to build and release to customers as quickly as possible, but still securely. And then for our InfoSec security team: how can I make sure that every single deployment is secure?

I'll now hand over to Chris to talk about the landscape we were working with and how we went about tackling these challenges.

Chris Rutter

Thanks, Simon. Okay, so if we start with the impact that security had on delivery, let's look at our workflows about three years ago. Hopefully you can see a little bit of detail there.

If we look on the left-hand side of that diagram, it's a pretty typical development lifecycle. We define what we want out of our product. We do a series of iterative development sprints. This project, let's say it's about 12 weeks long, and now we're ready to try and release value to our customers.

Now what happened a couple of years ago is this is the point we get stuck in the mud. As soon as we start the risk assessment process, basically we'd have a huge amount of backwards and forwards with security teams. We'd have to explain every single part of the system one at a time, and then we'd get feedback on that part of the system again, one at a time.

We then have a pen test. We then identify security requirements. Then we try to stitch all of that up together into the finalized risk assessment and then try to decide and negotiate over what we should fix now, what we should fix later, what we should risk accept.

The impact of this process was that usually from a 12-week project, we would have six weeks, or 50% of the delivery time of that product, spent in risk assessments and security work.

So if we dig into that risk assessment process, I'm not going to go into this in too much detail, don't worry, but every single line on that diagram on the right there was either an email or a phone call or a spreadsheet: hugely complex workflow.

The quote from one of our engineers was that we spent 75% of our time finding out what to do and 25% actually doing it. I think you might have been being generous there, to be honest.

What we realized is that the same security risks and requirements were being raised on each one of these risk assessments. The same controls were being implemented by every single team, but in different ways and with different levels of quality. This was obviously a hugely inefficient use of development efforts. Teams were terrified of going through this process, and what that resulted in is teams would avoid breaking down systems into smaller components. They would avoid delivering value faster and more iteratively.

That was what the platform was built for: this amazing capability to deploy to production very quickly. And teams were scared to do it because they didn't want to spend six weeks in a risk assessment process.

So how do we improve the situation? One of the first things we did was, okay, let's identify these security controls that are being raised every single time. Then we made some very strong decisions on which of these controls could we make cross-cutting and which were unique to each team and needed to stay bespoke.

Our platform teams built these cross-cutting capabilities for free so teams could use them on day one of their projects without having to build them themselves. Things like secrets management, identity permissions, logging. We made the decision that it's very rare a team is going to get a lot of value from implementing any of those things themselves. We get a lot more value from standardizing those things and providing them for free.

As mentioned in some other talks, we are extremely careful to treat our engineers as customers when we build these capabilities. We don't want an ivory tower architecture. We want empathic engineering. And as we know, software is not one and done. These things have to be evolved over time to make new use cases. We are very careful to factor that in when we built these products.

So what did that look like in practice? If I'm an engineer and I'd like to get two microservices that are just walking skeletons into production on day one, this is literally the only file that I need. One YAML file. We decide what we want the GCP project to be named. We define two microservices. We decide the permissions that they need, and we decide the name of our Slack channel.

This will be run in a pipeline and within 15 minutes all of this will be provisioned: GCP projects, Kubernetes namespace, everything that's required to deploy to production. Everything has IAM permissions, ingress, secrets management, logging, all done for free, all in a standardized way.

So all these things that you had to find out and negotiate right at the end of your project are now done for you on day one.

So what was the impact on the workflow that we looked at earlier? Now this is what a risk assessment looks like. If you're using the platform-provided controls, it's basically a tick box. So how are you handling secrets? Tick: I used the platform-provided mechanism. Thanks to teams not having to build these things every time, it saved us two to three weeks, 50% of every single risk assessment that we carry out. And that's a huge gain over the tens of teams that release new products every month.

So teams can release faster, deliver value in smaller chunks, and at the end of the day we want win-wins. It's more secure. It's very easy for us to prove the security across our whole system. We know how logging is done everywhere, for example. We know how ingress is done, and it's much easier to roll out improvements and updates over time to deal with emerging threats.

I think any cross-cutting control that you release that can't be changed is instantly legacy, and in security we have to keep up with emerging threats. So definitely one of the first win-win-wins that we achieved in our transformation.

Simon Skelton

So we've shown the improvements with the platform-provided services, but what about application security? Going back three years ago, we just had pen tests at the end of development. That meant we were finding vulnerabilities just before release, and obviously that was impacting the delivery timelines.

It also meant that we might do months of coding but then just time-box a pen test to a week. That risks missing some security gaps, and pen tests are only a point in time as well. The security risks surface all the time, as we know, but this didn't provide the best assurance for the whole life of the service.

Chris Rutter

Okay, so when we're trying to scale out security input in a fast-changing environment, one of the go-to methods I'm sure you're all familiar with is code scanning. It allows us to scale out security reviews to lots and lots of different systems. Thanks to our policy of encouraging innovation, we ended up with a stack made up of lots and lots of different tools and technologies. That's just an example of some of the tech in use there. There's way more, and there'll be way more next week when we check again.

If you're not careful, you can end up with five, 10 different scanning tools to help you scan all of this architecture, otherwise you have big gaps in customer-facing systems. This can get very time-consuming to implement all of those tools, and it can get very expensive. Often licensing models with our microservices architecture can be quite prohibitive.

So how do we get joined-up capabilities if we want to scan all of these codebases? How do we get reporting, investigation, compliance from 10 different tools?

We took an approach which we coined as bring your own scanner. Again, we really tried to understand what our different stakeholders needed. We realized that engineers just want fast feedback and quality scanning. Delivery leads and product owners just want to know: how secure is my thing? Business owners and InfoSec teams need to prove and demonstrate the security of the whole estate.

We found the hardest part about using all these scanning tools wasn't getting them scanning our code. It was gluing the scanning tool into how we work, into our communication methods, into our workflows. We had to glue lots of APIs, lots of backend scan servers, lots of user interfaces, and it ended up quite a complex workflow.

What we did instead, we realized: hold on, we know our environment the best. We know our workflows. So we built all of that stuff on the right. We built all the reporting capability, the vulnerability database, and we built dashboards on top of that.

What this allows us to do is to plug in any scanner we like, whether it's a lightweight CLI tool or whether it's a bigger enterprise tool, and easily plug it in within a week or two and get the same reporting and compliance that we do with everything else.

Just to bring that to life, this is an example of our dashboard. We have a few. This is the perspective of an engineer or a delivery leader or product owner. This relates to a single product on our platform.

We have a very clear security policy there at the top left. As an example, a critical finding must be fixed within five days. Very simple and easy to understand policy. There's no magic there.

Engineers love red-green. That's their language. They want pass or fail. So we have a pass or fail if any of your vulnerabilities are outside of policy. All of our scanners can be seen in this one dashboard. There's a dropdown. You can just choose two of the scanners or five or all of the scanners.

The dashboard was built with Grafana sitting on top of a BigQuery database. It took us a couple of weeks to cobble together. This wasn't a huge six-month project to build this thing. Using open source tech, very, very easy for engineers to build and modify.

We found the effort of building the whole vulnerability database and dashboards was less than the effort we spent integrating a single enterprise tool into our own workflows. So very efficient use of our time.

Also, one of the key aspects we wanted to gain from things like reporting and dashboards is we really want to make security everybody's concern. We want to give engineers responsibility for their vulnerabilities and keep them engaged.

Our last bit we did is we integrated a cloud function that sends a Slack digest every day to all of our engineers. You get a digest in Slack, in your team Slack channel: how many vulnerabilities do I have? How many are outside of policy? How many are almost outside of policy? So as a product owner, I know what to prioritize next sprint.

We've seen this have a hugely positive impact on engagement and on engineer ownership of vulnerabilities.

Simon Skelton

So we've shown you how we've provided security visibility at a product team level, which leads to less delay in pen test issues being raised late in the development lifecycle. It also means we could release features more frequently to customers. We have better assurance throughout the whole lifecycle. And finally, we can also provide an export from our security scanner to our security team to give visibility and make sure that they feel assured that we are running everything securely for them.

But also what I wanted to do was integrate this with our senior leadership team and the service reporting. I know it says security metrics front and center. It's actually top right, but you get what I mean.

Alongside our key metrics, which is availability, major incidents, website performance, and of course our beloved DORA metrics, we've added in there the security vulnerabilities, those that are just outside of policy. This helps prompt the conversation with our senior leadership team, allows us to request support for prioritization with the product leadership, and we can dig into it with more information with this high-level vulnerability dashboard that we have.

This lists all the services we're running on the platform. We can click on any of those services and drill into the dashboard that you saw earlier that Chris showed. We can sort that with the most that are outside of policy, with the most critical vulnerabilities, or the average age of the vulnerabilities over time. We can also select specific scanners, as Chris talked about. The charts also are really good to show you that trend and the position over time so that you can spot if you've got building issues.

But what about critical emerging threats? I've heard this over the last day from many different talks talking about Log4Shell, Log4j, and absolutely, I remember that on a Saturday morning. I think I got a message in December 2021 going, "Hmm, the biggest critical vulnerability in the last decade. Maybe that's a problem that we should start investigating."

And like many organizations, we started running around, calling every single team out to review what they had. Did they have that vulnerability? Tracking it all in spreadsheets, working out if they'd secured their particular service.

So whilst the actual response was relatively swift, within a few days, it took a lot of effort. This drove the creation of our bill of materials dashboard, which Chris will now explain.

Chris Rutter

Yes, as Simon said, we definitely had a lot of learnings from Log4Shell and our capability to respond to it. What we built is our bill of materials service. What this does is every two or three hours it uses an open source tool called Syft to scan every single Docker image that we have in all of our artifact repositories. It fingerprints each one and again sends all of the bill of materials, every artifact, every library that's in use anywhere in that Docker image, into a central database, which we then pin a dashboard on top of, again using Grafana.

What we can do here is I can type in the artifact name of any third-party library. I can give it a range of version numbers, and I can instantly get a list of anywhere on my systems that's using that library.

The key thing is this also cross-references with data from our Kubernetes server and basically tells us which of these things are in production and which ones are not.

So you can imagine before without this tool, how do I understand where I'm using this particular library? You're manually scanning GitHub repos, GitLab repos. You're trying to understand what's in production, what isn't. The disruption to teams is massive. It's spreadsheet-based.

And then how do you understand which teams have fixed the issues and which teams haven't? Again, it's a two-, three-, four-week task to go and do all that stuff. With a bill of materials dashboard, it takes us maybe 20 seconds to get a list of all the teams that are using a particular library. We know inherently whether it's in production or not. It takes us less than an hour to send a Slack message to all of those teams that are actually affected and ask them to upgrade. And again, it takes us 10 seconds to figure out who's fixed it and who hasn't.

So it's saved us hundreds of hours of development time and coordination efforts whenever we've had any kind of near misses since Log4Shell.

So what's next? We've discussed all of the successes we've had solving problems. We said we have lots of scanning tools for different tech, but that never stops. We want to introduce more scanning tools to keep pace with engineers and calibrate the rules to make them more effective and have less false positives.

We'd like to build more realistic severity scoring. When we're dealing with third-party dependency vulnerabilities, often there's little context and a lot of vulnerabilities that are down as critical aren't necessarily that critical in our systems.

We'd also like to have the same ownership of real-time security alerts. We'd love it if we had a GCP real-time threat alert going into a development team so they're the first responders rather than being filtered through a central SOC.

Finally, to summarize, I won't go through all this, but our takeaways here: share security responsibility; be opinionated and standardized, but also know when to be unopinionated; understand security requirements for all your stakeholders; and decide what you want to build and what you want to buy.

Thank you very much for listening, and I hope you found at least some of this of interest.