Enterprise on the Bleeding Edge

Log in to watch

Europe 2022

Enterprise on the Bleeding Edge

Technology Lead · Nationwide Building Society

Risk adverse enterprises need not fear new technology. By creating a fail safe environment and a culture of experimentation you can reap the benefits of being early adopters.

Chapters

Full transcript

The complete talk, organized by section.

Rich James

Hi, I'm Rich James. I'm a technology lead at Nationwide Building Society, and I'm here to talk about how risk-averse enterprises can adopt bleeding edge technology in a safe way whilst reaping the benefits of being early adopters.

I'm going to tell the story through the lens of one team that went on a journey to the cloud, and talk about some of the techniques we employed along the way.

So this talk, I've divided it into four sections: playing catch-up, pioneering, some of the results of our efforts, techniques, and buy-in.

Just to level set, what do I mean by bleeding edge? So essentially, technology that is so new as to be unproven. The logos depict a subset of the tech that we've worked with in an unproven beta or preview state.

Situation 2018. So we worked in a very gray building. Admittedly, not that gray, but we did have all of the things you would associate with a dull, gray picture like that, which is manual deployments, basic on-prem infrastructure. We had no test automation. It wasn't the most inspiring environment that was going to attract new talent.

But that was all about to change, and that is, in fact, the sole reason that I joined Nationwide. We had a big digital transformation starting, and not just Agile, it was cloud, it was DevOps, it was a pivot to insourcing and building our own teams and capabilities, and thankfully, our offices were also getting a makeover.

So the team in question, they were building or rebuilding our vast website estate. So we had many, many thousands of pages of quite often junk content because it had just grown organically and people had lost track of what was where. It was becoming a conduct risk.

So as well as the underlying technology problem that we had, we had an experience problem. I'm going to focus now a little bit on the technology. So we actually wanted to move from having multiple CMS instances to a single headless CMS, so decoupled via APIs, from monolith to microservices, virtual machines to Kubernetes, on-premise to cloud, web forms to React, and manual deploys to DevOps full automation.

At the same time, we had another team building out a design system called the Nationwide Experience Language. This was a React-based design system that we now use for all of our web experiences. And so when we set out to build our new headless infrastructure, we wanted to use that same design system.

So we looked at some new technology, which was Sitecore's JSS SDK, which was a JavaScript SDK, which enables you to split apart the content delivery from content presentation. So the server-side rendering of your content could be a standard React application, and the headless content from APIs would come from the enterprise CMS that our content editors were used to working with. So standard request response architecture, but decoupled with headless content. So we moved on to a full implementation, and we started with a single-region deployment.

We then moved on to think about our microservice architecture. So what does it need to look like beyond the front end and the content management side of things? How are we going to structure our microservices, and what would that look like? So the backend for frontend pattern was commonplace at the time. So we started to review that.

We quickly realized that with a componentized UI of a CMS-based website, you might have a backend service for a component or widget, and you might have many of those on a page. And your content editors are effectively empowered to put multiple components on a page, so you can quickly have a position whereby your content editors can create a very chatty interface. So we didn't want that. That wasn't going to be a great user experience, specifically on mobile.

And one way to solve this, it's rather an elegant trade-off of the BFF pattern, is to use a GraphQL server where you have one single round trip. And so we set about productionizing Apollo GraphQL for our use case. So Apollo, it was spun out of Facebook in around about 2015 as an open source project. It's very mature. So that was a cutting-edge API technology choice for us, which we de-risked with a mature implementation.

This burger comparison is one of my favorites. It really shows you the power of GraphQL and what you can do with it. You get what you ask for and in the order you ask for it in. So tying that back to the original problem statement. So the analogy would be that you're getting your bun from the baker's, your burger from the butcher's, but the frontend doesn't care. Like you don't care when you go to a restaurant, right?

So our internal NFRs for critical services stated that we had to meet greater than four nines for our website. Therefore, we had to build multi-region. And when you look at what that means from a cloud infra DevOps perspective, blue-green deploys, databases, et cetera, we realized this was going to get very complex.

And being an enterprise, we are, of course, using Azure landing zones for multi-subscription environments and following a hub and spoke model. We had a problem. Our cloud platform team didn't have the funding to deploy all of our shared hub services to multiple regions, at least not in the timescales that we were working to.

Another issue that we bumped into, the regulator, the PRA. So we told the regulator that we would build multi-cloud, that we would meet our resilience targets for our platinum services, that we would be able to demonstrate portability in a stressed exit scenario, that we would test our disaster recovery. So we put our heads together and started to think about, how are we going to do all of this if we've only got a single region of compute? And that's where we decided to look at alternatives like Jamstack.

So the advantage of Jamstack is that your content, it's pre-rendered, so therefore it's as fast as possible delivery. So your browser has rendered your content in your webpage. As you're watching this, probably you've got rendered content in your webpage. So if that content is pre-rendered, as close to the webpage, as close to your browser as possible, you're going to get the fastest possible content delivery.

So we export all of the content, render all the pages in one hit, and that means that there's no server to actually do that rendering on the fly. No autoscaling of web servers, et cetera. It also means that your exported content is hyper-resilient, so you can deploy it very simply to multiple CDNs or blobs or buckets or whatever in multiple different cloud providers.

We set about modifying our architecture to support Jamstack, so quite a few things had to change. We actually adopted some new technology called Uniform, which has been built by some of the developers that built the Sitecore JSS SDK. So that was de-risked a little bit, the implementation, because they really did understand how this new JavaScript framework we were already using worked. And what they actually provide is a composable DXP, which is basically the glue between your CMS and the kind of Jamstack website. So it handles the content exporting for you, handles all of the tree walking for you so that you can then export everything in one hit.

We realized, though, that not everything could be static, so we still had some dynamic environments. So our PR environments, which were ephemeral, we didn't want to generate static websites. They were basically Kubernetes namespaces, and we want to just build it, deploy it, test it, run our suite of automation tests. If it works, blow it away, get rid of it, merge the PR in.

But preview and authoring environments, very different. So those two environments were dynamic for the sake of agility and fast feedback. So our authors, they want to work quickly, curate content quickly. Our approvers want to be able to see that content quickly. They don't want to wait for something to be generated. It needs to be dynamically generated in those scenarios.

Static environments, though, we had to not only have production, but also pre-production and integration, and that was another new thing to us, the idea that you'd have a private static site. Our website, it went from many thousands of pages down to less than 1,000, but it was still a lot. And you have to generate every page when you publish.

It actually took as long as 10 minutes when we first set out, and we said, "Right, we can't go any longer than that. It's just too long." And then we actually focused on optimizing, and we started doing things like batch exporting all of our content in one hit into a blob so that we could actually just generate all the webpages against that instead of going back to an API. And it turned out to be much, much faster, saved us a lot of time, put it down to a couple of minutes.

Another interesting thing, unusual concept at the time, blue-green blobs. So blue-green deploys, we've all heard of. Blue-green blobs is now a thing. We use those for exactly the same purpose, but the extra advantage you get there is that you can actually run your automated tests or do manual testing for inspection to make sure everything's as expected. And you can do that in a blue-green sort of private environment before switching to live.

So the results. How did we do? We had a lot of experience objectives. These were meant to address a lot of the pitfalls of the previous website. SEO was bad, search, et cetera. Conversion rates weren't very good. Lots of errors, and whether that's 500s, 404s, whatever. So what we wanted to do was improve on all of those. And we hit all of those targets. I've redacted all of the numbers for the sake of this presentation.

The icing on the cake, though, and this happened sometime just after launch. So WIMBA Global, who are a digital CX benchmarking organization, they ran a, I think it's qual and quant test, which they recruit people to come and do various different exercises on different industries' web estates. And they do that, and I think it's a service they provide to other people. They effectively tested all the VFS websites in Northern Europe, or most of them, a big chunk of Northern Europe, for both journey and non-functional performance. And they ranked us number one, and that was a massive win for us.

It was real proof that the work we'd put into everything from our content design, UX design, our hyper-performant Jamstack, isomorphic architecture, all of those things came together to just really get us right to the top of that list. That was a great win for the team.

From a DevOps perspective, we had a one-week route to live originally. It wasn't great. It was 40 hours of manual install and test in only two environments. It meant that getting anything into production was basically quite expensive.

We moved to a one-hour route to live, so that's fully automated deployment and test in four different environments. So your development, integration, pre-production, production environments. We actually proved that one-hour route to live on launch night. We discovered a bug, and we fixed it. We wrote an automation test, we submitted a change request, we deployed and tested in an ephemeral PR environment. We approved the PR, we merged, deployed, tested in an integration environment, again in pre-production, pushed it live, and we went through that entire process following all of our change processes within an hour. One-hour route to live.

So enterprises, generally laggards, right? How do you convince an enterprise organization that early adoption is a good idea? So with Jamstack, we did have that rock-solid NFR argument, the cost challenge. There are many reasons why that was a good idea. But there are other times where you might just need to want to adopt new technology because it's the right thing to do. So there's a bunch of techniques we now employ in order to de-risk new technology adoption.

Experimentation. So experimenting with hypothesis-driven development, that's a rock-solid approach to trying new tech. There's an example here. I wrote this myself. So we believe that Dapr, which is Microsoft's Lego for microservices application framework, will result in an increase in engineer productivity and increased portability. We'll have confidence to proceed when we've measured a reduction in boilerplate code or demonstrated portability.

As we saw from the PRA slide earlier, portability is one of the things that they want to see. They want to know that we're not getting tied into a cloud provider, that concentration risk, et cetera, that we can actually move our workloads from one environment to another. So Dapr is one of the application frameworks we're reviewing to see if that can actually help with that. So this is a good example of how you can look at new technology through a scientific lens and tie it back to some requirements.

So ADRs, they are also great. It's an argument framework to force you to evaluate your options. This is architecture decision records. There's an example here on the screen. So when we started out to rebuilding this website, we set out to with a monorepo, which is actually ADR001. This is 002, and we needed some tooling for our monorepo.

So the monorepo tooling, we wanted to have the same local development build CI/CD experience as you have in your actual pipelines that deploy into all of your other environments. We tried a few things. We looked at Buck and Bazel by Google and Facebook. We chose a tool called Nix, which is a package manager from a Linux distro, Haskell-like syntax. Yeah, we chose it.

Then we undid that decision here. So it wasn't a great decision. Our developers couldn't pick it up. At the time, Scaffold wasn't mature enough, but throughout the course of the development of our website, Scaffold came along quite a bit and we decided to pivot to using a standard kind of Helm and Docker setup and using Scaffold to act as effectively the orchestrator glue for a monorepo so multiple different services could be built and deployed and hot reloaded and all the rest of it when you're in dev.

So there's a couple of techniques there, but you can tie them together, right? And this is a lean-ish evident enterprise governance workflow, starting with uncertainty in the top left, your architecturally significant question, something big that you don't know how you're going to do it. Portability, for example. You might write a hypothesis to effectively evaluate something, and then you might have some tech spikes, and then you might make a decision a little bit further down, and then you might go on to designing your solution and peer reviewing and going into a kind of evolutionary architecture loop there of build a bit, design a bit, build a bit, update your documentation.

So this is not necessarily a linear process, but it just gives you an idea of how you can tie these different techniques together in an enterprise environment to effectively de-risk that uncertainty, to become more and more certain as you get down towards that build and making sure you're doing the right thing or the most right thing and not wasting too much time.

Open source contribution. So two of the technologies that I've already mentioned, Dapr and Scaffold, they weren't mature enough to use within the enterprise when we started to review them and look at them and thought about adopting them. With Dapr, I just contributed myself. It was a really nice experience, very rewarding. I'd highly recommend everybody does it.

Just open source generally. The product missed a feature that we had an evaluation going on with our prototype team, and the simple thing was just submit a pull request, create a feature request, submit a pull request, update the documentation, and the guys at Microsoft actually helped me through that and made it a very pleasant experience, very rewarding, and I learned an awful lot.

Another tool that we contributed a fair amount to was Google Scaffold. Mostly with feedback. There were a few pull requests, but it was more about trying to get the tool mature enough, and the team at Google have been absolutely rattling through updates to that. And almost everything that we submitted as an issue or ticket was fixed in the next version, and the next version was the next sprint. So it was just lightning fast, super quick.

Timing. Now, timing. I find this really key when it comes to adopting new technology. So, like many enterprises, we follow SAFe, and we work to quarters. And this, coincidentally or not, has honed my personal opinion on the scale or size of technologies that you should adopt. So I have a simple rule of thumb. If one engineer cannot pick up a new piece of tech and get it close to production within a quarter, it's really not something you want to adopt. So tech churn is really high, and the cost to change direction shouldn't be.

So for smaller items, we drew some inspiration from Intel's Tick-Tock approach. So the idea is new tech, one sprint, optimize the next. And we use this to limit the amount of new technologies in flight. And that meant that the team's cognitive load was not through the roof. And it meant that you wouldn't end up with a huge amount of instability in your code base because you'd adopt one thing new, you would mature it, you do your KT, make sure everybody understood how to use it, move on to something else. And so we funneled our technology into this kind of pipeline of Tick-Tock approach to adoption.

So tech radars, a really useful way to track technology you're interested in. ThoughtWorks obviously provide theirs as an open source project, but simplest way, just use Miro. Means you can change the ring titles very easily and all the rest of it. And that is just easy.

So I personally use these. And it means that you can use it to time your adoption. And when I say timing adoption, I mean if you're looking at some new cloud product, like Azure Front Door wasn't GA when we were starting to look at it. It meant that we could track some of these things, and we had a good idea of what the alternatives were. Were we going to use CloudFront or something else like that if Front Door didn't actually mature in time?

But another good thing with the hyperscalers is you've probably got a line into them, especially if you're an enterprise organization, and they will tell you when it's going to go live, roughly. They won't give you an exact date, but generally, the rough date is pretty exact. Haven't ever seen slippage, personally.

This actually worked out in our favor massively when it came to the rules engine, which is a sub-component of Front Door and came along later. But we needed it to actually do a lot of redirects and some other bits and pieces, which services like Netlify provide, which is obviously one of the first or early Jamstack solution hosting providers. So we fed in our requirements into Azure Front Door's rules engine team, and the product actually became better and more usable for Jamstack websites because of some of the feedback that we gave them.

For CNCF projects, I think there's a different approach. You've got to be close to the community on Slack, on GitHub. You have to invest time in understanding that code base. This is obviously riskier, though it's less risky if you adopt technology which is being driven by a hyperscaler. Scaffold is Google, Dapr is Microsoft. Those are two technologies that they put a significant investment into these open source projects. They're not going to let them just fall over and die that quickly.

So we talked about technology. Now, how do we get people around technology? So Conway's Law asserts that organizations are constrained to produce application designs which are copies of their communication structures. I read this as have one big team, build a monolith, have a front-end, back-end team, end up with an undesirable split in your architecture, which ends up with hand-offs.

So to perform an inverse Conway maneuver is to effectively design your organization to suit your desired architecture, and that is where we move into sociotechnical architecture. So there's some of the techniques that we employ. So Wardley mapping, for example. You want to focus all of your effort on the things at the top left. So it's a value stream that's actually quite useful. The top is your customer, the bottom is your commodities, et cetera, furthest distance away. So you've got power, compute, and stuff at the bottom. Right on the top left is where you'd put your developer resource.

And you basically want to shift all the things that aren't custom development over to the right. And this gives you a really good way of looking at your systems and how you might group them and who might own them. When you go to the next level of detail about those systems, domain-driven design is really, really useful. And we've started to use it to understand boundaries, friction points, and to create microservice blueprints. And then you can work out which teams should own them, following team topologies. So, standard team topologies, whether it's platform teams, streamline teams, et cetera.

Got a worked example of that. So here's some microservices. So we've been through a domain modeling exercise, we've mapped some features, and we've worked out how we're going to build our services. We know what we're doing. We can look at that and go, well, actually, just from this lens alone, you could probably say that those three services on your left might be in an accounts streamline team. On the right-hand side, you might say that could be a payment streamline team. And you could draw these circles around and go, that looks like a sensible way to split these services up and have them built by different teams.

And that's great, but we're a big enterprise organization. We have loads of dependencies that are external. We don't have the, I guess, the position that a lot of small startups might have, which is that they own everything, everything's SaaS or cloud. So for us, we've got external teams, we've got systems of record, and we want to make communication as efficient as possible. So lining these up with our team topologies is also a very good idea. In this case, yeah, it kind of works out. You've got an external team and a system of record in these two different teams. Then, you can streamline your communications with those teams' systems of record. It's looking good.

So right now, we're going through a redesign of our website. So as we're doing this redesign, we're looking at the information architecture, and we kind of want our information architecture to line up, too. So yes, funnily enough, it does. We've got our home button, which would be in our app, would line up to things like your accounts. That's what you want to see when you log into a mobile app. Payments lines up to the other team. So that's looking good.

But if you drill down and look at how your information architecture would navigate you to those features, you might find that you have some undesirable friction points. So we've got natural fracture planes in here, but we also have a friction point. And this suite of techniques, I find really useful because it really does help you highlight how you should structure your teams and what you need to do to accommodate some of these undesirable friction points.

Now, from a user experience perspective, that was the right thing to do, but from a team perspective, it doesn't actually quite work out. So then you have to work out how you're going to overcome that hurdle. Is it going to be through collaboration? Is it going to be through much more loosely coupled architecture? Maybe you're going to use Graph and things like that, too, because it means that you can have the back end and front end a lot more separate than you would do with REST.

To innovate on the bleeding edge, the most critical element of all is psychological safety. So a high-performing team will be empowered to make mistakes knowing they won't be punished. And if you want to be a leader, you have to invest in innovation, and that means you have to celebrate failure and learnings from failure.

This diagram of Simon Wardley's, he claims it as a joke. It is now, I'm sure 100% that it did not, this used to be true. And maybe it still is in some organizations, but we're looking at doing the opposite now. So to lead, it means you have to be an innovator. It means to go first, and that's what you have to do if you want to be on the bleeding edge. And you can really only do that if you have empowered people.

Thank you very much. That's all.