How Betway Tests in Production: Hypothesis-Driven Development

Log in to watch

London 2020

How Betway Tests in Production: Hypothesis-Driven Development

Betway has been following a “test in production” approach to building software for a few years. They test in production for two primary reasons: to validate business hypotheses and gain confidence in technical implementations.

In this session, Michael will share tips, experiences, and lessons learned from testing in production.

Some of the topics will include:

• A/B testing: how Betway implements A/B tests and gathers data to prove that a product change is adding value

• Trunk-based development: how Betway uses this technique to release incomplete and broken code into production, so it can to be continually tested and refined within the production environment

• Managing problems: how Betway resolves both technical and business problems that arise. Michael will also share how the company implemented key aspects of testing in production across several systems. Plus, he’ll share a framework they adopted to streamline the whole process.

This session presented by LaunchDarkly.

Chapters

Full transcript

The complete talk, organized by section.

Michael Gillett

Hi, I'm Michael Gillett. I'm a solutions architect at Betway in London, and I've been with the company for a number of years, working from the front end all the way through to the back end on a number of the different websites and services that we offer around the world.

In this presentation, I'm going to take a look at the ways in which, over the last few years, we've really moved towards using testing in production as the way of doing releases, getting features out there, and how we are able to do more efficient testing. I'll also be sharing some of the tips and best practices that we've found over the past few years. So I'm going to share my screen and crack on.

So what I want to make sure, just to set everyone's expectations, is that you don't expect this to be something where I'm going to give you a load of horror stories. So I'm going to give you this disclaimer. What we've got today is going to be about how you can avoid things that look like this, that look scary. There's nothing scary about testing in production. And sure, we've had a few things go wrong along the way, but that's what, towards the end of this, I'm going to be able to talk about: the ways in which we've come up with ideas and processes to ensure that things don't go wrong, which is obviously what we all want. We want releases to go well.

So the talk is going to be around: how do we do testing in production? But it's obviously that element of: how are we building things in a reliable way? How are we making things that work, that do the things that we want them to do, and do it in a way that won't give us headaches and problems as we're getting them out there?

I'm going to take a step back from talking about software, but look first in a wider picture, which is starting off with craft. What I mean by this is, this is how things used to be made several hundreds and thousands of years ago. This is how we did things. We dealt everything with a specific, bespoke view. Everything was handmade. It was a craft to build pots and make anything. It was done in this very particular way. Quality was often there, for sure, but it was a slow way of working. You couldn't scale it to the levels in which often we needed things. So it's quite inefficient, but it did often yield good quality things.

That then led to, obviously with the Industrial Revolution, we moved to mass production, where suddenly you can produce things in vast quantities, far faster than humans could make by hand. We had loads and loads of things available to us. The problem, though, was that because everything was pretty cheap now to make, and you could make it quickly, things were made without really necessarily a need for them. Also, quality wasn't a concern. It was cheap to make this stuff, so they just made it. If 90% of every run through the factory was good enough, then that's fine. You just need to make a bit more, and you've got what you needed.

But obviously, that's not a very efficient way of working. And so that was refined into what perhaps is best known as Toyota doing, which was this kind of idea of lean mass production, where you don't want that element of waste. You want quality built into what you're doing. You don't want problems. You don't want quality issues. What you want is everything that comes out of that factory to be good. You want it to have that quality baked in.

The way that was done was through understanding that pipeline, understanding the materials, the processes that were involved in producing these things. Through that understanding, constant refinement could be made. Hypotheses could be created as to better ways of working. And then the end goal was: well, you can validate whether that hypothesis was correct. In doing so, bake that in for the next run through the pipeline or to the product that you're shipping at the end of it. All with fairly low risk, but it just means that what is produced results in very little waste throughout the process, and what rolls off that production floor at the end is of good quality.

Then really what that is, is employing the scientific method in the way things worked. So moving from craft, treating everything in a bespoke way, suddenly you want to do things frequently. You want to make lots of the same thing again and again. You can do that. You can't necessarily do it reliably. So then you bring in this idea of looking at the numbers, analyzing things, being able to test things, repeat things, draw conclusions. That leads you to lean mass production and an extremely efficient way of working. Obviously, we all get to benefit from that in our day-to-day lives with most of the stuff that we actually use. It's all adopting that lean mass production technique.

Perhaps the best example of that is with NASA and getting humans on the moon. They started a decade struggling to get anything into orbit and finished the decade putting humans on the moon and bringing them back. That was not necessarily lean mass production, but it was the scientific method. What they wanted to do, they had an end goal, they had a hypothesis: can we get there? Everything that they did in that time was about constant improvement and refinement, collecting data, analyzing that data, and making sure that the next mission was more successful than the last.

It wasn't to say that everything was successful. Sure, NASA had a lot of problems and some very expensive rockets blowing up along that course to get to the moon. But the point is, they constantly refined. They constantly came up with ideas about improvements. They understood what it is they were dealing with, and through doing that ensured they reached what they wanted to get to.

Obviously we've got a modern-day equivalent with SpaceX. Only a couple of weeks ago, they put humans into space. But here you can see they're landing two rockets side by side at the same time, which had never been done before. Again, just like NASA, SpaceX have had problems along the way. They've had rockets blow up and things go wrong and launches be scrubbed. But the point is what they're doing is they're constantly learning from all of that stuff and improving it for the next time.

But NASA and SpaceX are effectively doing this stuff in production. Yes, they are analyzing stuff ahead of time. They're running models. They're checking the data. They're checking that things all look good. But ultimately, when a rocket blows up because it's meant to be going into space or stuff, that is something going wrong in production for NASA or SpaceX. Obviously there are lots of similarities to the stuff that we do, and probably dwelling on that we should move away from NASA and SpaceX and rockets, however cool they are, and bring it back down to software and how do we launch our stuff into production, not space.

So what we're therefore really going to be talking about, but bearing all of that stuff in mind, is hypothesis-driven development. The idea where we take ideas and we build stuff in accordance with those ideas, proving those ideas, implementing the ideas, and constantly refining our products and our workflows in much the same way that a lean mass production factory floor works, or NASA and SpaceX are getting things into space.

The one thing I question with this term, and it is perhaps a niche term but it is used within the industry, is the idea of hypothesis-driven development, where development maybe feels more like a craft than lean mass production. Where if we're employing the scientific method, we're not really developing things which might give this impression of every bit of software is unique and bespoke and, no, you can't treat software in this standardized way. Everything we do, every app we have, every system we have is a unique thing, as certainly I've heard. But maybe rather what we're looking at doing is bringing in these standards and processes that ensure the quality throughout, even if there's this element of difference. Perhaps then what we're really dealing with is hypothesis-driven engineering.

But how do we do this at Betway? I mean, that's the point of the talk, right? That's what I've been doing for the last few years and what I'm going to be sharing with you. So the first one, and perhaps the most obvious way of doing hypothesis-driven engineering, is the business hypothesis, where we need to validate an idea that someone has had to improve our product. They understand it, they understand the data, they understand what we're trying to do, and they've come up with a thing that they think would be better for our product.

So what do we call that? Well, that was probably an A/B test, right? A lot of you will have probably heard of the idea of an A/B test, where you've got version A of something and version B of something and you want to run a test: which one is better? Quite often we found that we could just go down the route of implementing a new thing. Doesn't mean it's better, it's just new, and we'd do a big-bang release and it would be there. But really what we now do, and what I think is a better way of working, is creating a hypothesis first. Why should this thing be done? Why is it going to be better? What is the thinking behind that?

Then you can split the traffic between your known state and the new state. You can make sure everything in your environment, and I don't just mean production, I mean the wider environment, can make sure everything's the same. Then you can split your traffic equally between the two and see which one is better. Is it more clicks? Is it more conversions? Is it more revenue? Whatever it might be, whatever that success metric is, you can understand it, quantify it through logging everything of relevance to this experiment. Then you have to analyze that data. When you analyze it, you can see which one is going to be better for you or not, and therefore you can validate that hypothesis. Was it correct to do that or not? Has it done what the original idea was meant to do?

Well, you've done that, sure, and you've done it in production and you've tested something, and no longer is it this idea of thinking that it's going to be better or worse. You know if it's better or worse. You've done it with your real use cases, your real users, in your real production environment, and you have categorically proven whether the hypothesis was true or not.

I'm just going to go through a scenario that we've done at Betway. This was the homepage, and you'll notice I'm saying was. This was the homepage in the United Kingdom, and the crucial thing to note here is that we had multiple buttons that all were doing the same thing, and people wondered whether that was actually not really driving people to the right place. Those links were all going to registration, but perhaps this page is so noisy people were dropping off before they ever clicked any of those buttons. So a hypothesis was developed, which was: let's just have a single button. If we emphasize that single button, we might, we should get more registrations. And so we set about enabling this to happen.

The first thing we needed to do was design a new homepage. We wanted a slightly simpler one that can drive that message home a little better. This one you can see on the right is the new one, the hypothesis one. It's got a single Join Now button rather than multiple ones that are spread out on the left-hand side. It's still got a lot of what was in the original one. Obviously with this kind of stuff, changing a homepage of a brand as big as Betway comes with a lot of vested interest from a lot of other parties. We've got marketing, brand, SEOs to name just a few of them. Plus from a tech perspective, if we're doing a new homepage, is it an opportunity for a new app? Do you want the performance to be the same? There's lots to think about in this kind of thing, but the premise here is about that Join Now button. So let's just focus on that.

So what did we do for this? Well, we built that new homepage, and what we did was we ended up implementing a redirect on the Betway.com route using LaunchDarkly to split the traffic. What we can then do is most of the traffic, we can get to the old page perfectly fine, but we can release stuff to the new homepage and just only let devs and QAs look at that. Is it working? Is the feature that the dev is working on, is that correct? Is everyone happy with it? Cool. What we can do is we can roll it out now for key members of the business. Let's get some user acceptance testing happening. Is everyone happy with the way it looks and works? Cool. Awesome. Okay, if everyone's on board and we're all happy and the boxes are all ticked, we can now roll out to 5% of our real users.

Again, this is still no deployments happening here. We've done our deployments. They're on a separate app. We've still got our old app running, our perhaps A, control, and now we've got our B, new one, and we're just sending some traffic there. In this case, we're just sending 5% of all UK traffic who have their browser in the English language. Cool. Nice and safe. No big bang, no massive rollbacks, no panic. It's just a nice, just small incremental rollout. If that all is looking good, the data is fine, everyone's good and happy with that, well, then we roll out a bit more. We can roll it out to 25% of our UK English language. Cool. Okay. If it continues to look good, and by good, I'll come back to that in a minute, we can then roll out to 50% of the UK and our English speakers. But again, we don't have to do any releases for this to happen. At this point, we're in a full A/B test mode.

So we've got our old and new homepages. Traffic is being split equally. If one is converting better than the other, there's no real other variations involved now other than that design and that single focus on a button. So this is what we're really testing. Is this page better than what was there before? Well, what we found was that page actually resulted in a 25% increase to our successful registration rate. Now, obviously, that is hypothesis proven. Awesome. Right? Really good. No big bang, no scary stuff.

That goes into the technical accomplishment of that project, where we had no rollbacks on a release point of view. We didn't have this crazy, "Oh, we've gone live with everything. We need to bring it all back." That was just not there. Sure, there were things that weren't quite right with that new homepage. As we rolled it out from expanding the experiment point of view, we may have found issues. What we can do, we just turn the toggle off. Just bring that back to our internal users. Fine. No real risk. No getting up early in the morning at 2:00 a.m. to fix a problem. It's just easy.

We didn't have any critical alerts either because of the way we were able to roll this out slowly. There was nothing that massively broke. And again, if there was, we can just revert that toggle down. What we found was, because of those first two things and that nice ability to respond to things in a calm way, we actually found that all of our exceptions that were occurring decreased over time. You can see that in this graph, where each bar is actually a different day, but each color is a different browser on a device. It's the number of exceptions that we saw in that browser on those devices. You can see it decreases over those five or six weeks.

But what's really cool is we could actually be very informed with the processes that we were doing. If we found an exception was being thrown in a particular browser on a certain device, we could actually stop the new homepage being there, but still keep it rolled out to anyone else who fulfilled the 5%, 25%, 50%, whatever. Then we could also decide if an exception warranted us bringing back down that size of who we had rolled this out to. We can just do all of this. We can look at all of this data because we need this data to know about how the experiment is ultimately performing. But suddenly, it's also guiding who we're rolling this out to and how fast are we rolling this out. It's really, really powerful to be able to do this stuff with real users, real devices, real browsers. We're getting all of this, which might just not have been visible to us in the testing environment before going live.

The second hypothesis that we can use here is a technical one. What we want to do in this scenario is actually validate an implementation. So this is about rolling out perhaps a new feature, maybe a new authentication system or a new performance improvement, which is potentially a very significant change. Again, you don't really want to do a big-bang release with that kind of change. You want to be able to just put it out to a small subset of your users to validate that that is working as you expect. Is it bringing in the performance gains that you wanted?

So you can target just 5% of users. You could target just a country like Canada. You could target device type, make sure that the performance on mobile is as good as it is on other things or as good as you wanted. But you can do more complicated things. So we've done more complicated even than here, but this is perhaps a nice example where we can target a subdomain and a user who was previously logged in because there's a cookie set for that. So now we can start testing things in the returning user journey. Are all of the processes that are going on there, are they all right? We can improve the performance of a system there. Okay, cool. We've proven that that is actually faster, and we only had to target 5% of the users that we were looking to improve there. So there's some really powerful things that you can do from a technical rollout perspective.

We also found that technical testing was a really interesting thing that we hadn't considered as something that we might be able to do. But what we found was: wouldn't it be great if we could load test our production client-side applications without impacting our downstream systems? Certainly for big sporting events, our client apps have to work really to high levels of load. But if we're not actually going into that fixture just yet, we might want to just test that that app is working fine without impacting everything else. What we then started doing was implementing toggles within our applications that would use a toggle to choose a mocking service rather than the production service. Then we can turn those toggles on for particular use cases. So if a particular header is there, then this is a performance test or a load test, and you can send the traffic through to the mock service rather than the production service. Obviously you can target on different things as well. You could test browsers or platforms or devices or particular networks to see that they all stand up as you expect. Testing things at load on 2G is quite interesting, especially considering the size of the India market.

So that's the different ways that we can do hypothesis and testing by testing in production. But this now is some of the things that we found along the way, which I wanted to share with you because these are lessons perhaps that we've learned.

First is we've been able to adopt a process on some of our teams of trunk-based development. This is really interesting. Obviously we can still use Git, but rather than having lots and lots of feature branches, what we can do here is make full use of those feature flags in our production environment. The developers can work on the local environment and just push stuff that is contained within a feature toggle. They can push it to the production environment. Perfectly safe. It's not going to break the production environment. They can turn that bit of work on for just themselves, even if it breaks for them. Fine. Maybe it's giving them some error logs that they need to understand why that thing doesn't work in the production environment.

It's a massive time saver, though, because rather than building something locally, putting it to a test environment, having problems with the test environment, then getting it to live and finding another problem because the test environment and the live environment aren't quite the same, well, this is really nice because you can make sure that that feature is being built on the environment that it is going to run on, which is very efficient indeed.

Next up is what we found was very important was being able to track a device and not just a user's session. Because what happens in that scenario, if you're only tracking a session, is the user comes along once, sees version A of something, then sees version B on their next session, and then as they come back a few times, they flip-flop between different versions of something. Now, that's not a good customer experience, but it might even go a bit further than that in the sense that that might actually skew someone's impression of the experiment itself, and that the new feature, they may not like purely because of the bad experience that that feature is being delivered to them. So being able to track a device is very important.

This is certainly true of a logged-out user. Obviously, once the user's logged in, you can use their username and consolidate that data accordingly. But that element of not necessarily knowing who that user is, you need to find a way in which you can actually, as much as possible, track that device. Certainly that's not possible in every scenario, for sure. But it's very good. Being able to share a cookie or something across multiple systems and apps can be quite useful as well, especially if you're running a larger experiment that maybe touches different applications and is not just defined within one. If you can have that way of keeping track of that same user across the whole lot, therefore they always get that same A or B variation, works really, really well.

Now this is a term I've come up with. It's probably wrong, but I think it works for the issue that I'm trying to raise here, which is avoiding scheduling complacency. What I mean by this is when lots of teams start doing testing in production and make use of these feature flags to ensure that they can deliver things safely, reliably, and improve quality, well, what that might mean is that people don't need to work together so well at delivering something around the same time. Because maybe the assumptions are, "Oh, well, that other team, they can turn that on whenever we're done. We don't need to race to it yet, so we'll get to it a bit later." But if that other team are busy building that feature, that feature's now done but can't go live. Then that ends up with this perhaps wastage in terms of the resource that people had available to them, and that features are being done much sooner than they need to be. Which in some ways is great that that can happen, but it can lead to problems of real wastage of time management there. So it's worth bearing in mind that just because you can do it, perhaps you shouldn't be doing that.

Now this is a really cool one. A lot of what I've been discussing have been really feature toggles. They're short-lived things to prove a hypothesis or a test or an experiment, and once that's proven, you can do some work to tidy your code up, and now you've just got that good, better feature in your pipeline, your app, whatever it is. But what if you had a toggle that could live for longer? So we introduced this idea of a debugging mode. Some of our client applications can actually ship with both the minified and unminified version of the JavaScript. Now, all users by default get the minified version of the JavaScript. But it's great if needed, we can turn on the unminified version of that JavaScript, maybe for devs or QAs, or even in the instance where a particular user is having an issue, it could be that you could have the call center actually enable debugging for that user. Now suddenly the dev team is getting all of these logs coming in for that one particular user who's having a weird edge case problem. Really powerful to be able to do that, and not really something that was considered when we adopted the process of testing in production. But it's a nice byproduct of it, for sure.

So, moving away from the technical side a little bit, this is much more around making sure that working with testing in production and hypothesis-driven engineering is effective. What we found was early on that not everyone was in agreement as to what success looked like. What I mean by that is there are a lot of assumptions and expectations from stakeholders, from the tech side of it, from product. What it meant was that as the test was going on, the hypothesis was being proven, or even the conclusion was being developed at the end, people disagreed with what success really looked like for something, or disagreed what failure looked like for something. If you get to the end of the hypothesis, and that's where then there are disagreements, it's a very ineffective way of having run that test because you might now need to rerun it, which is not ideal for sure.

Leading on from that, there's even the problem of really understanding what the sample is for that hypothesis. When people say, "Oh, we'll just target 50% of our users," what do you mean by that? Is it a particular device type? Is it all devices you want 50% run on? Is it particular countries, sub-brands, network types? Who knows what? There's so many things there that, again, people come with their own kind of preconceived ideas to what they think the sample should be. But unless you're having those discussions to draw that information out, it can be that people just don't air those assumptions. Then at the end, again, the conclusion's a question because the sample isn't what other people thought the sample was going to be, and that's a real problem.

So, wrapping all it up really, wherever possible, I think you should be looking to use hypothesis-driven engineering. It's definitely not possible everywhere, that's for sure. Certainly, we've encountered places where it really doesn't lend itself to be an effective way of working. But all the examples that I've given I hope give you a sense of the power in which really you can do stuff in your production environment in a very safe way, gather information, understand where the improvements can be made and should be made, and that yields this idea of a flow of features going out all the time in a much better, more quality way that is very similar to lean mass production.

We're moving away from this idea of software being this kind of, well, it feels right, it looks right. Well, no, we're actually using the data, the logs, all of that information that we've got within our software and our applications to really prove that something is better or not. Then put it into the way in which we work, into our applications, into our processes to continually improve the actual end product that we're all building. And so I do think hypothesis-driven engineering is a really, really effective way of working.

Thank you very much for listening. I'll be answering Q&A in the chat, and happy for any feedback and comments. Thank you very much.