Getting Started with Multi-variant Testing

Log in to watch

Amsterdam 2023

Getting Started with Multi-variant Testing

In this talk, I will discuss how my team reintroduced the importance of user testing to our development process. We will share how we proved the value of A/B and Multivariate testing to our stakeholders by starting small and gradually increasing its reach and implementation. By doing so, we were able to significantly increase our onboarding completion rate. Our approach not only resulted in a better user experience for our customers but also improved our overall product development process. Join us to learn how you too can leverage user testing to improve your product development process and ultimately enhance your customers' experience.

Presented by LaunchDarkly.

Chapters

Full transcript

The complete talk, organized by section.

Gábor Csomák

All right. Welcome, everyone. Thank you for coming.

I'm Gábor Csomák from Bally's Interactive. My Twitter and Medium handle is @donkeycoder, but you can find me on LinkedIn and the conference Slack channel as well. Any questions or feedback will be welcome.

We had a very nice lunch today, so let me talk about the topic of today, which is, of course, desserts.

We have a great taste for ice cream. Everyone has their own preference. I myself will change flavors every day when I go to the ice cream shop. Usually most companies will have much effort dedicated on testing their primary income sources: which type of ice cream sells more, which type of ice cream we should change a bit. It's all very standard.

But no one will talk to you about the secondary products, the ones that are not tightly correlated, not tightly measured. That's what I'm going to talk about today.

If you go to an ice cream shop, they will never ask what cone you will ask. Usually they might offer a choice between cones or a paper cup, but I'm more talking about these kind of choices. Someone in the background will make a choice for ordering one of these in bulk, and then that's the ice cream cone you will have forever.

If you think about it, will that matter? Will that matter to the business, which are how the secondary products are functioning? Well, we can argue about that, but best if we measure.

Now, if you are an ice cream shop, you need to have extraordinary memory to remember which customer took which one of your cones. But luckily, we work in a different industry, so that can be sorted out quite nicely.

This, by the way, is just the first Google hit. So I assure you, if you go on researching cones online, you will find some stuff.

Like the other week, when I ordered an ice cream, my thumb poked a hole on it. So that ice cream-eating experience wasn't the best. But I never went back and told the ice cream shop, "Hey, I broke it. Can you give me another cone?" I was too busy finishing up my two scoops.

Before I talk a bit more about ice creams, I want to talk about who am I and where I come from.

I work for Bally's Interactive. We are an online gambling firm. In reality, we don't sell ice cream, but we sell entertainment.

We have multiple brands across the globe. For the ones in UK, you will be familiar with Jackpotjoy, Virgin Casino, Virgin Games, Monopoly Casino. In Spain we have Botemania. In the US we have various Bally, Virgin brands, and we have brands in Asia and Sweden and Spain.

Getting a user is very hard for us. These are all the funnels we need to go through. Some of them are regulatory requirements. Some of them are the KYC and other laws taking effect. Some are our promotions, where we try to know the player already a bit better during the process.

All this takes a huge amount of time for a user to register and all it. Every second on a registration journey, we can lose potential customers.

We are very proud that we are serving entertainment with care. We are a heavily regulated company and in a heavily regulated industry. We make sure that our players are rather enjoying themselves than wagering over what they should be wagering. We have all these steps to help us with that.

If we put it into some kind of analytics, I just made a very high-level proxy counter to help you. If 100 potential customers load the registration forms, by the time they are considered a fully functioning player, the amount of the funneling will result in 30 players after 400.

Every step of the way, like every text input, we can see that 65% will fill in the email form, then only 64 the first name, and only 63 the second name. It's very challenging in our industry.

Here is how our registration form looks like. I was having this conversation with my PO. We inherited the registration form in a way. The PO was new to it, and the tech team was half new to it, and we were discussing how can we improve it. The PO just said, "Oh, that banner, that banner on the left-hand side, that is increasing the funneling so much. That banner is putting so much value."

Do we have any PO in the room? I hope. Okay, one. All right, very good. No offense for the rest.

Who thinks the PO is right by that banner having meaningful effect, a positive effect according to the PO? No? Okay, not much.

Who thinks it has a negative effect? Okay. Who thinks it doesn't matter? Okay. Yeah, we aren't unified, is the point. So we need to have some decisions.

It would be very good to measure it, but measuring is hard. It's so painful. If you know the coastline paradox, then you know what I'm talking about.

Don't get me wrong. We have a full machine learning team or department, but they are busy with getting information about the ice creams, not the cones. If I would go to them like, "Hey, let's decide on this small problem," they would say, "Okay, maybe in three months if there's no higher priority," and then higher priority will always come. I mean, this is an enterprise summit, so you guys should know this.

Measuring is very hard, and the deeper you measure, the longer the number it'll get. So how can we improve measuring?

One of my favorite metrics since I've ever heard it is the bet-cost matrix, which is, again, we are betting companies, so it hits home. We need to know how much it costs to implement the epic or story. We have many ways to estimate, depending on the team. But how confident are you in that estimation? This is what this chart is about.

Would you bet beer that you're confident on that estimation? Would you bet your holiday on it? Maybe your car, a monthly salary? Or would you bet your house that you will deliver in time? Now this gets a bit tricky, right?

The same goes for product. Would you bet that that story will bring the described, the prescribed, effectively the sold-to-the-business value? In this case, I don't know, we have one PO in the room, so I won't single out. But they will say, "Oh, it'll bring the company millions." If you ask, "Okay, would you bet your holiday on that?" then it'll start to be a bit more questioning and conservativeness will hit the room.

With every normal company, we hit the December period, everyone on holidays. So I, as a tech lead, could work on my Christmas backlog.

The Christmas backlog is great because we have time to clean up things. We have time to optimize things that were lost in the big rush of deliver, deliver, deliver. We are enterprise, as you saw in the morning slides: we don't really release during Christmas. Like on the long chart from HSBC, you saw every January there was a huge dip. I think it was at least 20% in releases.

At the Christmas backlog, I figured, okay, we have a new tool which we signed, LaunchDarkly, that has feature flagging. It also has experimentation features. I want to play with all these things and maybe get some business value out of it. Win-win for everyone. So I did start experimenting.

Because I know you love code, here's some code. This is how the config is for that banner set up. Every site we have will read something like this. We have lots of images and based on some ruling from a JSON file.

I was thinking, okay, how will I use LaunchDarkly to do this? LaunchDarkly supports boolean flagging, strings, and full JSON objects. So I figure, okay, let's try to the limits. Let's put in the full JSON object and get on with it.

I anticipated this to be a very quick proof of concept, and I started to hit onto the first hurdle. The onboarding application, the registration, has a BFF. Actually, the BFF will decide which image is served and will pass it to the frontend and some other aggregated results and config. That was for optimization early on, but it just makes things harder in this case, because if I want to tie in the flag evaluation to what happened on the frontend, then I need to wire it through, or I can go with the technical solution and hack everything together on the frontend, which, because it's a proof of concept, of course we did.

The point is start small, start compact, as we saw together today in earlier presentations.

The first section: we create a LaunchDarkly user. We can fill it in with some custom parameters. This is a registration journey, so member ID will not exist. Later on, I just replaced this with a not-random number.

Then we have some custom parameters like which US state, which regulatory jurisdiction they are in, which venture, which environment. Then I can later use these to configure more advanced flags.

The second part is the flag evaluation itself. It gets the string ID of the flag. We just flag it with our Jira ID to be traceable, and then it'll get the expected default value. In this case I just left empty string, which will load no image.

The third part is if all the registration is complete, we will just dispatch a track event to the client and it all will fly well.

This is how it all looked in the UI. You can see in the bottom half of the screen, we introduced the targeting rule. So it only targets one of our ventures. If the venture is not Jackpotjoy, then it'll serve the empty variation of the banner, where there's no banner. Everywhere else, we are setting this up to be configured for the other two variations.

I went with one image literally being a transparent image, and the second image a nice little trial image. I needed to get proper designs for that because if I would do the Photoshop myself, then probably it would not be a valid experiment.

I got some work together with the designers. Then when you create an experiment in LaunchDarkly, it'll ask you to provide a hypothesis. It'll ask you to provide what is your randomization unit, name it, et cetera.

As soon as we hit this off live, it took a few weeks. Again, the enterprise release cycle and everything else is there in our company. Even if this was a proof of concept, to get live we need to go through all the circles.

But then when we started the experiment, we seen these graphs. What does LaunchDarkly try to tell us? The width of the bell curves are the credible interval. The more wide these humps are, the less confidence we have, because only a few users register every, I don't know, minute, second, day, depending on what do you count as small.

We didn't have enough data, that's clear. But then we still saw, okay, the average is being towards the banner image, and we have a nice little probability, credible interval. Not with 90% confidence we can say that with banner we will have this many registrations in an interval, and without that many.

We kept on refreshing. This was very new to all of us. This whole UI made it very engaging to the SEO guys who helped us delivering this and ourselves. Every time, every day we had some free minutes, we just refreshed the page: "Okay, and now? Now we are getting something."

Then we come back the next day. What next morning we saw is, oh, the empty variation took the lead. Okay, let's start theorizing: why is the empty better? I thought we said it's a clear benefit. You can see the humps are narrowing, so the confidence is gaining with time as we have more and more users.

The next day it was head to head, a really interesting battle. Again, we are a betting company. We might be placing bets with each other just for fun. Then we went on with our lives for the weekend. No one looked at the charts, I assume.

On the next morning, when we came back, we saw a clear winner. Even LaunchDarkly says 97%, the imaged banner is the right one. The important thing: the credible intervals are still there, so we still cannot be 100% accurate over the future. But there's a one and a half percent gain over the empty image, which is not a lot.

But if you think that the whole funneling in this period took like 24% of the users through from registration to being fully accomplished players, 25 and that's like five, six percent improvement compared to the base value. So it was a massive improvement. If you think that a company our size bringing just a few percent more players to the tables on a year-on-year basis, this will be huge.

Even though the experiment was only serving the better version to half of the people, the other half still saw the no variation, the media team already reached out to us to say that, "Oh yeah, yeah, we saw an increase in the immediate trades. What did you do? Did you change anything?" We were like, yes, we did.

It was quite a success story. The marketing team were very happy and everyone was very happy that we proved something positive. Of course, we were lucky because the hypothesis could really go any way. Like in this room, we had kind of a third split between no improvement versus better. So we ended up this being 200% positive.

It was a smart bet to do a proof of concept around this, but even if it wouldn't be a success, we could say, okay, now we can stop worrying about it and maybe simplify the code, clean up code, or just get on and try to pursue other things. So it definitely bring more value to us.

Now we know which cone do we need to serve the customers, if I go back to the original metaphor. That's a very good feeling.

What I want you to take away is to start baby steps, start compact. As we heard earlier, there are always going to be hurdles. If you try to push, "Oh, we need to have a full experimentation suite," it'll be tough. You can do most of the things without any tooling. Just have a fetch request to some random backend, of course with the security and everything, but you need to start somewhere.

We have many ideas for future improvements: how do we convince the players to pass different parts of the onboarding journey better? We also did another experiment. Here you can see the whole experiment is very to the left side, which means only one in 200, one in 150 people use variation A or B. The rest ignored. That means maybe we should just scrap that whole feature altogether.

But again, no success is also valuable data. Then if we go back to the original screen, we can debate what do we want to put here? Do we want to maybe add the logo to increase the authenticity, the trustworthiness of the site, or anything else?

There's so much room to improve any form, even if it's there for 10 minutes or years. There's always room to improve. If you have the right product people, they will have all the ideas. So you just need to find out which ones are worth measuring.

We also want to have some improvements to implement a multi-armed bandit algorithm. The idea here is instead of having two releases, one at the start of the experiment to say, "Okay, this is the three options," and then at the end we have another release to say, "Okay, now only serve option A," and maybe a third release to clean up code.

Instead of having that, with the multi-armed bandit algorithm, we can say, okay, here's my config over this frequency. Do the improvement, and every iteration it'll serve more of the winning option, but it'll still continuously try the least possible-to-win options, variants, to see: is it still the best? Was there any other side effect?

Maybe there was Super Bowl or different players were targeted. Maybe even different weather would affect the ice cream. If there was a marketing reactivation campaign, registrations will have all sorts of random inputs that would make it hard to measure anyway. With the multi-armed bandit algorithm, we can improve this massively. It simplifies config and we get to benefit from it earlier.

As I said, only with 50% being served the right variant, we already seen value outside of the teams. Now with this many iterations, we will see value earlier as well. That's a very important next step.

We demoed this internally and, sorry again, POs and product managers came knocking on my door: "Hey, we want in. We want to do similar. How do we do?" Just this small experiment, just this small demo, helped us share the knowledge that we are doing this, this is possible, try small, and just a few steps triggered the whole movement to product being more data-oriented again.

Just start small. For a big company, it'll just mean extra profit. For a small company, it might mean life, because we are almost always wrong about our users.

Thank you very much. I was Gábor, and if you have any questions, we still have three minutes.