Surviving the Grand National

Log in to watch

London 2016

Surviving the Grand National

Head of Operations · Sky Betting & Gaming

As one of the UK's largest online betting websites, the Grand National always breaks records by every metric. We have a complex technical platform, involving in-house and 3rd-party provided software, serving a highly dynamic website on the busiest day of the sports betting year.

I'll tell the story "from the trenches" of how we planned for and ran the day, and what lessons we learned to make next year's even better. Including technical details (keeping our website running whilst normal customer traffic looks to most companies like a DDoS) and how we organised our responsibilities on the day itself, I'll go through the various challenges we faced and how we overcame them to have our most successful Grand National yet.

Chapters

Full transcript

The complete talk, organized by section.

Kevin Bowman

OK, so I'm going to talk about Grand National from the point of view of a betting company.

We are Sky Betting & Gaming. We are one of the UK's largest online betting companies. Depending on what metric you use, by some metrics, we are the largest; by others, we're second or third place.

Grand National is a unique day. It happens once a year. It's usually a weekend in April, and it's unique by many different metrics.

It's unique in the horse racing world. It's the largest horse race, actually, in the European horse racing calendar, more so than the UK. It's one of the biggest, one of the longest races. It's one of the ones with the most horses in it. And importantly, to excite customers, it's incredibly unpredictable. It's really difficult to know who's going to win the race.

In fact, the favorite horse has only won nine times in the last 70 years. So even us bookmakers, we're not very good at predicting who's going to win the horse race. Customers get really excited about it.

I guess from a UK betting point of view, it's the one event which, even if you don't bet in the UK, you probably still bet on the Grand National, even if it's just a sweepstake in your office. If you actually bet online, it's a bizarrely unique event.

It also gives us the highest traffic of the year. To give you an idea what UK betting website traffic looks like, this is a normal weekday. This is in bets. I guess for a bookmaker site, a bet is the equivalent of an e-commerce site's checkout.

We have a bet slip, which is like the shopping cart, and then when you check that out, when you place the bet, that's like your shopping cart checking out.

So a normal weekday looks a bit like this. A couple of exciting things to note here: we've got a fairly organic curve of traffic, but then at about 1:00 till about 9:00 PM, we get this crazy scribbling over the top of this graph.

This is because virtually every day in the UK, there are horse races going on. There's usually two or three horse racing meetups around the country. Every 10 to 20 minutes, there's a horse race actually starting in the UK. We see the traffic straight before a horse race suddenly climb as people are betting on it, and then when the horse race actually starts, there's a great big plummet because the horse race has started. You can't bet on that race anymore. And then when that race finishes, people come straight back in and bet on the next race, or take their money out, or do whatever. So that's a normal weekday.

We cope with this kind of excitement every day, but if we scale that down to just the bottom of this graph, then a Saturday is even more dominated by football betting. Football traffic on top of that.

In the UK, most football matches on a Saturday start at 3:00. What we can see on that graph is a fairly sharp rise from 1:00 to 2:00, and then at 3:00, all the football matches start simultaneously. There's a great big plummet in traffic, and then during the football match, there's lower levels. But then straight after the football match, customers come crashing back to the site, find out if they've won. If they have, then great.

On this particular day, I think quite a lot of customers won. There is a single late match in the UK on a Saturday, so we had quite a lot of traffic and then another dip for an hour and a half while the late match goes off.

So this is Saturdays. I guess a lot of websites would be shocked and horrified to see this kind of rise up to an event and then a sharp plummet in traffic afterwards. It would probably set off some kind of alarms or something, but this is what we cope with on normal Saturdays.

The Grand National is like a combination of both of those, but multiplied up again. Grand National is on a Saturday, and it's a high-profile horse race, and it looks like this. We've still got the weekday line faint in red on the bottom, but many, many times more than that is Grand National traffic. For reference, the Y-axis on this is bets per minute, to give you an idea of the traffic levels.

You can see on here, we still have that football shape. We still have the rough doubling of traffic, the rise up to 3:00, the drop again afterwards. But then the Grand National itself this year started around 5:00-ish. You can see in the last half hour towards the start of the horse race, it rises crazily. The sharpness of that rise is really quite exciting, I think is the word. And then when the Grand National itself goes off, suddenly no traffic at all on our site. Great, customers are actually watching the Grand National.

But the problem with that is that throughout the whole day, when customers are placing bets on the Grand National, that's fine. That's several hours' worth of bet placement. But then when the Grand National finishes, they all want to know if their bet won. We have this thundering herd effect straight after the race.

Now, this is bets, so this doesn't actually show you website traffic, but what you can see just after the race is a progressive ramp-up over about 20 minutes or so as we try and control that thundering herd of customers coming back into our site to preserve some semblance of service on our site. And then the day goes back to normal later on, and all being well, we go to the pub.

Sky Betting & Gaming, it's probably useful to know how we're organized a little bit and where my place in that organization sits. We take inspiration from the Spotify model. We don't directly copy the Spotify model, but we're inspired by it.

We internally organize ourselves into a series of tribes. We use the same kind of language. In the top left, we have the betting tribe. In the top right, we have the gaming tribe, because we're Sky Betting & Gaming and we're not that imaginative.

What we also have in the bottom left is a series of shared service tribes, or tribes which cross-cut, providing services for the other ones. Unfortunately, they're not customer-facing, so they don't get shiny logos. But we love them all the same.

I work in the bet tribe. The bet tribe itself is probably the biggest of our tribes, and it's only betting and gaming that have their own P&L responsibilities. We try and make sure that everything needed to deliver for our customers is inside the bet tribe.

Although we have a top-level CTO in the company, my responsibilities are actually towards that bet tribe P&L. Everything inside of that, we need technology, product, and marketing. This is really crucial. We're not just a technology tribe. We try and do everything in the tribe. We make sure that everything that's needed to deliver for our customer is in the tribe.

However, in the technology world, we're loosely split into three-ish areas. These aren't necessarily the areas which define day-to-day work, however. Again, inspired by the Spotify model, we cross-cut our day-to-day work into squads, which cross-cut through all of those skill areas, I guess.

The verticals, we use architecture, engineering, and operations to try and promote the craft and to try and make sure that people's careers and their abilities are developed. But the day-to-day work is very much in those squads. And those squads own technical components. They own products, they own components. They deliver changes to those products, but importantly, they also own the future operational development as well as the product development of those.

We're trying to promote that even more. At one time, it would've just been the operations people in those squads who were on call and who were responsible for this 24/7. Whereas now we're trying to move that much more so that each squad has its own on-call rota, and very much, if they break it, they get called out. They need to fix it.

My world in that is operations, and I run these teams. Even within the operations world, we're still trying to promote various best practices, whether it's traditional Ops, site reliability, platform engineering. A more recent introduction to our family has been service management.

We've always done service management in our company, but more recently we've brought it inside the tribe. We're super happy to have them in our family. I'm excited to try and find out how we can best use service management inside our P&L tribe to make our customers' lives better, which is ultimately what we're trying to do.

Sky Betting & Gaming itself has gone through quite a rapid series of growth. Over the past few years, our growth has probably overtaken that of our competitors quite considerably.

We started off quite small, but to the point that just over a year ago, Sky, who used to own us 100%, sold a majority stake in us. We're now much more of an autonomous company. We were always quite autonomous, but now we're definitely not trying to be beholden to the way that Sky runs their other business units.

Now we're mostly owned by CVC, but this year's Grand National was the first year that CVC owned us. With new owners, it's a really high-profile year. And every year, the traffic levels we get in Grand National grow and grow and grow. We can't necessarily just copy what we did last year. We need to reevaluate what we're doing this year with new owners who are looking at us very closely.

To give you an idea of the architecture, a very summarized view. The blue part to the right is roughly how our bet works, how the bet side of our product works. The green bit is account and single sign-on.

It's important to note that although that was a bets-per-minute graph earlier, to bet you have to log in. So our friends in account and single sign-on also take quite a beating during Grand National day.

But both of us are backed off to a common system, this system in red at the bottom, which is provided by a third party. Now, for all the third party develops this, they then give us an RPM file. They give us the software, and it's entirely up to us to manage and scale this software. It entirely runs in our data centers. It's our ops teams who are looking after it, and the success for our customers matters about how well we scale that.

Because if that goes out, then our systems go out and our customers can't use our site. So we're very interested in how that system scales.

Notable in this, though, is that the third-party system all backs off to a single monolithic database, that IBM Informix database at the bottom. Informix is a great database, but the way this system works is that's a single system. So scaling it for Grand National day, more exciting. Woo. And we'll come back to that Informix.

The way we approach these things is just after Christmas, with about three months to go, I guess, we start a cycle of load testing. The way we did things last year, like I say, isn't necessarily relevant to this year, because our traffic levels will have risen again. So we need to relearn about our systems. We need to target what we think is bad about our systems. We need to load test it, we need to expose those problems, and then we need to make a plan to fix it, implement the fix, load test again to validate, move on to the next area.

We do this quite a bit. We were doing this probably once or twice a week in the first quarter of the year.

Unfortunately, load tests don't work. Load tests are great for highlighting problems in specific areas, but when you get managers coming to you saying, "Please tell me exactly how many bets per second we can do, how many logins per second," they won't tell you that.

You can't just load test to a number and say, "Yep, we can do that," because load tests will never replicate real customer traffic. They'll never replicate the background load going on. We can load test focusing on Grand National bet placement, but there's still horse racing, tennis, whatever else going on in the background. So load tests are a useful tool, but they're not the be-all and end-all of it.

Really importantly as well, load testing only works if you mix a little bit of expertise in with it as well. You've got to focus where you're load testing. You've got to use the expertise of your team. The only people who really know what the potential problems in your systems are is the people who own the systems. So we used quite a lot of load testing to target where to improve.

The kind of things we found whilst in that load testing period: the front end of our website is all PHP. It's all Apache PHP, and we don't really do any caching in front of that. Every request for an HTML page goes straight through to a PHP process.

We were hosting our PHP code on a big NFS server. Turns out even if you tell PHP to always cache its bytecode and never go back to its backend file system, sometimes it still does. So the NFS performance under predicted Grand National load levels was going to be a big problem.

We also found some other issues. The memory leak was an interesting one because load testing itself didn't highlight it, but this was much longer-term soak testing. We said, "OK, actually Grand National, it's a day. It goes off at 5:00, so there's a good 10 hours' worth of high-traffic betting going on, so we need to soak test that." And great, because we discovered a memory leak in one of our APIs.

And a little bit of learning about how we use VMware quite a lot, how we mapped our virtual CPUs to physical CPUs. And worst of all, our load injectors themselves had performance problems. We had to scale up our load injectors to really test this kind of traffic.

However, load tests will only tell you the things which you're expecting to go wrong. If you target your load testing, which is really the only way you can do it, there's all sorts of unexpected things which you're going to miss.

We love unexpected things. There wouldn't be a betting industry if the world was entirely predictable. However, we still need to cope with these.

A couple of unexpected events which impact the industry in general: Leicester City winning the Premier League was expensive, I think is the word. Horse race canceled. Competitor websites failing is something which you can never predict, but if one of our bigger competitors goes down, customers still want to place bets, so we tend to get a flood of traffic coming across to us. Which is great for us, but we can never predict when it's going to happen or how big that flood's going to be.

There are some unexpected failures which can more directly impact the tech stack. Generally, third-party systems which work fine in testing, but then under heavy load can fail. Particularly on Grand National day, banks aren't necessarily directly related to the betting industry, so they might not be expecting the flood of traffic from all betting companies simultaneously. Payment gateways as well.

Even things closer to your data center, network transit, storage backends, all sorts of things can fail. A good amount of pulling-plugs-out testing is great.

However, worse than these things failing is them slowing down, and this is the death knell of our technology stack. We need to do a lot of testing around this.

It would be great if we sent a payment request to a payment gateway, and the payment gateway immediately said to us, "Yes, that worked," or it immediately said to us, "Nope, we're under load. That didn't work."

The worst thing we can possibly do from a payment gateway is if we send it a request for a deposit or withdrawal, and the payment gateway says, "Oh. Ooh. No, that didn't work." It's horrendous. It ties up your own threads, it ties up your resources, and you can quickly lose a site based on that.

So we did a bit of work on this. We put a lot of circuit breakers, some good timeouts, some really good monitoring around how all these systems work.

To add additional excitement this year, the organizers of the Grand National decided to change the start time. In previous years, we've gone with the top timeline here. The graphs we were looking at earlier, where football traffic builds up to the start of the race, sorry, to the start of the match, but is a lot quieter during the match itself.

In previous years, we've managed to neatly nest the Grand National itself and then settlement of all the bets for Grand National inside that quiet period. And then by the time the football matches finish, we can concentrate all of our database resources on settling those football bets.

This year, helpfully, the organizers decided to move the Grand National later, I guess for maximum conversion of people watching the football matches, then get them watching Grand National. Unfortunately, this meant that our really busy bet placement time overlapped with our busy football settlement time.

Both of these boxes in red put a lot of load onto our Informix database and behind all of our systems. So as much as the things which we develop, we can shard them and componentize them and scale them, we're still relying on this single Informix database, and we had no clue what this effect was going to be.

This was the first year that happened, and we had no historical data. We had no modeling. We had no way to know how customers would engage with this. So we came to know that as the danger zone.

On the day itself, we did a couple of things to try and make the day work. Crucially, I think two things were quite important.

First of all, we had a really good plan, which we quickly abandoned about 10 minutes into the day, but having a plan is still a really important thing. It let us measure how the day progressed. It let us tick things off as they happened, and if something unexpected in the graphs happened, it let us reference back to the plan to say, "Oh yeah, that's fine. It's because this horse race has just started. That's fine."

Not nearly as important, though, as having a really good team, and this is going to be a theme moving on. Really good teams, I can't express how important this is. Second only to letting them do their job properly. There's no point having a good team in place if you don't let them do their job.

We chose to organize ourselves slightly differently this year for Grand National. We effectively had a single coordinator who was broadly overseeing how all of our technical systems were working. And then we had small localized teams, each with a point of contact, who had very specific roles to look after various different parts of our architecture.

We physically co-located all these people. Yep, photo of the office. There's me on the far left. I was the coordinator. This is part of our office, but we had a couple of people looking after some very specific parts.

For example, we had a team who were purely responsible for looking at our web servers, and if they saw problems with our web servers, then their job was to flag up the fact that they had problems with our web servers, hopefully make some kind of plan to fix it, and then we would have a very quick discussion, and they would then just go away and fix that, was the plan.

In fact, in a couple of areas, this did happen, and it worked really well. We had teams looking after things as specific as databases, even the network, checking our network switches were still performing well. And that was great, because if we generally had people looking after the whole estate, then we could easily miss the fact that one of our network switches was getting overloaded.

Really important as well is that I had some separate assistants. We had a couple of people who were specifically not in teams, and they could just take notes about what was going on. They made sure that everything was recorded. They could take over from me if I needed lunch, for example.

The team organization went pretty well. However, the whole day didn't go pretty well. From our customers' point of view, it did. It was great. But there were a couple of things which very nearly went wrong, but our teams were brilliant and noticed them, got out in front of them.

A couple of examples. We have lots of risk management systems. We're a betting company. We're in the business of risk. We manage risk in lots of different ways.

This particular one is to do with giving some of our traders and our risk managers a filtered list of what bets are going through our system at the moment, whether those are risky bets, whether those are bets which indicate we might want to look into the pricing of some of our markets.

We redeveloped this system recently. The new system is called Orwell because it sees everything which is going on, Big Brother style. The old system which it replaced was unfortunately not cool enough to have a name, but we'll call it the old system.

Our traders and risk managers were using Orwell, and that was great. It was performing its job. Unfortunately, in the run-up to the 3:00 football matches, traffic got to a level where Orwell hiccupped, and they suddenly didn't have a complete view, a kind of up-to-date view, of all the bets going through the system. Which is fine. We have a fallback plan for this.

What happens is all those people move across and use the old system. Normally, that works fine. Unfortunately, on Grand National day, the old system couldn't cope with that traffic. I think crumbled is probably... It was heading towards crumbling.

The problem with that is that the old system, if it gets particularly bad, is in line with bet placement. If the old system starts to crumble, then the way our customers place bets starts to be affected.

However, we had a team. We had a small team who were dedicated to watching how our risk management systems were working. They noticed Orwell hiccup. They noticed the risk managers move across to the other system. So they realized this was going to be a problem.

They very quickly made a plan, and their plan bluntly involved making the Orwell side of things bigger. But the great way we had these autonomous teams was that by the time they flagged this problem to me as the coordinator, they already said, "We want to make Orwell bigger. We think the way we want to do it is to talk to our infrastructure people, get some more servers, get some more CPUs, but that'll mean we need to tweak the firewall. We need to reconfigure our load balancers to put these systems in, and then we need to move the traders back to that system."

So it was a two-minute conversation. I knew the team knew what they were doing. We had a spare ops person who had a good relationship with the infrastructure team as well. Between a small autonomous team, they went away, and they just fixed the problem and then moved the risk managers back to the new, bigger Orwell.

That was not a problem at all. Our customers' behavior was protected. They never noticed this was nearly a problem.

The other problem we had on the day, or nearly had on the day, was with Redis. We use Redis quite a lot. We love Redis. It's a fantastic piece of software if used in the right kind of way. We use it for session storage and bet slip storage, for the kind of short-term, if the server crashed, we could afford to lose it because it's an inconvenience as opposed to a complete failure.

Unfortunately, on Grand National day, we had so much traffic that the memory which Redis was using was trending somewhat alarmingly towards its configured limit. And when that happens... This was the session storage one. When that happens, we basically start logging customers out in the middle of their browsing session, which is really inconvenient for customers. And the more that happens, the more customers get logged out.

Again, we had a great team in place. There was a team entirely looking at how our session management was working. They saw that we were trending towards this faster than our predictions were, faster than our testing had said, and they discovered the best way to fix this was to say to Redis, "Hot reconfigure your memory limit."

It's a brilliant thing about Redis. You can hot reconfigure it on the fly. The problem is we'd never done that before. We didn't entirely know that was going to work, and we had about seven minutes before we were going to hit the limit when we realized this was the best thing for us to do.

Again, a quick conversation, make sure I'm aware that that's what's going to happen. We made a decision to try it on the failover. With about five minutes to go, a couple of minutes to try on the failover. Yep, that worked fine. A couple of minutes to do it on the live platform. Worked brilliantly. Customers weren't affected at all, and it was a very near miss, but it was one which we missed purely by having a great team in place, knew what they were doing, knew what the solution was, and just dealt with it, looked after it.

Small teams who are masters of what they do, have a great purpose, and have the autonomy to do the right thing.

Finally, I talked about the danger zone earlier. What actually happened during the danger zone? We didn't know leading into Grand National what was going to happen during this period.

If you remember the bet placement graph from earlier, well, actually that huge spike just around five o'clock-ish is when the football matches are finishing and is about 20 minutes before Grand National itself starts. If I zoom in on that, we can see the plan we had to manage database load, given we didn't know quite how much load we were going to have, was to effectively batch up our settlement process and try and throttle a bit how our settlement worked.

But we do want to give money back to customers as quickly as possible. And what actually happened is that there's three rises. There's one just before five o'clock and a couple after five o'clock, and this pushed us well beyond our record bet placement for the whole day.

Actually, by controlling our database load and by giving our customers their money back in a good way without impacting bet placement, it actually pushed us into the most bets placed that we've ever had.

Brilliantly, one of our biggest competitors tweeted just after Grand National that their system had hit a record bet placement level of just under 10,000 bets per minute, and we were pretty happy with our strategy here because that pushed us beyond that for a good 20 minutes or so before the race.

We're pleased with that. Also, to put this into reference, the last but one Black Friday that Amazon UK had, they put out a press release afterwards saying that their system had hit a record of, I think, what translates to around 4,000 to 5,000 orders per minute.

So in a roughly analogous world, we are coping with far higher transactions, although I realize their transactions are more expensive to deal with than ours.

In summary, lessons learned: small teams are great, particularly if they are entirely in control of what they're doing, if they're expert at what they're doing, if they know what they're doing. Small teams are great. Trust them. You need to know you can trust them.

If they advise you to do something, as long as it doesn't knock on to anything else which they're not aware of, then it's almost always going to be the right thing to do.

If it goes well, everyone goes home happy, or everyone goes to the pub happy. One of the two tends to happen. It was a really good day. We enjoyed it. We had a good time.

And the one final thing, I don't have a slide for this, but to pose a question, to state some problems we're having at the moment.

We operate in quite a highly regulated industry, between Gambling Commission regulation, PCI, data protection, et cetera. We operate entirely in private data centers at the moment, but we're experimenting with moving some workloads out into AWS cloud.

One of our biggest problems is that, for all we are confident and we're happy that we can build things in a very compliant way in public cloud, we need our regulator to understand that as well.

So the question is: how do we educate the regulators to convince them that actually public clouds are good places to operate if done well?

Thanks for listening.