How William Hill uses PagerDuty and Rundeck to Deliver Full-Service Ownership in a Highly Regulated Environment
William Hill, one of the world’s leading betting and gaming companies, needs to satisfy both its customers and its regulatory requirements. In their quest for a negative mean-time to repair, find out out how they use PagerDuty and Rundeck together to empower internal teams to take action and automate fixes to common problems – all while maintaining impressive levels of uptime and keeping the compliance team happy.
This session is presented by PagerDuty.
Chapters
Full transcript
The complete talk, organized by section.
Matt Livermore and Rob King
Matt Livermore: Hi folks. Welcome to our session. My name's Matt Livermore, and I'm a principal solution consultant here at PagerDuty. And today I'm joined by Rob King, who's the head of technical automation at William Hill. We're going to talk to you a bit about what William Hill have been doing today. Before I do any more, I'm going to hand over to Rob. My first question, Rob, is can you help the customers in terms of understanding who William Hill are and what the journey's been so far?
Rob King: Hi. Yeah. Well, William Hill, as I'm hoping most of you know, is a gambling company. Started many years ago in London and has been growing internationally ever since, and recently expanded into the United States. That was so successful we've been bought out by Caesars, who are now taking that American part of the business and going to push sports betting and gambling in the US, while the rest of the world will continue on as William Hill, where we hopefully will give customers the best betting experience possible and the safest one.
In regards to where we are and what we are up to, the automation team are here at William Hill to just make everything we do, whether it's support staff, technical staff, we look to automate away the drudge, the toil, and make life better for our employees and ultimately our customers.
Matt Livermore: Okay, so let's touch on that automation piece then. In terms of the key workflows, what sort of things are we talking about there? What specifically are you working on?
Rob King: Well, one of the biggest things we've been working on recently is mean time to repair, so making sure that our customers can get to our product as often as possible. And if we do have an issue or we take some maintenance, that product is back with the customer as quick as possible. So it's generally where we start to concentrate our time. But more recently, especially with work from home, we're looking at ways we can improve how our staff work, how efficient they can be in what they're doing in their day, and to try and make their job easier, quicker, which ultimately does knock on to improve customer satisfaction. But, yeah, we mainly start with making sure our customers get the best product possible at all times.
Matt Livermore: Okay, got it. So in terms of how you do that then, obviously I'm expecting a big ecosystem of tools. Walk us through some of the sort of tools you're using and how you've put those together to deliver these sort of outcomes.
Rob King: Well, it's a mainly Linux estate, Windows underneath in part. From there we've got VMware, we've got AWS, all of the classic infrastructure elements. On top of that is overlaid classic corporate systems. We've got Office 365, et cetera, other corporate systems, time and attendance, those kind of systems. And then we run a system called OpenBet, which is a fairly ubiquitous gambling platform that helps run not just us, but our competitors too.
Also, what we tend to get is customer complaints, customer issues, issues from our monitoring platforms, whether that be sort of New Relic. We had CA Suite before. We've got Splunk looking at all of our logs, aggregating our logs. We're looking for patterns. Anything that's out of the norm we hope to spot it, and very much hope to spot it before the customer does, so we can get there early. But it's not always the case, so you sometimes get things direct from the customer, whether that be via tweets or phone calls, emails, et cetera, chats.
And what we're looking to do is string together all of those things, whether it's the service tool. So we've got ServiceNow, and then you want to alert into Jira if you're wanting a dev team to look at it, or another operational team via ServiceNow. Well, it used to be, and we used to use emails. And that's where we started to get the real value where we got from PagerDuty, as we started to get that response automated and get the team we needed on those issues as quickly as possible. And what we're doing now is trying to string that first bit of getting an engineer on a line with the right information right through to actually actioning a fix. So that's where we are right now.
Matt Livermore: Okay. So let's dig a bit more into what you're doing with PagerDuty. So where does PagerDuty fit into this?
Rob King: Originally it was a way of getting the right people on a call during an issue, and it was a way we could also start to sort of unify the information around a fault or an issue. So most teams' first experience with PagerDuty was to be uploading their on-call rotas and their escalation paths if that on-call rota failed. So most people, that was their first value they got out of PagerDuty.
The person who was actioning the call just hit, "Call the DBA," "Call Network," and PagerDuty would do the rest. So we didn't have to know exactly who, what their mobile number was. Really clunky spreadsheets of who to call when. That was all gone. It was just, right, that is the escalation path for that team. That's their current rota. If that rota fails, it followed the escalation path. So we knew how to get the right people very quickly. That's where we started with PagerDuty.
Then as it grew, we realized we didn't just buy it for that. We then started to build the service structure, so you could start to identify in more detail what things were, where it had gone wrong, what we should do next, who should be called. That service structure started to be built so you could align PagerDuty with, in our case right now is New Relic, and we would match up specific errors with specific PagerDuty responses. And that's when you really start to get into the fine art of improving your response time to incidents.
Matt Livermore: Okay, good stuff. So obviously that's all around the initial part of an incident response process.
Rob King: Yeah.
Matt Livermore: It's about getting the right people into a virtual war room as it's become over the last year or so, rather than a physical one.
Rob King: Yes.
Matt Livermore: But yeah, so that whole point of getting people together to work on an issue. So where did Rundeck come in? I believe you were using that product before PagerDuty acquired it, so tell us a little bit about that.
Rob King: Yeah. So Rundeck, it was born out of technical staff's want not to have to log on to every box to make a change, to run things in a batch. If you wanted to check a port is open across an entire service, all the VMs within a service, you wanted to know is that port open or closed, or you make a change or configure a new path, you could do that in an instant using Rundeck. And that's where we started from, just the techies starting to make improvements.
And then for us, the sort of eureka moment was shortly after we'd done a lot of work to improve our patching. So patching prior to Rundeck was very laborious. We'd ask each team to come and help with the apps, and then the central team would take them down. We'd take the database down, patch everything, bring it back up, bring the apps back up with the engineers. And the amount of engineers that it involved to make it quick so that we could get product back to the customer was far too high. And with Rundeck, we managed to reduce that by over 95%.
So what we asked for was the teams who ran these services to give us service wrapper scripts: start it up, bring it down, take one out of a load balancer, patch it, put it back in, all of the scripts needed to test those things. So in and out of a load balancer, check service is still up, reboot, apply patches, put it back in a load balancer, check service is good, move on to the next. That kind of follow it all the way through. We've put a lot of time into that, and in the end, we got to achieve over a 95% improvement in time taken to patch.
But what these scripts also gave us was the ability when, and not all applications are perfect, and you do try the classic Windows fix on things of just giving it a restart. So these scripts allow us to no longer need an engineer to restart them in the middle of the night. We don't even have the on-call lag particularly. We still use PagerDuty to raise a ticket, spot the issue, alert people, but those people can now run those scripts. Doesn't have to be the exact engineer from that team who might only have a small on-call team. It can be done by a central team who are given access via a Rundeck project, and they can restart applications.
And that's when it started to mesh together, and that's the point where we started to look at PagerDuty custom incident actions to actually allow an incident triggered to also trigger the fix.
Matt Livermore: Right, okay. Yeah.
Rob King: So it doesn't even need, potentially, depending on how you want to run it, it doesn't even need a human anymore to do some of the classic fixes. We can try a restart of a service before calling an engineer now.
Matt Livermore: Cool. Sounds great.
Rob King: Yeah.
Matt Livermore: I mean, okay, sounds great on one hand. Also sounds a little bit scary in terms of what happens if the machine runs away. So how do you tie that back to other systems for an audit trail or anything like that?
Rob King: The Rundeck scripts are all audited. I mean, they are logged, so that was one of the things we worked through with InfoSec. The actual system, because PagerDuty's a SaaS, it's up there in the cloud, and Rundeck can be different ways of doing it, but quite often, it's quite core because it can action things on servers that are potentially very critical to your business. So we put a lot of work in matching up to ensure that the payload was exactly what it should be from PagerDuty before it would initiate a Rundeck action.
So we had that thoroughly checked by InfoSec. They signed that off. We use a Lambda function. It was shortly after we developed that and put it into live that we did show PagerDuty, and that's when they started asking lots of questions about Rundeck, and then a few months later they bought them. So I still swear blind it was all down to me and my team.
Matt Livermore: All down to you. Yep.
Rob King: Absolutely. I'm just waiting on my commission check, I really am.
Matt Livermore: So just on that part, again, you're talking about lots of different moving parts and everything else. How difficult has it been to sort of go through this journey? What level of effort have you had to put in terms of PagerDuty, Rundeck, bolting them together?
Rob King: The bolting them together, I think because it hadn't been done to our knowledge, and we didn't really find anything, and neither PagerDuty nor Rundeck at that time were together, it did take us quite a while. We were a small team, and we were learning certain aspects of our role because it was quite a new role, this sort of enterprise-level automation.
It was difficult. It did take a lot of work, but I have got some excellent engineers. Once we'd broken the back of it, though, and it was mainly around the communication. We had issues that were really down to how we originally set up PagerDuty. Which I can go into a little bit, but the Rundeck end of it, that Rundeck setup's really quite simple. The scripts, you can write them in many different languages. It doesn't declare what you have to do. It gives you a lot of options on how you want to do things. Which allows separate teams who own a service, the more now agile DevOps teams, they own that product. They can write them in the way they want to, but others can run them, which is very powerful.
And it frees up others to do all sorts of tasks. We've got great savings from the compliance team using Rundeck jobs.
Matt Livermore: Okay.
Rob King: So when we have to do PCI audits, instead of having technical people who really hate doing PCI audits...
Matt Livermore: Yeah.
Rob King: ...you can work with them, and you can write the scripts that will do, based on their input, the checks they want. So quite often a PCI audit'll ask for, "Can I have X number of this type for server audited, please, for evidence packs?"
Matt Livermore: Yeah.
Rob King: And now we can say, "Right. Well, there you are, compliance team. There's the Rundeck job. You hit go, you input whichever ones, you can pick them at random," which auditors prefer rather than being led by techies, which I always think they're a bit suspicious of. So the audit team or the auditors themselves, even external auditors, if they're being chaperoned, can run these things themselves. And it's saved a lot of time, and it just gets a lot of drudgery work off techies' backs. But yeah, the setup was difficult, but once it was done, and I'm sure it'll be a lot easier going forward now that PagerDuty and Rundeck are together, it's been pretty solid.
I know there's already improvements in place. The system we currently use are custom incident actions, which I think are still currently limited to three per service instance. But there are new systems coming along that we've got some view of, which allows a much wider scope of actions that can be carried out automatically by a PagerDuty alert.
Matt Livermore: Ah, yes. The new V3 webhooks that give you lots of...
Rob King: The webhooks, yeah. Which I think can even, depending on the flavor of what the alert is, will tailor what the options are, which is really quite interesting to us.
Matt Livermore: Yeah. Priority based. Yeah. Priority based runbooks, all that sort of cool stuff starts to come into play. Yeah.
Rob King: So while we're still in that end-to-end bit, we're still fairly basic. We're now building up and understanding how we bolt these things together in a better way.
Matt Livermore: Okay. So again, I want to drill into a bit into the benefits. So you already touched on some of the productivity ones. If we come back to things like mean time to acknowledge. I can remember talking with some of the team at William Hill, like Alan, for example, Alan Alderson, around the we want to get all this down to 20 minutes end to end. Where are you at now in terms of mean time to acknowledge and mean time to resolve? I know it's different for different types of issues, but give us a flavor for the sort of benefits you've seen as a result of having PagerDuty and Rundeck integrated into the environment.
Rob King: Well, the acknowledgments, I barely get to the alert before it's acknowledged. I get it on my mobile phone. That's one of the great things about PagerDuty is the stakeholders can have everything if they want to, or you can have just your area, and you can have it by several different ways of getting that communication. But the acknowledgement rates, I'll hear my text go, and before I've even read the alert, someone's acknowledged it, and we're on our way.
In regards to mean time to repair, we've had our best Grand National, and the last two quarters we have had exceptional, exceptional uptime and resolution times. So PagerDuty and Rundeck and all of our other tools like New Relic are really starting to prove a great deal of benefit in our customer services.
Matt Livermore: Oh, fantastic. Good to hear. In terms of lessons learned, if you were going to start again now, what would you do different, do you think?
Rob King: So if I start at the PagerDuty end, I think you really need to know what your service structure wants to be. What tends to happen, you link a monitoring into PagerDuty, and quite often then PagerDuty might go into, depending on what your choice of service reporting is, another tool like a ServiceNow, and then that goes on further. If you don't line things up well, if you do, you can set up PagerDuty and have just sort of one service, and you can run a large amount of products like that, but you just can't then automate off the back of it. It's not granular enough.
So you need to be quite granular in your service structure. I would say you need to understand where you're going to do your reporting, because in the work we did and the work other teams have done since PagerDuty came along, we did a lot of work to match it up with ServiceNow to help reporting. So the PagerDuty information would go into a ServiceNow ticket to help with the reporting. But again, if they're not lined up, if you've got a trading application that's got 1,000 different parts and you've got one big lump of a service definition in ServiceNow, you just can't match them up easily, and it just makes reporting a bit of a mess.
And in terms of the Rundeck end of things, I would say the biggest thing you can do to make it work for you is to have those restart scripts part of all deliverables, so that a team has to deliver restart scripts and management and maintenance scripts and compliance scripts as part of delivering any product. Because that will, from the get-go, empower many of the people to run those tools, including your service teams, and that should, I would say, empower you and Rundeck to really improve your service deliverables around mean time to repair, et cetera.
Matt Livermore: Okay. So you're talking about both the sort of service design within PagerDuty and also the sort of the jobs that you want people to have available out of the box effectively with Rundeck.
Rob King: Yeah.
Matt Livermore: Are you templating that stuff now to make it easier for teams?
Rob King: We've had a few goes at this. It does depend. A lot of times in tech, people are always, "Oh, it's going to be new, and it's going to be new and fancy, and it's going to work in a different way." And it rarely does. It is starting to become a deliverable to ensure that you deliver X, Y, Z in a certain way or within guardrails. I wouldn't say we have a strict template for Rundeck especially, because we don't want people to be forced to use certain things. But there are guardrails to what we're trying to achieve. So I think rather than a strict template on how you do it, it's more a, it must do these things within these parameters. That make sense?
Matt Livermore: Yeah. Okay. Get that. I guess going on from there then, you've achieved so much already in terms of, as you said, best Grand National ever, really great last couple of quarters. Touch wood, it's all going to keep going the way it's been going, everything else, and obviously making inroads, massive success in the US right now. What's next? AIOps is something you hear a lot about. What does that mean to William Hill, AIOps?
Rob King: Well, if you listen to our guru on capacity and monitoring and integration of APIs, it means a negative mean time to repair. For us, it's collecting the data through systems like Splunk, which obviously aggregates your logs, New Relic, which is our current monitoring tool, both the out-of-the-box monitors and the specific ones that you build for an application. And over time, from that information, learning and predicting from many sources of information what's going to happen next.
So if you know there's a marketing campaign, and there's going to be an uptake here, but you know that pattern's already on a way to a problem, and you know through the marketing update that there is going to be or should be more load, you can get ahead of that and make the change and scale things up.
So what we're after is Rundeck and other tools like it to provide the muscle that we can attach to this sort of artificial intelligence and machine learning, pattern-matching brain to try and remove the human effort, human reports, human predictions, and rely on what's gone on in the past to learn about the future and make that happen. But to me, intelligent automation is linking many tools together. You can be triggered by AI, you can be triggered by a straight-up monitor, but it's what you do and how you communicate automatically, which obviously PagerDuty handles faults and communication and escalations really well, and Rundeck provides that muscle, if you've got the scripts written by the teams that you require, to give you an outcome. Hopefully, it's the right one if you've configured it right and your scripts run correctly.
Matt Livermore: Yep. Totally agree. Well, Robert, it's been fascinating hearing sort of what you've been doing at William Hill, and yeah, thank you very much for the time today. I found it really illuminating, and I look forward to seeing more and more successes at William Hill going forward.
Rob King: I hope so, too. Thank you very much.
Matt Livermore: Thanks a lot.