How Sky Betting & Gaming is Driving Real-Time Operations Transformation

Log in to watch

London 2020

How Sky Betting & Gaming is Driving Real-Time Operations Transformation

Amit Juyal

Sr Service Lifecycle Manager · Sky Betting and Gaming

Steven Wheeldon

Service Operations Manager · Sky Betting and Gaming

Neil MacGowan

Technology Evangelist · New Relic

Rachel Obstler

VP Product · PagerDuty

For many organizations, DevOps transformation is now a business imperative as it drives widely understood advantages in innovation, agility, and empowerment. However, many organizations struggle to implement and realize the true benefits of DevOps transformation due to challenges with culture, processes, and tooling–in fact, as many as 78% of organizations fail to get DevOps right.

Join Sky Betting, PagerDuty, and New Relic as they discuss Sky Betting’s transformation story and their lessons learned from over the past decade.

This session presented by PagerDuty.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Perrine)

Thank you for joining us today on our webinar, How Sky Betting and Gaming is Driving Real-Time Operations Transformation.

We've got a great presentation on what it looks like to become more operationally mature in your organization, with specific input from Sky Betting and Gaming on how they view maturity and what it takes to get there.

Before we start, some quick housekeeping. Today's presentation will be recorded and will be shared with all registrants. And we will have a Q&A at the end of the presentation. Please enter your questions in the chat box, and if we can't get to them during the presentation, we'll reach out afterwards.

Okay. So our presenters today, our speakers are Steven Wheeldon, who is Service Operations Manager at Sky Betting and Gaming. We've got Amit Juyal, a Senior Service Lifecycle Manager at Sky Betting and Gaming, Rachel Obstler, VP of Product at PagerDuty, and Neil MacGowan, Technology Evangelist and Strategist at New Relic. So without further ado, I'll go ahead and hand it over to you, Rachel.

Rachel Obstler

Thanks, Perrine.

So today, what we're going to talk about is first an overview of how digital is changing operations. And then I'll hand it over to Sky Betting and Gaming, who will talk about their DevOps and operational transformation and the key learnings that they had going through that process. And then lastly, we'll spend a little bit of time, myself and Neil, talking about tools that can accelerate your operational maturity, PagerDuty and New Relic. And then lastly, as Perrine said, we'll take Q&A at the end.

So if we move to the next slide, I'm going to talk a bit about how digital is really changing the way that you need to operate. And on the next slide, that leads to four key macro trends that we're seeing.

And the first one is in the digital world, and probably in the last 10 years, pretty much every company that's been around longer than 10 years is going through a digital transformation. And then there's all the companies that have been born in the cloud, who are those digital disruptors. But for all of them, it's clear that there's a real importance of real-time action, being able to respond to anything that happens in real time because your customers, and that could be your external customers, it could be internal employees, but they've grown to expect 24 by seven capabilities, always-on access to whatever they want. And in fact, there's a lot of metrics out there. One of them is that 81% of users will wait less than a minute before abandoning an application. So just saying, "This isn't working, it's taking too long, and I'm out of here."

And so what that also means is that companies have been transforming the way that they operate, and moving to more of a full-service ownership model, so where someone is coding and owning their software. And what that does is it improves agility. It means that the person who is operating something was the last one to touch it, knows what the problem could be and can more quickly and easily fix it. And then this moves to the developers really being the architects of the whole digital experience.

Other trends that are going on is the rise in operational complexity. So, there's a lot of infrastructure changing, and that's the other trend. So you have migration to the cloud, you have monoliths changing to microservices, you have growth in services like Lambda, where you're leveraging and building on cloud services. And this is also leading to just a huge rise in operational complexity.

And a little bit of metrics around that are that the monthly events per responder that we've seen just in PagerDuty's data across all of our customers has increased three times in the past three years. So everyone who is managing these infrastructures, dealing with more noise, more information, and having to deal with an increasingly complex infrastructure.

So if we move on to the next slide, what that means and what that's leading to is we're seeing a lot of leading companies really reimagining the way that they're doing operations. And it means that the old reactive method of doing queued work or waiting to find out that there's an issue from a customer doesn't work. You need to be able to operate in real time. It's less of a queued system. It's more of a swarm when something goes wrong, and you bring in all the people that you need quickly. It's a lot more collaborative in that way.

It's putting monitoring in place so you can be proactive in figuring out that there's a problem. It's investing in automation so that you can instantly route data that you need to the right place, that you can spin up a response very quickly, that you're not going down a call list, for instance, and trying to find someone by calling them one by one to help you. And it's moving from a more command-and-control type of method towards the individual responsibility and democratization of data to make sure that the people on the front line can respond quickly. And then lastly, it's going from more static or rules-based approaches to a system that can learn and be intelligent and get smarter and help you operate better all the time.

So moving on to the next slide. What PagerDuty has done is we looked at what makes a company successful, working with a lot of our customers, and we've built a real-time operations maturity model to help companies understand where they are and how they can improve.

And so this maturity model is at a high level pictured here, and what it does is it looks at companies that operate in more of a reactive mode all the way through companies that are operating in a preventative mode. And some examples of that are when you're reactive, you're waiting until a customer files a ticket, and that's how you find out that you have a problem. Whereas when you move to responsive and proactive, you're getting timely information in automatically. You're using lots of monitoring capabilities, and you're automatically routing that information to the right person who can act on it in a matter of minutes, if not seconds.

So some other examples of what maturity looks like is that you may have knowledge in silos. There's a lot of people who know things that no one else knows, that when they're out, you have a problem resolving things. Whereas you have a much better method of sharing information when you're getting to be more proactive and preventative. And then in the preventative world, you're using techniques like machine learning and predictive capabilities to get ahead of issues before they even happen.

So what does that mean in terms of how you end up operating? And so what we did at PagerDuty is we actually did a survey, and we asked a lot of companies about their operational practices, and then we also asked them how they performed. And what we found is that the more mature organizations significantly outperformed their lower maturity peers along several key metrics. And so some of those metrics are, for instance, more mature companies acknowledge incidents on average seven minutes faster. They're able to mobilize around incidents 11 minutes faster, and they're able to resolve incidents in general, so putting all those numbers together, plus the resolution, two hours faster. And then what that leads to is that there is an average of seven incidents per month across these companies that were major incidents, so major typically meaning customer impacting. And so on average, the more mature organizations had 14 hours of less downtime each month than the less mature counterparts.

So this has a real and very large impact to your customers when you can move up and operate in a more mature way. So with that, I'd like to hand it over to Amit at Sky Betting and Gaming, who's going to talk about what they did to mature the organization and get a lot more effective at responding to issues. So, Amit, over to you.

Amit Juyal

Thanks, Rachel.

So just a bit of background about us as a company. We are a leading betting and gaming company here in the UK, purely online based. And we are basically aiming to be the UK's best digital business. We do this with the help of 1,400 plus colleagues and aim towards developing some of the country's biggest brands in the online betting and gaming industry.

Just a bit about our product portfolio. So our product consists of, in the sportsbook areas, gaming, and free-to-play products. And that's just a wide range of products that we developed and manage in-house.

One of the key reasons for our success is how we do things, which is what we term as the SBG ways of doing. To highlight some key ones out of there. We are customer obsessed, so everything that we develop, everything that we are facing towards the customers is always with the key aim that what is best for the customer in terms of experience, features, promotions, and few other examples.

We are game changers. In terms-- It's what key message here is basically, we don't shy away from experimenting. We make sure that we take the risk with the right level of plans and right level of research.

We learn and adapt, which is quite key for any company to be successful. We won't say that we are perfect all the time, but we make sure we take the learnings very quickly, adapt, and we move on from there onwards.

And the last one I would like to highlight, which is we are all one team. And that is one of the biggest strengths that we hold, is that we don't shy away from going and speaking to people within the organization or external to the organization, taking learning, adapting them, sharing ideas openly. Amazing colleagues do the things, the right things by the people, our business, and our customers.

Moving on to the technical operations journey from here onwards. The world of DevOps first got introduced into us as a company in 2011. We had a centralized team looking to support multiple products and functionalities going live to customers almost every week. Along this came our strategy to build new products in-house. Key events like the Grand National, Cheltenham Festival, big football event, brought new challenges for service and operations team every year on year as we grew as a company.

In 2013, we adopted a big change, which is what we call as the tribe structure, which was inspired by a model that Spotify introduced to the industry. This evolved over a period of time and resulted in Sky Betting and Gaming divided into autonomous teams, first mainly at the product level, and then at the squads level.

The key theme here was that DevOps, which were previously a centralized team, now emerged into three basically main areas, as you can see in the presentation. You had a DevOps role, which was there within the squad. They were named as SquadOps. You had reliability and platform engineering teams that would sit within the tribes to basically help do all the evolution regarding the DevOps and the operational areas. The aim of these roles were to help squads optimize reliability and delivery of products, features to customers.

This fast growth brought its own challenge to the technical operations team. Steven Wheeldon, our service operations manager, will now take you through some of the key challenges we've experienced, and how PagerDuty as a tool helped us through this journey, to overcome them.

Steven Wheeldon

Great. Thanks, Amit. So this is just a quick overview of where we are right now with PagerDuty in this particular instance. It's taken us around two and a half years to get to this point we're now at. And while that may sound a little daunting to those of you considering similar transformations, you'll see in this deck that some of the most meaningful changes and improvements occurred within a couple of months of integrating PagerDuty.

In the beginning, we had a fairly unpolished, unloved, and typical monitoring setup. We had our alarms for all our hosts, but often they were riddled with useless default functions, false positives, and redundancy. Monitoring was a real black box pre-2016. Alongside this, we had a traditional, yet super inefficient escalation process, too.

Random contact numbers for random people in random places within our knowledge base. With information spread so sporadically, incorrect call-outs, human errors, and delays tainted our whole escalation process.

The scenario is this: You sit staring at your screen for hours, yet another unknown alarm goes critical. You panic. You start trying to match keywords in the alarms to scraps of information you've got in the knowledge base. You think you've got it. You pull up a page of numbers. You find your guy. You call as quick as your trembling fingers will dial. But no, you've called the CTO. You give him a fake name and hang up abruptly. The panic intensifies. Back to the knowledge base. The next number, could this be the one? You call. You call again. No answer. You're alone.

To try and combat this situation, we began to track every alarm we came across and record what action was taken when they occurred, so that the next time we might be better informed. More manual work, more pain. Just to add to our sorrows, many teams would have a single shared on-call phone. It was up to the engineers to make sure, A, this phone was handed over, and B, was charged or even worked. That is essentially how on-call was managed. As you can imagine, this caused a lot of problems. Something had to change.

In case those sort of words didn't quite hit home, I've put this together just as a little image of what I'm trying to say.

In the summer of 2016, we made our first step toward introducing PagerDuty in a minimal, but albeit effective capacity. We did away with manual on-call rotas, opting instead to build the rotas into PagerDuty itself. If an alarm fired, we would still need to figure out if the service impacted based on the alarms. But once we determined this and figured out which team was responsible, we would just need to select them from a drop-down, type a little message into PagerDuty, and let it do all the work.

It would notify engineers in a way they preferred, email, push notifications, or calls, and if they didn't answer it, it would automatically escalate to the next in line as per their escalation policies, which is what you're looking at there. This removes the manual process of entering phone numbers, searching for specific on-call rotas, and deliberating over how many missed calls is too many missed calls. It also killed the on-call phone practice I spoke about in the last slide.

This first basic setup we had with PagerDuty also granted us an invaluable break glass protocol through its response play function. If SBG services ever take a real hit, then we have the ability to have every on-call resource, which is well over 50 individuals, online within a matter of minutes, with only one call actually being made.

This protocol can also be used on a more granular level, too. Should one product go down, for example, we could page out all the resources of that product through a tailored escalation policy or response play. When your products have really hit the fan, being able to immediately reach out to everyone who may be able to help is a real lifesaver, and has gone a long way towards boosting our mean time to recovery.

With the initial success of our PagerDuty integration, the next step was to chuck some automation into the mix. By early 2017, we'd tied up all the services to their responsible squads, removing once and for all any doubt as to who should we contact when a service has been impacted. All we need to do now is search for whatever it is that's broken in PagerDuty and let it do all the work for you.

PagerDuty has also helped us to manage the ownership of services. As services develop and responsibilities change, some services can get left behind in the process with no squad or team willing to take responsibility. PagerDuty provides an irrefutable catalog as to who will be contacted should an incident arise, and makes it very easy when you need to amend it.

At this stage in the course of PagerDuty's implementation with our services, we reported a decline in our mean time to recovery of approximately 20% midway through 2017. This will certainly have been helped considerably by PagerDuty removing all the manual time-wasting and trial and error processes of the past. Our mean time to acknowledge improved dramatically during this period, too.

The biggest change in the whole process was the move to where we are right now. We decommissioned our old monitoring tools and switched on Visibility in PagerDuty. We use a number of different tools to monitor different aspects of our infrastructure and services, and Visibility allows us to integrate them all into one central platform for monitoring.

Our New Relic instance, for example, feeds directly into Visibility in real time. Combining these two powerful tools has been invaluable to our operations. Neil will go into this relationship a little bit more later.

Our main source of information about service-impacting incidents is now the major incident section, which is the one in the middle, along with the service health section to the left. Naturally, the major incident section is reserved for incidents with real-time impact to services. Service health is for more generic, non-critical alarms. The section on the right, infrastructure health or bubbles, as it's fondly referred to in SBG, is helpful as a retrospective tool to analyze periods of instability for specific services. More alarms fired at a certain time, the bigger the bubble. If the bet platform had an issue at two o'clock last Friday, we can use bubbles to quickly build a picture of what services specifically were impacted and how.

At the same time as we moved to Visibility, we also tied alarms to specific services and then configured the alarms to automatically call out the responsible squads. This completely removes the need for any manual work in monitoring alarms and puts the onus of maintaining alarms onto the engineers. This change really drove refinement in service operations, with tribes working to streamline their critical alarms as opposed to leaving it up to the service desk to determine if anyone cares that the disk space on host 203 is at 98% capacity. Tribes would configure these alarms so that PagerDuty would essentially know if anyone should care.

With this latest step in our transformational journey completed, we've seen an additional decrease in mean time to recovery by approximately 8% since the go-live of Visibility and auto call-outs. From the beginning of PagerDuty's implementation, we've also had a total decrease in mean time to recovery by approximately 28%. This will certainly have been influenced by our PagerDuty integration.

Perhaps our most remarkable metric is the improvement in our mean time to acknowledge, which has risen by 86% over the last two and a half years.

The next steps in our transformational journey are loosely centered around additional integrations. Primarily, we want our PagerDuty to interact with both our Slack and Jira platforms. We recently launched a bot into Slack that will allow us to type a simple command that triggers a PagerDuty alarm, so we won't even have to go into the PagerDuty portal to use PagerDuty. The aim here is to claim back a little bit more time during break glass or major incident situations. Having said this, I'm reliably informed that PagerDuty are on the verge of launching their own version of this, and that will be up for grabs in the near future. With Jira, we're looking to avoid duplicating work by having Jira and PagerDuty work as one, where our PagerDuty call-outs will trigger the automatic creation of Jira tickets.

Beyond those, we're actively working with our third party suppliers to take on PagerDuty to further boost our combined recovery. And with that, I'll hand you back to Amit for a closing statement on our bit.

Amit Juyal

Thanks, Steven. So all these brilliant functionalities through PagerDuty has helped us basically drive the three main key areas of what we always aim to improve on.

It's helped us drive revenue and improve customer experience. 86% reduction in our mean time to acknowledge has ultimately resulted in us reducing the impact the customer sees every time the systems get impacted.

It has helped us improve people productivity and engagement. Smarter integration through PagerDuty API help to monitor services on real time and helps reduces false alarms and call-out, ultimately having more productive time in the office rather than getting called out at two o'clock in the night where service was not even impacted.

And finally, it reduces the business risk and improve cost efficiency because the lesser issues customer experience, the better it is from their perspective in terms of product journeys and other functionalities. On that note, I'll hand over back to Rachel to discuss more about transformation of real-time operations with the right tools.

Rachel Obstler

Thanks, Amit.

That was a huge improvement that you saw going down 86% from 30 to four minutes in mean time to acknowledge, so that's awesome.

So I heard a lot of themes in what you were talking about, and I think it's important to talk about what it really takes to do an operational transformation, and you really talked about a lot of different elements. So one of them, of course, was a tool, but it was also changing your organizational structure. It was putting the right processes in place and also moving to a culture of ownership of people both building and owning their services. So all those things working together.

And so one of those elements is having the right tooling. And so, wanted to talk just really briefly in the next slide about PagerDuty's platform for real-time operations.

And another theme I heard, or a couple of themes I heard throughout your presentation was automation. So making sure that you can automate as much as possible about those mundane operational things that you don't need your brilliant team to be spending time on.

And so PagerDuty's platform does that with on-call management and modern incident response. So that's where we essentially automate the whole runbook and automate the capability to spin up a response with multiple people in a very surgical way. So you know when you have this problem, you need these five people from these five teams, and PagerDuty can automate that whole process.

Yeah. And then the other thing is about having the right information in front of people so they can be more effective. And one of the ways we do that, that was shown in the presentation, is through Visibility, so making sure that you can see all of the data, all of the issues that are going on at one time, who is working on them, what other services could be impacted.

We also have Event Intelligence, and what Event Intelligence does is it manages and helps manage the noise. So when you have a lot of data coming in, we can look across that data and say, "These things are related to each other." And instead of pinging with multiple incidents, we can automatically group together related issues into one incident so that all of that context is available for the responders, and they have more information from time zero when they first get told that there's an issue about what's going on.

And then lastly, PagerDuty has analytics so that you can look back afterwards and really learn and improve. So look at how many incidents you had, what was causing them, how long did it take to resolve them? Is there anything happening that's repetitive? Are there certain services that are more noisy than others and maybe need an operational investment?

So that's the PagerDuty platform all built on top of an enterprise-class scalable platform and with a large amount of unique data that all these products on top of it can really leverage to help you create a learning and improving organization.

So with that, I will hand it over to Neil to also talk about monitoring and getting the right data into PagerDuty.

Neil MacGowan

Thank you, Rachel. I appreciate the introduction.

So one of the things that I'm here to talk about actually is the fact that it's incredibly important to make sure that when you use the tools, when you do put people into triaging incidents, et cetera, and you start notifying people, that they're working on the right things. Because nobody likes a false alarm, whether it comes to wake you up in the middle of the night, or whether it requires you to mobilize an entire team to try and resolve certain issues within your organization.

And an example of perhaps probably one of the worst false alarms that we've seen in recent years was the missile warning system in Hawaii, in January last year, which was triggered through actually a bad user interface. So it was down to a code-level issue, and notified the entire population of Hawaii that there was an imminent risk of them being attacked by a missile.

Now, we're not saying that when it comes to running your business-critical applications that the consequences of a false alarm are going to be quite so drastic, but it just highlights the importance of making sure that when you do trigger an alarm or an event, that it's accurate. And if you want to move your organization from being reactive to proactive and ultimately predictive, then you have to make sure that the notifications you provide can be relied upon.

So how do we do that at New Relic? Well, fundamentally, we expand the concept of APM to go way beyond just looking inside your application code or just looking at your infrastructure. Instead, we provide full out-of-the-box instrumentation that allows you to quickly ascertain what's the impact on the customer from every user experience mobile device through the applications tier and to the back-end infrastructure and cloud services that are supporting those particular applications. And that immediate visibility gives you an idea as to where problems are potentially coming from and what the impact is.

Secondly, you can extend the New Relic instrumentation with custom attributes, custom metrics, which key performance indicators relative to your business. So how many bets are we taking, for example, within the betting and gaming industry? Or how many orders are we taking through e-commerce? And how is the alert that we're actually triggering at the moment impacting our ability to do business, and how many customers are being impacted? So this gives you a much greater context into whether or not what we're triggering as an alert is meaningful to the business and requires the appropriate action.

And then finally, actually generating those alerts in an intelligent way. So not just basing alerts on fixed threshold breaches, but looking at using machine learning techniques to understand what's normal, what's abnormal, looking at things like cohort analysis. So for example, if metrics are supposed to operate in a similar fashion, maybe across a load of states and one of them starts to behave abnormally, that's something which you should know about. And also, how is the changing environment actually impacting that change in behavior? So how is deployment, which has resulted in perhaps a performance regression or introduced a new number of errors into the equation, and providing that information back in full context of a new trigger? Not only is it intelligent, but it's also providing the full context.

So fundamentally, what New Relic is doing is it's allowing you to connect the technical performance of your applications and infrastructure actually to the business value that they deliver.

So to sum that up really, if you think about becoming proactive, real-time, and delivering APM-driven operations, there are four key areas that you need to consider. One is that you need to leverage APM data. And when I say APM, I don't mean just looking actually at the code. Gartner recently expanded the APM category to include user experience and infrastructure, so that you have full context of what's going on, so that when you're notifying the right teams and they understand everything which is going on within that complex environment.

The second thing is to reduce mean time to repair. Now, we've heard from Sky Betting and Gaming how they've made significant reductions in mean time to repair. Reducing mean time to acknowledge, so making sure that somebody is in receipt of that alert and is working on that faster, but also providing the context to make sure that it's the right people that are involved and they've got all the information to hand, and they can collaborate accordingly in order to resolve the issue faster.

And the third thing is make your alerts actionable. So if you can connect these two technologies that allow you not only to more intelligently alert, but also mobilize the appropriate resources to resolve the issues faster, then that is going to benefit your business greatly.

And then finally, in which organizations are able to deploy a much greater frequency. It's never been more important to be able to detect issues faster. At the end of the day, the faster you go, the more often you'll make mistakes. So you have to be able to not only deploy quickly, but you have to be able to determine whether or not that deployment has had a positive or a negative impact. And if negative, you have to be able to roll it back just as quickly as you rolled it out. And that requires either the people to do that or the triggering of automated remediation in order to make that happen.