DevOps Transformation - A Case of History Repeating

Log in to watch

London 2019

DevOps Transformation - A Case of History Repeating

Platform Engineer · Sky Betting and Gaming

At Sky Betting and Gaming we've been on a DevOps journey since 2011. We've undertaken numerous "DevOps Transformations" and learned that this is an ongoing program rather than a "Big Bang" initiative. As the company has restructured in to agile autonomous tribes and squads, we've identified and reused a number of patterns and techniques as we've evolved our ways of working. This talk aims to share those learnings.

However, unlike many talks at DevOps Enterprise Summit this talks focuses on the practical rather than strategic. It looks at how team structures have changed, the way work is managed, how we learn, how we own products/services and how we safely increase velocity. In addition to looking holistically at the whole company, we'll focus on one case study of how the Container Platform Squad has evolved from a small greenfield POC to be the platform of choice for many projects/products across the tribes. Covering how patterns have been applied from the rest of the business and how they solved many of the challenges faced during the squads' evolution.

Andy has over 25 years of industry experience and has been part of the team at Sky Betting and Gaming for 5 years. He's held a number of engineering positions in the Bet, Data and Infrastructure tribes, currently a Platform Engineer he spends his day running and extending the Kubernetes platform.

Outside of work Andy has been running the Devops meetup group in Leeds for almost 6 years and is part of the organising team for the DevOpsDays London conference.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Next up is Andy Burgin. He is a platform engineer at Sky Betting & Gaming. He spoke here at DevOps Enterprise London 2017 about how he helped bring DevOps principles to the big data and Hadoop teams. We thought this talk was so fantastic that we asked him to present the same talk at DevOps Enterprise in the US.

To me, there are two very interesting things about Andy's work. One, he's always finding ways to solve problems using DevOps principles, and the results are always so novel and interesting. So this year, he'll be talking about how he chose to work with the newly formed container teams and figure out how to best integrate them into the existing development processes. Which brings me to the second thing that I find so interesting about Andy. After decades of managing teams, he's decided to jump into a more technical individual contributor role because he likes it, and he's doing the work that he loves and that got him into the field in the first place. Andy.

Andy Burgin

Morning.

Today, I would like to talk to you about the DevOps transformation work we've been doing at Sky Betting & Gaming over the last eight years.

Before I do that, I should probably do introductions. Hi, my name's Andy. You must be DevOps Enterprise Summit. How very nice to make your acquaintance.

I've been at Sky Betting & Gaming for the last five years. I'm currently a platform engineer in our container hosting platform squad because I really like YAML, and I'm very much a proponent of grassroots DevOps. I'm on the organizing committee for DevOps Days London, and I've organized the DevOps meetup in Leeds for the last six years. My Twitter handle is on the screen there if you would like to ask me questions or just troll me. I don't mind. I quite like the attention. And my pronouns are he and him.

So Sky Betting & Gaming is the UK's most popular online bookmakers. We have a number of products, our sports betting product, so Sky Bet, and a number of gaming products, so Sky Bingo, Sky Casino, Sky Poker, and Sky Vegas. We've a number of free-to-play games as well, and as you can see from the charts on the screen, we're doing okay.

So what does okay look like in terms of numbers? Well, we've 1,400 employees, most of them based out of our headquarters in Leeds. We have another office in Sheffield where a large number of the team is based as well, and we have some subsidiary teams based out in Solihull and in Hammersmith. We've been going through 20% or 30% growth year on year for the last five years, and that's presented us with a number of challenges: scaling customers, scaling tech, but also scaling teams. And I don't think we could've done it without DevOps and without agile adoption. I think it really has made that happen.

Our revenues last year were £670 million, which is quite a lot. But it's not nearly as much as the amount we were bought for last year by The Stars Group. We were bought for 4.7 billion Canadian dollars. Being an engineer, I'm quite good with numbers. I know my megabytes from my gigabytes, my petabytes from my terabytes. But trying to understand how much money that actually was, a little tricky. So I did a calculation, and I worked out for 4.7 billion Canadian dollars, I can take us all for lunch at Nando's, main and two sides, not scrimping on that, every day for 350 years.

I don't think I'm going to get that expense claim past the boss, so it's a nice idea, but it gives you the idea of the success we've been going through. So we're now part of The Stars Group. We are based in the UK. We have a sister company now in Australia called BetEasy, and The Stars Group have products and licenses across the globe. So it's a really exciting time to be part of the business.

But we weren't always so big. We weren't always so successful. If we go back to the beginning of our DevOps transformation at around 2011, we were very different. We had a couple of delivery teams working in agile ways, and we had an infrastructure team working in a waterfall way. They were concerned with data centers, virtual machines, network switches, routers, storage arrays, and the collaboration between those two teams was somewhat difficult. So even though it was an anti-pattern at the time in 2011, we created a DevOps team and filled it full of DevOps engineers.

Now, originally, they worked as a middleman to help with the collaboration, but they soon became a release team, and then a pipeline team, and then an automation team, and then an infrastructure-as-code team. And if we whiz forward to 2016, things are very, very different. That centralized DevOps team, that's gone. We've adopted the Spotify Tribes model, so we have tribes and we have squads. Some of the tribes have dedicated operation squads, but some of them are experimenting with cross-functional, putting operations engineers into squads. Very much depends. The level of adoption is very different across the business.

But I think the important thing on this slide isn't that it's called squads and tribes. It's we have autonomous teams doing agile software delivery. That's the important point here. And if we whiz forward to 2019, I've listed out some of the tribes in technology. That diagram is far too complicated to draw these days. These are the technology tribes. We have other tribes as well to do with customer experience, fraud and risk, finance, marketing, people experience. But these are our main technical ones, and you'll see that some of them are organized by function and some of them are organized by product. They're autonomous. They can choose how they do stuff.

So what I wanted to do today was explain how we've done that. And I wanted to start with a quote, and I really like this one. It's by a Greek philosopher, so it's obviously clever. But this idea of kaizen, this organization's work and learn from what they do and improve all the time. But that wasn't all I wanted to get across today. I wanted to get across this. The fact that we see the same problems repeating again and again. We see the same answers being applicable. It is a case of history repeating. Now, this quote you will all know and you will all recognize. It's from the great Swedish philosophers, ABBA. Who knew ABBA were DevOps visionaries back in 1974?

So today, what I want to explain to you is a couple of patterns. One of them is concerned with going from an idea through to a product. The other one is how we take teams on that journey. And when I'm talking about a pattern, what I actually mean is a collection of best practices, collection of principles, not a process, not a task list. These are guidance.

But for them to really work, it really helps if you've done eight years' worth of DevOps transformation already. So we have autonomous teams doing agile software delivery. We have had the difficult conversations with the developers about being on call. They now build it, they run it, they get up in the middle of the night when it cries. We've had the mindset shift from projects to products, and we've developed a culture of experimentation and learning where it's safe to fail. And that's really important because if you've not got that, then experimenting as you build the right thing can be very difficult.

So these patterns may help you on your journey to get to this point if you're not there already. But the good news is you're at the best conference on the planet to learn how to do this stuff. So make sure you go to those experience reports.

I'm going to cover off the product pattern first of all, and this works best with new products. So I'm not talking about feature stories or user stories that go into an existing work stream. I'm talking about the big initiatives, and typically, they're not products that affect our end customers, our end users. These can be internal teams. So an idea of these sort of products are: let's build out a data analytics function; let's build a new CRM system; let's build a container hosting platform.

So let's have a look at it in action. So it has four stages. We start with the proof of concept, the PoC. And for this to work, we obviously need a reason for building it. So we are stakeholders, and they will identify an objective, a capacity issue maybe, a customer problem we want to solve, or even a hypothesis or an idea proving out that something will work or sometimes it won't work.

So we assemble a small team, and we go around the iterative process of plan, do, study, act. But it's okay if at any point we realize we're failing with what we're doing. We can stop the project. Failing early is good, but also, so long as we're learning and doing it, that's okay.

So at the end of that, you will come out with your proof of concept. You will have proved the concept, and you'll present it or the findings to your stakeholders. And at this stage, they can get quite excited, especially if it's going to save money. They get very excited.

So DevOps Enterprise Summit, I need you to promise me something, a proper promise, a pinky swear promise, a real one. If you are faced with the situation that you have made a proof of concept and you present it to your stakeholders and they get excited, under no circumstances put it straight to live. Don't do that. If you do that, next time I see you, I will be very disappointed. And I have an incredibly good memory for faces. Okay. So don't do that.

Instead, take it into the next stage. Take it into MVP, minimum viable product. And we can start with our PoC, or we can start with the learnings from the PoC if it's not suitable foundations. And we introduce some extra people into this stage of the pattern. We bring in a customer, first of all. We're going to work with that customer and build the product for them. If you don't have a customer for your product, you probably should be asking the question, why are you even building it?

So we work with them going through the iterative process, fast feedback loops, getting their feature requirements, putting in our own. But we also introduced two other groups into this stage. We introduced service lifecycle management, and we bring in security. The idea with this is we build just enough non-functional requirements that our service lifecycle team are happy it runs in live. Minimal viable operability is something I invented sat over there earlier. You can use that if you like.

And we also work with our security teams to make sure what we're doing and how we're handling data and what data we've got going through the system is handled in a secure way, but also that we follow all our compliance. We're a heavily regulated industry, and we need to make sure that we follow those rules. As we go through this process of building out the MVP with some NFRs and some feature requirements, then at any point, again, we can fail. We can change our mind. We can stop. Or we could even go back to the PoC if we've realized that we've made a mistake there.

At the end of it, we should have our MVP. We then move on to the next stage, which is MMP, which is minimum marketable product, which is slightly different. At this stage, hopefully, your proof of concept and your MVP has got other customers in your organization interested and wanting to use the product or be involved in its development. Equally, we may actually go out and market the product so that we are finding new customers to bring on board in this journey. So we're getting more functional requirements. We also need, in this stage, as we approach a final product, to build out all the non-functional requirements as well, because tech debt as a service is bad.

So we're building out our product. Again, we can fail if we have to, if we need to, if it's not the right thing to do. And at the end of this, we will have a product, which is great. A fully featured, fully operable product which we can run in a live environment.

To do that, we're going to need to build the product with people. We need a team around it. Now, it could go into an existing team if there's synergy with the products that they own and run, or if they've got the right domain knowledge. But typically, we'll start with a new team on the big initiatives. That'll be a small team that's put together to form something we call the pilot team. And this goes hand in hand with the POC stage of the product pattern.

And this team are going to be a bunch of specialists that have the right domain skills to work on this product, and they're going to go on the journey through MVP, through to MMP, and eventually to product, collecting all the domain knowledge and ownership as they go. Now, ownership at this point can be a little bit vague. We're at very early stages. Maybe a stakeholder is acting as a product owner. Ideally, it's someone in the team because you want the ownership and the responsibility in the team. Or we may bring in a product owner into the team at MVP or MMP stage, but ideally, they're in the team.

We build the team out as we move into MVP. We add in generalist developers. We don't just want those specialists. We want all the skill sets necessary to provide a cross-functional team that can build out this system. And as we go through our journey with this, we will bring in our customer, we will bring in our security, and we'll bring in service lifecycle management, as I previously described.

As we hit the MMP stage, we're going to market this product to other customers. So it's the responsibility of this team to go out and sell the product to the rest of the business. It's not just down to the stakeholders. So by the end of this, we've built all our non-functional requirements, we've built all our functional requirements, and we're defining our service boundaries as we go along, and the responsibilities of how we would operate this product in production, because that needs to be done as part of the build of the product.

Ultimately, when the product is at a finished state, we might move it into this idea we have as a product as a service. We might move it to a service team to run because we're not necessarily building out new features at this stage. We're running the product, maybe doing some enhancements to it. But it moves to a service team. Now, when we move the product to a service team, what we also need to do is move some of that development team so they can act as subject matter experts in that team and take the knowledge of the journey it's been on with them. The rest of the development team may disperse amongst the rest of the customers which have been consuming the product to act as subject matter experts there.

We should have a look at an example of this. So in September of 2016, this document was created on our intranet, and it describes the kickoff of what became our container hosting platform. And it has a very good why and what we should do. For those at the back, I'll just read out a couple of salient points. This is intended to deliver the next iteration of the Bet Tribe hosting platform, supporting the platform's fit-for-growth branch of the lean value tree. Outside of the squad, there should be no humans involved in the value stream between feature and customer. So we've got a really strong why and what.

So to actually start the POC phase of this product, we need a team, our pilot team. And we wanted to build this out in cloud. Now, at the time, there wasn't a great deal of cloud development skills inside the organization because of our heavy regulation. We tended to build stuff on-prem in our data centers. But luckily, we did have some experience of building out in cloud. And the team we looked at had just finished building out this notification system that updates all the prices and scores in the mobile site and the website in real time. There's a lot of traffic goes through it, therefore it needs to scale on demand. So the team that built that, we took some of the team and formed our pilot team around that because they had the correct domain skills.

And the POC stage that this pilot team built was pretty much a technology selection piece, identifying which technology choice to make to fulfill the objectives that we've just spoken about. As we move to MVP stage, we expand the team, and we bring in our first customer. In this example, we brought in our trading squad. They wanted to build out their trading models using container technology. We wanted to host containers. Good match. So we worked with them to build out their requirements along with our requirements as well. We brought in IT service management, and we brought in security to make sure that what we were building was suitable to run in live because the MVP would go in a live environment at some point. And we also worked with the security team to make sure we were compliant and we were handling everything in a secure manner.

We did have a problem, though. At one point during the development of the MVP, the stakeholders changed their mind a bit. They wanted an on-prem version as well in our data centers. So there was some refactoring done. So we were building out for both environments. But because we had this flexible, iterative approach, that meant we could do that.

As we move to MMP, we're going to need more customers. Now, at this point, a lot of other teams were very interested in the work we were doing, so actually finding new customers wasn't too hard. Having said that, we still went out and marketed this product to the rest of the organization. So far, we've run a dozen workshops with about 10 developers in each workshop. So a large percentage of the 700 technical team at Sky Betting & Gaming have had hands-on experience of our platform and know its capabilities and its limitations as well.

So at this stage, we are building out our non-functional requirements, our functional requirements, getting it to a state where all the features are built. We're also defining our service boundaries so that we're not having to have difficult conversations when we actually get to live with this and finished. And eventually, we will move it to this product-as-a-service idea. We're not at that stage yet, and it's highly possible that given the fast-moving nature of container technologies, we won't for the foreseeable future. We may even end up always being a squad that enhances and runs this product. But if we did move it to another service team, we would move it to our platform services squad and take some of our development team into that squad to be the subject matter experts there.

So to recap on what we've just seen there, we started our journey with our POC and our pilot. We moved it through the MVP and the MMP, expanding out the team, and we have the option to take it into a service team as a product as a service. With a number of services running live, we have our trading models, who are our original customer. We have some promotional work, we have some feed ingestions, and we have the original notification system that the team we formed the pilots from worked in. It turns out that level of scale on demand works much better in containers than it did on virtual machines. So we've reduced the hosting cost for that particular product by 60%.

Now, I just want to cover off something that isn't a pattern, but it's something about our ways of working and one particular tool that we built that's really helped. We're predominantly agile at Sky Betting & Gaming. A very different mix of Kanban and Scrum. Some teams are very advanced. Some teams have all the ceremonies. Some teams don't. Some teams keep it simple. They're autonomous. It's up to them. We've also now got teams outside of our technology tribes wanting to do agile. Our people experience team, or HR as you might call them, are wanting to try out some of these methods.

And one of the smart things we did was we consolidated all our ticketing systems down to one. So our service lifecycle management, our security, and our development tickets are all in one system, so that makes it much easier to manage.

But tickets are great, but when tickets get out of hand, they can be a problem. And at the beginning of 2017, we were starting to have a big problem with tickets, particularly the unplanned work. So not the ones we were doing as feature stories, the ones that we didn't know were coming. And they were starting to block planned work. They were starting to become a blocker on getting work done and collaborating across tribe and across squad was a real battle. There was a lot of effort spent moving tickets around and reassigning when they didn't need to be.

So we were very heavy users of Slack, and we had this culture of really communicating with each other by it. And what typically happened when we wanted to get something unblocked was this. We would phone a friend and ask for help. Now, that's great. That gets the important stuff unblocked, but it messes with all your priorities and it means that the flow of tickets around the business is starting to get broken.

So one of our platform services developers used their L&D time, their R&D time, their 10%, their Friday afternoon to experiment with a chatbot. We called it Monkeybot in homage to our friends at Netflix. And what he wanted to do was create a bot that would work from within our chat system, create tickets, allow metadata to be tagged against them, track time, and then produce metrics at the end because a lot of the teams hadn't really got accurate metrics on unplanned work.

And this is what they built. So this is Philip, our engineer. He's got a problem with one of the build pipelines. So he goes onto the support channel for that particular instance and types @Monkeybot help and describes the problem. The Monkeybot springs to life. It creates a ticket in our ticketing system. It also talks to the PagerDuty API, finds the support engineer that it's routed on, and gives them a nudge as well.

So David, our support engineer, dives on and starts looking at the problem. Immediately, he can see which system component it is and tags it inside the chat transcript and then investigates the problem, finds out the rather embarrassing error that David's made, or Philip's made, and logs some time against that. So within this whole chat transcript, we've handled a piece of unplanned work, we've recorded what it's about, and we've also recorded how long it takes. And we can also close the ticket. And that's all happened inside chat.

No one's been into the ticketing system, but we've got a ticket with a full transcript, tagged service, and amount of time. And that gives us some really powerful metrics for targeting unplanned work. We can see which services are getting the most tickets and the most time associated with support on those tickets. We can also see who's requesting that so that when we're doing our plans for our next sprint or we're reviewing our work backlog, we can actually target planned work at reducing those tickets by applying our efforts in those particular areas. And that's so much more powerful than not knowing really how much unplanned work you had.

And this has become the de facto way across the business now to ask for support requests. Yes, you can still go into the ticketing system and raise a ticket, but it's so much easier to do it and it fits our culture of how we operate with one another. And it's really made a difference to the amount of velocity in the business and what it's achieved. But it matches our culture, and that's the thing. If you don't have a chat culture, it's probably not going to work, but it works for us because we in tech are really heavily bought into that.

So to summarize today, we've seen two patterns. One of them is about taking product from an idea through to a finished product and the various stages we go through. We've seen how we take teams along that journey. So going from a pilot team on the POC through to running a final product as a service. And we've looked at our evolution over the last eight years and seen a little tool called Monkeybot, which helps us manage unplanned work.

I would estimate over the last five years because of the way we are able to change what we build with our products and how we manage them, the products that have made it through to completion, which is probably about half of the ones which were started, may not be the product they were intended to be at the beginning. But because of the way we can change the question as well as the answer as we go, that's enabled us to do that. And the other 50% of products that didn't make it, well, because we've got this culture of experimentation, it was probably right to not do those products rather than carry on blindly hoping that everything would be okay.

And my question for help from you all, there's my Twitter handle on the slides there, please ask or I will be in the speakers' corner later, is I personally really value collaboration, empathy and experimentation with an organization as much as I value delivery of story points and team velocity. But I want some ideas from you as to how you can measure that, how you can demonstrate a business value around those three items. I think it's really important and I'd love to hear your ideas on it. So please reach out to me.

And personally, I would spend the rest of the day stood here talking to you but I think there's some other speakers coming. So with that I will say goodbye. Thank you all for listening. Thank you for laughing at the jokes. Thank you. Bye.