DevOps in a Data Warehouse: Inside Out

Log in to watch

London 2017

DevOps in a Data Warehouse: Inside Out

Lead Devops Engineer · Sky Betting & Gaming

Your organisation may be a high-flying technology driven company, providing highly available reliable services to your customers. You may be mature in your devops adoption, maybe your local tech press refers to you as a unicorn. But somewhere in that organisation there may be some pockets resistance where the DevOps adoption didn't reach.

This is a story of how DevOps was brought into one such pocket, not by a top down strategic move but by a team of frustrated engineers who could see the benefits the rest of the company was enjoying. We proved even in a non traditional devops environment such as Data Warehousing and CRM it could still happen, this is how we did it.

Chapters

Full transcript

The complete talk, organized by section.

Andy Burgin

So this talk is probably a little different to many others you've seen over the last couple of days. It's not about web-scale stuff. It's about data warehousing, the super sexy world of data warehousing.

And it's different in the sense that I've called it data warehousing inside out. I perhaps mean bottom up. It's a talk about grassroots DevOps. It's not about consultants or CTOs or CFOs or management, because it's about a team of engineers changing things.

I'll introduce myself. Hello, I'm Andy. Thank you all for coming. I'm Lead DevOps Engineer in the Data Tribe at Sky Betting & Gaming. That is my Twitter handle if you would like to follow me and ask questions or troll me. I don't mind. I like the attention.

And don't let these youthful good looks fool you. I'm actually middle-aged. Yeah, it was a shock to me as well. But it means I've done a lot of stuff in my career. I've done a lot of desktop work, originally some mainframe stuff, lots of development languages, web stuff, front-end, back-end, lots of platforms, and more recently, infrastructure as code and data warehousing in Hadoop.

That's what I do. But Sky Bet offer a range of products, probably best known for their sports betting offering, because we are an online bookmaker. That's what we do. We don't have any shops. We're purely online, which means if we're not online, we're not making any money. And where it's a bit different is if we are online, sometimes we even lose money. But that's the gambling industry for you.

Best known for our sports betting product, so that is betting on all the top sports: football, horse racing, greyhounds, golf, tennis, Britain's Got Talent, elections, all the top sports. And we've got some gaming products as well. So we've got our slot machines, our roulette, our poker, our bingo, and some free-to-play games as well, and our odds comparison site.

So that's what Sky Bet do. And we've been doing the DevOps thing for quite a few years now. We're not the earliest of adopters, but we're certainly no laggards.

We've had a DevOps function, originally as a DevOps team, since about 2011. Originally, they were put in to act as a bit of an interface between platform and infrastructure and the dev teams. Later, they moved on to being more of a release team and the tooling, CI pipelines, etc.

And as we get more up to date, we've adopted the Spotify Tribes model. Now, I don't want you to get hung up on the fact we're calling tribes and squads. The important point here is we've got autonomous software delivery teams delivering software in an agile manner. That's the important bit. But you'll notice our DevOps engineers are now sometimes in squads inside the tribes, but also some of them are actually integrated into the squads as well. So it's kind of how it's evolved, and it's the right case for the right use case, if that makes sense.

And DevOps has certainly had a massive impact on the business. We've grown sort of 30% year on year for the last few years. A lot of that is attributed to the technology, the agile approach, certainly DevOps. And we've got sites in Leeds and Sheffield now. Our local tech media refers to us as a unicorn, which I think is hilarious. I don't know about you. I don't think we're what you call a unicorn, but...

So, I'll now talk about the Data Tribe, which is the bit where I work, and this is where the warehousing comes in.

To call us a tribe back then would be an exaggeration. We were a couple of devs and a couple of CRM and Insights people looking after a data warehousing product supplied by our then parent company, Sky TV. We're now owned by a VC company. And we basically outgrew it.

So by 2014, we've now got our own data warehousing solution. So this is a Hadoop cluster put together by a dev team with some testers, and you'll notice there's a DevOps engineer here as well, working alongside the devs and testers and also working with CRM and Insights. So they're kind of our customer.

The way it basically works is we have a number of data sources on the far left. We ingest from those into our Hadoop cluster. We analyze the data. We enrich it, and that's then consumed by our CRM team and our Insights team.

Now, this BI and this MI, as we call it, has given us a load of competitive advantage because we had a load of information which our competitors didn't. We had lots of information about liabilities. We knew what the impact was of certain football results depending on what the score was. We also were getting really good at marketing to our customers through the segmentation and analysis we were doing on customer activity. So we were getting some real wins.

And the company said, "That's brilliant. Can we have more, please?"

And what they meant by more was more of this business stuff. They wanted to prevent churn, better segmentation, better personalization, and some real-time stuff would be nice as well. So they said, "Okay, we want you to do this."

To be fair to them, they did scale up the team as well. So we moved from a single development squad to four. The original one was still there working on the cluster and doing the ingestion pipelines, but we now had a spinoff squad, which was like a research facility using the newer modern Hadoop technologies to ingest stuff faster and quicker and better.

We've got a promotions team which is looking at taking the information we've got in our cluster and pushing it out to the actual site so we can do more targeted promotion. And we've also got a data science function, so building models around user behavior, understanding how we can target them and segment them better, market better to them, but also some fraud and risk analysis as well.

And let's not forget the newly formed DatOps team, as it became known. And this is where I come in. Around beginning of 2016, I joined that team. I'd been at Sky Bet for a few years by then.

So the Data DevOps squad, or DatOps as we're called, these are four engineers put together, and they weren't just operations people. They'd had sysadmin skills. They knew some Hadoop administration; some of them knew a lot. But we were put there because of our infrastructure as code skills, our ability to code with Chef and write tooling in Python. We were more a dev-in-ops team rather than an ops-doing-dev team, and that's kind of an important distinction.

So we come from this developer background, and this was the to-do list they asked us to do, which is great. Lots of cool stuff: improving the platform, improving the reliability of it, getting rid of all the kind of snowflake stuff that had been created over the years. Really going back to infrastructure as code stuff, getting some live-like test environments set up because we were kind of multi-tenanted on this cluster. But, of course, supporting the development teams. We have to help them with their tooling. There's lots of new teams. They're going to want lots of CI and lots of help with that.

So we're charged with all that. So we get going at the beginning of 2016, and it's not really, really good. We hit a load of problems, and it was making the team sad.

And when I say sad, this was really why. So if you look at our organization, we were kind of spending our day not doing the backlog. We were spending our day dealing with tickets from our development squads, and we didn't really know why.

But we helped them out anyway because we're a new team. We don't want to be one of those ops teams, the one that says no. We're trying to do the DevOps thing. We're trying to be collaborative. We're trying to help. So we start helping, and of course, we're into a vicious circle then. The more you help, the more you get expected to help, and the more stuff you get.

So the problem here is the devs aren't happy, so the ops aren't happy because we're not getting our backlog done, and all we're doing is troubleshooting and problem-solving. We're not actually using the skills we're employed for. We're not writing any code. This is why we're sad.

So what are we going to do about this? We've got a couple of options. And what we decided to do was we would go back and we would relearn some of the DevOps principles. Some of us knew them really well, some of them not so well.

So if we look at the classic DevOps scenario where we've got the team that's charged with doing features, the team that looks after stability and is charged with uptime, the wall between them where there's no communication and bad handoffs and terrible feedback loops and a big war and a squabble between the two. We should probably all know that, being at a DevOps conference, but I'll mention it there just in case this stuff's new to you.

So that wasn't really our problem. We got along with our dev teams. We went for lunch with them. We went for beers. We were quite matey with them.

So we learned some other stuff. We started listening to a load of podcasts, so there's a couple there. We went to conferences. We came to this last year. Go to DevOpsDays, and watch a load of the videos that are online, learning this stuff. And we go to meetups. I happen to run the one in Leeds. If you're ever there, do pop in and say hello. We have some really good talks. But there's loads in London, and if you're outside London, there are loads more. Google is your friend on that.

But as well as that, we did some reading. So we read The Phoenix Project, which we should all do. Even if we don't like it, you should still read it. And we read Next Gen DevOps by Grant Smith. We would have read The DevOps Handbook, but it wasn't published at that point. We also would have read Effective DevOps by Jennifer Davis and Katherine Daniels, but again, that wasn't out at the time. But we would have read them if they were out. They are particularly good books.

And from all this reading, we basically came up with a short list of things that we thought we should look at. Some of them are really basic principles, some of them are a bit more advanced. And this is where we started looking for answers.

So if we look at the Three Ways now, hopefully, I'll not go over this in very much detail because you probably all know it. But just for brevity, we've got three ways. We've got getting flow going from the, "I've got a really good idea that will make us money," to here, making us money. Okay? Getting that flow going, removing the blockers across there and getting the work across.

The second way: amplifying feedback loops. If there is a problem, people downstream know about it, and we can do something to fix it and prevent it from happening again.

And the third way, which is this culture of experimentation in a learning organization. So particularly one that isn't satisfied with how things are, even if it works. They're always trying things out and experimenting to see if they can make things even faster than they are, even if they're not broke.

Okay, so that's the Three Ways. That was quick enough, wasn't it?

And we really couldn't see that this was going to help until we thought about it a bit more. We realized that we were the blocker, and we didn't know why. Because the work wasn't getting done because we weren't doing it, because we were doing other stuff. We obviously didn't have enough bandwidth for it, but we didn't understand why.

And when we look back at the previous incarnation of the Data Tribe, and we thought back to where they'd come from, and we thought about that single dev team, we realized that this DevOps engineer here was a bit of a rock star. He was a really clever guy. He's actually a solutions architect now, but he was the DevOps engineer at the time. And he basically did all the stuff. He fixed all the things. He worked out how to do all the tricky stuff.

So this team here began to rely on him. So this idea of this martial arts kata, if you've heard of that in all the Japanese references to DevOps, the muscle memory, the way they react to problems, they'd kind of trained themselves to rely on that guy. And when they scaled up, they relied on the new team. So we'd worked out, at least from the Three Ways, what the problem was.

So the next thing we looked at was projects versus products. Now, these next two slides are stolen blatantly from Damon Edwards. Damon was meant to be here. He's unfortunately not able to make it. So I put a reference to a video with him talking about this because he'll do a much better job than I ever can.

But the basic theory is there is a problem with projects. So where you have functional teams, where you group your devs and tests and ops people together, as you have the handoffs and the feedback between these teams, knowledge gets lost. So ultimately, the ops people don't really know why the code is there and what business value it gives. And your dev team doesn't necessarily know how it works in production.

This is bad, particularly on a project which has the final milestone as go live. And then all these people go away and the ops people are left with it. That's not a particularly optimum business solution.

So Damon explains that if you have cross-functional teams, and this has already been covered today, but if you have cross-functional teams that have responsibility for certain products, first of all, all that knowledge is shared because they're all in the same meetings, they're all working on the same stuff.

But if you say, allocate that's this team the shopping cart, this the stock inventory, this the mobile application, they have ownership of those products. So they can obviously fix them if there's problems, but they can enhance them. They have a lifespan. They live on after they've gone live. They're not just this temporary thing that created it and then it's expected to live forever. They're actually living things.

And by doing this and allocating them to specific teams, they get responsibility. And responsibility's great because it gives this idea of autonomy, mastery, and purpose, and it gives people something to work on and ownership, and that's very fulfilling.

So we thought there must be something in this. But when we thought about products, did we actually mean services? Because we're not necessarily making something and handing it over and letting them run it. We're actually running it for them.

And when we thought about it in terms of the more traditional layered approach, we realized that our infrastructure team was supplying us with tin up to OS. We were then going from OS to working Hadoop. And then sort of this application or data layer, we're writing applications that produce data which was then consumed by the SaaS at the top, by the CRM platform.

And if we thought about operational responsibilities in that model, we realized that we were doing this and we understood our non-functional requirements. We were running this thing. But all this non-functional requirements that goes along with running an application or owning an application, these were the things that were coming to us. These were the problems. We were being asked to do operational stuff for the application layer, effectively.

So we needed to find a way to get to the "you build it, you run it" kind of mentality, but without being seen to be shirking responsibility or being Teflon or dodging it, and certainly not being a team that says no and destroying the good working relationship we've got. We're still trying to do the collaborative DevOps thing.

So what we decided to do was try and drive this change in ownership or responsibility through providing different services to what we were doing.

We came up with these four E's, really, which are... They're a two-way thing. They're really about encouraging and making the team embrace these new responsibilities, supporting them in doing that, but also making sure we create two-way trust so that they're comfortable with it and we know they're going to be able to run with it, and hopefully they don't think we're jerks when we've finished doing it as well.

And these were the kind of services that we decided would facilitate that. So we already had our platform. That's our physical Hadoop thing. We're already doing that. And we were Chefing that as part of the work we were doing.

We then had the testing environments we'd been asked to do in AWS. So we were going to reuse that Chef code to create live-like test environments in AWS. But what we thought as well is if we built self-serving tooling around that to allow developers to spin up their own environments, that would empower them to do that, and then they could just be responsible for what happened in that environment.

And we use a heck of a lot of Jenkins, not just for CI, but for actual job scheduling. And we realized that a lot of the teams had pretty much the same install of Jenkins with some customization. So why don't we take that and build that as a service or a product and allow them to run their own Jenkins? And we did a lot of work around that. I'll explain it later.

And finally, the "Can we just have Docker, please?" which seems to be a reoccurring question we get asked. We decided we needed to stop container sprawl and not just install it everywhere and look at putting a platform in place. So we were actually containing the container sprawl. That's not a paradigm.

So moving on to the four types of work. Now, if you haven't read The Phoenix Project, this bit's a spoiler. Sorry, you can stick your fingers in your ears or look away.

But apparently, there are four types of work. So there are business projects. These are the things we've been asked to do. That's our to-do list that we were given by the business at the beginning.

There's internal IT stuff. So this is stuff we need to do to keep the lights on, like patch and test our backup and upgrade our monitoring systems, things like that. These are all things that need doing.

And we work with service lifecycle management, so we are ITIL-based. We generate changes off the back of that. They're still work, they still need doing. Anything that touches a live service needs a change associated with it. So there are changes.

But The Phoenix Project teaches us about the fourth type of work, the unplanned work. And this is the killer, because it's described as toxic to the other three. If you have unplanned work, you can't work on the other three. It smothers it. It stops you being able to get on with what you need to get on to. And this was certainly the thing we were experiencing.

What The Phoenix Project also says is you must do everything when you're doing the top three to reduce the bottom one, because otherwise it's going to keep happening and it'll get worse. So when you're thinking about what you're doing with your business projects, your team lighting, your changes, it should have an effect on the unplanned work.

Okay. Now, we were working with Kanban at the time. It was working for us. We were very disciplined, actually. We made sure that everything got ticketed up. Everything went through the board. We didn't have this black market of donuts, bribery, and stuff like that. Stuff actually got logged.

And if we look at how our development teams were working, we thought we might find a better way of working than just the Kanban.

So this is our dev team having a standup. These are some of the boards. I'll just show them in more detail.

So this is one of the development team's boards. This bottom area is backlog refinement. So that's BA and product owner territory to elaborate on the tickets ready for the sprint. They then work in two-week sprints. So they have prioritization, planning, story pointing. They do the work, then they have retros and look at the metrics. Okay, so two-week sprint, agile delivery with a scrum master to help.

We decided that we should keep our Kanban board, which is the bottom area here, and basically copy what they've done. So again, the same functions, the same scrum master to help us do this. And we put all the business stuff through there. We put most of the internal stuff that wasn't just short little things, but was actual work, we put that through there, and we put the unplanned work through this.

And by doing that, we were tackling the problem of the unplanned work head on. All those tickets, those requests come in, we were putting there. And we realized there was four of us, so we rotated someone onto this bottom board to work on those tickets for a week and then rotate off.

Now, you can look at that one of two ways. You could say, for one week of the month I get to work on that and I've got to triage it and I've got to deal with the incoming problems every day and it's a bit disruptive. Or for three weeks of the month, I get to work on the top board, I get to build product, I get to write code using the skills that I was employed for, and I get ownership of the product as well.

And that's a game changer. That overnight stopped the whinging. That turned that smiley that wasn't smiling into a smiley.

So let's have a look at some outcomes from all that.

And that's the metrics. Now, what we didn't realize was actually what a boost this would be. This blue area here is the Kanban board. The green area is the platform board. And you'll see as soon as we implemented, we got a load of work done, which we didn't realize we had the capacity to do.

So straight away, we're getting loads of easy wins. There's a bit of a dip here around Christmas. We had people away. And then there's a very odd, interesting shape here where we have loads of Kanban work, but no platform work in about February, March time. I'll explain that in a minute. But ultimately, we were keeping up with the demand on tickets as well. We weren't getting a swelling backlog.

In terms of the services we set out to build, the Hadoop platform got shifted. We got rid of all the snowflakes. It's wonderful. It's so much better and it's so much more reliable. That in itself reduces unplanned work.

The cloud platform we tried to roll out, we did the work, we built it, but we still got compliance issues trying to connect Amazon to data sources in our data centers. We haven't solved that yet. There's loads of people working on it behind the scenes to get that sorted out. We will get there.

In terms of Jenkins, gone better than I could ever have hoped. Our development teams are really chuffed now they've got shiny new Jenkins boxes with the latest version of Jenkins. They've got all their new plug-ins on it. We built a load of tooling to help them move jobs between servers, which was open sourced as well. So you can copy jobs between instances, and it's all synced to version control now. So we've got it immutable. So if we wipe out those Jenkins boxes, we can recreate them with the config. Winner.

Container platform is work in progress. We've just started to roll that out. That's fairly new. But we delivered everything on our to-do list we were asked to do, and we still managed to manage that unplanned work through the Kanban board. And it kind of acts like a WIP limit by having one person working on it because there's only so much bandwidth, and that kind of regulates it.

We faced some challenges with this. One challenge we didn't have is our organization already got the DevOps thing. It already did agile software development. It already was technology focused. We didn't have to have those conversations, which I suspect many people are here trying to solve.

We didn't realize that the muscle memory and the way people reacted could be like a form of technical debt, which you could amplify up. We thought technical debt was like code and stuff that hadn't been done properly.

But we were very careful when we were working on services that when we were optimizing locally, we thought about the effect on the teams that work with us. Because if you optimize locally and you don't think about that, then you're setting everybody else up for a fall, really.

The spike in Kanban was down to unplanned projects. They barged their way in. It turns out we had a data center migration, Grand National buildup, and PCI auditing all to do in February and March. So that kind of wiped out anything on the platform board because we threw all our resource at the Kanban board.

And it's really easy to revert to being that ticket silo. It nearly happened. We had that curve of adoption and nearly gave up, but we kept going and persevered and we won. And guess what? Grassroots organizational change is still hard from the bottom up.

I have a bunch of thank yous. So first of all, thank you all for coming out and watching today. I was a little worried I was just going to be me and the sound man, but luckily you've all come along. Thank you so much.

This isn't just my story, though. This is the story of the team. I didn't write this. This happened. So I want to just thank the team back in Leeds, present and former members of the team.

If you have questions, I am hanging around, but that is my Twitter handle. And we have an engineering blog which has got a load of cool stuff on about technology we use, but also covers all the web-scale stuff that the Bet and Gaming Tribe are doing as well, and they're doing some amazing stuff at scale.

And I was asked by Gene to whack in a question for the audience to see if they could help me. Although, again, we've got this really good DevOps adoption, moving DevOps into the squads, which is something we hope to do to break the team up, we're getting a lot of pushback from the development managers. Not in the sense they don't want help, but they only want those people to do the opsy stuff. They don't want to cross-skill into those T-shaped people.

If anybody's got any thoughts and ideas on that, I'd love to talk to you. And with that, I'm done. So thank you for listening.