DevOps in a Data Warehouse: Inside Out
Your organisation may be a high flying technology driven company, providing highly available reliable services to your customers. You may be mature in your devops adoption, maybe your local tech press refers to you as a unicorn. But somewhere in that organisation there may be some pockets resistance where the devops adoption didn't reach.
This is a story of how devops was brought into one such pocket, not by a top down strategic move but by a team of frustrated engineers who could see the benefits the rest of the company was enjoying. We proved even in a non traditional devops environment such as Data Warehousing and CRM it could still happen, this is how we did it.
Chapters
Full transcript
The complete talk, organized by section.
Andy Burgin
So this talk is probably going to be a little different to other talks you've seen over the conference.
First of all, it's not about web scale. It's not about transactional infrastructure. It's about data warehousing. And I've described it as "inside out," and what I probably meant to say was "bottom up," because this is really a story about organizational change, but from a grassroots level, by engineers, not by CTOs, not by CIOs, not by consultants.
And I am one of those engineers. So hello, my name is Andy. My day job is Lead DevOps Engineer at Sky Betting & Gaming in the data tribe. That is my Twitter handle. If you would like to ask me any questions, or you just fancy trolling me, that's fine. I like the attention. Please feel free.
And don't let these youthful good looks fool you. I'm actually middle-aged. And that means in my career I've done a lot of things. I've done some mainframe stuff. I've done lots of PC development, lots of Linux and Unix dev, front and back end, lots of databases, lots of systems administration, and more recently, infrastructure as code and big data technology in the shape of Hadoop.
So that's what I do, but what do Sky Betting & Gaming do? Well, you probably haven't heard of us unless you're from the UK, because that's where we're based and that's our primary market. We do have some services in Italy, and we launched in Germany last week. But we are essentially an online gambling company. We don't have shops. We don't have casinos. We are purely an online business.
Amongst our products, we have our sports betting function, where you can bet on all the sporting results: soccer, horse racing, golf, tennis, NHL, NFL, and even The Voice. You can bet on that, if you should wish to. Not that it's really a sport, but that's debatable.
And we also have a bunch of gaming products as well. So we have slot machines, roulette, poker, bingo. We've got some free-to-play games on our odds comparison site as well. So there, that's our portfolio stuff and where we operate.
And we were quite early adopters of DevOps. We've had a DevOps function since around 2011, where we put it in a DevOps team. The original intention of that was to act as kind of a middleman between our infrastructure teams and our dev teams, because we were having some communication problems there.
So that evolved over time for them to becoming more of a release team and a tooling team. And it also means we're now stuck with the word "DevOps" in our job titles, which is a bit of a downside, but there you go.
But moving to more recent times, as you can see, by 2016 we've adopted the Spotify tribes model. And I don't want to get hung up on the fact we're using the Spotify tribes model. Whether we've got tribes and squads or departments and teams isn't really the point. The point is we've got small, autonomous teams doing agile software delivery. That's the important point.
You'll also notice that we don't have a centralized DevOps team anymore. There are ops squads inside the tribes, and in some cases there's actually ops embedded into the development squads. It depends on what's right for that particular tribe or squad.
And DevOps has certainly been one of the main drivers, along with agile, along with lean principles, in our growth over the past three years. For the last three years, we've grown 30% year on year. So that's in terms of customers, in terms of profits, which is good, but also in terms of technology, and more importantly, in terms of developers. And scaling teams, as we all know, is hard.
We now have around 1,500 employees. Most of them are technical roles. We are a technology company. And we are based primarily out of a couple of sites in the north of the UK, in Leeds and Sheffield. And last year our revenue was just over half a billion UK pounds, which I think is like $0.6 billion or something like that.
And that means our local tech media refers to us as a unicorn, which is kind of nice.
But today's story isn't about all that growth or web scale. Today's story's about one of the teams that, I suppose, in a way got left behind or wasn't ready for the DevOps journey at the time. And that's the data tribe, where I work.
Because in 2011, when we started our DevOps journey, we were far from a tribe. There were four of us: a couple of engineers, some CRM and insights people, running on a data warehouse solution provided by our then parent company, Sky Television. We've since been sold to a VC company since then.
But we outgrew that data warehousing solution, and we replaced it with our own. We built our own Hadoop cluster. We had a small team that built and ran that, and wrote the code that ran on it to do data ingestion. We had some CRM and insights people working with us. But you'll notice there's an ops engineer embedded in that squad.
And for those of you of a technical persuasion in the room, this is what they built. They built a Hadoop cluster that pulls in data from various sources around the business, ingesting primarily off the database server that runs all the front-end websites, but also a number of third-party feeds and other feeds around the business. And that data's pulled in, it's analyzed, it's ingested, and it's crunched, and we produce valuable information that's used by our CRM and insights team.
And this gave us loads of MI and BI that our competitors didn't have. It gave us competitive advantages. We had informational liabilities. We knew what would happen, what the risks were, on certain football results or certain sports results. We knew what would happen if the results went certain ways. Our competitors didn't have that.
We also were really good at marketing to our customers. Our segmentation of our customers was really good, and we could target effectively to them.
And the business said, "This is great. It's really giving us value. Can we have some more, please?"
And we said, "Well, okay. What do you mean by more?"
And they said, "Well, it's business stuff, isn't it? It's preventing churn, better segmentation, some personalization, maybe some real-time data rather than overnight batch flows. That would be really nice. Maybe some streaming technologies. And I'm sure you made the point when you put this Hadoop cluster in that it could do machine learning and artificial intelligence. Could we have some of that? That would be great."
And to be fair to the business, they invested in the team, and we moved from having one squad that ran the cluster to four squads, each with different functionality.
We kept the original squad that built and maintained the cluster and ran the flows, but we also spun off a kind of R&D function of that. So they were looking at newer big data technologies so we could do things better, cheaper, faster.
We also spun up a team that worked with the front-end websites to do promotions using the data we got in the cluster, to do better marketing and instant marketing to our customers.
And then we spun up a data science function as well to build analytical models so we could better understand our customers and the interactions they had with the site.
We also spun up an operations team to help run this, rather than a single engineer. And that operations team, the data DevOps squad, or DataOps, as we became known for short, was made of four engineers. Now, some of them were very skilled in Linux systems administration, some of them in Hadoop administration. But what we all had was the ability to code.
We were more devs in ops rather than ops in dev. So some of us had Python skills. Some of us had more infrastructure as code skills, in this case Chef.
And we got given a lovely to-do list. It was just the kind of thing we were looking to do. We were looking at improving performance and reliability of the cluster. We were going to de-snowflake it, so get rid of all those little bits of tech debt that had crept into the server configurations. We're running in a multi-tenant environment, so we should spin up some test environments, possibly using all this infrastructure as code we're building.
And, of course, we've got three new development squads. We need to support those as well.
So we set off going, and it didn't really work. There was frustration amongst the operations team.
And the basic problem was twofold. Number one, we had a backlog of work to do, and we were getting given tickets from the development squads. Lots of tickets, lots of troubleshooting, lots of firefighting, lots of operational and administrator kind of things, and we weren't working on the backlog.
And because we're a new squad, and because we want to do the DevOps thing, we want to be collaborative. We want to help. We do. And that's kind of a vicious circle, because once you start helping with these things, you kind of get expected to help with these things, and then even more stuff comes your way. And we're not getting the backlog done.
But more importantly, the engineers are frustrated because they're not using the skills they were employed for. They're not doing any development work. They're just troubleshooting and firefighting.
And we were pretty sure this would be a problem that had been solved before, particularly in the DevOps space. So we went and relearned our DevOps education.
And we probably all know this. So this is the misaligned teams: one of them building features and incentivized on that, another team on stability and uptime and being incentivized on that. The wall between the two, so they don't communicate. The poor handoffs and the terrible feedback loops, and the general squabbling and fighting that happens in that.
I do hope you don't work there.
Nervous laughter. Maybe some of you do. Well, you're in the right place to find an answer.
That wasn't our problem, though. We got on okay with our development teams. We weren't fighting. We weren't squabbling. We just had a lot of work to do.
And we went and did some learning. We went and listened to podcasts. There's two good ones there. We went to conferences. We came to this in London two years ago. We went to DevOpsDays. We watched a lot of conference videos, saw a lot of DevOpsDays videos. And we went to meetups. I actually run the one in Leeds. If you're ever passing, unlikely as it is, do call in if we've got a meetup on. But if you do have a meetup in your towns and cities, go to them. They're really good.
We also did a lot of reading. We read The Phoenix Project, because you should. And if you haven't read The Phoenix Project, you should.
We would have read The DevOps Handbook, but it wasn't out at this time. And we would have read Effective DevOps by Katherine Daniels and Jennifer Davis, but that wasn't out either, so we couldn't read that. We did really like this Next Gen DevOps book by Grant Smith, though. We found that really good.
And out of all this reading and all this learning, we came up with these three areas that we thought might actually have the answer to our problem: three ways of DevOps, projects versus products, and four types of work.
So you've probably seen this 100 times at this conference. But I'll just run through it quickly.
So three ways of DevOps. First way: accelerate flow. So going from, "I've got a really good idea," to code running in production as quickly as you can. Going across those functional teams, removing the blockers, and getting the flow going. Looking for constraints.
Second way: amplifying feedback loops. So that's not just throwing problems and feedback down the chain. It's actually coming up with better ways to work together, to be more collaborative and more constructive.
And then the third way is this culture of experimentation, a learning organization, one that takes risks, one that isn't satisfied with how they're doing things at the minute, but are looking to always constantly improve.
And we thought about this, and we thought, "Well, we haven't really got many handoff problems, other than the volume of problems. The data operations team is effectively the constraint. We're kind of this Brent character from The Phoenix Project."
And what we realized was the reason that was happening was because of the way the team had evolved in the early days. When the data warehouse was first built, the operations engineer in that team was the go-to guy, the clever guy, the person that always got the complicated stuff done, who took on a lot of the burden to let the devs concentrate on writing code. And that muscle memory, and that way of working, we'd kind of scaled that up.
In a way, the culture had become a bit of a form of tech debt in the way they worked. So we realized that we needed to change the way the development teams were working, and we didn't know how. So we went to find out.
These next two slides are stolen blatantly from Damon Edwards. He does a talk who can explain it much, much better than me. But this is around the concept of projects versus products. If you get the slides, there's a link at the bottom there to one of his videos that's a much more detailed version than this.
But Damon explains that where you have functional teams, where you have handoffs and feedbacks between them, every time that happens, a bit of knowledge is eroded to the point that the operations team in this example have no idea of what the code was actually written for, what business value it gives, or what it does. They know how to run it.
And equally, our dev team doesn't know how it runs in production. So we've got some problems here.
And particularly by the nature of projects, where you have something that runs for a certain time period and then finishes. I don't know how many times I've seen a Gantt chart with the final milestone as "go live." That's scary, because code lives on after it's gone live. It needs looking after. It needs maintaining. It needs to be run.
And particularly in that example where the whole of the development team might move on to other projects or not even be there. You might have outsourced the work. This is really bad.
And what Damon explains is if you move to more of a product-centric way of developing software, so rather than things living and then going live and being expected to survive on their own, they're actually products that are owned by teams, in this example by product teams. They can be enhanced. They can be bug fixed. They can live on.
And of course, the great thing about this is immediately, where you've got these cross-functional teams, all the handoffs and feedback loops are gone. So all that knowledge is known because they're all working on the same stuff. They're all working in the same meetings. That knowledge is there.
You're also ending up with much, much higher quality software because things are looked after, things are maintained, things are enhanced.
And I suppose the final point here is about the teams themselves. This idea of ownership of a product is really motivating for teams. It's Maslow's hierarchy of needs: autonomy, mastery, and purpose. It really enables that.
And in our example, I suppose, are we thinking about a product or are we thinking more about a service? Because it's not a thing we're just building. We're actually running it.
And if you move to thinking about things as services, and you look at this layered approach to services, in our example, our infrastructure tribe are providing us with VMs and tin, because we run Hadoop, we work on tin sometimes, with the base operating system installed and the networking set up.
We're then taking that infrastructure and running a software layer on top of that, a PaaS, effectively. This is the Hadoop software, configuring that, getting it running, and then that's used by a sort of an application layer, or in this case, a data layer. So our developers are writing applications and flows that run on that platform ingesting data, and then ultimately that data is consumed by the SaaS layer. That's our CRM platform, our insights platforms.
And if you think about the way the boundaries should be between these layers and you start thinking about the "you build it, you run it" ethos, then the non-functional requirements to build and run should be separated out between the teams like this.
But in our case, a lot of these non-functional requirements that, according to the "you build it, you run it" ethos, should be maintained by the application development teams.
We need to reestablish these responsibilities. We need to put boundaries in place, which sounds like a bit of an anti-pattern for DevOps, but this is how we're going to get stuff done.
And we couldn't just go to the point where we would say, "Right, from now on, we're not looking after these particular responsibilities. It's now your problem." Because we are a collaborative DevOps team. We don't want to do that.
So what we worked out is, as we release features into our products, we can use that as a way to reestablish these responsibilities.
So what we came up with was this idea of these four E's. So when we release a feature, we make sure that it's giving the development teams some empowerment to run it themselves.
To do that, we need to make sure we're collaborative and empathize with them so that we're not just throwing something back over the wall which they don't need. We need to support them in doing that and educating them about what it is it does and how to use it.
And it's a two-way trust thing. We need to trust that they're going to be able to run this thing in production, and they need to trust that we're building the right stuff.
And we had a lot of products. We didn't realize it. We just thought we kind of had this platform we ran.
But actually, we're moving to a world where we're going to start running Hadoop clusters in the cloud, so using the infrastructure as code we've been developing.
We've also got loads of Jenkins around the estate, not just for CI and CD, but also for automation and also for actually running our flows as a scheduling tool. And we thought maybe we could build some sort of generic version and provide that as a service, rather than running one monolithic Jenkins or fragmenting it so much that the teams had several ones.
And we were getting constant requests for, "Can we have some Docker?" which we were very wary of, not only the container sprawl, but also how we managed the installs of the Docker runtimes across our servers. It was more things to patch, more things to update.
And to tie this all together, if we look at the four types of work, and if you've not read The Phoenix Project, sorry, this is a bit of a spoiler. But The Phoenix Project tells us there are four types of work.
There are business projects. There is the to-do list your business gives you, things that come from product owners and managers. They're things you've got to do to deliver business value.
But you'll also have internal IT projects as well. For us, these are things like looking after our backups, our monitoring, our updating and upgrading, and also patching. They're the kind of things we have to do to keep the lights on.
And in an ITIL or service management environment, you're going to have changes to manage how you put this work into production and can track it.
But The Phoenix Project explains that the fourth type of work, unplanned work, by its nature, you can't plan for. It also describes it as toxic, because what happens when it turns up is it just takes over and pushes all the other three types of work out the way, smothers them. And we were certainly feeling this.
What The Phoenix Project also advises is when you're working on any of the types of work in the first three, you should be making sure that you're doing something to reduce the unplanned work, or otherwise it's never going to go away. You should constantly be working at reducing any tech debt that causes that.
So how do we start working differently to deliver these features and products, and work on technical debt, and reestablish these responsibilities?
Well, we were very disciplined with the way we did work. We put everything through Kanban. We didn't have Slack channels where people asked for favors. And in a way, by doing that, that's how we knew we were in trouble, because we could see the number of tickets coming through the board. We knew we had a high volume of stuff.
So we want to move to developing products and features. How do our dev teams do that?
So these are one of our dev teams having a standup and some of the boards they use. It might be a little hard for you to see, so I'll describe this.
This is one of our development team's boards. The bottom area is an elaboration area. So this is where work is assessed by architects, product owners, BAs, and things are fleshed out on there until things are ready to go into the backlog.
The top board is a Scrum board. We work on two-week sprints. Work is refined, story pointed. There's a selection process to work out what's going in the backlog for the sprint. The work's then done, and at the end of it, there's retrospectives and review of the metrics, all facilitated by a Scrum Master.
So that's the way the dev team develops software. Well, we should do that, too, if we want to be developing products and features.
So we kept our Kanban board, which is the bottom board on this picture, and we introduced a platform board.
So again, we refine work, we story point it, we select it, we do the work, and then we do metrics, and then we do retrospectives as well, again facilitated by the same Scrum Master that the development teams use.
And what we decided to do was we put all the business work and all the internal work through the platform board, and then all the unplanned work could go through the bottom board.
And we realized that we were a team of four. So if we rotated someone to be responsible on a weekly basis for that bottom board, that meant one week a month, you probably had a bad week where you were dealing with all the unplanned work. But for three weeks of the month, you were writing code, doing what you were employed for. You were writing features for products that you owned, and that's really empowering.
And that overnight changed the dynamic of the team. It made them happy.
So what have we seen as outcomes from this, this journey of reestablishing responsibilities and managing work better?
Well, on the left graph, the blue area indicates the number of Kanban tickets we did. The green area indicates the number of platform tickets we did.
And what happened when we introduced the platform board was originally we got loads more work done. Maybe we weren't context-switching on work we were doing. Maybe it's because we were building things we wanted to build, and we were motivated to do it.
We had a little dip around Christmas where people were on holiday, as you'd expect, and then things picked up again in the new year. We then had something happen. We had a curveball from our management. They decided we were going to do a data center migration. And the only way we could really manage that was through the Kanban board.
So we stopped doing platform work for a couple of weeks while we managed that work through, which was okay, and then we picked up again afterwards.
The graph on the right really shows the number of tickets raised against done, and what that's sort of indicating is that the backlog isn't running away. We're keeping on top of all our requirements.
And we built all the work, all the business objectives we were given in that list at the start of the talk. We got that done. We built out our physical Hadoop platform in the new infrastructure as code. We got rid of all the snowflakey stuff. We rebuilt every node.
We took that infrastructure as code, and we use it with some tooling to roll out to AWS. We built some preliminary self-service interfaces for that, but we've had to park it due to our compliance teams not letting us put the data we wanted into AWS. That's changing once we've got the audit trails in place.
And the Jenkins as a service worked really well. We created a common Jenkins and built boxes out for each of our squads. They could then customize them as they wanted. They could install the plugins they wanted. And all the config was synced to Git.
So if there was a problem with that box, we could just kill it and rebuild it, and it would be back as it was. And the teams are really happy. They've got the latest version of Jenkins and the latest versions of the plugins.
We also contributed to some open source tooling to transfer jobs around from command line. So rather than recreating all the jobs in the new Jenkins, they could just copy the jobs effectively between them, or batches of them, or individual ones.
And we put out a container as a service platform to give squads dedicated Docker hosts to run on, but also a management framework around that and a catalog system, so they could define their own stacks and run them through a management interface rather than having to go command line.
So we delivered all the work, and that was on both the Kanban and the platform board.
Challenges. So yeah, we had a few.
We didn't have the challenge of getting our company to adopt DevOps because we'd already done that. The business already was bought into that mindset, so we didn't have to have that battle.
We didn't realize that culture and ways of working could be a form of technical debt which could sting you. We thought that was badly written code or hard-coded config or something like that.
But we worked out that to change responsibilities through delivery of features, we needed to be really careful, which is where this idea of four E's came in, and we needed to think about how we optimize the system globally rather than just locally.
Yeah. Big, urgent projects came in, caused us trouble. We had to adapt to that, and it's of course really easy to give in and just move back to doing the Kanban thing.
And guess what? Grassroots organizational change is hard. Yes, it is.
I need to give some thank yous because I'm just giving this talk. I worked on this stuff, but I worked with the team. So it's the members of the team, present and former, who really deserve the thanks on this, as well as some people that helped us.
That is my Twitter handle still. It hasn't changed. And we have an engineering blog that has a load of articles on our big data technologies and also all the work our betting and gaming tribes have been doing at web scale. So if you're interested, have a look at that.
Gene asked if I could ask you for help. And what I'm considering at the minute is moving from a model where we have a DevOps team in the squad to moving to more of a platform and an SRE squad, and pushing operations engineers actually into the squads to not just do all the operational stuff, the non-functional stuff, but to make those first-class citizens, like monitoring, like logging, like metrics, like performance, like scalability.
And if you've done anything like that, if you've got any experience of that, I would love to hear your thoughts on what worked and what didn't.
So it looks like I'm just about out of time. So I'm going to say thank you. Thanks for coming down, and I'll be around for the rest of the conference.
Thank you.