ITV's Common Platform: Now with Added Windows

Log in to watch

London 2017

ITV's Common Platform: Now with Added Windows

In this talk, Tom will introduce you to ITV's key DevOps principles and open-source tooling choices, explaining how they're used to develop and manage the linux cloud platform that underpins their multi-million pound technology modernisation programme.

He'll then share details of ITV's plan to apply the same principles and tooling to their large Windows estate, allowing them to automate the lower half of the ""IT estate iceberg"" and ultimately delivering a truly Common Platform.

Chapters

Full transcript

The complete talk, organized by section.

Tom Clark

I'm Tom Clark, and I'm the Head of Common Platform at ITV.

To kind of introduce myself, I've been working for about 16 years now. And no, I don't have an amazing plastic surgeon. I actually started straight out of school and started my first job with Jaguar Cars as a 17-year-old kid, actually, as a Windows sysadmin.

In my career, I've worked for organizations large and small: Global Radio, the BBC, ITV, Siemens, three startups that failed because they're startups, that's what they do, and most recently, ITV.

In that time, I've worked across a number of disciplines. I've been a sysadmin, I've been an architect, and also a Perl developer. But I consider myself a recovering Perl developer, so don't worry.

A few years ago, actually, I quit my job and I went traveling around Southeast Asia. I grew a big beard, and I motorcycled around, had a great time. And when I came back, I wanted a new challenge, so I decided to go into management. I actually went permanent with ITV for the first time in my career, and so I became the Head of Common Platform.

At ITV, I report to the Director of Infrastructure, and he reports to the CTO, and he reports to the board. So it's quite a flat structure. And please, you can see my Twitter handle there. Please tweet me. It makes my mother very proud.

So ITV, for those that don't know, is an integrated producer broadcaster, which essentially means we make content and have the ability to distribute it around the world. It was founded in 1955. The Brits in the audience will have known it as Channel 3 growing up. I wasn't allowed to because it's far too common.

We're a member of the FTSE 100, which means we're one of the largest companies in the UK, and we turn over about 3 billion pounds annually. To do that, we are a commercial organization, so we make a lot of popular content across the whole schedule.

Last year, we had the most popular television program broadcast with the England versus Iceland Euros match, most popular soap with Coronation Street, and the most popular new drama with The Durrells. And because we reach the whole schedule, the whole demographic, we've got things like Jeremy Kyle, things like Love Island, the really high-quality stuff, and then the terrible dross like Downton Abbey, and Victoria, and The Voice.

And because of that, we reach 75% of the ABC1 demographic. For those that don't know, those are the ones the advertisers care about because they've got the money.

Now, we do that with very little. We've only got five and a half thousand staff. I've met many people today, like Jason Cox from Disney, almost 100,000 employees, and Jose from Accenture, 400,000 employees. So when they do business change, I'm in awe of what they do because I've managed to do it with five and a half thousand people. So I got the easy mode.

I'm going to give you a little bit of history about what we have at ITV.

We moved into our data centers about five years ago, and we did a big P2V, and we lifted and shifted everything across. And we said, "One day, we'll clean all that up."

Back then, ITV was very much a waterfall organization, end to end. Everything, and I mean everything, took four weeks minimum. But it was fine. We put it on a Gantt chart, and everything moved out to the right, and it wasn't really an issue.

About three years ago, we started a very large modernization program because we realized all of our in-house developed systems, and by that I mean our sales systems, our rights systems, our content delivery systems, our broadcast scheduling system, the things that really differentiate ITV, they were slipping behind the business and making us uncompetitive.

So we started this modernization program, but we very quickly realized that waterfall and agile don't mix. Weird, huh? And so I was actually asked to build a replacement platform for the old waterfall one, and that's something I talked about last year at the DevOps Enterprise Summit. That's what we call our Common Platform.

So we had this Common Platform, but the Windows estate was the elephant in the room. It was the big outstanding issue.

The Windows estate hosts most of our COTS apps, our commercial off-the-shelf applications, and it was still there. It's still being kicked down the road, and it was still a big issue. And the hardware is reaching end of life. It's five years along. We did this big capital purchase, and it's about to be written off.

So it's still waterfall. The changes are still manual. It's still slow-moving, and it's still expensive to maintain. So what do we do? Move it all to the cloud, right? That's what everyone's doing. It's what all the cool kids are doing.

Jonathan Fletcher earlier on mentioned about doing a TCO, like a total cost of ownership analysis. And we did ours. We found out that actually moving all of these Windows apps, these COTS apps, into the cloud would actually be at least 40% more expensive for us.

Now, one of the main reasons for that was we're a broadcaster, and we have loads of broadcast technology, like big hardcore transmission equipment, in our data centers as well as our compute. So even if we moved all that kind of stuff out of the DCs, we'd only have to take away about 15% of our footprint, so not a massive saving. So it's going to be really expensive.

But as a result, we knew we needed to do a hardware purchase again, unfortunately. But we know that we needed to do it better. We actually needed to do some config management, some automation, and actually make it better than it was previously with the old waterfall approach.

The thing I mentioned before, the Common Platform, this is what we built a couple of years ago to host our internally developed applications. So the rights and sales systems, like I mentioned, content delivery, the in-house developed ones with a product team attached to them. That was really good. It runs on AWS, and we're really happy with it, and it's config managed end to end and fully automated.

What we asked with the Windows world is could we apply the principles and the tooling that we'd developed with Common Platform and apply them to the Windows world. Still on-prem, not in the cloud, but the same principles and practices. And so I'm going to tell you about the idea we have to actually do that.

Key principles behind the Common Platform today, and many of these will be very familiar to you if you've done a similar agile and DevOps transformation.

Automation. It's a very obvious one. If you automate the boring, repetitive stuff, you take the humans out of the equation, and you let the humans do the really interesting stuff. And there's a cool side effect of automation: the more you automate, you get more time. And so you have more time to automate, so you have more time, which actually turns into a virtuous infinite loop that's really positive for everyone involved. So obviously, automation is key.

Standardization. A quick show of hands. Raise your hand if you drive a car. Good. Okay. Now, if you know which side of the road to drive on, you can lower your hands. In fact, you all have, so I'm assuming you all actually know which side of the road to drive on, right? So we've standardized on that.

Obviously, with a mixed international audience, you may drive on different sides of the road, so we need to work on that bit of standardization. But imagine how difficult it would be if you didn't know which direction the car would be coming in. It would make it very difficult. So another key tenet to the Common Platform is standardization. Very key.

Again, loosely coupled, but highly aligned. Standardization is really useful, but not if you all have to down tools and move in lockstep. So we ask ourselves, how can we actually be loosely coupled, but highly aligned? Loosely coupled, allowing ourselves to move independently, but highly aligned, always heading in the same direction. And one of the key ways we solve that one is actually through versioning pretty much everything we do.

Blast radius reduction. Historically, and actually still at ITV, there's lots of shared infrastructure. And you know at some point in the past, there must have been a change made over here that took out all of this stuff over here. And you know someone said, "Hey, I know how to solve this one: more process."

They would've introduced a CAB, which I learned yesterday is actually a change advisory board, not a change approval board, which explains why we've been doing it wrong all the time. But you'd go into the CAB and all the directors would be there wearing their robes and their wigs, and you'd say, "Please, sir, may I do my release?" And they'd say, "Yes, yes, you may do your release. Now be gone out of my chamber." And you back away out of there, hoping to fit it into your release window.

Those kind of changes actually do work. For unknown stuff, for really high-risk stuff, a CAB probably makes sense. But is there a better way? We think so.

So we have this concept of blast radius reduction. We like to draw rings around stuff and say, because of that ring, because of that barrier, because of that great wall we've built around this, we limit the area of effect of something going horribly wrong.

And if you draw around that domain and you say that domain is a product and that product has a product owner, and that product owner has a responsibility for actually the availability of their particular application, isn't it best they actually are empowered to make that decision? We think so.

The principle of least astonishment. Lots of techs in the audience, I'm sure. Who's run a command that did something they didn't expect? Good. A few people there. So now, who's had to push on a door that has a handle? Bit weird. And now imagine a green button marked "Stop." Okay?

Those three things have violated the principle of least astonishment. They've broken a contract they had with you. They've done something unexpected. All that means, actually, it breaks trust and it slows you down because you can't make assumptions anymore, and assumptions are really useful because they allow you to move quickly if you can do them safely.

So principle of least astonishment: when we develop on the Common Platform, we think about the most obvious thing it should do, and then we make sure it does that.

And then finally, you build it, you run it. And it's really obvious. I'm a big believer in workplace psychology. You give people responsibility, and they'll want to make it work. You get that quality through psychology, not through process. You don't say, "Make it good." Really good people want to make it good, and if they don't, they shouldn't be on your team anyway.

So what we've done to deliver this, we've got devs, testers, platform engineers. We've put them together in product teams, and we've given them full control of the means of production. So dev, stage, production, end to end, it's all there. And as a result of that, actually, they feel empowered to make it better. So really key.

No more throwing it over the fence to a separate operations team. No more making it someone else's problem.

I'm going to talk about this in the classic IT layer cake of infrastructure, platform, and application.

So now moving swiftly... Okay. I'm actually on NSX. Clap, boys. Almost got you.

Infrastructure, I'll touch on it briefly. Out with the old, in with the new. Out with the blade chassis, which we found are good at computing, but even better at heating your data center. So in with the new: Dell pizza boxes.

Now, it's important to say, we want to software-define stuff. We want to control the IaaS stuff, but we don't want to build a private cloud. Chris Saxton yesterday said it's a fool's errand trying to compete with Amazon. Obviously, we're not in that business, but we do want to config manage the IaaS layer for this stuff: compute, storage, networking, and how we could do that.

The existing stuff runs on VMware 5.5. The new stuff is going to be VMware 6.5, and it introduced something really cool called NSX, which is basically software-defined networking, and it allows you to do vNIC-level firewalling, like security groups from Amazon. And firewall changes were one of the big killers that we had because each one would take four weeks to get delivered, and it was an absolute pain in the Gantt chart.

Again, it's going to be built and managed by our existing managed service provider.

So infrastructure. I'm going to take you quickly through how we lay stuff out on AWS, but you'll see why this is relevant.

We have a product that we'd host, and we'd have environments here, and the environments would actually hold applications in them. And what we actually found is that the Venn diagram of this is basically non-production over here, so dev and stage, they're non-production, and then production is over here.

So basically, it's production or non-production, and we came up with a term for that of ecosystems. Either you have development ecosystem or your production ecosystem, and then we put some infrastructure management across them, again, at the infrastructure level.

This limits the blast radius. So if you have a horrible failure in your monitoring and management tools in production here, it won't affect anything else over here and vice versa.

That's how it looks in Amazon, and we map these Common Platform concepts to Amazon concepts. So ecosystems, we map to accounts on Amazon, and environments, we map to VPCs. So that's what it looks like on Amazon.

I'm going to show you now what it looks like on the new software-defined data center we're going to build. And if you played Spot the Difference when you were a child, see if you can do it now. So three, two, one.

All that's changed is we're mapping it to different concepts. In the new world, we're going to map it to one of these tenants, and ecosystems to business groups, and environments to subnets. But actually, fundamentally, it looks exactly the same. So we can use exactly the same language when we're talking about hosting on-prem or in the cloud.

And again, we're going to have multiple instances of the SDDCs. These are various business divisions we're going to be hosting. But again, blast radius reduction. Individual instances, not shared, not crossing over.

You can see in our workplace system here, there's been a catastrophic failure in one of their systems, but it's fine because of blast radius reduction. And I move that explosion around depending on who's annoying me that particular week.

So enough about that. On to platform.

Common Platform had a whole load of tooling that we used, open source tooling that we really, really liked. And I asked my team, "Can we use the same tooling that we've used on Linux on the Windows world?" And we went away and we did some R&D, and we found out some interesting results.

Config management. We currently use Puppet for config management on Windows. What do you think? We tried it on Windows, and it worked. It worked really well, actually. We got some really, really nice outcomes from that as well. So that's good.

Monitoring. We use Sensu on the Linux world. We found it's a very, very powerful tool on that side of stuff. We use SCOM on the Windows world at the moment, which is fine, but it's manually configured, and it looks a bit like a Christmas tree. And actually, sometimes we worry more when it goes green because that's a bit weird.

So we thought, monitoring with Sensu, trying to automate as much as possible. We found it works really well. It's better than SCOM, we found.

ELK. There's no log centralization in the existing Windows world. If we want to know if something's gone wrong in the event log, we have to log in to the server, which is very, very different for me as a Linux person.

So can we use ELK Stack, Elasticsearch, Logstash, Kibana, to centralize our logs? Actually, we found you can. There's a thing called Filebeat, there's a thing called Winlogbeat, and we're pushing all of that into ELK now as part of our R&D stack.

Metrics, so the TIG Stack: Telegraf, Influx, Grafana. We don't have any metrics collection on our existing Windows estate because it was never part of the initial requirements capture five years ago. So you're monitoring things like CPU state. Whew. Monitoring things like memory consumption. Huh. Monitoring things like disk usage. Cool. You can. It's amazing. It's like a brave new world for the Windows people.

And then finally, orchestration with Jenkins. We don't know yet. That's what we use on Linux world. I don't think it's going to be a good fit, actually. I think it's going to be a bit of a crowbar-shoehorn kind of activity. So something like Octopus Deploy might be a better fit, and if you've got any ideas, please grab me afterwards.

So ultimately, that's what I look like at the end of it. There's tons of stuff you need to do to build a platform, but a lot of this stuff is actually going to be translating the existing Common Platform work into Windows. We've actually done a lot of the R&D already.

We're also going to try and deliver the platform itself in an agile way. Historically, we would've done, again, big design up front, big build phase, and then only once it's fully built and signed off will we try and put an application onto it. We tried that in the past. It hasn't worked so well.

So first phase, and these are all in agile sprints. They are phases, but again, each one has multiple sprints in it. Signing off the SDDC that the managed service provider delivers to us. Can we change the firewall rules without talking to them? Can we spin up kit without talking to them? Can we configure a load balancer without talking to them? So sign it off.

Basics, actually installing the operating system, actually building out an application, seeing it spin up, seeing Puppet get installed. Big phase there.

The first application we actually migrate as soon as we possibly can. That phase, a little bit shorter because it'll be a P5 application, one of the least important ones that we host.

Strong and stable. We'll be enhancing our platform and actually making it better, getting it ready towards that P1 readiness state. So that'll be adding more metrics, more logging, more monitoring, more hardening to the system, and making sure we actually do things like patching.

And then finally, P1 ready, when we actually put that first P1 critical Windows app across onto the estate. But it's very important that we want to get apps onto it ASAP so we can find out all the mistakes we're making.

So people to operate it. A small number of brilliant people.

Historically, the managed service provider had hundreds of people on the contract operating the system for us entirely manually. And we found with the Linux Common Platform, a small number of people works much better.

So I have two brand-new Windows platform engineers on my team, Manon and Lawrence. They weren't Windows platform engineers before, but they had that talent. They were the ones doing the PowerShell. They were the ones trying to automate. I said, "You two, join the team. We'll do this POC. We'll do this pilot." And they're loving it. They're loving playing with all these tools. They're loving seeing metrics and graphs. It's really cool.

Two sets of engineers at ITV: embedded engineers and divisional engineers.

In the Linux Common Platform, we have product teams to develop software, 16 products that we host. And I embed a Linux platform engineer in each one of those teams. But we don't really have product teams for these COTS apps. The P11D expenses app doesn't have a whole product team supporting it, so it'd be crazy to have a one-to-one mapping of engineers.

But we need to embed them somewhere, and so we embed them in our business divisions. These are the five main business divisions here, and we align them with the business division, and they'll actually sit with that business division, and they'll know the end users. They'll know the app owners, they'll know the tech directors, and they'll build up a relationship, and they'll actually feel a sense of responsibility for the systems that they operate.

So what will they do? Three main things.

They'll build these systems, and they'll migrate them from the old estate into the new based on a priority order. And they'll build them really, really well. They'll actually understand them end to end, which has never been done before.

They'll obviously then operate them. They'll take responsibility from the managed service provider for operating them day to day, and they'll actually make sure they're operating 24/7.

And then finally, they'll also enhance them. Continuous improvement, continuous innovation was never factored into the contract we signed with the managed service provider five years ago. So it didn't happen. Why would it?

But we realize actually now that always making things better, always looking for something to improve, actually is a good return on the investment. So they'll be doing a lot of enhancement as well as they go forward.

Separate to the embedded engineers we have, we also have this concept of a core team, instead of subject matter experts. This is the model we had historically. You would actually have three product teams or three individual teams duplicating effort. So this is logging, alerting, monitoring, all being duplicated between all the different teams.

And what we'd do, we'd actually add a core team into the equation, and they would take responsibility for doing all of that work. Because you can see they've made a triangular wheel, they've made a square wheel, a pentagonal wheel, a hexagonal wheel. Doesn't really work. The core team comes along, and they make a round wheel, and they share it with all the other teams.

So actually, and that you can see now, they've got a lot more capacity left in their day to work. Very important to have a core team in the equation.

Now finally on to applications, because I can see I'm running out of time, unfortunately.

Oh, so... Has anyone got 300 Bitcoin they can lend me?

So here's the IT estate. The Linux stack is actually hosting 16 products comprised of many, many microservices, but actually only 16 is the easy bit. The Windows estate: 171 applications hosted on there at the moment, and that's down from 800. Originally, 800 applications on the list.

And we've rationalized this. These are the ones we said, "Do we need this? Can we decomm this? Can we SaaS that?" After we've done that entire exercise, we're still left with 171 ones to host. It's a lot of work to do.

So we're going to be migrating them, as you'd imagine. We're going to do the P5s. Going and saying to my boss, "We've never done this before. We're going to do a terrible job." And after he picked himself up off the floor after I admitted I was going to do a terrible job, he said, "Okay, we can do some throwaway prototypes." And that's what we're going to do.

So we'll do a load of P5s, we'll do a bad job, and we'll learn. We'll build some patterns. Then we're going to do some P4s, we're going to do some P3s, we're going to do some P2s, and then by the time we get to P1 readiness, by the time we feel we're confident enough to do a P1, wouldn't it be negligent not to do all of the P1s at that particular phase?

So we do all of the P1s, and then we do all of the P2s, and we do all of the P3s, do all the P4s, P5s, and then I can finally retire.

So moving swiftly on. Some would argue that we shouldn't automate everything. Surely, if it doesn't change very often, there's no point automating it. I don't disagree.

So this is frequency of change here, and this is applications over here. This is the Linux estate we have at the moment. It's very fast-moving. We do multiple releases per day. And this is the Windows estate. Lots and lots and lots of applications, and this actually stretches off all the way over here.

But surely the area under this graph here is a total amount of change in this estate, which you could define as this, and surely the area under this bit here is a total amount of change in this area, which you could define as that. And surely they're actually kind of possibly broadly similar.

So you're saying you would automate this, but you wouldn't automate that?

My thinking is actually a lot of the COTS vendors, their reaction to SaaS is going to be, if they can't SaaS their products themselves, is to increase the frequency at which they do releases of their software.

When your business starts coming along and saying, "An annual release of this software isn't good enough for us anymore. We want it quarterly," and all of these jump up by four, are you going to increase the size of your workforce by four times to satisfy that? Or maybe there's some way you could actually make them more productive, maybe through some kind of, ah, automation.

The result, as I'm running out of time, unfortunately.

We're going to have one Common Platform at ITV. It's going to be on-prem and AWS, Windows and Linux, self-documented, a limited blast radius, higher quality, more frequent change more safely, and a smaller team or equal to save money.

The biggest thing it's going to give us is what I call an off-ramp. Right now we couldn't move to the cloud because we don't know how half of our applications actually work. We can do a massive investment in application archaeology, working out how they actually work under the hood, config managing the entire thing end to end.

So in 2022, which really sounds like the future, we'll actually know how they work, and we can press that button and play that config and move the entire stack into the cloud and finally shut down our data centers.

Gene asked to add a slide at the end: how I need your help. Well, I've never done this before, and I'm hoping there's people in the room that have. I'd really like you to grab me afterwards and tell me how I'm about to screw up.

Thank you very much.

Q&A

We've got a couple of minutes, so anyone got any questions? Yeah. Any thoughts? Any questions? Maybe one or two. Sure.

Q: What did you learn from your Linux migration that's going to help you with the Windows one?

A: Good question. With the Linux migration, actually, we actually built out onto the new platform.

The question was, what did you learn from the Linux migration that you can apply to the Windows one?

Actually, I think the slightly cheaty thing was actually there wasn't really a Linux migration. So the Linux stuff that we had actually was all developed onto this new platform we developed as part of their modernization program. So actually I think I've done the easy bit, now the hard bit's coming up.

Q: So you mentioned there obviously the way you structure your team. You've got your core team, and then you've got your engineers sort of within the individual project teams. How do you actually maintain that alignment going forward? The individuals you have in the product teams not develop their own delivery entity?

A: So all of the platform engineers at ITV report to me. I think by that, actually, they all identify as Common Platform first.

One of my team identifies it perfectly. She describes herself as a citizen of the Common Platform, but an ambassador for the platform within that particular product team. So she considers her allegiance ultimately to the platform.

But we get together as a function weekly, and we're in Slack together, and we operate as a family inside Slack. So actually we self-police ourselves.

Right. Thank you. Thank you.