Doubling Down on DevOps: Global Transformation at Ticketmaster

Log in to watch

London 2016

Doubling Down on DevOps: Global Transformation at Ticketmaster

SVP, Platform and Technical Operations · Ticketmaster

A view of Ticketmaster’s journey through scaling their DevOps transformation across the enterprise, including what worked well, lessons learned and where we have doubled down for the the future.

Chapters

Full transcript

The complete talk, organized by section.

Justin Dean

Good afternoon, everyone.

First, I just want to say how grateful I am to be here. When I was thinking about the talk and how I think about Ticketmaster now that I'm part of it, I always thought how old we are and the challenges. Then after today, all of that's just kind of blown out of the water, right? I'm sort of like, we have nothing to complain about when there was a company here from 1200. It's like, wow.

So anyways, the talk today is around things that we've invested in around doubling down on DevOps practices and sort of the learnings along the way.

Quick about me. I'm SVP of Platform and Technical Operations at Ticketmaster. I've been there for almost a year and a half. The Ticketmaster Tech Ops team, specifically, is a lean and mean 155 people. That's all of the traditional infrastructure-type stuff: networking, databasing, systems engineering, infrastructure build tools. So, quote unquote, "Ops."

A little history about Ticketmaster. We are a publicly traded company. We do about $7.6 billion in revenue and $25 billion in GTV. The company was founded in 1976. They disrupted the ticketing world by introducing computerized tickets. Then in the '90s, we brought that to the internet and launched ticketmaster.com.

In 2010, we merged with Live Nation, and we joined forces to create the world's largest live entertainment company. Then in 2011, we realized that we needed to really reboot the company and disrupt ourselves before the industry and the markets disrupted us. It seems to be the general theme of most companies who have been around a while: competition is fierce, and you're going to be disrupted, whether you disrupt yourself or whether others disrupt you. So we started our transformational journey.

What does Ticketmaster do? One of the things that I find people don't quite realize is the scale of Ticketmaster, and they just see the website. But we essentially power 26,000 live events each year. So every 20 minutes, there is a live event happening which we are powering: sports, concerts, festivals, theaters, arts, runs the gamut.

We've got about 53 million fans doing about a half a billion ticketing transactions. We also own and operate 167 venues, so think about sporting arenas and places where live events happen. And we have over 12,000 B2B clients where we power their ticketing business.

Our website gets about a billion unique web hits a year, and we are a top-five e-commerce company.

So we have some unique challenges. The way I like to describe it is we are a Fortune 500 company with all the complexities of a business that size and scale, and we're a web-scale company.

We have this phenomenon called an on-sale, which for us is essentially we invite a DDoS to our website all the time. That's how we make money.

So essentially, think about Adele. Adele goes on tour. She wants to sell tickets. She's got 400,000 seats available. There's about 10 million fans that showed up to buy those at the same exact second, 9:00 a.m., .001. So it just hits all at once.

And then the other phenomenon that we have is the face value of tickets is lower than the market value of tickets. So we have a huge spike of bot traffic and broker traffic, right, where people are trying to buy them and then go sell them on the secondary market. What it creates for us is an internet load of traffic crushing our sites as part of our practice. It creates a huge cultural dynamic that's part of the fabric of our company to operate that.

Sorry for the terribly long slide here, but to give you an understanding of the complexity that we're dealing with, people say, "Hey, what's your tech stack?" And it's almost impossible to answer because it's everything.

From 1976, we built our ticketing system on top of VAX, and it was a custom application specifically built for ticketing. We still run that today, and we've emulated it, and it still runs in the same fashion with a lot of layers of technology to help modernize the operational aspects of it.

But we have everything from Big Iron sort of filers, and all of our cloud world connects to Big Iron like NetApp-type stuff. We're huge on NFS. We have a huge Xen Cloud that we run. A lot of applications in the 2000s in mod_perl.

We had to write a lot of our tools at scale before they were really a thing in the web world. So think about caching. Everyone takes it for granted. We built it before caching was a real thing. So it's embedded with our business logic, so it makes it really complicated for us to iterate on those things that we've built into the fabric of our business.

Then we have teams that are running more modern applications and immutable infrastructures inside of Docker containers, fully deployed Terraform-style in public clouds and the whole bit. So really it runs the gamut across all technologies.

We've got 22,000 or so servers across multiple data centers. We have 15,000 network endpoints, which is quite a unique challenge. But if you think about every time somebody walks into a venue, the little scanner, boop, lets you in, that has to get authorized somewhere. So that makes it through our network.

Then we have huge growth: 60% VMs in last year alone. So it's creating quite a problem.

In our products, we have 21 full-scale ticketing businesses within Ticketmaster. We've done a lot of acquisitions over the last 40 years that we've been around, and it's quite a challenge to keep up with. We have 150 unique products. Interesting fact: we have more products than all of our competitors combined, and we have fewer people working on those products per product than all of our competitors do.

So the note here for us is we just have this complexity at scale that is challenging as a business.

Competitive pressure, just like everyone else, I'm sure you have huge market pressures. Last year we grew the business by 12%, and we're tracking towards double-digit growth. It's super hard to grow a large business in those types of numbers.

We are the market leader in the ticketing industry, in the ticketing space. We have huge surface area, which means there's opportunity everywhere on our surface area for smaller companies like a startup to just get 1% of that business, and that could be a sustainable business for them. So we have a whole world of competition trying to get us. We understand that as a business, and we're betting hard on speed and agility.

So we realized we must get faster. How do we think about this problem? To achieve our business goals and our financial goals, we need more market share, we need more revenue, we need better products, we need better features. We need to deliver them faster.

The way in which we try to deliver them faster is by creating autonomous product teams. Essentially, we create the equivalent of a micro-business that has the speed and agility of a small startup competitive business. In order to really power that, we've had to double down on some of our DevOps practices and really simplify the system.

So what did we do? Quick note on our journey. We started where everyone else probably started: super waterfall, the standard chart where you see dev and ops with the wall. The wall was probably like 50 stories high, and waterfall deployments took months. It's quarters.

Then over the last couple of years, we've DevOpsed. We've gone full bore in every direction, read all the books, went to all the conferences, and just started making it happen.

We did a lean transformation across the company, moved everybody from the waterfall style of working to an agile style of working. We've built these cross-functional teams that are about delivering business value. We've automated everything we can. We'll continue to work on that, but we've made it part of the culture that it should be automated.

We've built a ton of build pipelines. We have metrics everywhere. To give an example of the scale of metrics, one of the systems that we use for just time-series data gets 400,000 metrics a second crushing the thing. So we have more metrics than we know what to do with. And we started blurring the lines between dev and ops.

Then we started realizing it feels like there's still some challenges. We started seeing cultural challenges around our on-sale process. So this is the Super Bowl, like every week. It was heavily ops ownership around it, and we built this protective culture of protect the business from themselves.

AKA, things are slow. We're really focused as a company on outputs and projects, and are we tracking towards projects and deadlines, and not really on the actual outcome. Is the needle moving? Are we actually working on the right things? Is it helping?

Teams were still highly siloed, functioned by skill set, super complicated environments, and it was still really common to hear teams that are blocked by something, or there's some friction, or we're waiting on something.

So what we decided to do was look at the core areas that we went out of the gate with on our DevOps practices and our lean transformation and figure out how to scale them up, how to make them better. So essentially, the iteration-two version of some of our practices. So we'll go through each of those.

We started out with cross-functional teams, and this was a big win from where we were. We built 55 teams that were supposedly autonomous and able to focus on delivering some product. So we made these teams of product, development, QA, UX. We didn't actually add any ops into it. So we missed the ops portion of the DevOps movement.

Then what happened was the teams got really fast, and then they got really fast to the point where they needed something done physically, like, I need this software to now be somewhere in an environment, and then insert the problems. So we didn't scale the ops team enough. We didn't scale our ops capabilities.

What happened out of that was we ended up with mega-teams. To combat that problem, the teams would go and they would grab a function like, "Hey, we need some network stuff," or, "Hey, we need some database stuff." They would go grab and add those people into the team. So before you knew it, you had teams that were 30, 40 people, and everybody brought their silos and their function into the team. Cross-functional team, but not sustainable and not what we were going for.

So the huge lesson that we learned as a company was full-stack teams are not the same as fully staffed teams. Full-stack teams are about teams that have the capability to run their business. Fully staffed is everyone brings their silos.

So what we did about it is we doubled down on how do we create autonomous product teams? And we got serious enough to make some pretty substantial org changes.

It's kind of a sensitive topic. But in my opinion, if you're going to do this for real, there's org changes coming.

So what we did was we moved the primary on-sale support, the Super Bowl of our business, to the product teams directly. That was a major thing for us. That's 40 years of history being done by ops, and the value of ops is on that specific function, and moving it to the product team. So that got all the product teams involved very connected to how their business is actually performing and running, and then making sure that they have things in their roadmap to make that process better.

We moved our application support teams out of Tech Ops into the product teams they support directly. So the team that operates our core ticketing system used to be in ops, and you had that huge wall. So we moved that team directly into the product team and instantly DevOpsed it.

I use that sort of jokingly, because obviously it doesn't instantly DevOps, but it removed 90% of the friction immediately. So you had this wall of like, "Oh, you can't have access," or whatever, and all that's just gone. Then they have a shared vision of what is important, not ops protecting them from themselves.

We've also changed the expectations around systems engineering and what that means. Systems engineers are one of those ops teams for us that's the most depended upon. So we changed them from the team where you go and ask for things to we deployed them out to embed them into product teams directly and made their function in life being the -ility partners.

What I mean by that is they are the people who help product teams with their scalability, reliability, deployability, repeatability. They are an advisor sitting in the passenger seat, not the driver's seat. So it is a huge fundamental shift.

And self-service tools. We went big on self-service tools. We essentially look at all tickets coming in anywhere, and if there's more than a few of them, it's a problem. Culturally, we got really rigorous about saying self-service it, period. Invest in it to self-service it and get it done. That's just paying off hugely for us.

Essentially what we've created is self-service businesses. These autonomous teams, now they are expected to, and they do, have the capabilities to build it, to run it, to own it, to optimize it, to monetize it. So we've essentially created micro-businesses.

Another area we looked at was our knowledge sharing. We started seeing a lot of problems with the complexities of our environment. Everyone brings in their silos and their knowledge base. The product team has different experiences and knowledge they bring to the table versus the operational team, and they just weren't really aligning well.

We saw lots of problems with, we have no way to structure the learnings. How do we teach somebody how to do things?

So what we realized is our teams were empowered. We're saying you have the ability to do what you need to run your business, but they just weren't enabled. They were either missing access, knowledge, or the complexity.

So we sort of built up the culture around every ticket is an error condition. Meaning, if you have to submit a ticket somewhere, that's a fail. That means either something's not self-serviced or you don't have the knowledge or the access in your team internally, and that's a problem.

So what we did was we invested heavily in a DevOps cross-training program, which was really structured with the curriculum around teaching people how to deploy and manage their software in prod.

I won't go through all the curriculum. You can see it in the slides. But all the systems, all the build tools, all the monitoring tools, PagerDuty, so that you can now be on call for your product. Yay.

Then we created this pretty elaborate curriculum. We sent all of the engineers to Hollywood, our headquarters, where we administer the course for a solid week of in-depth training. At the end of it, what they left with was the access. They had the right levels of privileges to do their job and to deploy their software and to run it, and they had the knowledge of how to use it.

What we got out of that was we've trained 250 engineers thus far to be self-sufficient. They've brought those capabilities to their teams. So now their teams are enabled and empowered to do what they need to do without asking outside of the team. So they have the keys to the car, and they know how to drive it.

We're seeing less time waiting on operations. So we're 3X faster. And product teams are now fully on call. So we have hundreds of engineers that are literally on call now, and ops is not on call.

A huge, I don't know if I'd call it a side benefit, but somewhat of a side benefit, is the graph showing the MTTR. So we went from an average incident response of 47 minutes to being able to resolve things in three minutes. Turns out, when the person who wrote it and understands how it works, when they get paged, they can fix it faster than a group of people who don't understand it.

Then we also looked heavily around... I didn't know what to call this, but ultimately what we started to see was we were talking the DevOps talk a lot. Everybody spoke the language, everyone knew the lingo, but it felt like we were starting to slow down in our returns.

What we started to see was everyone was doing more deployments than they've ever done, faster than ever, but there was a high degree of variance. Everyone's technical debt was not really decreasing. Continuous delivery wasn't happening as fast as we thought it was.

Ultimately what we saw was muscle memory. People were bringing in old habits and just DevOps-washing it, calling it something DevOps. "Yeah, we do the DevOps," and then things like, "We do our own releases," and then you find out actually it's one dude who does that for you, and it's manual. So once you start diving in a little bit, it's like, we're talking the talk, but we need to find a way to quantify it a little better.

So the lesson for us was we have to define and measure performance objectively. We need to quantify the DevOps.

So we spent a little time to figure out what's the scope of the problem, and what we realized was the complexity of our software factory was costing us significantly. We sent out some surveys and everything, and we figured out that about 50% to 70% of our time is spent cranking the dials on the system, moving bits around, non-value-add work to just get things through the system in some way.

So we're saying that you can boil that down to almost $100 million of extra innovation time that we could unlock by just solving some of those problems, along with the benefit of just getting faster as a team and as a company.

We found we had 150 custom-built ways to release products, often manually. We were really low on our maturity around key capabilities like monitoring, alerting, testing. Fifty percent of our incidents were caused by issues that should've been detected before they went to production. And we were just spending a lot of time reinventing wheels that don't need to be invented again.

So we invested in what we're calling maturity models. We've built two maturity models thus far: tech maturity and process maturity.

Tech maturity model is a model that our chief architect actually developed. It's essentially a way for us to assess about 50 capabilities within the dimensions of coding, building and testing, releasing, operating, and optimizing, to really understand a product's maturity objectively and give it a score.

Ultimately what we get out of that is we get a non-biased view of what is the actual maturity on the things that we care about in the sort of DevOps umbrella. So we know where we're at, and it provides a clear roadmap of where we need to go. And you get out of the verbal game of asking teams about their performance. Everyone thinks they're doing the DevOps or their performance is high until you look at it and see where you actually are.

We released techmaturity.com internally for right now. We've made this extremely visible within the company, and we report on it, and everybody's aware of everyone else's performance. So we've got about 350 or so products that we've assessed, and their score is visible.

You can quantify the technical debt aspects of a product and actually put it on par with business features, so that you don't only feature, feature, feature, and you ignore tech. This is a quantifiable way for our teams to get funding to raise their maturity in languages that the business can speak.

Here's a snippet of what it looks like. This is the average of all of our teams across the board, or all of our products across the board. What we've learned from this is we're pretty... First of all, what we learned is we're not mature from the model in which we're grading by. There's four different levels. We're in the two-ish range. Level four is where we're aspiring to go.

Then you can see each of those dimensions there. It probably might be too small to see, but you essentially get what you measure.

Interestingly enough, one of our highest scores is around our DevOps practices. So we've been preaching it for the last couple of years in the company, and it's working in terms of our maturity level to some degree. But it gives us a clear roadmap on all of the other capabilities that we need to improve.

So what we found is we're pretty good at coding, we're a little not that great at build and test, and release and operate, and then we're decent at coming back and optimizing. So every team has a clear roadmap of where they need to go.

This is a little snippet of a little bit more detail of what those capabilities look like. Within the coding dimension, inside of there, you can see capabilities like what does your test suite look like? Do you get feedback and requirements, and logging and metrics? How mature are they? Do you do backwards and forwards compatibility? How well are you doing code reuse?

So you can see that it's fairly robust, and definitely a quantifiable way to look at your maturity level across your products. I won't go through reading all of these, but the team loves it.

We thought when we rolled it out, they would feel like, whoa, I'm being graded, like that's not cool. And really, it gives them the ammo that they need to go and get investment for maturing their products.

Then the second model that we built was a process maturity model. Really what this is, is team performance. Whereas the tech maturity is around the product performance, this is around the team performance. How well are we working as a team, and are we becoming a high-performance team?

So it has the same type of format, where there's a bunch of dimensions that we measure. The thing that we found was, number one, it's great to be able to quantify the team maturity levels, but really, when you get the team together and you assess them together, they understand where the gaps are as a whole, and they want to improve it.

Here's some of the elements that we look at. Is the team getting better at their sort of, not only are they doing agile for agile's sake, but are they all aware of the strategic roadmap and the assets that they're building, and are they staying the course of that? Are they getting better at their inceptions and workshops? Are they clear about roles and responsibilities within the team? How do they measure success? Backlog grooming. There's a whole list of things that we're looking in there.

But ultimately what we're trying to do is just a clear, quantifiable roadmap for people to understand where are we and what do we need to do to get better. And then ultimately what we want is teams that aspire to get there themselves, to level themselves up.

So imagine the culture difference when this is clearly displayed on a screen and you have a sorting function by score. You can see quickly that teams are not going to want to be in that bottom percent.

So all that to say is we're leveraging models and scoring a lot. We think that the future, we're seeing a lot of value out of them. Scoreboards don't lie. We're working on a risk assessment model, technology cost optimization model, model optimization model. Just kidding.

We're going to open source this really soon. We need to clean it up, make it publicly available. You can ping me directly or Alex Hasbun, who's our chief architect, who initially architected this, and then we'll get you involved, and we would love to scale it out more and get it out there in the community.

Another area that we looked at in the business was what we were calling our lean transformation. Essentially, this is our value alignment system.

So what we learned as a company is we can do anything, but we can't do everything. And we always try to do everything. That's just natural.

So as a company, we created a roadmap where we said, "These are the core few handful of things we're going to actually work on. These are the outcomes that we want." So we created this whole portfolio management system of the way we invest in our resources and the outcomes that we expect to get. Every team and every team member fits somewhere within this portfolio.

What we started to find was, it brought along some good things, but what we started to see was sort of like, I don't know if you guys have ever seen this commercial, the Facebook wall, where the old lady's saying she's going to unfriend people, and she's got a physical wall with pictures on there or whatever. It's like, you're just doing this wrong.

So we started looking at this and looking at some of the stats, and it was like, we're kind of doing this wrong. The purpose of this is to really get focused on delivering the value of the things we think are important as a company. When I looked at things like this one team has 66 FTEs, and they have 65 projects in flight, I'm like, that's not a good one-to-one project-to-person ratio. We're doing it wrong.

We were bringing a lot of our old structures and processes into the system, and we were doing a lot of value tagging. Let's say we had this strategic initiative to increase our stability. We would just say everything is in the name of stability, and just put a tag. Equals stability, oh, it gets funded.

So essentially what we built was a giant WIP machine, work-in-progress machine. The huge lesson that we had around that was around outcomes and not output.

The way we're attacking this is through what we're calling value-driven delivery. If you think about test-driven development, TDD, you build the test case, and then you do something to make the test case green. We're taking the same concept, and we've applied it to our business outcomes.

These are our business dashboards. This is literally one of my dashboards. And we build the test case as a graph of a business outcome that's live. So in order for somebody to do work, they need to move that dial on that. They need to turn that graph green, essentially. So we essentially are defining the goal upfront, and it's measured. So it's very clear if you're winning or not.

We're moving the whole culture towards having the scoreboards and having one of these scoreboards for every team and one for every product, to treat our products like a business. A P&L sheet. Is it working? And having all the dynamics that we care about within just the simple version of a dashboard that shows every product as a business and the velocity and the maturity.

So imagine the case where if your product is not mature, your team velocities are really low product scores, your risk assessment is high, and you're costing a lot. That might bring up some conversations. So it gets really quantifiable about the reality and the state of the union of some of these products.

One of the things we've invested a lot of time and energy in is reshaping our leadership culture. Getting people comfortable with iterating on your org. That's an uncomfortable space for a lot of companies who have been around a while. You have leaders who have been in the company for a long time, and they have a certain version or vision of something.

So getting comfortable making org changes and being dynamic, we've been baking that into the culture, and we're evolving that. It's not easy by any means.

A quote here from my boss, Jody Mulkey: "People don't create value up to their boss. They create value towards the customer." Why I put that in there is because in a lot of cases in what we're doing is we're taking people out of your traditional functional organization, and we're moving them into a different team. But then maybe you're still the functional manager.

So a functional manager has to get used to, I don't manage the tasks. They're there to make that function better. If you're a systems engineering leader, your job is to make systems engineers better. Your job is not to manage the actual work that they're doing. The team where they add value helps manage their workflow.

We're creating the culture around dashboards versus words to create a move-the-needle culture. So ideally, we're going to get to the point where we don't allow text-based emails to send a status. Send the graph. TL;DR: graph.

So we're creating this culture of total transparency. We're putting all of our data, democratizing our data, and making it available in our faces: our product inventories, our business dashboards, our maturity scores, our costs, our team and product performance.

What that gives us is that visibility and transparency puts our leaders in the position to where they truly become the CEOs of their products. Is your line of business working? And you have all of the tools, all the transparency to know.

We're continuing on this. We obviously haven't solved everything. Just like everyone, we have a lot of room for growth and improvement here. But we've been able to create these product teams, and essentially, now we have the tools and the visibility into where we need to go. So now it's a matter of leveling them up.

As mentioned, we definitely haven't figured some of this stuff out. Areas of interest that I would love to get feedback on are leadership cultural shifts and the roles of leaders, especially functional leaders, as the future changes what is needed. What does that role look like? What does the day-to-day, what's the job description look like?

The other big topic, I think, for us that's interesting is how do you mix diverse skill sets at scale? Specifically because we're a 40-year-old company, which is kind of new compared to some others. But we have people who have been industry veterans that have been doing ticketing in the ticketing business. It's complex. They understand it, but maybe they're missing the modern web-scale practices.

So then we're mixing a lot of people who come from dot-coms and come from web-scale environments who have no industry knowledge of ticketing or these other systems. So we just have this epic collide of multiple cultures who don't speak the same language on a lot of things, but we need to collaborate and work together and figure out how to solve the problem and create the right shared vision. It's been extremely challenging.

That is it for me. Thank you very much for listening.