Log in to watch

Log in or create a free account to watch this video.

Log in
San Francisco 2014
Share

Can Other Enterprises Operate Like Netflix Does? A conversation with Netflix

Netflix is often referred to as a prime example of a “DevOps Unicorn”. They are nimble. They are successful. The make tools the inspire envy. They have an organization structure and way of working that defies industry conventions and rules.


But why does any of this work? Can we all get our organizations to feel more like a Netflix or are they just different?


In this session, Damon Edwards will be interviewing Netflix’s Dianne Marsh and Roy Rapoport about the principles behind how and why Netflix works the way it does.

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

So first of all, this is not a panel. Okay? This is a conversation. I know panels are the dreaded thing.

So Gene gave me a marching order, and he said, "Go find me a unicorn, buddy, and let's do an autopsy on stage." So I went and grabbed Roy, who I've seen done some great talks on explaining sort of not just how Netflix works, but why Netflix works. And then through Roy, got to Diane and put together, I think, a good insight into the Netflix way. And that's kind of what the whole point of this talk is, is that we talk a lot about unicorns and horses. And we see Netflix being held up as this example of Netflix is a unicorn, right?

And people tend to think of one of two ways. Either it's kind of dismissive, like, "Oh, well, it's easy. If only we were a streaming movie company, or if we started in the last few years." Or they say, "Hey, let's just copy everything they do." But I think what kind of gets lost in the conversation is that unicorns are just horses that think differently. They fundamentally see the world differently, they think about the world differently, and then all these choices kind of flow from that.

So we want to talk a little bit about that thinking today. But first, can you guys just kind of give a quick introduction of yourselves to the audience? Diane, I know you run all the engineering tools. Roy, you're the metrics guy.

But let's give them something more formal than that. Okay. I'm Diane Marsh. I lead the engineering tools team.

That means that we're responsible for getting the code from the developers' desktops out, deployed into the cloud. It means we're building the continuous delivery platform at Netflix. I'm Roy. I manage insight engineering at Netflix, which is the group responsible for building the real-time operational insight systems at Netflix.

At minimum, that means when things go bump in the night, our system should be telling people about that, and ideally, actually, our systems actually automatically remediate and don't tell anybody about that. Great. So, first of all, let's get this whole horse versus unicorn thing, right? I feel like when you actually look at what Netflix does, it's not just some little streaming movie company.

This is big business. Give me some numbers. What are we talking about here in terms of the size and scale of what Netflix actually does? So what, we're about 22 billion market cap at this point, about $5 billion a year in revenues.

53 or thereabouts million customers, which of course also means about 53 million credit card numbers to... Actually, anybody here a Netflix subscriber? So, thanks for paying our salaries. So, 53 million credit card numbers to keep safe and to- Right ...

comply with things like PCI Level 1 audits, and SOX, and all of that stuff. We're like a real public company, and we actually operate like a real public company. So from your perspective, you feel like a horse? I don't usually describe myself that way.

Of course. That's a good point. All right, so let's talk about, that I think you see Reed Hastings, your CEO, he talks about the Netflix culture, and I think a lot of people hear culture, especially if they've been working in large enterprises for a while, it's kind of the thing you just sort of gloss over, like everybody has a mission statement, and everyone's got these values. But Netflix, I find it very interesting that these two words you keep coming back to, which is freedom and responsibility.

And it sort of permeates through the company, and I think it gives a very different flavor and a very different way of working when you actually talk to folks who work in Netflix or you see how Netflix works. So can you talk about those two words, freedom and responsibility? Like, in a nutshell, how would you explain to somebody from the outside seeing Netflix operate for the first time, what does that mean? So, I think that what it means to me is that we all at Netflix, all the employees, have the freedom to make decisions, but the responsibility to make sure that those are good decisions for the company, for our teams, for our subscribers.

And so it means that we don't have a lot of rules. We really depend on each of us to be making those decisions every single day. And decisions range all over the map, from what technologies we use to how we do expenses. So one of the things to note is that we don't actually have an expense approval process.

So as a manager, I don't approve my team's expenses. They just put them in there. I do get an email about them, but I don't have to approve them. They just get automatically approved.

We expect that our teams are using good judgment about how to spend Netflix money and that they're making good choices about the money that they spend. Yeah, I think for me it's about speed of innovation, right? So we say when we believe that fundamentally what's going to make us successful is maximizing speed of innovation, and I think that to maximize speed of innovation, you really have to decentralize innovation, which means also decentralize decision-making. Let's try this thing, right?

So if you make it so everybody in the company can make that let's try this thing- Mm-hmm ... kind of decision, that ends up meaning that you can essentially get a lot more sort of shots on goal. Right. So let's talk what that actually looks like, right?

So, I think everyone's got the same goal, right? We're all here. We all want to move faster. We all want more shots on goal, which is the kind of proven way to, year after year, get more out of an organization.

But what does it actually look like when you say decentralize, when you say give people freedom? Let's talk about the freedom side first, right? What does freedom actually mean? What does it feel like to work in that kind of organization?

What's a day-to-day world look like for a feature trying to make its way through the world to production? I'll brag for Roy a little bit because I don't know that he'll do it himself. So, when Roy was an individual contributor at Netflix, before he was a manager, he needed toMake one of those features come to life. So he needed to put a service out in the cloud.

So he was a really great Python programmer and decided that he would rather write this in Python than in Java. And a lot of the company has a lot of support for Java. Our platform is written in Java, and so if you write in Java or JVM-based languages, you have a lot of support. But Roy is effective in Python, so what he decided to do was to write his service in Python.

That meant that he took the responsibility for building up the platform components that he needed in order to support his Python service being a good Netflix citizen. Did I characterize that correctly? I would probably say that I was just not a very good Java developer, but sure. But if you want to talk about freedom, I would say that the thing that's important to note here is I was brand new to the engineering organization.

I had to solve this problem. I decided to do it in Python. And for about a month between the time that I started talking about doing it in Python and when it was actually delivered, I had weekly one-on-ones with my boss, and he'd be like, "Python? Are you sure?

Are you sure that's the right thing?" So the thing about freedom as an engineer is that it doesn't mean your manager is always going to agree with you. But there's this concept of the override bar, which is how much do you have to disagree with an engineer before you actually say no? For me, for example, I would say that I have to know that I'm right, and I have to know that if their engineer is wrong, it's going to be significantly costly. That's pretty much never happened in the two years that I've managed my team, which means that part of the thing about freedom and responsibility for engineers, for example, at Netflix is you get to make the decisions even when your manager disagrees with you.

And I've been wrong when I've disagreed with my managers a whole bunch of times. It actually gives me a certain degree of freedom as a manager when I know that disagreeing with my engineer doesn't mean that they're not going to do what they think is right. So let's talk about it. You used the word management several times there, right?

Manager, right? I think some people get this assumption that freedom means chaos, freedom means do what you want whenever you want and there's no guiding force. People go, "Well, how does that actually going to scale?" Right? But there is a strong sense of management, and there's a strong sense of direction and mission to what Netflix does.

How does Netflix do that? What is the role of management kind of flowing down through the chain? So as managers, we have two big responsibilities. The first one is to attract and retain great talent.

So to talk to people like all of you and really hire great people and bring them into our teams and build a team, not just a group of individuals that are independently working on something. And our second responsibility is to provide context to our teams about what the rest of the company's doing, and then to take context from our teams and provide that out to the rest of the company. And so that's how we don't have chaos because we're spending a lot of time building this context. We're also hiring really great people who bring their own sense of responsibility into their work every single day.

But then, a key role of management, or at least theory, is to define the structure of the organization. To say, "Here's how the organization needs to look, and here's the reporting structures, and here's how we divide up the work." How does that work at Netflix? Because I hear a lot about... Now, Adrian Cockcroft is here, I'm sure, somewhere, your former colleague.

He did a lot of pounding on the podium saying that it's not about DevOps, it's about, he calls it "NoOps." Right? Because he says, and get the bottom of it, he's not saying there is no operations work being done, but he's saying, "I don't need all these organizations. I need empowered teams of smart engineers who can go and get things done, and in a self-service model, pull the things they need to get that done and let them get on their way." So how does Netflix, from a management perspective, define organizational structure? Or how do you decide who goes where and who reports to who?

So I think we define organizational structure just like I think everybody else does, which is every once in a while, we talk and see if things make sense. I think maybe the worthwhile thing to note here is that we just don't pay that much attention to it. So, that means that my team is my team. They report to me.

I set salary, do all of that fun stuff. Their team meetings are with me. But, for example, while Dan's team is responsible for deployment automation and my team is responsible for inside engineering, I had an engineer who decided that the right thing for us to do was to have somebody within inside engineering build our own independent deployment system because we had some special problems that the general purpose system wasn't doing. In a lot of other companies, "Well, that's not your job.

You can't do that." At Netflix, it's like, "Well, okay. If that's the right thing for you to do, then go ahead and do it." And you did provide context to me- Yeah ... that you were planning to do this, and I said, "That sounds great. Roy's going to be off doing some experimentation.

He's going to build something that works great for his team. The rest of the company doesn't need it today. They may never need it, but we're going to learn a heck of a lot about what that system is that Roy's building, and if it does make sense for the rest of the company, we'll fold in the things that do ultimately." And there's no contentious arrangement between us. We weren't- Yeah ...

feeling at all like a conflict. Yeah, so structure is more contextual than control-based. Interesting. So, when talking about control and the responsibility side of that, it definitely feels like there's this you build it, you run it, type of ethos there.

Or I guess actually in reality, that's how it works. I know you see a lot of kind of classic organizations, everybody bristles at that, right? Developers don't want to wear the pager because they don't want to be responsible for everyone else's problems. Operations side doesn't want to give up that control because they're afraid everyone's going to blow up the business.

Your Sarbanes-Oxley, PCI, report to Wall Street, big business, lots of credit cards. How do you allow developers to manage the things in production that they built? Thoughtfully Okay. Here's the thing.

Developers don't want to be responsible for other people's problems, of course. But in the end, if they don't want to be responsible for their own code, I would argue that they probably shouldn't be working at Netflix. And as somebody who was a developer, I would say that obviously if I write code, I'm responsible for whether or not it works. And irrespective of whether or not I was at Netflix, I would say I've always been responsible for my code.

We tend to hire people who are interested in that particular thing. Now, as an ex-ops guy, as an ex-IT guy, I would also say that frankly, this whole idea that operations wants to hold onto production because developers will break it, which God knows is very emotionally attractive to ops people. The truth is, we set up these systems where the ops people have to push stuff to production because of this whole separation of responsibilities thing. Mm-hmm.

But it's not like your ops people actually know most of the time what they're actually going to be pushing, right? Mm-hmm. So you're really creating this, I would argue, sort of almost security theater game of having this separation of responsibilities, but really ops people are just going to do what the change control ticket, if you use that kind of thing, does. And gosh, if it breaks, right?

The person who pushed it, this poor ops person pushed this code out to production, it broke. They're not typically in a position to be able to fix it. Right? The developer who pushes the code out to production is taking on the responsibility to push it out there, but also is the one who's in the best position to fix it quickly, right?

So it feels to me like there's this disconnect between, let's keep the code close to the person who wrote it and the operation of that code, because that seems less risky to me than having some poor, uninvolved person push that out to production and then have to figure out what happens when everything blows up. Right. Now, Roy, you had a presentation that opened my eyes. It was, I think, Actionable Metrics, it was called.

It's a great presentation if you guys want to Google it. But what struck me is one of the things that seems like such the classic operations tasks, monitoring, right? Yeah. It's like, hey, it's a thing.

It happens with documentation and some deployment automation at the end. The afterthought, right? Gets thrown over the wall. Someone's got to figure out some monitoring.

And you were talking about how coming from a traditional enterprise two-tiered background, it was a mindset change. I'm putting words, I'm paraphrasing here, but you kind of became a service provider. You created a metrics service that you had to convince other people in the organization to use, otherwise they would go and build their own, and involved getting the developers libraries. So basically, it was driven by developers, and you turned yourself into a service provider.

And I kind of see that model as the operations as a service, right? That operations is not going away, it's operations is becoming embedded with the teams that do the work. And operations is becoming tool smiths and service providers to give people what they need to get their job done. And you used a word there, you said, "And my job is to get the hell out of a developer's way." Right.

Can you talk a little bit about that and sort of how you approach doing something that seems like a classic operations task, like monitoring and metrics? So this is actually something that I'm deeply passionate about, and I think what it comes down to is a concept that we don't talk a lot about, which is operations engineering. Actually being thoughtful and conscientious about engineering your operations processes. And in fact, this is particularly important to us because Diane's team and my team are two of the four teams that make up an organization within Netflix called Operations Engineering.

Mm-hmm. So there are practices and the day-to-day tasks, which are less interesting to me, and there are all the processes and tools and systems that you can build to engineer having these things be done well. And I think of that as essentially the operations engineering discipline. I'd love for us to be talking more as an industry about operations engineering.

So for us, this is really at the core. We don't really think of ourselves as an ops group, but we do very much think of ourselves as a group that is engineering the operations processes and systems at Netflix. Yeah. So, I guess just to drive that one step further, my team builds this continuous delivery platform, right?

We don't actually deploy anybody's code. Right? Roy's team builds the insight platform, but he doesn't actually monitor anybody. He doesn't write the monitoring code for anybody else's system.

They add their own insight monitoring- Right ... into their own services. Well, it's interesting. You said also though, if those teams find that that doesn't meet their needs or not meeting their needs, they have the freedom to either build their own or go buy it somewhere else.

Yeah. Again, we created our own deployment system. Mm-hmm. Though I'm hoping that by around Q1 or Q2 of next year we'll start using Diane's deployment systems again.

And there are teams who have built specialized insight systems to solve their particular cases, because our general purpose insight systems weren't sufficient for them. Generally speaking, as a service provider at Netflix, the general sort of rule of thumb is if one team needs it, it's probably not something that we're all that interested in building. If more than two or three teams need it, that's probably something that's worth looking at as a general purpose tool. Interesting.

And so then- And probably a little bit eventually consistent- Yeah ... on a lot of tools. Yeah. Right.

So. Great. So I don't want to hog all of time here. So if anybody has any questions in the audience, I'm going to keep going here, but just raise your hand and you can ask a question as well.

So, one of the things I'd heard ... Is that a hand? Where's there a hand? There's a hand over there.

Oh, yeah. Stand up, yell it, and I'll repeat it. So when trying to get this development and operations group work together, what do you do in businesses like finance or healthcare, where there are legal reasons to keep developers out of the operational environment? Okay, so it's about compliance in finance, healthcare was the examples.

How do you get this kind of freedom going when those regulations are there? Now, you were just saying how PCI and SOX, those are scary things as well. Yeah. So I would say I'm not a compliance guy.And I've thankfully worked in the finance industry for a very short time and never in the healthcare industry.

But I would say that what I would start with is really fantastic auditors and a great relationship with their compliance and auditing teams. I've worked with our compliance team within Netflix and our internal auditors at least, and we seem to be able to have conversations with them that result in reasonably positive outcomes. Mm-hmm. I suspect people coming to a lot of those conversations, and gosh knows I did before, with a lot of preconceptions about how things should be looking around things like, for example, separation of responsibilities.

And I remember talking to a senior auditor at Deloitte who said, "Yeah, separation of responsibilities is not actually mandated in anything," but people seem to sort of gravitate to that as the one true way. So I would say I would start with maybe more open-ended conversations. Right. So taking a little bit more implementation detail, we've segmented pieces of the business that have to fall in categories that have compliance associated with them.

And so while any developer at Netflix can deploy our service to production, only a more constrained number of developers can get their fingers in the pieces that fall under compliance. And so what we found is that there are ways to separate out our microservices so that not all of them fall under those same concerns, and by separating them out, we're able to do more than we would if we just saw it as one big monolith. And I would add maybe two things to that. One, the parts that are under SOX or PCI are a trivially small part of our environment.

And two, and this is a mistake I've seen a bunch of different companies make, unlike a lot of other companies, we use the lowest level of control for any given part of our environment, and we don't try to have the same level of control across the board. I've seen companies that basically have parts that are SOX compliant, and so they use the same level of structure and control across their entire environment because it's easier and simpler to have just like, this is how we do all systems. We segregate those systems pretty ruthlessly, and we don't manage them like we manage other non-segregated systems. Got it.

There are lights in our face, so you have to raise your hand real high if you want. Jump up and down. You want to jump up and down. There we go.

Get it. So as you, at Netflix, foster the ability for different groups to invest in tools and infrastructure based on their needs, how do you keep an eye on the aggregated amount of resources you think are being applied to all these tools, like company-wide, so you can kind of balance what you centralize versus what you allow very well? That's a good question. So the question is about if everyone's making these choices about their own investments, how does Netflix manage the bottom line across the company if everyone's got the freedom and responsibility, the freedom to do whatever they want to do?

So I guess that's an interesting way to look at how do you look at cost, right? Because you guys are all on Amazon, which has been proven that unless you're really smart about it, it's not necessarily the cheapest. You build your own tools, which is definitely not the cheapest. You've stayed on record that you make sure you're always paying market rate or above for any talent because you want the best talent.

That doesn't sound like you guys are doing a great job of controlling costs. So how do you think about it, right? And it's a public company, so I'm sure everybody wants to control costs, right? So to be clear- From the macro perspective ...

we don't pay above market, we pay top of market. Top of market, sorry. But also, we don't depend on premature optimization. So we're not prematurely going out and layering a control there that says, "You've built X system and you've built X system.

Let's all get in a room together and decide who's going to own it." We do let those experiments happen. Mm-hmm. And we feel like having those experiments happen provides us with a lot of information that we wouldn't otherwise get if just one team were looking at it. Mm-hmm.

And the responsibility piece comes into that. So if you're building something, figure out who else in the organization is doing something and collaborate with them. Right. So, two things.

One is a number, the other one is an observation. The number is $1.8 million, which is the revenue per employee, I think, in 2013 for Netflix. Okay. That's about 50% higher than Facebook.

It's about 50% higher than Google. I'm pretty pleased with that number, and I think it suggests that this sort of decentralized approach works reasonably well- Right ... for keeping your business going. The other observation is, if I own insight engineering, that basically means that I should be aware, I should have insight into- Yeah.

I should have insight into all insight efforts inside the company. And that doesn't mean I have to formally track them, but I would say that I would be a poor manager if I wasn't aware of what else is going on at Netflix and having ongoing conversations with other people who are doing this stuff- Right ... talking about optimizing. Do you think also you have an advantage because you've sort of made decentralized kind of customer-aligned teams that you can look at the holistic cost benefit of things far easier than you could if you're in a classic siloed organization where everybody kind of lived in their functional silo, and you didn't really understand where the costs were going?

Because I sort of hear this a lot when people are talking about, "Oh, we couldn't possibly let developers do the operations tasks as well because they're too valuable," right? When the reality is when you add up all the effort that goes into running the service, they pay far more than they would if they just let those developers build and own the thing that they had built. So I'm just kind of wondering how much of that being able to have more decentralized, sort of business-focused aligned cross-functional groups allows you to keep track of those, makes those costs clear to everybody, and so you make decisions a little easier and you understand your path to the customer. Or did I just make that up?

I don't know. I think it's highly useful, right? Again, the whole thing about context at Netflix is that you can't be siloed, and there's no thing that you don't get to know. Sometimes it almost feels weird.

I've had conversations with our facilities people about how much our lunches cost, because I'm kind of curious. And I've worked in a lot of organizations where, if you're asking some other group how much they're spending, it feels kind of personal and invasive. And our facilities people thought that was actually really nice and suggested that I was actually interested in what they do. So it's almost a contextual thing.

Hmm. But the other thing I would say is that the thing that helps us is we're not focused on small details or small dollars. This is why, for example, if you want equipment, if you want things like new power supplies or something, there are vending machines at Netflix. They don't track who got the stuff, right?

Because in the end, we don't think it's useful to track who got two AA batteries or a new mouse or a new keyboard. And so we don't track small expenses. We don't sm- Don't sweat the small stuff To use a cliche, we don't sweat the small stuff. Right.

We just sweat the big stuff. But the cool thing about it is, on the vending machine, which doesn't take money and doesn't take a credit card, and you don't have to badge in or anything, it says how much each item costs. So, I tend to lose those little attachments for my Mac adapter. I don't know, like water.

I don't know where they all go. But it tells me how much those things cost. And so, I feel like I should keep better track of those things because I see the price, and then they had those cozy things, and that made everything so much better. Oh, those are wonderful, yes.

Yeah. Anyways, so it's good to know how much things cost, right? So- But make the decision yourself, yeah. But make the decision anyways.

Make it yourself. Cool. So, what- What's that? Over here.

Oh, sorry. Go ahead. Couldn't see you. No problem.

So, for leadership that is, or executives that are more risk-averse, there's this whole concept that Agile moves too fast, or DevOps moves too fast, and we need to be more deliberate in the way we progress towards production. You recently had some significant problems deploying Netflix in Europe. Is it tied to working too fast? So the question is, Agile means go faster.

Going faster traditionally leads to things breaking. When you have outages, can you look back and say, "We were moving too fast," or do you feel like there's usually some other systemic issue that you look to? So I've been in a bunch of incident reviews, which is what happens when you have production outages. I don't think that in the three years I've been attending those, I've ever heard the phrase, "We were moving too fast." But I have heard the phrase a whole bunch, "Our tools let us down." Hmm.

Interesting. Now, what kind of tools? Is that on the testing side? The deployment side?

All tools? What type of issues do they normally- Well, not just- What are the common tooling problems that let you down? Pretty much just in Diane's team. Okay.

Fair enough. No, so for example, one of the things that we've done a much better job over the last year is, my team has built an automated canary analysis system. So we can now actually automatically validate that your new code push is looking at least as good as your previous code that's running in production. Operations engineering, at its heart, is all about letting you maintain your velocity while improving your quality.

There was a question in one of the earlier sessions about how do you build this blameless culture? And, I don't know how you get there, but I know how you stay there. And you stay there by when you have an outage and an incident review, you don't point the fingers at other people, you point them at yourself. Right?

So I think there's enough room for blame to go around. You're in one of these incident reviews and maybe somebody entered a configuration parameter incorrectly, and maybe it took down all of Netflix. That would never happen, right? But maybe it did.

And so in that incident review, the person who actually typed that thing incorrectly, right, says, "Gosh, you know guys, I could've done better." Right? I look at it and say, "My tool should've given you visibility to let you know what you were doing, so that you would have better insight into whether or not this was going to be a dangerous maneuver. Roy's tool could've given you better understanding that this isn't as good as the version that's out there in production today." And so instead of everybody point the finger at another person in the room, what you find is everybody's pointing the finger at themselves, and you all walk out with things that you should do to make it better. And so that's how you continue this blameless culture, is that everybody takes blame and figures out how to make it better the next time.

So move fast, forgive yourself. Right. Fail fast, but also fix it fast. So it almost sounds like the moral of the story here is if you want to improve this kind of culture, that taking the responsibility is as important as asking for the freedom.

Yeah, they don't exist independently. I think that is a great place to end it, because we're out of time. So everyone give Diane and Roy a round of applause. Just a fast plug.

John Willis, wherever he is, my colleague in crime, we do a podcast called DevOps Cafe. Get it on iTunes. And we do this kind of stuff all the time. So, hope you enjoyed it, and thank you very much.

Thank you, Damon, Diane, and Roy. Thank you so much.