DevOps at Netflix
Netflix is well known for its innovative company culture that emphasizes autonomy, transparency, and psychological safety. This culture has helped shape Netflix's agile and DevOps practices which have allowed the company to quickly scale and adapt.
In this session, I will provide insight into how Netflix thinks about DevOps and agile development. I will explore how Netflix's unique culture permeates its approach to practices like continuous delivery, automation, and collaboration between teams.
Participants will walk away with actionable ideas for how they can apply elements of Netflix's culture to their own organizations in order to improve agility, speed, and alignment. This will be a fascinating look into how one of the world's leading streaming platforms operates and innovates through its people-first mentality.
Chapters
Full transcript
The complete talk, organized by section.
Tejas Chopra
Hi everyone. My name is Tejas Chopra, and today I'll be talking about DevOps at Netflix.
I'm a senior engineer at Netflix, and this talk is for 25 minutes. So I may not be able to get into the technicalities of the DevOps side, but in this talk, I would focus more on the cultural aspects of DevOps.
The agenda for the talk is as follows. I'll talk about what I do at Netflix, about Netflix, about DevOps, what DevOps means for Netflix, and how it has evolved as part of our culture. And if we get some time towards the end, we'll have some question and answers.
So I'm a senior software engineer at Netflix. I am working on cloud infrastructure for Netflix Studios. A lot of people know Netflix from the movies and the shows that you watch, but that is the streaming side. You'll be very intrigued to know that only 1% of the data, or even less than that, actually makes it to the streaming.
Movie making is a very nuanced process, because for every shot you have multiple camera angles, multiple rendered images, multiple encoded images, and only a small sliver of it makes it to the screen. So the movie making process is really very intensive when it comes to cloud infrastructure, and my team supports building the storage and the transfer that gets this data into cloud.
Along with that, I'm also founder at Go EB1, which is a thought leadership platform for immigrants. I have worked at companies like Box, Samsung, Apple, Cadence in the past, where, again, I focused on cloud infrastructure. I'm a tech speaker. I speak on blockchain, cloud, and distributed systems, and I teach kids about software development at University of Advancing Technology in Arizona.
When you think about Netflix, our main focus is winning moments of truth. What does that mean? It means that every evening when you go home, when you want to relax, we want to delight you as a customer and provide you entertainment options. We started off with just movies. Now we are into games and ads as well.
And so in order to do that, we recognize that our moat is the content. It is not solely the infrastructure. Today, as it's structured, we have compute and storage in Netflix that completely runs off of AWS, most of it. We have Akamai for UI and small assets.
When Netflix initially started, it did not have its own CDN, content distribution network, which is what allows streaming at the edges. When you watch a movie from Netflix, it's not streamed from the cloud. It is streamed from one of the content distribution network hosts that are actually physically installed in your ISP provider's location.
When we started out, we used Akamai, and we realized this is very expensive for us. And if streaming is our main business, we better own the parts of streaming that are critical. And so we moved into something called Open Connect.
Open Connect is our open source software for CDNs, which is installed at these ISP providers. All of the video bits, all of the bits that you watch in Netflix, are streamed using our Open Connect framework. Not only that, our Open Connect also allows us flexibility to leverage the infrastructure for movie making process. Because it is a globally distributed network of nodes that are optimized for video, it naturally lends itself for movie making as well.
Let us talk about the scale. We serve more than 220 million members in more than 190 countries. We have thousands of microservices, thousands of daily production changes that we do, tens of thousands of virtual machine instances, containers, hundreds of thousands of customer interactions per second, billions of time series metrics that we capture, tens of billions of hours that are streamed every quarter. And we do that with tens of operations engineers. No network operations center, or no such thing that is called as a network operations center, but called something else.
Someone took a stab at how microservices in Netflix talk to each other. And this is the graph that we saw. There's no person in Netflix who understands this. It is, I think, if I remember correctly, upwards of 1,500 microservices talking to each other. You can imagine the scale and you can imagine the chaos that it can cause, but it works.
When you go to Netflix, you see recommendations. Hopefully they are meaningful. Hopefully you get to watch good shows, and it all works seamlessly. So that brings the question: how do we do it? How do we really do DevOps at Netflix?
The first thing before I dive into how we do DevOps is understanding what is DevOps. Just to give a, I know a lot of folks here would probably know about DevOps. We are in a conference that's called DevOps. But before DevOps, there was Waterfall, Agile, and then DevOps.
I'll just spend five minutes on this, but Waterfall is called Waterfall because it's a very traditional approach of software development where you gather requirements, you lock yourself in a room, you work through the problem, and when you get out of the room, you're like, "Hey, this is your solution." And that's it.
If something changes between the two stages, you have to go back and start the process again. And that is why this is a problem. This used to work, and this still works very well in, let's say, hardware companies where you have very good release cycles that are well identified. But for a software company, this is not scalable.
First of all, clients themselves don't know about what they need. I think most of us can agree on. It's very expensive to make changes during the end of the project because you have to go back to the start and do all the processes again. And in today's time, at least, software needs to be delivered faster and with much lesser resources.
So then we came up with a technique called Agile. And Agile is, many people call it like a sprint technique as well, where the entire process of building software is broken down into actionable blocks that are called sprints. So you plan, code, test, review, and you do that regularly every two weeks or four weeks, whatever is your sprint cadence. And the idea is that instead of delivering the final product, you iteratively go towards the end.
This is a typical model of Agile workflow. You have a product backlog, which is all the needs, features, requirements. You do sprint planning. You come up with a sprint backlog, and there are people that sit in a room, and they decide what needs to be delivered at the end of the sprint. You have a deliverable product, which you give for review.
This works great, but the product that you give out after the sprint need not necessarily be the product that is actually deployed on cloud. In most cases, before DevOps as a terminology was widely accepted, sprints would end with something working on your laptop, and you would demonstrate it to the owner that, "Hey, this is what we had planned to do. This works on my laptop. Life is good. I'm moving on."
There are advantages to that because, compared to Waterfall, client requirements are better understood in the sprint model, and there is constant feedback. Product is delivered much faster as compared to Waterfall model.
But there are disadvantages. Like I said, the product only gets tested on developers' machines and not on production systems, because there is no way after the end of the sprint, before DevOps came into being, to actually deploy the deliverable on cloud or on systems. And the main important part is developers and operations team work in silos here.
The job of the operations team in a traditional setting was to take the bits that a developer creates and then put it on cloud, scale the machines appropriately, find the right size machines, and so on and so forth.
DevOps is an evolution from this Agile model where Agile tries to bridge the gap between client and development. DevOps tries to bridge the gap between development and operations. That's why it's called DevOps.
According to the practices, traditionally, there are eight phases in DevOps: plan, code, build, test, integrate, deploy, operate, monitor. The lines are blurring between all of them. Many companies, most companies have a subset of these.
In the plan stage, it's very evident. The business owners discuss project goals and create a plan. You have programmers that code the application. They use tools like Git, Bitbucket to store application code. Netflix uses a layer on top of Bitbucket that we have been using for a while.
Build tools like Maven and Gradle: Netflix uses them today. Most of our services are written in Spring Boot, and we use Gradle a lot.
Testing: Netflix, again, open sourced something called Chaos Monkey, which was very popular as a testing framework as well. And we use Selenium and JUnit.
Integration: we primarily use Jenkins for it.
Deployment: for deployment, Netflix actually, and towards the end of our session, I'll show you a screenshot of that. Netflix has come up with something called Spinnaker. For folks that are not aware, Spinnaker is an open source platform that actually allows you to deploy your apps to cloud or on-premise systems. And I'll show you how that looks like.
And finally, once it's deployed, you look at alerts, metrics, and monitor them to see if things are good or not.
This just captures all the different DevOps tools in the different phases. You have Git, Maven, Gradle, JUnit. For integration, Jenkins and Bamboo is there. For deployment and operation, you have Chef, Puppet, Docker.
Netflix uses Docker images. We have our own proprietary container runtime that we call Titus, which was developed before Kubernetes was developed. And we use that internally a lot. And for monitoring, Splunk and Nagios are good tools.
So given that now we know about DevOps, how does Netflix do DevOps? Really, we don't.
Coming at this conference, packed crew waiting to hear about DevOps, it's very difficult for me to say this. But we don't do DevOps a traditional way. We do not have systems, let's say no to our engineers.
You'll be surprised to hear that every full-time engineer in Netflix has full access to production systems. Everyone can check movies, the screenshots, images, all of that. We do not build guardrails to prevent engineers from having access, even if it's the quarter end.
We do believe in freedom and responsibility. One of our goals at Netflix is to hire smart people, smart engineers, and get out of the way. And if we hire someone who is good at what they do, then they need to be given the freedom and responsibility.
And this is not just something that is a hiring goal. When, as an engineer, I have to write a microservice, I have the full freedom to decide which language I want to write the microservice in. Should it be Spring Boot? Should it be C++? Should it be Java or some other variant, or Python?
But along with that, I also have the responsibility to manage it. That is where we are different, because every engineer at Netflix is a DevOps engineer. We do not have separate teams that do DevOps.
As an engineer, for every service that I write, I am responsible for finding the right sized machine in cloud where this service can run. What are the memory requirements, CPU requirements for this service? Deploying the service to cloud, managing the autoscale, managing the different clusters, managing the security as well for these services.
So that is how it empowers engineers and removes the silos between developers and operations teams.
So we do not optimize for uptime at all costs. There are some companies, and many companies actually, for which uptime at all costs is very important. These are companies in finance, health tech, or healthcare sectors. But Netflix inherently is a creative company. We make content that delights people, and so we do not try to have processes that compromise on that. Actually, we do not optimize for having up. The repercussions for downtime for us are very different.
So what we instead do is we prize the velocity of innovation. Netflix as an org understands that if we have to delight our customers, if we have to give engineers the freedom to write good microservices, good recommendation algorithms, use machine learning and AI in the right way, then we will have to sacrifice some of the uptime.
There are ways to mitigate that. For example, a lot of you may be familiar with rolling deployments or canary deployments, where whenever Netflix comes up with a recommendation feature, we do not roll it out to the world. We start with a small subset of users we believe are best representative to test out these features, and without you knowing, you'll suddenly see little changes there.
It was very interesting for me to learn that when you watch something on Netflix, Netflix captures what type of environments you like, whether it's sunshine, winter, what are the characters that you like. And then for every movie, the image that we show tries to find the image that is most representative of your life.
So for two people under the same account, the same movie may have a different tile altogether. Even the video snapshot that we show when you just hover your mouse over a movie icon is different for different people. And that is how this creativity that our production folks or algorithms have given us is because we have allowed them the freedom to fail.
And so we don't optimize for that. We don't have processes and procedures. In DevOps especially, you'll come across a lot of processes and procedures, but that's a very bureaucratic way of thinking. And we try to avoid that by ensuring that everyone picks and chooses their own way of doing DevOps.
There's a lot of chaos there. So we've somewhat come to a point where there are teams that try to build tools that enables other engineers to freely work on their moat. What that means is every service requires some form of logging, metrics, alerts. So we have teams that build just those systems, like the alerting system, and your service can just integrate with that.
The teams have built clients for different languages. So you can pick and choose which language you want, and if you're picking a random language, which is not like Rust, for example, which is not too random actually, but if you're picking a language like that, then you are responsible for writing the client for the services that you're using.
And that's fair. In my case, I was writing a file system, and you know that file systems can be painfully slow in Java. So we used C++ and the FUSE file system in user space. Metrics, alerts, logging did not have clients in C++ because who codes in C++ these days? So then I had to write my own clients for that. And that's the deal. If you pick and choose a language, you're responsible for the deployment and the management of it.
We do believe in trust. We do not, like I said, have any issues giving full access to everyone in production. There are still some security requirements. There is still some two-factor authentication, which you anyway have on your laptop. But other than that, everyone has access to every content piece that we create.
We do not go for standards. For example, we have a lot of different languages and frameworks, and we do not have any prescriptions. Today, I think a lot of Netflix uses Python as well for the data science side. C++ is pretty much rarely used unless it's really closer to the metal use cases for optimizations and file system stuff. But a lot of it is Java. A lot of it is Node.js and JavaScript languages.
We focus on enablement. For example, many years ago we had our UI completely rewritten in one of the JavaScript languages. And we realized that the effort it took to rewrite the UI was significant. And that's when we started investing in teams that can build these logging, metrics, and other things that enable other engineers to build their microservices faster.
We don't work on having silos, walls, and traditional operational fences where the engineering group sits behind a fence over which code is thrown in the hopes that it'll show up in production one day. But we instead focus on "you build it, you run it" philosophy.
We do not rely on guesses or tradition, but we rely on data. First and foremost, Netflix is an enormous data company. We have 2.5 billion time series metrics, and a vast majority of our decisions are based on data. Like I said, recommendations is a good example where we collect enough data from our users to recommend them in the best way possible what are the things that they would like to watch.
There are many companies that have these walls that have televisions where there are graphs, and people sit and watch these graphs all day. I used to work in those companies as well. We realized that if at Netflix we had such televisions, it would actually be a huge wall of just television sets, and that is practically infeasible to have.
So we have leveraged artificial intelligence and machine learning to actually alert us. Instead of having graphs, we just get pages or alerts. And they have significantly boosted our productivity. So this is how data has been very critical for us.
So we don't, in the traditional way, do DevOps. What we do believe in is culture. We value our culture very highly. Netflix is notorious for its culture document. I don't know how many of you have read it. It's a huge document that runs into multiple pages, but it actually reflects the way we think about software development. And I would highly encourage a lot of you to just spend some time while you're in the flight, when there's no internet. You can just read it.
DevOps is a result of a healthy culture. And if nothing that I said today resonates with you, then you don't really have a DevOps problem.
I end with what DevOps means for me, because for every person in Netflix, it's different. I work on something called Netflix Drive, which is very poorly named, but very good software. It's a clone of Google Drive. Google Drive has issues when it comes to scale because you have artists that work in different parts of the world, and if they have to collaborate on a piece of art together, then Google Drive has limitations on number of folders, number of files, access permissions for all of them.
So we started creating our own edge software called Netflix Drive. It runs on macOS, Windows, CentOS. For CI/CD, we use Jenkins and Spinnaker. We have Mac VM instances on AWS for deployments. We use Brew and pkg on OS X, yum on CentOS, and Chocolatey packages on Windows. And of course, we use Sentry for logging and in-house tools for metrics and observability.
And this is what Spinnaker looks like. This is actually a screenshot of something I took today morning. I don't know if you've tried this platform or not, but what it shows is an app. It shows different pipelines, different clusters, load balancers, security groups, and you can actually see the different deployments here in different regions.
So we have an app called Pegasus that has multiple machines, multiple instances that are deployed. And as soon as you check in a piece of code in Jenkins, a pipeline gets triggered that deploys it on cloud in your test environment first, in your integration test environment, and then finally to production.
So I would highly encourage folks to just try this out. It's a tool that's available online. You can just see if this fits in your organization or not. That's my timing.