DevOps Story of a Crisis & Conquest

Log in to watch

US 2021

DevOps Story of a Crisis & Conquest

Engineering Manager (DevOps) · STARZPLAY

The ongoing pandemic hasn't just challenged health and movement of people worldwide - the sheer unpredictability has also taken a toll on businesses, strategies and their technical delivery.

The Platform team at STARZPLAY one of the largest video streaming providers in the MENA region has some bittersweet memories and lots of learnings from this time. On one hand was a sudden surge in viewership by almost 3 times (fueled by the lockdowns and people started binge watching) which was a good sign for business on the other hand was handling the challenges of scaling up the infra, keeping up with the software delivery schedules intact. Adding to the woes were spikes in cyber attacks to the service to exploit the vulnerable times and the challenge to equip remote-working capabilities to the organization.

This talk is a true story about the challenges presented to an Engineering Team amid the COVID-19 pandemic and how DevOps principles guided them out of troubled waters. The lessons learnt will go a long way in setting the right organizational culture to overcome crisis.

Chapters

Full transcript

The complete talk, organized by section.

Prasanjit Singh

Hello, and welcome to the DevOps Enterprise Summit 2021. My name is Prasanjit Singh, and today I'll tell you a DevOps story of crisis and conquest. This is an account of the challenges faced by a cloud infrastructure team serving millions of customers in the face of the pandemic. We will also talk about the lessons that this crisis taught us and how DevOps principles helped us overcome these challenges.

Before we go on with our story, let me quickly introduce myself. My name is Prasanjit, and I am the director of cloud and DevOps practice at STARZPLAY Dubai. Along with platform engineering, I am passionate about learning and teaching, and I have trained over 20,000 students at Coursera and other edtech platforms. With my team at STARZPLAY, I am responsible for deployments and site reliability.

STARZPLAY is a video-on-demand service that streams movies, TV series, documentaries, kids' entertainment, and live sports for 20-plus countries across the Middle East and North Africa region. We also support other OTT platforms across the Asia-Pacific region. In addition, we have a live subscriber base of more than two million, and our apps are installed on more than 10 million devices worldwide. That brings us to more than 50,000 requests ready to bombard the platform every minute around the clock, and that is a large number.

Now, let me tell you a story, or rather, you can call it a diary account of an infrastructure engineer, with the incidents that occurred over a month since the pandemic broke out in 2020.

11th March 2020: World Health Organization declares COVID-19 outbreak a pandemic. The WHO on March 11 declared COVID-19 a pandemic, pointing to over 118,000 cases of the coronavirus illness in over 110 countries and territories around the world and the sustained risk of further global spread.

"This is not just a public health crisis. It is a crisis that will touch every sector." I quote this from Dr. Tedros, the WHO director-general, at a media briefing. He said, "Every sector and every individual must be involved in this fight."

Wow. That's not something you hear every other day. This was happening, and this was real. We weren't sure how this would affect us as a company, being from the media and entertainment industry and serving the traffic, the servers, the infrastructure. We were not at all sure what to expect. We knew that we would have to be careful health-wise, but we had no clue we would have to scale up our servers as well.

16 March 2020: half of the employees were asked to work from home. Looking at the severity of the spread, we decided to have 50% of the workforce at office and the rest 50% to start working from home. That would reduce the chances of the spread by half, at least for the employees in our office. That's how we thought about it then.

With this announcement, we, the infrastructure engineers and DevOps advocates in the company, found ourselves in the limelight to help maintain the connectivity and the communications to hold the organization together during this great corporate dispersal. We had our deployment pipelines in place and DevOps tools for the software development life cycle processes that we have. It was then that we realized the value of these more than we ever did before.

I say the productivity and the continuous delivery of the services by IT departments through the crisis rested on our shoulders, and in addition to aligning and syncing of developer initiatives with operation cadence, everything was on us. That reminded me of the DevOps principle: end-to-end responsibility. As a practitioner of DevOps, you have to be responsible for everything. You can call yourself a jack of all trades, and you have to do it well.

We had no clue yet of what was about to come. However, we just reacted, prepared our VPN servers for half of the people that were going to work from home, scaled up our user accounts for Zoom and Google Meet, and rejigged our layouts to highlight binge-watching of movies like "I Am Legend" and "Contagion," because these were in high demand in those times.

Then comes 4th April 2020: nationwide lockdown declared. By now, the pandemic was wreaking havoc in Spain and Italy, especially with a huge increase in numbers from all over the world. On 4th of April, a nationwide lockdown was declared, and the entire city of Dubai, where I work, came to a standstill. We were all locked in our homes, and our work had to keep going. 100% of the work from home for a media and entertainment house means complete dependence on the technology team. That's when we had to roll up our sleeves and get into our actions.

In this time of crisis, DevOps practices really came in handy. I'll tell you how. We have a DevOps principle: automate everything you can. As organizations sent their employees home for work, suddenly it became a lot more apparent how many manual processes are at work within the organization. We realized we also, in spite of having DevOps pipelines in place, automations in place, there were lots of places where manual processes were still at work within the organization. We were dependent upon people being at their desks running certain scripts and being able to talk to each other directly and work out those challenges.

The more methods you have within your organization that are not automated, that require manual intervention, that may even require a physical presence, the more pain you're going to experience during this rapid shift of the new work paradigm, even if that shift is temporary in nature.

In order to reduce this pain, the only option was to move into a DevOps paradigm, and that is to put people, processes, and technology to work in order to eliminate these manual processes, manual bottlenecks, and the need for a team to be sitting in the same room in order to deliver functional software and safe database changes.

Think of automation as not only a software development process, continuous delivery including continuous integration, continuous deployment, but also as a whole infrastructure landscape, and one that allows infrastructure to be versioned and treated as code as well. To automate a process, it needs to be converted into basic code. All processes normally actioned through console will have to be transformed into a series of API calls and script-run commands. So that's what we did. Whatever was being done by people in the console were converted into scripts if they weren't already.

That helped to reduce the manual tasks and make things automated, so when people work from home, they could conveniently just trigger certain jobs, or even the jobs were scheduled, so there were fewer dependencies on team members. Even if they were dispersed, the processes were in place, and they proceeded as desired. With these things, slowly everything got settled, and we were getting used to the new way of working. People started calling this new way of working the new normal.

That brings us to 6th April 2020. That was when we decoupled our deployments and releases. This is a lesson that we learned from certain challenges that we faced. Let me tell you more about it.

There is a DevOps principle that calls out continuous improvement, and in a bid to improve our productivity in this newfound scheme of things, we made certain changes to our delivery pipelines. The concept of decoupling deployment from release is a key thing for any DevOps team to be aware of, and it is important to understand how feature flags can make that possible.

Before we delve in a bit more, and before I tell you more about it, let's start by looking at what decoupling is. Deployment is pushing your code into some part of your infrastructure, and release is exposing your code to execution. The ability to decouple deploy from release means that you're able to push code to anywhere without exposing the code, and therefore without impacting your users.

This then allows you to gradually release the new feature to assist in internal testing, dogfooding, and progressive rollout. But what is most impressive is that if done correctly, you can compare the health of a system, metrics, and user behavior between the users who have access to the new feature against the users who do not have access yet to this feature, and thus learn much sooner if there are any issues.

This gives you an ability to roll out features faster and get them tested from the live users themselves, because you're not exposing the entire user base. You're exposing a certain fraction of the user base. Feature flags are what make decoupling within a feature release possible. A feature flag or a feature toggle is implemented as a function call that controls access to a particular code path.

Unlike traditional compile-time flags, command-line flags, or configuration file entries, feature flags operate on a user-by-user basis and not per server images, and are remotely controlled from outside by the application, which means you can just toggle or change it by changing a property and without pushing new code.

That is how we started working on this, and there are many reasons one would want to do this. One I already mentioned: it is to make deployments faster. It is also important if your DevOps team utilizes trunk-based deployments, so all code is committed to the master at least once every day. Without decoupling, all the work-in-process code would go live. So you should have feature flags which you can toggle, and then switch it off when required. The second reason is to enable safe testing in production so that your entire user base doesn't suffer.

We found that this worked very well for us in the given situation, and it resulted in getting code to the production sooner, even if the whole team wasn't in the office working together.

Then comes 8th of April, when we enabled another feature toggling mechanism. In software development, feature toggle is a mechanism that allows code to be turned on or off remotely. Feature toggles are commonly used by product engineering and DevOps teams for canary releases. We did that, and it allowed us again to toggle on features quickly whenever we needed them.

One more thing we did with a dispersed team to function well was to divide them into autonomous squads. We divided the teams into squads which had one member with an expertise on one particular line. For example, there would be a squad with one database admin, one person who is very well conversant with DevOps tools and pipelines, a person who is a developer, someone who can merge code, and so on. Every team would be self-sufficient, and even if they're working remotely, they work in tandem with each other so that everyone doesn't have to jump into a call.

Say, for example, you have a big team with 30 people. Having everyone in sync is difficult when you're working remotely. Rather have smaller teams with one expert from every area, and this squad becomes more functional and more productive than bigger teams. That is another thing we did.

That brings us to 10th of April 2020, and because of this nationwide lockdown, we found that most of the people were at home. By 10th of April, two-thirds of the whole world was indoors, in fact. As expected, the one respite for everyone was watching movies and web series, and it applied to adults and kids as well. This led to a huge surge in traffic. We were seeing more than three times what we see in a normal day.

This meant scaling our infrastructure, app servers, databases, caching systems, load balancers, everything that we have on the infrastructure level, because the traffic just grew three times and we were being hit by lots of requests, and all organic requests, believe me. That's when, again, DevOps practices of infrastructure as code came to bail us out.

Being on the cloud, the elasticity of cloud also came to the rescue here. The cloud's API-driven model facilitated exchange between developers and system administrators. We used APIs and tools like Terraform and CloudFormation to rapidly scale up our systems. Then we used Ansible to configure our systems on the go. Because we had these practices in place, we could easily scale up and we could take on this challenge of 3X traffic within a couple of hours and scaled up to serve our customers well.

Since infrastructure is made of code, engineers were able to interact with the infrastructure. Infrastructures and servers could be duplicated continually or updated as per requirement. We could ramp up our servers in no time to deal with the surge in traffic that we saw, and again, I would say bravo to DevOps.

That leads us to 11th of April, and this was exactly one month from when the worldwide pandemic was declared. That's when we saw that as if this triple surge of traffic wasn't enough, we were now seeing malicious attempts to our servers, DDoS attacks, and unwarranted stress on our systems.

Most of these threats intensified because of opportunities that have arisen during the COVID-19 outbreak for hackers to try out different things, because they already know systems are under stress. However, having proper DevOps practices in place helped bail us out from this problem as well.

DevOps focuses on examining the entire process. We had this objective of monitoring and detecting troublesome areas of a process and analyzing the feedback from the team and end users to note occurring problems and better improve the quality of our products.

Having good observability practices in place helped us quickly detect the threats and mitigate them. We were able to solve those problems as well with our monitoring dashboards and monitoring tools in place, to be able to quickly detect them and block them, apply policies to stop them from happening ever again.

That was how the month went, right from the time pandemic was declared. Then we moved on to adjust ourselves and found ourselves at a much better place. Today, we are exactly one month into the pandemic, if I speak about 11th April 2020, and now this has been more than one and a half years from now. When looking back, I'm happy that we were able to sail through, and I'm here to share the story of our experiences. In spite of challenges, we overcame these times, and having DevOps practices in place helped us immensely.

I would say the pandemic is a cautionary tale to businesses and companies that refuse to evolve and change the core of their existence. DevOps is here to stay, and the advantages it possesses will ultimately give your business the wings that it would need to soar over unexpected futures such as this. The events in the past year have only made DevOps more relevant.

Thank you.