The Shift to a DevOps Model While Building Our Cloud Platform

Log in to watch

US 2021

The Shift to a DevOps Model While Building Our Cloud Platform - You Build it, You Run it!

Director – Cloud Platforms · Discover Financial Services

Senior Principal Enterprise Architect · Discover Financial Services

At Discover, we've built the next generation of container platform based on Red Hat's Open shift Container Platform which uses Kubernetes at its core orchestrator. This platform is the nucleus of a larger container ecosystem (a.k.a. "Tupperware"). Tupperware groups numerous products and services together to provide container builds and deployments, software defined networking, and brokered relational database, object store and caching public cloud commodity services. Discover’s multi-cloud architecture is centered on Tupperware’s cloud abstraction design and processes.

In this talk, we will share how we transformed our engineering practices from one that was silo’d to one that combined development and operations team into a DevOps team. Before DevOps was implemented, development and operation teams worked as two independent squads, each with its own goals and objectives. The differences and lack of communication between these teams often impacted the product, which in return affected the consumers of the platform and Discover. There was a lack of ownership, accountability and the lack of feedback loop from operations back into our product backlog was affecting us adversely.

We'll talk about techniques we used to drive the cultural transformation towards a DevOps oriented approach, practices that we followed and share what worked for us. We'll also cover the lessons learned through taking our teams around this DevOps transformation and the failures that we learned from. We'll discuss how we measured the efficacy of the transformation and our success criteria centered on consumer feedback and metrics like system uptime, number of incidents that caused production downtime, number of customer issues, etc.

This shift in approach helped improve our system uptime, improved our team ownership and accountability and resulted in fewer consumer complaints. We'll also cover how we leveraged trainings to upskill engineers and blameless postmortem analysis techniques to drive root cause analysis when “incidents” happened in production that caused outage. We’ll talk about our practice of Site Reliability Engineering and the focus on reliability which was central to our transformation. You'll walk away with an understanding of what it takes to truly transform the practices within your product team, embrace a true DevOps mindset and the benefits that this approach entails.

Throughout the transformation journey to shift our practices from one that was silo’d to a collaborative devops model, one of the key aspects we focused on was the culture of the organization. Our goal ultimately was to create better outcome for the consumers of the platform my offering improved reliability and better customer service. We created a set of teams that were self-organized and empowered to make the right decisions. We modeled this behavior within the teams by encouraging team members to make decisions for the scope of work they are responsible for and provide support as needed.

We recognized that mindset is everything when it comes to these transformations and new ways of working. We provided incentives in the form of “bravo” awards and recognition for team members to embrace the culture of “You build it, you run it” mindset. When “incidents” happened in production, we leveraged blameless postmortem analysis to drive for root cause analysis and take actions. To summarize, these are some of the key aspects of how we embraced culture as an enabler to the “You build it, You run it” mentality within our teams:

• Empowering team members

• Encouraging self-organization and decision making

• You Build it, You Run it: Mindset is everything

• Blameless Postmortem Analysis

Chapters

Full transcript

The complete talk, organized by section.

Bryan Payton

Welcome to the DevOps Enterprise Summit. My name is Bryan, presenting "You Build It, You Run It: The Shift to DevOps Model" with Discover Financial Services. Presenting with me today is Sakthi.

Sakthi is the director of cloud and application platforms at Discover Financial Services, whereas I am a senior principal enterprise architect focused on application platforms.

Let's take a little deeper dive into our presenters.

As mentioned before, my name is Bryan Payton. I am a technology strategist and engineer. I spent the majority of my career focused on government intelligence and data analytics and working a lot in future state and applied strategy in engineering. I enjoy playing sports, doing anything outdoors, and hanging out with my two boys, pictured below. As you can tell, they are a handful, and they are a lot of fun. You can reach me at my Discover email address, brianpayton@discover.com.

Sakthi?

Sakthi Kasiramalingam

Thank you, Bryan. Hello, everyone. I am Sakthi. I am a director in our infrastructure products area here at Discover Financial Services. I have been with Discover for the last two and a half years. And before coming to Discover, I spent the last 16 years of my career in various engineering roles in product development organizations. I am a mother of two girls, and being a mother has taught me how to ruthlessly prioritize my time, so I am maximizing my time for myself, my family, and for my teams at work.

I am a lifelong learner, always looking to learn something new and adding to my toolset on a daily basis. You can reach me through sakthikasiramalingam@discover.com. That is my email address.

As Bryan said, we both work for Discover, and Discover offers award-winning credit card and personal bank offerings, as you might be aware. While we are a financial services company, Discover is very much a technology firm that leverages a technology-first approach to offer the best experience for our customers. We also embrace a culture that is deep-rooted in innovation and volunteerism.

I would encourage you to check out our technology openings at the career website that we have linked here. And as a reminder, all the views that are expressed in this presentation are ours individually and not those of our employer.

With that, I will pass it over to Bryan to get us started.

Bryan Payton

Thank you, Sakthi. So what we want to talk today about, our DevOps journey, really focuses in on our application platform here at Discover Financial Services. Discover's application platform can be summarized as a container-based ecosystem that we have overlaid on top of public and private cloud infrastructure. This overlay ecosystem allows us to have a consistent capability, products, and operating-model environment for our developers' experience.

Some of the key attributes and highlights of this infrastructure, as you can see represented across the globe there, is a full network and service mesh topology. So a network mesh and a service mesh overlaid on top of the application platform, on all these little bubbles there that you see presented. This gives us a very high increased level of availability and disaster recovery.

All the dependencies of the products and capabilities, and the deployment and delivery of those products and capabilities across this environment, is all abstracted through the common set of APIs and all packaged and available through Helm. So our developer experience in that fashion is always consistent and always delivering the same experience to the deployment and CI/CD processes.

All of our common operations for these dependencies, such as database backup and recovery, everything is also centralized on that common set of APIs. So again, this really allows us to delegate the operations back to the application community and gives them one control plane, one set of experience of APIs for everything that they do within our environments.

And then this gives us a very nice way to confine and consolidate our security architecture and enforce all of our applied standards on all these different private and public cloud infrastructures equally. And so this application platform has a lot of highlights. It sounds very challenging. It sounds very rewarding. And in that, there have been some struggles. And so that is really what led us into our DevOps journey, which we want to take a look at next.

So this DevOps transformation that we went on really rooted from a core set of problems, and the problems are highlighted here on this page and illustrated with this car engine that is fuming with something that we can all relate to as a very inconvenient thing to happen to you in the middle of nowhere, right? That is how we felt with our platform. Our application platform, as we have been developing it, is that we do not want to leave our application teams feeling stranded or feeling vulnerable, not knowing what happened and why.

So here are some of the things that we ran into. Our decreased platform reliability was the first indicator that we needed to do something. Our reliability of the platform, it can be measured as the uptime rate, the experience to the consumers, their application's uptime rate as well, right? That kept going down and up, down and up over time, and it really was a wide set of problems and a wide set of different issues that resulted in that. But nevertheless, it was consistently inconsistent.

Lack of ownership and accountability was another thing that we identified pretty early on in our journey that we needed to correct. And what this really resulted in on a day-to-day basis was, most environments, and ours was not unique, we have a segregation between our operations support and our platform engineering or core engineering teams. And so that turnover is not always great, that communication is not always great. And in our environment, that resulted in a lot of finger-pointing and, "Hey, we did not get the proper turnover," or, "Something was made and it was not communicated to us," and, "You did not fix this per the SOP or per the documentation or using our automation." And just a lot of that back and forth. And so there was not really a good sense of ownership of these problems and these things that were causing our platform to be inconsistent.

Lack of product orientation surfaced itself as well, as teams focused on troubleshooting and not really focused on the brand of our platform and promoting that brand and building around that product so that we were differentiating ourselves in our environment and differentiating ourselves to our consumers as, here are the benefits and here is your mindset into this platform.

Feedback loop into the product backlog is where we were lacking: taking those lessons learned from those different areas of issues that we were just expressing and putting those back into our agile backlog so that we could generate some successful remediation to those issues, or take a deeper dive and really find a root cause analysis that maybe we were not able to identify during a postmortem, right, or during a fallout. And so we were lacking that and constantly fixing the same things over and over.

And then time to market suffered. Our time to market in delivering new products, delivering new capabilities into this platform, or delivering the platform itself, was starting to slow down because of our time spent in all these other areas. And so we narrowed it down to these areas and said, "These are our core problem areas that we want to focus on in this transformation."

So how are we going to start remedying or fixing and correcting these problems? Well, we had to identify techniques that we were going to use, right? We sat down and said, "What tools are we going to take out of our toolbox to actually solve these problems?" Right? There are a lot of different ways that we can handle this. What is going to be the most effective? Let's sit down and think about this before we just present a bunch of problems to the team and say, "Start fixing these problems."

One thing we started off with is a single backlog for developer and operations. So we consolidated our operations and our development engineering teams. And what that enabled us to do, and what we sought out to do with that, is reestablish that accountability, that ownership, understanding that when you build something and you have to provide support for it after you build it, it really makes you focus on building it the right way, building automation and remediation tasks around that documentation. Those sort of things, when you are put into the situation of supporting that product, really makes you focus as an engineer. So we were rotating our folks through that process.

Operations as a rotational responsibility was the segue from that. And so now that we have collapsed the teams and they can see both sides of the picture, we do not want anyone to sit in operations too long, we do not want anybody to sit in engineering too long and lose that focus. So we rotate our people through operation cycles and engineering cycles on different sprint cadence in our agile framework. And that has allowed them to, again, understand the issue, take it and put it in the backlog, and also see it when they are developing: how is this going to be handled on the operations side?

Single operating mechanism and Scrum ceremonies: this is going back into our agile framework adoption here at Discover. Making sure that our operations teams are following the same procedures, making sure that everyone is inclusive in the same agile processes and Scrum ceremonies, allows people to voice things earlier as, "You need to consider this in the user story. You need to consider this during development," and give more voices to be heard earlier on and consolidate that instead of waiting until it is too late or waiting for a problem to occur for somebody to address it.

Upskilling through training and exercises: we have established a very rigorous training environment here at Discover, and this is meant to develop our internal staff. And so we have built many different courses, many different developer courses and administration courses and engineering courses and manager courses, all focused on educating and improving our engineering across different domains and within our application platform in particular.

Exercises was a way for us to make people get comfortable being uncomfortable. And what is meant by that is we would set up in lab environments. We would set up these mock environments, and we would break things. And in breaking those things, certain people would have to respond and correct those things. It was kind of a training exercise to evaluate how you handle the stress of fixing something when it breaks, as you would if you got paged out, and what documentation are you using when you fix it, what pipelines, what automation are you using around that when you fix it, to make sure that people understand where our resources are, to make sure people are getting comfortable in responding to issues in those situations. And it really just became a big learning environment for people, and we do that on a quarterly basis.

Blameless postmortem analysis: this was our way to level set after an incident and make sure that we got rid of the finger-pointing and really focused on what was the issue, how are we moving forward from that, and how are we generating work to avoid this happening again.

An operations review meeting: this is kind of a weekly thing that we do, to evaluate how everything is going on the operations side. And again, because of our collapse of engineering and operations, the amount of feedback in those sessions and productivity in those sessions has really increased.

So now that we have identified the techniques, we had to make sure that they are successful. And to make sure they are successful, we have to measure those.

And so these are some of the things that we felt were important for us to measure as we are progressing through the DevOps journey. A couple highlights of this, and there is more that anyone can include in their journey, but a couple that were helpful for us is exercise scores and metrics. We talked about doing those training exercises. We gave scores out of how was it corrected, was it corrected in the right way? Was it a kind of Band-Aid fix? Was it following the process? And so we kind of said, was this a pass-or-fail type of environment? And so that really gave our guys a sense of accomplishment and showed them areas where you need to do this the right way because it has consequences, right?

Team exhaustion and inclusion surveys: that is something that is a way for us to understand, are we pushing our people too much? Are we asking them to do too many things? Do we need to scale back capacity on our planning and make sure people have time to be effective in their jobs and be effective in operations and engineering work?

And then the number of incidents that were escalated. And so the reduction of issues that were open, the reduction of production incidents that were open from our consumers, is a really easy key indicator of are we taking what we have learned and generating new ways to automate that or correct that long term. And so these are just some of the metrics that we decided to evaluate and measure throughout our journey.

But it is not just the technical issues and techniques and measurements that were going to make us successful on this journey. We understood that there has to be a focus on culture. There has to be kind of a feedback loop between lessons learned and bumps in the road. And to talk more about that is Sakthi.

Sakthi Kasiramalingam

Have you heard of the saying, "Culture eats strategy for breakfast"? Well, I think if you are not careful, it is true that culture eats strategy for breakfast, lunch, and dinner.

So far, Bryan talked about the specific challenges we faced initially that drove us to do something different. He talked about the specific techniques we applied to overcome them, along with how we measured the efficacy of our success centered around certain objectives and key results.

While we focused on applying specific techniques to drive this transformation, as well as measuring what matters, we recognized that culture and mindset is everything when it comes to change and transformation. The biggest challenge around DevOps and transforming to a new way of working is not the technology or the metrics, but it is the people and the behaviors exhibited on a daily basis. So we decided to focus on our people and leverage culture as a key enabler for this transformation.

So you might wonder, how did we go about creating a culture within the team that enables this transformation? What did that even mean for us? First, I purposefully moved away from a command-and-control type of an approach, and instead focused on creating a set of teams that were self-organized and empowered to make the right decisions. We modeled this behavior within the teams by encouraging team members to make decisions for the scope of work they are responsible for, and we provided support as needed.

I embraced an approach of leading with questions instead of answers to guide the team towards self-organization. When incidents happened in production, to take the finger-pointing and the blame out of the picture, like Bryan talked about earlier, we used techniques like blameless postmortem analysis and five whys to drive for root cause analysis and to learn from the failures.

When handling routine operational support, our product owners encouraged our team members to get to the root cause of these incidents so they are resolved once for all. Most importantly, as a leadership team, we focused on establishing psychological safety within the teams by being open, engaged, listening to, and responding to team members' feedback.

We centered our practices on the philosophy of you build it, you run it. In order to break the knowledge silos that existed between our development and operations team, we helped establish a weekly learning series where our product teams came together on Friday afternoons for upskilling and cross-training to eliminate the single point of failures that existed within the team. This led to a culture of being a learning organization that continuously learns and improves.

While we had our fair share of success and excitement about the new way of working, we also faced some significant challenges, especially initially. There were some initial failures that we faced, but we quickly inspected and adapted our approach to make some tweaks as required. This helped us to turn our failures into stepping stones for success.

So let us take a look at some of our early failures and then how we overcame them. Number one, our technical exercises resulted in failures and lower morale within our team. Our technical exercises were nothing but simulated chaos tests of possible incidents that could happen in production, so that our team members could practice, learn from the experience, and become ready to handle real incidents. Initially, these exercises led to failures because our team members were not able to get to the root cause of the incidents. So we had to restructure our technical exercises with clear outage steps, misconfiguration injections, expected outcomes, and lessons learned. And we also recognized the top performers from these exercises to motivate our team.

Secondly, the training material that was delivered was not very effective in helping meet the needs of our team. So we slowed down the training program to ensure quality training materials and lessons were delivered. We reviewed the material prior to presentation. We also started loading these as tasks in our Agile planner to help keep us on track.

Last, our engineers were being overworked balancing operations and product delivery. Based on the feedback from our team, we made adjustments to on-call rotations so that engineers were not on support 24/7 when they were on call. We also adjusted our product development cycles so that engineers on call were not allocated towards product development initiatives during that particular time period. This helped our teams to maintain a better work-life balance.

The onset of the pandemic and the complete remote work also led to an increased burnout for some of our team members. So we helped establish forums like virtual happy hours, watercooler virtual team chats, and informal coffee chats to help our teams have some social forums outside of our day-to-day operating model and product delivery initiatives.

So in addition to the initial failures we faced, let us take a look at some of the key obstacles we faced, and most importantly, how we overcame them. First, not all engineers were comfortable with operational responsibility. We overcame this by gradually exposing our engineers to customer issues through tickets, incidents, as well as recurring chaos test exercises. Using these approaches helped us to instill the confidence and the experience that the team needs to really embrace the approach of you build it, you run it.

Secondly, there was a lack of prioritization for the team's training, exercises, and documentation. With an extremely busy product backlog, I am sure you can relate to that, it was very challenging for us to prioritize the time for team training and documentation. Feature development and product delivery initiatives took a higher priority over training the team. But we had to make a conscious effort on being a learning organization and make these trainings a priority to ensure our team's overall success. This sometimes meant being willing to put a deliverable back in our backlog so that we can properly train and mature our team, documentation, and practices.

Finally, addressing technical debt. With a huge volume of support issues that started coming our way and the demand from the enterprise, it was very easy for our team to just resolve the issue at hand and call it done. But as part of our new operating model and the new way of working, we helped establish a weekly operations review meeting where our team developed a new muscle of looking at this operational backlog of issues through a different lens to see what repetitive patterns of issues came along, and then how can we put capabilities in place to improve automation and reduce technical debt.

So what are some key learnings that you can take away from our experience that can help you in your journey? First, of course, it is people first. Get to know your people, what motivates them, and how they like to be rewarded. There is no one size that fits all when it comes to motivating your teams and providing leadership. Building a culture that is centered on trust and shared accountability takes a lot of time and energy, but it is very well worth it. So invest in your people and center your transformation leveraging a people-first approach.

Second, be flexible and learn from mistakes. When you do something new, there is a possibility that you might fail, especially initially. But that is okay. Set a clear vision and goal to help you stay focused on the outcomes, but remain flexible on how you will get to your desired outcomes.

Next, focus on metrics that matter. It is important to measure what matters to ensure that you are making progress. It is also important to project visibility around these metrics with your entire team and your leadership team. When you face obstacles, use the objectives and key results as a guiding light to keep the focus on where you need to go. Leverage data to make decisions whenever possible.

Next, be customer obsessed. Keep your focus on the customers of your product and focus relentlessly on delivering value for your customers. Find a way to get periodic customer feedback through surveys and other feedback channels, and make this a regular practice for your product teams.

Finally, evolve and grow. You are never done. Any change or transformation is a journey, and it is not a destination. So allow yourself and your teams to fail. As much as we have had a successful transformation to a DevOps way of working over the last couple of years, we are not done. We also constantly have a newer set of challenges pop up, but it is about looking at these challenges through the lens of continuous improvement to see how can we iterate, evolve, and then grow through that process.

So those are some of our key learnings from our experience in going through this DevOps transformation. Hopefully, you have learned something from our experience that can help you in your environments as well.

Feel free to reach out to Bryan or me through the emails that we have shared. We would love to hear from you if you have any feedback or if you want to ask us any questions. Thank you.