Accelerating Value Delivery with Site Reliability Engineering

Log in to watch

Las Vegas 2023

Download slides

Accelerating Value Delivery with Site Reliability Engineering

Erika León-Ravinez

Tribe Leader DevSecOps & Resilience · Banco de Crédito BCP

Maria Luisa Polo

Head of Software Engineering · Banco de Crédito BCP

Luis Alberto Guevara Sandoval

Digital Architecture Manager · NTT Data

Our success story is based on the joint work carried out between BCP (the largest bank in Peru with more than 12 million clients that belongs to Grupo Crédito and one of the most important banks in South America,BCP's Memory) and NTT Data (IT services company part of NTT Group that is in the top 6 worldwide, number 1 in Peru and with a presence in more than 80 countries. https://www.nttdata.com) in the evolution of DevSecOps (as a cultural, process and technological paradigm in BCP and that generated the highest level of value delivery acceleration) towards an approach that includes operational stability as a fundamental pillar of software development, demonstrating how BCP evolved and I change the traditional ways of working on application stability, oriented towards operations management (reactively), to an "assurance" scheme of operational stability (proactively throughout the software development cycle), bringing significant results (avoiding economic losses of up to USD $2.4MM per year) to ensure the balance between speed, quality, security and "Stability".

Chapters

Full transcript

The complete talk, organized by section.

Luis Alberto Guevara Sandoval

Hello. Well, hello everyone.

Today we'll talk about site reliability engineering, with focus on the successful case of BCP and how the way we apply this practice in different ways can really accelerate value delivery while ensuring stability in all operational and resilience.

First of all, let me introduce myself. I'm Luis Guevara, Technology Director of NTT DATA. I don't know if you heard about us, but if you haven't, NTT DATA is a global leader in information technology services, and the number one in Peru, Latin America.

And we're proud to say that we're the chosen strategic partner. We're the first option of 75% of Fortune 100 companies. We belong to the NTT Group, a top Japanese company, and we have over 200,000 employees, 15 high-performance centers dedicated to rapid technologies, and more than 50 operations in more than 50 countries.

And that's why we can lead digital transformation processes and tell you about successful cases of SRE, like the one with BCP. So I would like to introduce Erika and Maria from BCP. They are renowned experts in the field of software engineering, DevOps, and resilience. And as partner of BCP, we're proud to be part of this journey. And I will leave you with Maria.

Maria Luisa Polo

Hi, everybody. I am Maria Luisa Polo. I'm the CIO and Head of Software Engineering at BCP, and with me is Erika León-Ravinez, who is responsible for the DevSecOps and Resilience tribe.

So let's start. Let me take a couple of minutes to talk about the bank.

We are the largest bank in Peru, and our presence extends beyond our country. We operate in four countries across Latin America. We have 1,700 employees organized in 35 tribes and more or less 300 squads.

But beyond the numbers, let me address why BCP is so relevant in our country. As we said in the previous slide, our purpose as a company is to transform plans into reality. BCP delivers financial services for the majority of Peruvians, and its speed and operations are crucial for our country's economy and success.

We also have the commitment of delivering a wow experience for our customers. And because of that commitment, we began this transformation journey I will tell you about in the next slides.

Okay, we started, surely as many of you, with very low delivery speed. It took us months to overcome this. We invested in technology frameworks and digital capabilities, but the results weren't, as you can imagine, as we expected.

Why? Because of adoption. We had all the toys, but nobody used them.

So we decided to take our strategy one step further. We implemented a comprehensive approach that included processes and people. We worked with all the levels of our company, starting with the C-level, because we needed a lot of support and sponsorship in those days.

We explained, we taught, we sold that all the benefits of these practices were going to turn into success in our organization. And, of course, things happened. Everything wasn't very good at the beginning, but we took these lessons and put them into our strategy, and we moved forward.

With this approach, we multiplied our delivery capacity by six. We went from 6,000 releases to 35,000 releases per year. We reduced by 80% our delivery time, from months to just days. And we maintained our release frequency to one week for the most important applications in the bank.

This posed many challenges regarding technological adoption, mindset, and of course ways of working, but also a lot of benefits in terms of flexibility and value delivery. So we were good.

But despite achieving this acceleration in delivery capabilities, the pandemic brought us a new challenge. The high demand for digital services caused us instability, lots of incidents, and this resulted in economic and reputational impact.

Erika León-Ravinez

Hi, I am Erika, and I will show you today how BCP improved using SRE.

This scenario showed us that operational resilience has to be more rigorous to ensure stability. But how could we achieve this?

Well, we have five components: observability, scalability, dependency, availability, and show evidence for full application design.

As you know, hope is clearly not a strategy. So we decided to leverage SRE as a practice to achieve this objective. For us, SRE is not only a methodology. SRE is a software engineering approach with a primary goal to ensure reliability in our applications all the time.

People usually associate SRE with post-operation: improve response times for problems, resolve incidents. However, at BCP we take a different approach. We apply SRE as a whole, from the planning phase all the way through post-operation.

This helps us to generate resilience stories from the beginning of the process. Those stories also take the business point of view into consideration and ensure that all requirements are born with application and stability mindset.

Resilience is not just a technical matter. It must be aligned with business objectives. It's not just about getting a system up and running. It's about making it work in order to create an optimal experience for our customers.

This is when SRE becomes a strategic tool to build business objectives and provide the best possible experience for our users.

Our first step was to collect all the information, practices, knowledge, reference knowledge, and all experience, both globally and locally, exploring the best practices of industry leaders like NTT DATA, DevOps Institute, Gartner, McKinsey, all of them.

From that learning, we approached the case and created a pilot that allowed us to see how we apply these concepts and generate value in our organization.

This pilot was a crucial starting point. We applied it in wholesale banking services, generating an important backlog and actions that helped us to give visibility and increase the availability of our applications.

By showing the value of SRE through this pilot lab, we took the next step and developed an implementation model. We started with a centralized approach based on the most important banking channel, and then scaled all the capabilities to other channels that need and require speed and resilience.

And our expansion included attention to critical areas such as visibility, observability, alerts, and monitoring, which allowed us to achieve excellent results.

Well, we started with proactive monitoring. We managed to reduce 75% of false positive alerts. It's a lot.

On the other hand, in the massive incident management front, we were able to identify application maps that generated the incident and were able to minimize the impact by 30% on the application.

Once controlled, we found that there were several components, both software and infrastructure, that had neither redundancy nor scalability mechanisms. So we had to review the design and generate more structural changes.

Also at the deployment level, we identified recurring errors and established procedures to resolve them. And finally, we generated one evolution backlog that will continue to ensure operational stability.

Well, we had very, very nice and good results. We increased our main channel's availability from 95.97% to almost 99.9%, which represents millions of dollars of savings for the bank.

This was accomplished in less than one year in just one application, and represents almost 3% of the revenue generated by the channel. This just by adopting SRE practices. Can you imagine how much we can get if we apply it to the main applications? That is a lot, considering that it will not be applied to all applications.

Okay, and now Luis will tell you our learnings and conclusions. Thank you.

Luis Alberto Guevara Sandoval

Well, thank you, Erika.

So ultimately, you heard about our journey and more than all the benefits that we have talked about, as well as the progress that we have so far. We are still evolving. In fact, we're now evolving with two dimensions in mind: scalability and technology.

In any case, the scalability model allows us to grow from a centralized model, with some specific and limited characteristics, to a federated one, for more deployment and speed. As a result, we are now on a hybrid model, which combines the best characteristics of each model. And therefore, we must implement these models with specific criteria and a special focus on when and where applied.

On the other hand, if we talk about the technological model, we start automating observability in monitoring applications following standards and frameworks, for example, like OpenTelemetry. And we are really improving the root cause analysis. And equally, we use generative AI to improve or to apply it in all the software-building process, like reliability.

So finally, we continue evolving by implementing practices like chaos engineering, continuous recovery, and self-healing platforms.

And well, to conclude, we want to share with you some lessons and insights that can be very useful for you in general organizations or future implementation, future projects.

So first and foremost, sponsorship is key. If the organization at the highest levels is not committed with the value of the practice, we cannot succeed in this implementation. Just for example, BCP defined the entire organization model responsible for implementing this practice at all levels.

Another important factor is culture. Teamwork and human capital focus are critical factors, and aligning efforts with business objectives is crucial to achieve goals through promoting the use of these practices and how to use them, as well as showing the business benefit they provide.

On the other hand, it's essential to remember that development demands resilience over the entire software process. It's not just operational stability. It's also the ability to plan, design, and adapt the whole software-building process.

Well, in the same order of ideas, applying SRE practices across the entire software value stream is crucial to avoid incidents that can be mitigated from the outset.

In addition, it's essential to establish counterbalancing practices within your organization to protect from business pressure. For example, take the case of the product owners that want to release more functionality without taking into consideration that maybe a simple extra feature may compromise the entire application. That is unacceptable.

And well, finally, observability is just the foundation of resilience. Remember, that is merely the starting point. There is a lot more to discover and explore.

Well, thank you very much. I hope this insight helps you in your future implementations, for future projects. And feel free to reach out to us. Thank you so much.

Thank you.