Rethinking Operations to Deliver Business Value Faster at DBS Bank

Log in to watch

Las Vegas 2019

Download slides

Rethinking Operations to Deliver Business Value Faster at DBS Bank

Dapeng Liu

VP, Development · DBS Bank

Shaun Norris

Field CIO Asia-Pacific · Pivitol

Join Shaun Norris, Pivotal CIO for APJ, and Dapeng Liu, of DBS Bank as they share how a team at DBS Bank has scaled deployments from 10 per year, to 4 per day. We will look at the operations side of this transformation, and how by building a solid foundation of automation, deployments have increased from 10 per year, to 4 per day.

We will start at a high level view of why automating operations can help development teams increase velocity, and then dig into the details of typical ops activities such as HA restarts, VM lifecycle, network provisioning, patching of polyglot microservices, sharing security credentials, setting up centralized logging and more.

We will also cover how DBS moved key applications from monoliths to microservices, how using an automated platform has automated a number of key operations activities and how the mindset around production change control has shifted in the organization.

Chapters

Full transcript

The complete talk, organized by section.

Shaun Norris

I'm Shaun Norris. I'm the field CIO for Pivotal, covering Asia Pacific. I'm based in Singapore. This is my fifth DevOps Enterprise Summit. I was at the first one way back in San Francisco in 2014, and it's the fifth one, the fourth one here in the US, and it's a real privilege to be here and introduce my friend and working partner, Dapeng Liu from DBS. Dapeng, do you want to introduce yourself quickly, please?

Dapeng Liu

Yep. Hi, everyone. My name is Dapeng. I am from DBS Bank, Singapore. I came all the way from Singapore to here just to share some interesting stories that we think are worth sharing. I lead the group's credit risk development team, and maybe we'll actually talk more about what's going on, right?

Shaun Norris

So we're just waiting for our slides to come up. This will be really fun if we do 30 minutes with no slides. We probably can, because we've done a dry run, but the interesting thing, when Dapeng and I got together and had coffee in Singapore six or seven months ago and started this idea of doing this talk together, it was really based around the idea of the Accelerate book. The work that Dr. Nicole Forsgren and team have done, if you saw her talk yesterday, it was great. That book has been really informative as I go around and talk to customers, and in my roles working for large banks in the past, especially these big four metrics.

When I got talking to Dapeng from DBS and he started telling me the transformation story that they've been on over the last two and a half or three years, I was like, this sounds to me like one of those elite performers that we read about in the State of DevOps reports and in the Accelerate book. Let's start at the end of the story in terms of business outcome. This is really a technology talk. But in July this year, first of all, how many people have heard of DBS? Wow. That's good. Awesome, yeah. DBS is Development Bank of Singapore. I'll let Dapeng explain a little more, but maybe let's start with talking about the award that DBS won in July this year. Can we have the slide, please? Oops. Looks like we may actually have some slides. Oh, they went.

Dapeng Liu

Okay, so this year we have achieved global recognition, being recognized as the best bank in the world. Thank you. So that was a magazine in the European Union called Euromoney.

Shaun Norris

Yep. In July of this year, they voted DBS Bank the best bank in the world. You may not have heard of them, but hopefully you'll go look that up and reference it. I think we're going to use that as a placeholder. We're not going to go into balance sheet reading and prove to you that the DevOps outcomes talked about in Accelerate are definitely in place. But let's just use that as a placeholder and agree that DBS is one of the trendsetters. They're one of the elites in the banking world.

As we go back and remind ourselves of what Accelerate talks to us about, I was hoping to have the big four metrics up here, but since we don't, remember what they are. There's really one on throughput and one on stability. You've got two metrics on throughput: how long it takes you to get a change into production, and how often you go to production. In a lot of the organizations I've been part of, the speed that you went to production was maybe every month, and we're going to hear from DBS that was the position they were in a couple of years ago.

And yet, it's not good enough just to have lots of throughput and be going really fast. You also need stability. The other two of these four key metrics talk about how fast you recover if something breaks, and how often things break when you go to production: change failure rate and time to recover. I'm hearing a lot more about those four metrics as I talk to customers across Asia. I want to use the Accelerate report, or the State of DevOps report, as a backdrop for what DBS have been up to.

In the Accelerate book, the punchline is that if you adopt DevOps practices and principles, you are going to improve your delivery capability of software and operations, and that, in turn, is going to improve your business outcomes, however you measure that, whether it's profit, revenue, turnover, et cetera. We've already gone to the end of that story and talked about how DBS was voted best bank in the world this year, but maybe let's talk about what are some of the specific principles and practices your team picked up, Dapeng, over the past couple of years?

Dapeng Liu

Since I'm not from the marketing department, I will leave the sales talk to my marketing department. By no means do I want to steal their job. Giving you an example of where we have been and where we are now: I joined the bank about three years ago. Back then, our in-house-developed so-called legacy credit approval application was about 8 million lines of code, .NET Framework, ASP.NET. Don't judge; actually, we know. It was really painful that we could not afford to have more than one release every month into production. We used to have this tradition saying that every time we go into production, we need to have a party, because it's just way too much effort to put in and we need to give ourselves a pat on the back.

Now, after three years of development, we are talking about almost 100 times into production every month. Since we only do weekday deployment, and we don't actually do Saturday or Sunday, we are talking about four times a day into production. That's where we have the title from: 10 times a year into four times a day. If you convert that four times a day into a year, that's roughly about 1,000 times a year.

Shaun Norris

We have some slides up now. A really quick rundown: we don't have a monitor here, so we're going to do the awkward thing and look behind us. If you think about it in the Accelerate backdrop, lead time went from 32 hours down to three, release effort went from 12 to two, and infrastructure provisioning now, instead of taking a week, happens the same day. If you've not seen these slides, go download them from the DORA report, read them, take them back, and share copies in your organization.

DBS has been doing things like using cloud-native architecture, using a PaaS, using private cloud and automated delivery pipelines, and self-service of infrastructure provisioning. All the stuff you read in the State of DevOps reports and the Accelerate book saying these are good things and they're correlated with high performance. We now have a real-life example of a team in a highly regulated, complex financial services industry that are doing these things. They're achieving these sorts of improvements in their technical outcomes, and they've been voted best bank in the world.

There are seven key areas that the Accelerate book talks about in terms of practice areas that you want to think about adopting if you're going to get on the path to be one of these elite performers. We wanted to pick two, and we also wanted to talk about both dev and ops, but shade it a little more towards ops because it's maybe the forgotten part sometimes of the DevOps story: what platform teams and infrastructure teams are doing to help achieve these improvements in delivery. We want to zone in on two specific things. One is how DBS adopted a platform as a service, and the second is what sort of things they are doing around infrastructure as code.

Let's talk about infrastructure as code first. The Accelerate book tells us that if your teams are manipulating their infrastructure as code, they're almost twice as likely to be an elite performer than if they don't, especially having declarative, version-controlled environments. If you managed to see my colleague Cornelia Davis' talk yesterday, it was fantastic on using Kubernetes and having a declarative functional model for managing your infrastructure. That's the theory, but let's hear from Dapeng and DBS on what they have actually done around infrastructure as code.

Dapeng Liu

From the self-provisioning point of view, we do have Cloud Foundry to help us provision compute, RAM, and disk. However, if we think about software development as a whole, infrastructure shouldn't necessarily only be classified as the hardware or computers. It is also the stuff that is relevant to the code but is not part of the code. Therefore, we think the pipeline and the automation part should also be considered part of the infrastructure.

These days, we spend enough effort to make it real, in the sense that you can automate not only environment provisioning, but also the rest of the stuff: provisioning the Git repository, provisioning the pipeline, which does the build, test, scan, and deployment. Until now, we have about 260 live repositories living in a live code base, and another 130 code repositories have already been decommissioned. Put into perspective, that's about 400 repositories in three years' time. Assume people work about 270 days a year; that's about 800 days. Therefore, every other day, we are putting on a new repository. These activities start becoming a norm.

From the pipeline perspective, we have a one-to-one mapping from the code repository to the pipeline. Every one of them is backed up by exactly one pipeline. Every time there's a push into the repository, the pipeline will be automatically triggered. Last month alone, we managed to push into our testing environment no less than 200 times. This has a fundamental difference from the old legacy application we were doing.

Shaun Norris

Very cool. Let's talk a little bit more about this pipeline automation. You were sharing with me how before, when it took a long time to get a pipeline, you ended up with really giant microservices. I think this would be a great anecdote to share with the community.

Dapeng Liu

The story hasn't always been this romantic. We all start with something. Once upon a time, we had some so-called microservices, but by no means were they micro. The bigger ones had hundreds of controllers in them, tons of people working in the same repository, and people competing with merge conflicts because there were so many branches going on. Despite the constant emphasis about putting up the right architecture and doing microservices right, we didn't actually see people spinning up new repositories. We were right on the trajectory of being a certified monolith.

Then we sat down and talked about this problem. Why do we think there are good things we need to do, but people choose not to do them? We had this conversation, and I think it was one of the most important conversations I ever had for the whole development. We asked, what is the reason why people wouldn't want to spin up a new repository, given the fact that the codebase has already grown a little too big? The feedback was: I only need to spend about 100 lines to get my ticket out of my hand, but setting up the pipeline took me about three days. In terms of risk and reward, it doesn't really gel. Suddenly that realization kicked in: just because doing something is right does not mean people will do it if it's not easy to do.

So we said, let's sit down and streamline the pipeline setup. More and more microservices started to pop in. These days, if you set up a new pipeline, it takes you no more than five seconds. If you kick in the pipeline and run it in the whole testing environment, it takes about ten-something minutes end to end. In the beginning, when you have an application running locally on port 8080 given a JDBC URL, that's all you need to have a pipeline.

However, another problem kicked in. As more and more pipelines were put into the environment, we figured every microservice had its own little pipeline specification or definition file. Upon about 50 to 70, we felt that it was becoming a little bit unmaintainable, because every now and then there would be a bespoke requirement for the pipeline. We figured out there was a way to inherit all of the pipelines. For each and every one, you don't have to put out the full-blown definition. It is sufficient to provide the necessary key pieces of information: where is your code, and where do you want to deploy it? It's a very iterative approach. We never anticipated landing in this situation. Today, all of our 200-something repositories with 200-something pipelines inherit from a single master definition. The rest, you don't really have to do anything. As we go, we figure out more problems, and then we try to find solutions to solve them.

Shaun Norris

The other key part of the story is the use of platform as a service. If we refer back to the backdrop of the Accelerate book and the State of DevOps reports, teams using platform as a service in last year's report were one and a half times more likely to be elite performers. Our theory is that this is likely because developers are spending more time above that value line and less time wrangling infrastructure and doing non-value-added heavy lifting. Dapeng, share with us a little bit about how the platform has enabled your team.

Dapeng Liu

As a development manager, I put myself in a position where my job is to optimize developers' time. I really want them to focus on one thing, which is dealing with the business logic other than anything else. In this case, specifically, it's Cloud Foundry. Here are the things we are enjoying on a daily basis.

First, about the abstraction of the environment: you don't have to call anybody to spin up the environment to run your application. All you need to do is specify how much RAM and what the disk requirement is, and boom, there you go. You have your environment running.

We don't have persistent storage. It started as a shortcoming, but it turns out to be a very good constraint. It cultivates the right design of the application to be more toward the stateless spectrum, so we can easily spin more and more services and scale.

Another very key point for all of the achievements so far is control of the DNS layer, where we have the capability to manipulate the route and the load balancing algorithm. Therefore, this gives us the capability to do blue-green deployment. With all the 1,000 deployments that happen every month, they all happen at 2:00 PM during office hours. We don't have to request a special downtime.

Coming back to our interesting tradition, we used to have this tradition: let's go for a party after each deployment. I don't think, as a bank, we have that kind of money to host that many parties after each and every deployment. And even if we do, there's a physical limitation of people's belly. How much can you stuff in?

Shaun Norris

Deployment was so rare and so special that you used to throw a party every time. This was like two years ago, and it used to be at 11:30 on a Saturday night in a green window. Now, how many times a day on average are you going?

Dapeng Liu

In a day, it's actually about four to five, and sometimes it can shoot up to twenty-something. Just think about 20 parties in a day.

Shaun Norris

This is around 140 applications in production and 600 containers, so it's a sizable chunk of business logic.

Dapeng Liu

Actually, a special shout-out for the concept of buildpacks. For those of you who are not very familiar with Cloud Foundry, it has a very interesting characteristic where the containers are built inside the platform. As the developer, I don't really need to do anything but specify, here's my code; run that on the platform for me. I don't have to specify the base image. I don't have to specify the user ID. I don't need to specify the JVM flags and switches. I don't even need to specify which port to export. All of these things have been taken care of by the platform. It's such a huge time saver for all of the developers.

Shaun Norris

Speaking of time savers, I'm going to let you in on the behind-the-scenes speaking gig here. Gene talked about how sometimes you want to step over the other side of the velvet rope. Well, our clock just came back on and we have 10 minutes left, so let's crack on. We want to talk quickly about things like incident management and legacy processes like change management as well, because this is a bank. It's a traditional organization that's been around for a long time, and as you can imagine, ITIL traditional processes are still the norm through lots of technology. Let's talk about incident management. Explain how incident management has improved as you've been on this journey.

Dapeng Liu

I have another interesting story. We used to have this reporting module where, under very subtle circumstances, one of the queries would fire up, take way too long to run, and eat up way too much memory. As a result, this application crashed in the production environment. In the old legacy world, a crash and running out of memory is sometimes going to trigger the whole world, boiling the whole ocean.

Fortunately, we run multiple instances of this report module, and no one was actually raising an incident report to us. It was more like our platform operation team started alerting us and saying, hey, guess what, one of your microservices has started crashing. Then we go in and take a look: what happened?

Now we have the option to say that, in case of catastrophic failures, out of memory, stuff like this, the platform is there to help automatically revive the failed service so that we don't have to immediately put everybody into the war room and figure out what happened next. We have the capability to understand the problem first, and then take action, whether immediately or maybe later. Another interesting thing I observe is that these days we are talking about rolling back less and less often. We are talking more about rolling forward. Anyway, we are going to roll, so why roll backward? Let's roll forward. Understand the problem and fix it and then move on. From the traditional incident management point of view, it used to be a very reactive process. Now, we're not completely there yet, but we try to be a little bit more proactive and preemptive, to figure out what is happening as the application is running.

Shaun Norris

Change management is another topic that I think a lot of people in the room are probably still grappling with to some degree. Talk through some of the tactical things you've done around improving change management, even though you haven't quite got to where you want to yet.

Dapeng Liu

After all, as I said, we are still at a bank. My team is not here to break all the glasses or upset a lot of risk-control people, so we still follow exactly the process that has been set up for the bank. One of the key pain points for the team is to raise the form. Going to production, we have to raise several forms. Over many years, many different fields have been put into the form. Just because you have the right information does not necessarily mean you know how to fill up the form. People joke that it takes serious art to fill out the form in a way that people won't reject it. Most of the error happens when people fill up the right information into the wrong field.

We said, what can we do to ease the pain? All of the information needed to fill up the form is already there in our code repository. So that's easy: write a generator or some little automation script to automatically fill up the form so the right information fills the right field, and the right sequence of events happens exactly the way it needs to be. That helped us a lot. Just by submitting the change request, the chances of getting that change request rejected are very, very low now. For now, preparing a change request takes about five minutes or 10 minutes up front to put up a single change request, and we don't really have any rejections these days anymore.

Shaun Norris

Along with the specific technical principles or process changes you've done in the ways of working, I found it really interesting when we talked about this before: the culture changes that have happened. Three themes came out around your team topology or structure, what's happened around sustainable pace, and how engaged employees are. We've heard a lot already at this conference about this, and I think this is going to be an interesting area for you to share as well, Dapeng.

Dapeng Liu

The team used to look like layers upon layers. We had the front-end developer, the back-end developer, and the so-called cross-cut developer, who was there to take care of the framework-building part. When I first joined, this product manager always came to me every day after standup and asked the single question: for this feature, is this supposed to be front-end or back-end or the cross-cut? That kind of confusion is unfortunate.

Then we said, why do we still give people titles in terms of what they do? We looked not only at the development team, but a little bit further: there was segregation between QA and the developers and also the BAs. We are still having the conversation: why do we still give people titles? Maybe we should only give one title to all of the people relevant for building software. We should call them software makers. In the making of software, there are different activities that you need to carry out. Just because you are doing something does not mean you have exclusive access to do only one activity. If you can do more than one, why not?

The other one is about automation again. There used to be this invisible wall: my job is done, I throw it over the wall, and then it's ops' problem. We have this Loggregator set up. All of the logs pile into a single log-mining and logging system. Anytime there's a 500 error happening in production, no matter what kind of 500 error, that will be automatically fed back to the development team. Operations will not need to know what exception is happening in production. Instead, they are looking at the operations part, like capacity planning.

Shaun Norris

How about sustainable pace? You've shared with me how, if you're having developers work almost every weekend, and working weekends and deploying on weekends is normal, engagement is often not a very enjoyable job. What changes have you seen in that area?

Dapeng Liu

When I first joined the bank, it used to be an accepted norm where people should work overtime, and it was okay for you to come back Saturday or Sunday. We no longer do that, in the sense that by throwing more time into the project, you can only scale the team out so much. The ceiling is actually very low. Instead, we changed the team-building philosophy slightly: can we look after the people and then transitively let the people look after the applications that are running in production?

Most companies put a lot of emphasis on the starting time of the day. We put a lot more emphasis on the ending of the day. 18:30 is the time to cut off your work, and you should go home. Every time we figure someone needs to stay behind after 18:30, that's a good opportunity for a conversation. Why do people need to stay back? Is there a lack of support, lack of expertise, or lack of planning? There has to be a reason. In terms of that, I think people are getting much happier.

Shaun Norris

I really like this policy that if you do have to work a weekend, you get twice as much time off during a weekday at your choosing, when it's convenient after. I think that's a real positive step towards engagement. Let's finish off here and talk about the improvements you've seen in measurable engagement.

Dapeng Liu

My first-year team engagement score, the survey came back to be only 66. I told myself, actually not that bad; I can win the election by this much of an approval rate. However, it's not good enough at all. You look at our department, and the average is about 80. I said, never mind, we know our problem; let's sort our problem out. The next year it came out to be 89, which was pretty good. Two weeks ago, I had this year's score, which actually, guess what, is exactly 100%. Thank you.

I don't think 100% means we are done now. There's a lot more we can do. But it means you can see the progression from 66 to 89, all the way to 100.

There is another very interesting thing, talking about culture. I do see a lot of culture in this conference. I love it. There are a lot of organizations that say they are a learning organization, but where's the test for it? We run this Lunch and Learn series every Thursday at lunchtime. We gather everybody together, buy the food for them, and hopefully someone will put up a topic and discuss. The topic can be very broad. We have topics about credit risk approving, the latest and greatest technology, and we even had a topic about black holes.

When we first started this series, it was really a pain to get the speakers there, because people generally tend not to speak in front of an audience to share something. These days, we have to set out more like a job-scheduling thing to make sure people don't fight too much on their slot of speaking. That sharing mindset is already there. I personally believe when you have a learning organization and you have a learning culture, the sharing part is the ultimate test for whether you have passed that or not. Obviously, I don't say that we are already done, so there's still a lot more to be done. But yes, we think we are on the right trajectory now.

Shaun Norris

I think that's a perfect spot to wrap things up here. I just want to thank Dapeng, and let's give him a round of applause for coming and sharing his story. Thanks very much.