Real-World DevOps Experiences from a 165-year Old Bank

Log in to watch

London 2018

Real-World DevOps Experiences from a 165-year Old Bank

Global Head - Cloud Infrastructure Services · Standard Chartered Bank

Shaun Norris heads up the Cloud Infrastructure team at Standard Chartered Bank in Singapore. With 20+ years of experience in the industry, Shaun has previously held roles at EDS in Ottawa, lastminute.com in London, VeriSign in Cape Town, and Amazon Web Services in Singapore. Shaun’s team are building a developer tools pipeline and corresponding cloud capability as the new standard for software delivery in the bank.

Chapters

Full transcript

The complete talk, organized by section.

Shaun Norris

My name is Shaun Norris. I'm from Standard Chartered Bank. I'm based in Singapore.

And for the next 30 minutes, I want to run through what I would call the good, the bad, and the ugly of our DevOps discussions and journey at the bank.

A little bit about me first, starting from the bottom up here. I started my career at EDS in Canada, ended up working here in London at lastminute.com for about five years. Worked my way through various other companies, including a stint at Amazon Web Services, till I now find myself running cloud infrastructure at Standard Chartered Bank.

So why am I here talking about DevOps? Well, when I'm not working, I'm usually doing what you see in the right half of the slide here, which keeps me out of trouble on the weekends.

But why am I here talking about DevOps? Well, I've been passionate about this subject for a long time. Back to 2012, I was busy working through an MBA via distance learning while working full-time. I don't recommend that to anyone. However, having made that foolish choice, I then had to decide a thesis topic that I was going to write.

And so I'd heard this buzzword kicked around of DevOps, and my initial thought was I was a bit skeptical. I wasn't sure there was anything to this concept. So I ended up writing my MBA thesis, and I will say it was written quite badly, and it was a fairly mediocre effort by the time it was done.

But nevertheless, I managed to get exposed to all these people like Gene Kim and John Willis, and reading things like keeping the data on time, and all these new practices that were transforming development and operations. So really, since then, it's been a real journey and learning exploration to be more involved with the DevOps community and find out more.

A lot of people talk about imposter syndrome. I'm fairly certain I am actually an imposter standing up here on stage talking about DevOps, but I'm hoping, giving you a little run-through of what we've found, both that's worked and what's not worked at the bank, you'll find interesting over the next few minutes.

So let's jump straight in and talk a little bit about Standard Chartered as an organization.

We're in the UK, so you probably best know us for being on the front of the Liverpool Football Club jersey. But what is Standard Chartered? Well, we're 165 years old. Queen Victoria signed our charter back in 1853, and we've printed banknotes in Hong Kong. If you've visited Hong Kong any time, you'll notice that it's a public-private partnership, but we print a good portion of the banknotes in Hong Kong and have done so since 1862.

Fourteen billion U.S. in operating income, more than nine million individual customers, and we run over 1,000 applications in production. Since we're at a technology conference, that's the part you're probably most interested in.

So what I want to cover today is, quickly, I want to talk a little bit about why it's pessimistic to be in banking, and certainly in banking technology. There's lots that, if you're of a pessimistic mindset, can get you down and get you discouraged about the size and magnitude of the task. I want to then talk about why we're still optimistic, and some of the things that have worked and some of the things that haven't, and where we're headed next.

One of the other reasons I'm up here is that Damon Edwards at Rundeck, he and I got chatting when I was in San Francisco at the last DevOps Enterprise Summit in San Francisco, and we were talking about some of the struggles we're having and things we're working through at Standard Chartered. And he said, "Well, it'd be interesting if you came to the next DevOps Summit and talked about it."

And I sort of dismissed it at first, but eventually they convinced me. So I think before we go any further, if it goes well for the rest of the presentation, then I'm planning to take the credit. But if it goes badly, then let's all just agree that it's Damon's fault.

To give you an idea of our footprint at the bank, all the blue countries are countries where we operate or have some presence. And so you can see that's more than 60 countries.

The reason why this is interesting from a challenge point of view is that every one of those countries has regulators. This is one of your typical buzzword slides. What I want you to take out of this is that in every country, we have at least one regulator.

Across these 60 countries, there's lots of agreement, but lots of disagreement among those regulators, among things like data sovereignty and use of public cloud and how often you must report things to them, et cetera. So we have, I think, a good relationship with our regulators, but it's a challenge keeping those 60 relationships all going at once and just keeping them straight.

So from a technology point of view, to give you a sense of the challenge, and I'm only about four years into banking out of a 25-year career in technology, so a lot of this has been eye-opening to me quite recently, and I think some of you may find it.

For example, for any new application going live in our environment, at last count, there's more than 250 security controls that have to be mapped, evidenced, audited, logged, et cetera, before that new application can go live.

And so if you're in the infrastructure side of the house or the ops side, and this is part of the new ops track here at DevOps Enterprise Summit, this means that what's the result of all that regulation I just mentioned?

Well, all that regulation and the accompanying compliance that goes along with it means that our processes are really optimized for compliance and not for speed. So many of the processes that we use to manage risk were designed for one or two waterfall releases a year. That was the way we did things not that long ago, really, in our bank.

Probably five years ago, if you came in, everything was waterfall. Today, we're probably two-thirds agile, and the rest are busy getting there. But all of our operations processes are still really designed and built to handle that one or two waterfall releases a year.

And so what we're finding is something that I've heard through a number of other talks at the conference: that as teams embrace agile and move faster and get more efficient at building high-quality software and working with their business better, operations increasingly becomes a bottleneck in the back, the constraint.

In our environment, servers still typically take weeks to provision. And while parts of the delivery chain are automated, a lot of it still remains manual.

To put this in more perspective, this here, and I'm not going to go through every box, is our standard change control process flow. Okay? So we map out and processize all this stuff.

I counted through quickly. I think there's more than 35 individual steps, and this is what we have to go through for every single change that wants to make its way into production, whether it's a new release or change to an existing application.

Most of this is manual. It's driven with meetings, it's driven with Remedy, and it's prone to delays. And really, you need to have built up a good relationship network of who to go talk to and call to figure out the next step, because sometimes it changes without notice, and sometimes it depends who you talk to and what kind of mood they're in.

And from talking to colleagues at other banks, we're not unique in this, which I think we can take a little bit of solace in.

Now, what happens when things go wrong?

So this was an incident since I joined the bank, and it wasn't really remarkable in any particular way, but I wanted to give you an idea of Conway's Law in action. I think at a DevOps Enterprise Summit, I think the seventh one, everybody probably knows what Conway's Law is. Well, this is a manifestation of Conway's Law.

During this incident, it was quite a serious one. We had everyone from the business to app dev to app support, service management, database support, Unix support, network support, all the way down to country technology teams. And all these are different teams, not just individual people.

They all have their own management. They all have their own off-sites. They all have their own goals and objectives, and those overlap, but not entirely.

So when we look at this, the result is that very few people in our environment actually have the full end-to-end view of how things work from all the way from the data center concrete up to a satisfied customer using a mobile application.

Most of these teams, none of them have APIs. The protocols we use are still email and meetings and Remedy. And you could get very discouraged if you spent a lot of time thinking about this.

The other thing is that as we try and embark on doing things in a more modern way, is that we've already tried. Several years ago, we stood up a very classic bimodal-type effort, and it didn't work. Bimodal didn't really work for us.

I've heard from several points, including Scott Prugh and Erica Morrison yesterday, that it sounds like it doesn't work for some others as well, but certainly didn't work in our environment. Communication ended up being quite opaque. Delivery was fairly poor. We didn't realize much value from it, and the costs were really high.

Another quick anecdote I wanted to give about our environment: those of you who may not work in banking, you can imagine that when anyone wants to access a production system, it's quite a detailed affair. We need to know who's accessing it, what change request or incident ID and ticket is linked to that, and then what things they did with that production, say, root access or elevated-privileges access while they're in production.

All very sensible, common-sense things that you, as a bank customer, would expect if it's your money in an account that this person's going to go in and work on those systems.

However, in order to satisfy auditors, what ends up happening is that, how do we evidence that the person who went in and made those changes in that change request actually did the things they said they did?

Well, our solution was to put in, essentially, a video recording system. So a digital CCTV, which recorded a screenshot or screen grabs of everything the person did. And then that goes through another ticket system, assigns it to their manager, who has to watch the entire video end to end of everything they're doing in production, and then attest that all the things they did matched up to the things they were supposed to do, and they didn't do any things that they weren't supposed to do while they were using that production ID access.

So I'm going to come back to that, because those are the sorts of encumbrances we have in financial services when it comes to doing DevOps and moving faster. There's lots of interesting reasons why.

So at this point, hopefully, you're all feeling a bit discouraged and thinking, "Wow, this is not a great news story."

Despite the obvious, there are a number of things that are quite encouraging that we're working on. So I'm going to talk about these and a couple other topics really quickly, and I'm going to try and run through them in a hurry.

I want to talk about the DevOps tools pipeline we've built. I'm going to talk about cloud. I'm going to talk about Rundeck a little bit. And I'm going to finish up talking about culture and some of the interesting and encouraging signs we're seeing around the bank's culture as a whole changing.

Where to start, though? Now, the reason I put this up there: we had a strategy session about a month ago where all of the technical services folks, of which I'm a part, got together and we talked about our roadmap for all the products we manage in technology services over the next 18 months.

And what we said is, as we go through and present these, if we hear something come up in conversation that we think is a fundamental tenet or a principle that we all agree with, let's write it down and let's capture it.

And this was the output at the end of that session. So this is really encouraging to me because it tells me that we're fundamentally, as a management team, thinking about the right things and we're putting the right kind of tenets and principles.

The other thing I want to outline here is I'm part of the core technology services group. I'm not part of an innovation lab. I'm not part of an incubation startup. We are the main group that runs all 1,000 applications for the bank.

Every cent of revenue that the bank generates and every dollar of asset under management is managed in production by our group. So we are the bank's technology. We're not a kind of splinter, bimodal, stealth group going off the side saying, "Hey, look at all the cool stuff we're doing."

One of the things that has really started to develop traction is that we're using all the standard tools. I don't think any of the tools on here are going to be a big surprise, but if you were still using our old version control system, Bitbucket, which is not on here for some reason, is a big upgrade for you. And compared to having to use Remedy, the Atlassian suite is a big upgrade.

So this has started to be useful. We've called this our VX pipeline. We've tried to put some internal branding around it.

And how is it working? Well, we've got more than 200 teams in the bank on it. Some teams have seen around 90% reduction in turnaround time by being on this platform. We're doing 6,000 builds a week.

But to riff a bit on Damon Edwards' chat, the part that we haven't figured out is the last mile. Of these 200 teams, there's only one that can now go all the way from idea into production, automated through the pipeline all the way, and that only went live a week ago.

So I want to be honest here about where exactly we're at. The rest of these teams are some distance of the way, usually to dev test and maybe to test, but not to production. So that's really our effort and our focus this year.

We've also tried to get better at architecture. Now, if you can imagine in an environment like ours, when you come along with a new investment and you want to build a new project, eventually, after your developers do some design and say, "We're going to build an application that looks like so," well, you come to the infrastructure folks in our team and say, "We need you to host this thing for us."

And what tends to happen today is that it's all manual and bespoke. We go off and spend a couple of weeks and we generate a really pretty Word document for you, and it says you need two web servers and seven app servers and three database servers, whatever it is you need. We package it all up, write some nice architecture diagrams, and then we send it to you and we give you a big internal cross-charge for that effort.

One of the things we're rapidly trying to adopt is pattern-based architectures, where if you're running a three-tier Java app and we're trying to map that using the Banking Industry Architecture Network, BIAN, it looks like that's going to be really interesting stuff because it maps banking capabilities from a business point of view.

And we want to map those to architectures so that when you come along and say, "I'm doing this banking activity in this jurisdiction," we can map it to an architecture that we've already done, and we can just instantiate an instance of a pre-approved architecture that meets all the bank standards and is following all our best practices. And we can do that in minutes rather than days and weeks.

And it's also going to be expressed in the form of a Terraform script and Ansible, not in the form of a Word document. So this is pretty exciting as well. It's still early days, though.

Earlier this week, we pushed eight dev environments for eight different teams live with this system for the first time. So we're waiting to gather feedback on that, but really I feel a bit embarrassed even getting up here and talking about it because it's really in-flight.

The other interesting thing I want to talk about, at least I hope you find it interesting, is around private cloud.

Now, you probably saw that I spent some time at Amazon and I was very well drilled at Amazon that there should be no such thing as private cloud, and everyone should use the public cloud at all times for everything because it's wonderful. And for lots of industries, that's true.

What we're finding in financial services is that our regulation is actually getting more difficult over time, not easier. So I think there was this view, certainly I held it, that if we just wait long enough and all the benefits of public cloud become clear, regulators will all get on side and say, "Yes, off you go. Go into public cloud."

It hasn't panned out that way. It's actually becoming a steeper uphill climb to put workloads into public cloud. There's more scrutiny, more oversight, and it's not impossible, but it is lots of overhead and paperwork and committee meetings.

So our previous approach could be summed up fairly simply as, how fast can we adopt public cloud? Right now, we're seeing a private cloud solution be a tactical stopgap until such point as we can go faster towards public.

Also, that 60-country map I showed you, that introduces huge complexities in terms of just keeping it all straight across 60 countries where we can do what. So by running in our own data center, but exposing our infrastructure with an API and giving a cloud-like experience to it, we get some of the benefits of cloud, and we can also go a lot faster.

We're working with StrataScale on this. Early days, but we're really encouraged with the results on that.

Containers as well. We've heard lots about containers. I loved John Willis' talk yesterday. I learned a lot as well. I was the guy that he asked and said, "What version of Docker are you running?" And I said, "I don't know." So just to put my hand up and admit my limitations.

When we look at containers, it's pretty clear that we're going to end up on some flavor of Kubernetes. We want to standardize and manage that centrally.

What we saw is there have been a few efforts in cottage industries across our technology organization to start using containers. But we now want to centralize, standardize, make sure it passes best practices and all of our compliance. And then we want to figure out how to connect that pipeline that we mentioned in to be able to deploy things to containers.

We expect that this is where the majority of our apps are going to be, let's call it in two to three years. So that's the timeline we're looking at: if a majority of our workloads are not running in containers in three years' time, I'll be really surprised. But it won't be in three months' time.

So let's talk a little bit about public cloud.

Up till now, most of our work has been with AWS. I've tried to put them here in more or less alphabetical order, so I'm not showing any bias to any three of them. Any of you who work for large enterprises with sourcing committees and lots of governance around that will realize why I do that.

What I would say is that we expect in the next one to two years that we'll have business relationships with all three. One of the things we're hearing from regulators is that they would like us to have options. They're worried about concentration risk, and so we want to get good at operating in the clouds, not just the cloud.

And so that may mean that we don't get to use some special features or secret sauce in one cloud, but we think the benefits of being portable probably outweigh that. And in cases where we do want to use secret sauce, there are aspects of each of these three that are unique and really cool. But we should take a mindful decision to say, "Hey, we're going to lock ourselves into that technology because it gives us so much benefit that it's worth being able to use something that's not available anywhere else."

The controls and regulatory landscape is really the hard part of public cloud when it comes to banking. It's not the technology. We've already got lots of folks. We've hired actually quite a strong team around public cloud. And it's the regulatory and compliance and controls that is the difficult work here, not the actual technology itself.

The other thing I want to talk about is sometimes agile/DevOps, and I'm not saying they're the same thing, but they're kind of overlapping Venn diagrams of communities, gets a get-out-of-jail-free card when it comes to really rigorous reporting of milestones and value delivery.

Certainly, some of us have had a questionable reputation when it comes to, are you actually getting stuff done? Are you delivering the things you said you would? And, "Oh, well, it's agile, so we don't have to report," or, "Oh, well, it's a DevOps thing, so we're working iteratively." That's not really an excuse.

And so this has the capacity to hamstring your DevOps efforts if it gets the reputation for just not having rigor and discipline. This should actually make it easier to report on where things are at. And the fact that you're regularly taking feedback and changing direction or pivoting regularly shouldn't give up the need to report those types of decisions and to be transparent and clear.

So one of the things we've done is we've brought in some people with good, strong agile background, but also very strong program management skills to be able to really report on our program as we go. And it's early days on that, but we're encouraged with the results.

A couple specific wins that we've had. This is in our core estate. There's no cloud in sight here.

But what we found is that engineers working on a problem, if they picked it up, our monitoring estate, as you can imagine, the polite way to put it is heterogeneous. So depending which monitoring portal or platform or email queue you were looking at, it might not filter back to the people who could kick off an incident bridge and get all the machinery of incident management rolling to get that resolved as quickly as possible.

So one of the things we built was a virtual Andon Cord. Again, at a DevOps conference, I probably don't have to explain what Andon Cords are. If you're not familiar with the concept, happy to chat afterwards and look it up.

But the idea came from Toyota manufacturing. Anyone on the line, if there was a defect, they could pull the line, the whole line would stop while everyone swarmed the problem, help them sort it out. It was completely opposite to the way manufacturing worked before that.

We found this works really well. This is just a simple web tool that guys knocked up as a prototype in about a week and a half, and now we're using it for all incidents across a couple lines of business. And it also keeps a really nice timeline of all the communications of who did what, when, et cetera, so that when we get together for post-incident debriefs, we've got all that stuff built already. You don't have to do a whole lot of work finding it.

One of the other things, in terms of specific takeaways that I hope you find useful, is we've been doing a bunch of work with Rundeck.

Now, Rundeck started out, and I think it's more interesting how we started using Rundeck, maybe even than the tool itself. One middle-seniority engineer decided that he was unsatisfied with the status quo of how we did operational tasks. And so, as a result, he went through the considerable effort and, let's call it bureaucracy, to have this tool introduced to the bank.

So he didn't cheat, but it was unfunded. It was non-strategic. It didn't come out of anyone's roadmap or planning for the year before. He just went and, through convincing the right people, managed to get it in. And once it's in, and that was in one really small part of our operations, it was the guys looking after our middleware and web support, and they started scripting all the manual tasks that they did and exposing them as Rundeck jobs.

When I started, I heard they were doing this, and I had a similar problem because I haven't been running cloud the entire time at Standard Chartered. I actually started running all of the retail technologies, all the 250-odd applications that run the retail bank. I initially started, for about the first nine months I was there, supporting that.

So we quickly put it in, to the point that now this year we did 10x as many jobs as we did last year in Rundeck, so 13,000 jobs.

Remember that story I told you about managers having to watch videos? Well, it turns out that if Rundeck does the job, nobody has to watch any videos because it's done in a consistent way. The auditors can come in and see that the scope of control around the job is well-bounded, and they're quite comfortable that, provided there's a good, safe procedure on how you update and release Rundeck jobs, they're quite happy for those to be run and just an audit log recorded of the action without this kind of double-checker, people having to watch everything go through.

So every Rundeck job is one video that a manager doesn't have to watch in terms of production access and change reviews. So you see, 13,000 jobs is not just kind of cool, it's actually a big savings for us as well.

So we've had good luck with this. It's actually a key part of our next-gen pipeline I talked about as well. That's going to be the button that someone clicks to actually do a promote to production, mainly so that we capture the audit and tracking around that.

I want to finish up by talking a little bit about culture.

When I joined the bank, we had five values or principles that I honestly can't remember because they were a bit bland, and I didn't hear them mentioned on a daily basis very often.

Several months ago, the bank, after some effort, went through and we relaunched our valued behaviors. And the reason I'm excited about this is I think they're much easier and more accessible: do the right thing, better together, and never settle. And then some specific things around sub-behaviors under each of those.

This reminds me a lot of my time at Amazon. They had 14 leadership principles when I was there, and very rarely would a day go by that you didn't hear someone reference something like customer obsession, which was the first and probably most important one there. These were kind of the signposts as to how you settled arguments or broke ties or reminded people to get back on focus.

So I think this is early and we're only going to see the true results of this in several years' time. But the reason it's exciting is since these were launched, I'm seeing people hashtag these into email. When two teams collaborate together, I'm seeing people put hashtags at the end of the email saying, "Better together," when there was a win. Or when there's a difficult conversation, I'll hear people asking, "Hey, are we really doing the right thing here?"

And so it's starting to develop a flywheel effect, where I think these are going to help be the signpost course correction for our culture as we go forward.

So there's tons of work in progress, as I've mentioned. Our process universe, we literally have a thing called the process universe, and it's all the processes in the bank. It's, as I mentioned, engineered for the way things were, not necessarily the way things are going.

So we have a huge amount of work to do to just rewrite a lot of our processes to be more lean, shift things left, apply DevOps principles, but yet still protect the bank.

Our support model: we're taking a bet on SRE. We think that's going to be the easiest implementation of DevOps, or the best one for us. So we've got a lot of work to do on that, and we're just starting it.

And then the business case as well. I'm lucky enough to be in an organization which has kind of taken the leap of faith and said, "Hey, we're going to put 100-odd people on this effort and see how it goes." But we're in the business of money, and it's not going to be years and years with no returns and no value delivered that we get to play with cool new tools. So these are all things that are in progress.

Really quickly, things we've learned. Well, start with your tenets and principles. Figure out what those simple things are that you agree on, and then decisions, you can always refer back to them as a signpost.

Innovation comes from unexpected sources. If we had followed every rule to the book, we'd never have Rundeck in the environment. So don't be too quick to dismiss things that people bring in through the side door.

Obviously, I'm not talking about putting the environment at risk, and in this case, we did bring it through all the necessary checks and balances. But it took somebody supporting that effort. And so I'd encourage you as leaders, when you see someone doing something slightly out of the ordinary that you think, "Hey, wait, that's not the right way to do that," don't necessarily dismiss it out of hand. Sometimes good things can come from that.

And to steal Cornelia Davis' line, when it comes to cloud, how matters a lot more than where. How you get your applications from version control into cloud is actually a much more complicated and interesting story than just what data center they're running in at the end.

And if you wave a magic wand on your enterprise environment and just shut down your data centers and move it into a public cloud without doing all the work of having that pipeline to build it consistently and automated test it and show all the clean-room-type stopgaps that Topo Pal has been talking about, well, you're going to end up with the same mess. And in a lot of cases, you're going to end up spending more.

So help we're looking for? Feedback.

This is the first version of this talk, so if you think there's things I've missed, if you think there's things that are wildly egregious, feel free to come collar me at the end. I'd love to hear about things that have worked in your environment that maybe we haven't mentioned.

How to package DevOps and sell it as a philosophy and not just a toolchain. A lot of people hear DevOps and they immediately think deployment tools. So at this conference, I don't have to convince any of you, but I'd love to hear if you've had success marketing that.

And then we're looking for good talent. We're hiring all over Asia and here. And if you have ideas on how you've done that, I'll avoid actively poaching all of you in the room, but if you have ideas on how we can do that.

And I'll just finish up with our latest marketing campaign, which I think is a good way to finish, which is that greatness has no finish line. We're never done in this. We've heard this from Cornelia this morning around that if you think you're done, you're probably not.

And so as a company and as a tech organization within a large bank, we're hopefully developing the mindset that we're never done.

Thanks very much.