Next-Generation Ops at Standard Chartered Bank

Log in to watch

Las Vegas 2018

Next-Generation Ops at Standard Chartered Bank

Global Head, Cloud Infrastructure Services · Standard Chartered Bank

Shaun heads up the Cloud Infrastructure team at Standard Chartered Bank in Singapore. With 20+ years of experience in the industry, Shaun has previously held roles at EDS in Ottawa, lastminute.com in London, VeriSign in Cape Town, and Amazon Web Services in Singapore. Shaun’s team are building a developer tools pipeline and corresponding cloud capability as the new standard for software delivery in the bank.

Chapters

Full transcript

The complete talk, organized by section.

Shaun Norris

Good morning. I'm Shaun Norris. I run Cloud Infrastructure Services at Standard Chartered Bank in Singapore, and it's a real honor and pleasure to be here today and to share a bit of our journey of next-generation operations, and a bit of our DevOps journey in general.

So how did we get here? I met Damon Edwards at the DevOps Enterprise Summit last year in San Francisco, and when we got chatting about some of the stuff we were doing in operations, he suggested that it might be an interesting idea to come and share it at the summit in London. After some trepidation and really a fantastic experience in London, amazingly enough, Gene and the organizers asked me to come back and give you an update on where we're at. So it's a real privilege to be here.

It's a bit humbling because my DevOps journey started in 2012. I was working on an MBA while working full time. I don't recommend that to anyone. While doing that, I was searching around for a topic for a thesis paper, and I kept hearing about this DevOps concept. At first, I'm an operations guy by trade. I started in the mid-'90s with EDS that you can see up here, and I ended up in London in 2000 with Lastminute.com and ended up with VeriSign doing some security work, but these have all been infrastructure jobs. My first job was as a Novell systems administrator.

So I started coming across agile teams in the early 2000s with Lastminute, and then really around 2011, 2012, I started hearing all this talk about DevOps and being skeptical, being an operations guy. Paranoia is a very healthy trait if you've spent as long in operations as I have. And I was skeptical. So my thesis was really a fairly poor attempt, if I'm honest, at trying to investigate: was there any reality to this DevOps thing? Were there actual real practices, or was it just a marketing buzzword to sell the next version of software?

So that got me introduced to people like John Allspaw, at least reading their books, and Gene Kim, Dr. Nicole Forsgren, et cetera. And so I went on this journey of getting immersed in this community and learning a whole lot from it. I ended up at the very first DevOps Summit in San Francisco. This is my fourth one now. It's definitely imposter syndrome to be here on this stage and sharing it with people who've really started and created the movement. I consider myself hopefully a fast follower of those who've built this movement, and it's awesome to be here.

Let's talk a little bit about Standard Chartered Bank. You've probably heard of us because we sponsor the Liverpool Football Club jersey. So our name is on the front of the football club jersey. But I think a more interesting story to introduce the bank is what we do with that. We've got this really amazing charity called Seeing is Believing that we've partnered with since 2003. In 2003, some folks got together and said, "How can we help prevent blindness?" Just to give you some idea of the numbers, there are over 200 million people globally who have some sort of vision impairment. More than 30 million of those are actually blind. And they estimate that 80% of that could be prevented or cured with the right medical interventions.

So Standard Chartered set out in 2003 to raise $100 million US, and it's really awesome to share with you guys to start out today that just this year, we've achieved the $100 million two years early. You can see the numbers up here. We've helped 4 million people with actual interventions. We've trained 300,000 health professionals. We've done hundreds of projects worldwide. This is one of the most gratifying parts of working at this organization, that we're more than just a bank. We're involved in the communities we're in, and we're trying to make a real difference in the lives of our customers and the communities they live in. So I'm really pleased with this. It's an exciting part of working for the bank.

So on the topic of the bank, we've been around for almost 165 years. Queen Victoria signed our royal charter that initiated the bank way back then. So we predate things like electric lights and electricity, and definitely things like DevOps and cloud. We had about $14 billion in revenue in US dollars last year. Our technology is headquartered in Singapore. We operate in over 60 countries, and we've got more than 9 million individual customers. We've got more than 1,000 applications in production, and we've got really all the way from mainframe through to microservices running in containers and everything in between.

If we look at our footprint, this is one of the things that makes Standard Chartered unique. If I go back really quickly, you can see our mission up there in the top, that it's really about our unique diversity, and this unique diversity comes from our really unique footprint. All the blue shaded areas you can see on this map are areas where we operate as a bank. So you can see we're really heavily involved in Africa, the Middle East, and Asia. And the interesting thing about this is that each of these countries has its own financial regulators. They often have their own views on things like data sovereignty and cloud governance and many other things.

So a bit of background about me. I don't come from a banking background. As you saw the NASCAR slide of some of the companies I've been lucky enough to work with, I've really only been in banking for about three years in the technology side. And one of the most surprising things was just how much paperwork there is. You hear about it, but when you actually experience it on the inside, it's immense. You think of those 63 or so countries that we're in with regulations. We also have obviously a lot of internal compliance and policy. And so to put this into context for you, the average application we put into production, we have to look after at least 150 different security controls that have to be mapped, attested, evidenced, audited, et cetera. And so this is one of the overwhelming challenges of doing technology in a financial services environment versus doing it in a regular enterprise.

If we now come into the infrastructure world, where I've operated for most of my career, the result of all this regulation and paperwork is that our processes over time have grown organically to be optimized for compliance and not speed. Many of these processes are around managing risk and controls, and they were really invented or designed for waterfall, one-to-two-times-a-year releases. And so in an era when development teams and businesses want to go faster, and they want to embrace agile and they want to release more often, the status quo for infrastructure provisioning looks increasingly antiquated.

Servers still take weeks to provision in our environment. And while parts of the delivery chain are automated, a lot of it remains manual. One story in particular I think will illustrate how we tend to add controls and bureaucracy to our work over time. We have this idea of a break-glass mechanism for production, that if you want to do an operation in production, say there's a production incident going on and you need to SSH into a Linux server to maybe do some troubleshooting and remediation, you have to go through and unvault a privileged password from a system that stores the password. You unvault it, you log who you are, what incident ticket you are, you go in and do your work, and you save it in the change request, and you revault the password.

As if that wasn't enough, in order to make sure that you only did the things you were supposed to do with that privileged password, we have an extra component where we like this idea of maker-checker. This is an idea that well predates technology even, or information technology, that in a banking environment, you'd really like anyone's work to be checked by someone else. And as an account holder, I'm sure you appreciate that level of diligence around maintaining the right account balances, et cetera. But in production, if you're doing a change, then you need to have your manager sign off.

So what that means in practicality is that after this change request is done and checked back in, your manager then has to watch a video of your entire screen session and then attest and sign off that you did the things you were supposed to do, and you didn't do any of the things you weren't supposed to do, effectively at least doubling the amount of resource effort that goes into production changes.

So this is a flowchart, nothing out of the ordinary, of our change and release process. It's roughly 37 steps. It's about a 10-day SLA for a normal production change, and this is largely manually driven. We heard yesterday from Dr. Forsgren and Jez Humble that change approval boards are not really correlated with IT performance. I wonder how they feel about pre-change approval boards, where you have a meeting to prepare for the change approval board meeting, because we've got those, too. And you might not be surprised that this process and this way of working is sometimes prone to delay.

In a typical incident, this is one that I was involved with about a year ago. We had this many different teams all on a conference bridge to try and resolve the problem. These aren't just different individuals. These are actually different teams with different leadership who all had to get on, and so each of them had their different lens of what was working and what wasn't working. And what you find is the further down the stack you go, by the time you get to network or data center people, they often don't know the application or its context of how it serves the bank or its customers.

The other challenge we've got is that we've already had a failed cloud transformation. We set up a proper bimodal kind of consulting-compliant separate team, and we spent a couple million dollars on it, and we didn't accomplish very much, and it was a real big failure. I won't spend a lot of time on that, but failure is expensive, and banks historically don't like failed projects. And so people are a little nervous about this DevOps and even agile thing.

People think things like DevOps just means, "Oh, I'm using Jenkins, so I'm doing DevOps," or, "Oh, our DevOps team takes care of all the DevOpsing." And sometimes people have been using agile as an excuse to just skirt around the bureaucracy, which, fair enough, their intention is good, but we've got commitments to regulators and compliance, et cetera.

So at this point, you're probably thinking, well, this is a pretty depressing story, and there's lots to be pessimistic about. And perhaps that's true, but really what I want to focus on this morning is talking about why we're optimistic that we're making progress, even if it's small incremental progress, towards what the DevOps movement is really all about.

We're going to talk about a few things up here. One of the things in general is that we've got recognition at all levels of our organization that we need to change the way we do technology, and we've got really enthusiastic participation, really across groups and outside organizational structures to do that.

One of the things we did is we got together in my group, which is called Technology Services. Cloud Infrastructure sits within Technology Services. That's an important point to bring up, that I'm not part of an innovation lab or a separate kind of incubator or a bimodal go-fast team. We're actually part of the core infrastructure team, the technology services that runs all thousand-plus applications in the bank. We happen to be doing some of the cloud and DevOps pipeline work, but we're right there with the rest of the infrastructure folks in the bank.

And we got together with all of our leadership earlier in the year, and we actually came up after a one-day workshop with this set of principles and tenets. You probably saw, I spent some time at AWS earlier in my career, and one of the things that I took away from that experience is how powerful this idea of starting with the right thinking is: that if you can come up with the right principles and agree on them, and then implement them and hold each other accountable to implement them well, you can do really big things. It also streamlines things and removes bureaucracy because now you don't have to argue over first principles all the time because you've agreed on them. In the Amazon six-pager process you've heard of, talking about principles is kind of a key part of that. And if you were to prepare a six-pager internally that didn't have the principles laid out, usually someone senior in the room would ask you, why not?

Let's jump in and talk about some of the stuff we're doing. One of the things I want to call out is that when you're talking about cloud in particular, which is the high-level title of my team, my big takeaway from the last year and a bit is really that how is bigger than where. I have to credit people like Cornelia Davis at Pivotal, who I first heard this from. But if you think of that long provisioning chain that we put up earlier, the bottleneck in that process is not a systems administrator struggling to find out how to right-click and launch a new VM in vCenter.

And so we've heard this a lot, and it's really resonated the last couple of days at the conference, that if you just pick up your data center with a forklift and you put it into anyone's public cloud kind of infrastructure as a service, you're going to be disappointed. And I saw this on the other side of the table working for a cloud provider: folks who didn't want to do the work. Andrew Clay Shafer's analogy of health and fitness as a metaphor for digital transformation really resonated, and I'm going to steal that from him. I'm sorry. I'm going to use it over and over again because I think it's awesome. It reminds me of the kind of maybe person looking a bit like me who's sitting on the sofa watching someone buying weightlifting equipment or something and going, "Yeah, I'd really love to be thin, but I don't want to put in the work." So that really resonates in terms of how I think a lot of enterprises, not just ours, kind of operate.

So one of the things we've done is we've built a DevOps pipeline. These are all pretty standard tools. But what we've tried to do is say, how do we modernize the way that teams deploy to production? If we look at the best teams in the world, whether it's Amazon or Google or Microsoft or Netflix, one of the commonalities I've found is that they all have a very rigid, defined process of how you're allowed to go to production. At Amazon, it was called Apollo. I caught up with some of the Microsoft team a couple of weeks ago. They told me that within Azure, they've got something very similar. Within the bank, we're a little more fluid. We've got several different routes. And so when you don't have that standardization of how to go to production, it's difficult.

The last mile has been a really big challenge, though. When I got involved with this team about six months ago, it was a situation where lots of teams were on the left side of the pipeline, the kind of stuff you see up here in collaborate and build and maybe even test, but they weren't getting all the way to production. So, of a 10-mile journey, only eight or nine miles were paved, and the rest of the time it was too bumpy to even drive. You kind of had to get out and walk.

One of our big focuses for 2018 has been: how do we get apps end to end, all the way from idea and from the business all the way to production, automated all the way through with all the compliance checks, and they have to end up on a cloud environment? Now, I would love for our definition of cloud to be the same as what we heard last night in terms of the five things you really need. Right at this point, we're starting incrementally. Cloud is like, can I drive my infrastructure with an API? So we said, if it's API-driven infrastructure in a public cloud or in our own data center or on a container platform, hopefully Kubernetes, then we'll count that as kind of compliant.

Now, when I talked to some of you in London in June, we had one app in the whole bank that was live end to end on that journey. As of today, we've got 13, so we've made good progress. We've been doing kind of more than one every two weeks since I last talked to you. So we feel good about that. But when you look at overall numbers, it's still a lot of work to do. We've got roughly a third of the bank's applications on our new pipeline. We've done almost a million builds on the platform this year. About 4% of those can deploy to some environment, and 1% of those can deploy to cloud.

Now, you might say, "1% cloud? Wow, you guys haven't even gotten started." What I would say is that's 1% by number of applications we've got. We're using cloud to do things that we really can't do on our own data centers. Just as an idea, to give you an example of this without sharing hard numbers, we spin up more cores every day in public cloud to do some risk simulations than we have in all of our private data centers combined by a factor of two or three. And then we shut them down again six hours later. So we've got a couple of applications in that 1% that are using huge amounts of compute that we couldn't really afford to spin up and down like that anywhere else.

One of the other things that we get challenged a lot is, what's your multi-cloud strategy? How do we avoid lock-in? We hear this from regulators, we hear this from our executive, from our business. And so we've been thinking a lot about what multi-cloud means. I've done just a simple quadrant map of how we're thinking about cloud. In the top half is public, kind of somebody else's data center. In the bottom half is our own data center. On the right half is maybe more modern cloud abstractions, and on the left is more infrastructure as a service. We think in the fullness of time, we're going to be doing business with all three of the large IaaS providers. The regulation and complexity of operating in those 60 countries, though, really means that we have to have a strategy for running in our own data centers for some time. I think if you look 10 years in the future, it's quite possible we could be entirely in a public situation, but probably not in five.

What we've learned a lot along the way is that this is hard. We've got some OpenShift running in our own internal environment, if you look over on the right side. We think Kubernetes is the way as well. I went to John Willis's talk at this conference, and it was awesome as usual. And I agree that Kubernetes and containers are the future. We've also been seeing that the friction for our teams going live, like of those 13 apps that I talked about, more than 10 of them are in containers on OpenShift. So those were way easier to do, and the app teams were able to move a lot faster than they could otherwise. So this is kind of how we're thinking about cloud.

Really, I think the top-left quadrant is already obsolete. If you're having to manage VMs and log in and patch them, it kind of doesn't matter whose data center they're in. Maybe if you're using composable infrastructure and it's all Terraformed up and you don't really have to touch things, and there's lots of tools out there to slipstream and make managing VMs easier, but we really want to get to the top right or at least the bottom right.

Let's talk about some of the less sexy stuff or the more legacy parts of our environment, though. When I joined the bank, it wasn't actually to run the cloud team. It was to run a production operations team, running all the applications in the retail bank. So about 250 applications were directly under my purview in that. And what we found is that whenever we had stability issues, we went back and did a bit of a review over a few months of stability incidents. And what we found from that is that there were some common causes. One of the common causes of instability was the fact that we were doing a lot of things manually: things like code deployments, things like DR failovers, things like service restarts.

And so, really interesting story of innovation. One of the things banks grapple with is how do we be more innovative? How do we avoid being disrupted? How do we keep up with the cool fintech kids? How do we be more innovative? And so you see lots of things going on, whether it's innovation labs or incubator labs or those sorts of things, and those are all great, but you need to be innovative in your core technology team as well. That's why bimodal doesn't work for me. You need to innovate and improve in your core operations as well, because that's where you'd probably need it most.

And so a junior engineer about two years ago in the bank said, "I've heard of this thing called Rundeck, and I think we should start using it." So it was an unfunded project, and he really had no budget to do it. He didn't really have the organizational position to get it done, but he just kept at it, and he kind of got it through the bureaucracy to get it installed as a POC, and he got it up and running. That was in the web support team. This was like web middleware, WebSphere mostly support. And they started using it for things like service restarts and some failover type activities.

Well, when I came in and started running the retail team, I said, "Well, we need some of this." And so this has moved on quite a bit. With people like Damon's help, we've now gotten to the point where we've run hundreds of thousands of Rundeck jobs. Really interesting from this, because we're using it for specific incident remediation. Say there's a service restart. Rather than having to go in and unvault a password and do that whole video-watching story that I told you about earlier, now because it's constrained and in version control and can be audited, we can pass that requirement because it's a kind of a chain of custody thing. You know exactly what the script's going to do every time, and you can tell its provenance, et cetera. And so we've found that it reduces incident time. So TTR reduces by on average 25 minutes for apps where we're using Rundeck versus where we don't. So that's really exciting.

We think next year, Rundeck is probably going to save us about 28 people years worth of work, at fairly conservative estimates, and we haven't even rolled it out particularly widely yet. So this has been a huge win for us. I'd be happy to talk to anybody if you want to find out more about that.

The other thing we've been experimenting with is shifting to an SRE team model. Now, at banks, it's a lot easier to just call things new names rather than actually do the new name. I think in this case, we're actually taking good steps towards doing the new name. A colleague of mine named Venkat Raghavan has been experimenting instead of the typical L3 support model we used. What happened there is that production support people like Venkat and me in my previous job, we would pay a budget to the development teams to actually fund L3 work. So this is like bug fixes, operational fixes. Inevitably, what would happen is those would go down to the bottom of the pile priority-wise, and then they wouldn't get done.

So Venkat said, "Well, why don't I take some of this L3 budget back? You lend me some developers, I'll pull some people with coding skills up from operations." So we kind of had developers and operations working together, and they made some really good results within a couple of months. This is really early. It's still experimental. We're still kind of feeling our way through this and seeing what works and what doesn't, but this is really exciting. They've reduced their backlog by 80-some percent in the first few months they've been doing this, and they've reduced the amount of incidents. They've improved stability in the platform. It's a really exciting story.

If I look at the bank and the organization as a whole, you always hear in these talks, and this is really a converted audience already, but culture is so important. One of the exciting things that happened earlier this year is we streamlined and simplified kind of our codified culture. Before, we had six or seven different statements of culture or valued behaviors that were good, but they weren't as good as these, and these are really simple: do the right thing, better together, never settle, and then with some individual specific sub-behaviors under those. This has been rolled out bank-wide, not just in technology. And it seems to have started to become sort of a movement that I see people regularly hashtag things like #BetterTogether when they're trying to collaborate with a team in a different department. And if someone's kind of pushing back that we should do something better, they'll hashtag it in an email or on our internal social media with #NeverSettle. So this is really optimistic. I think it might take a long time before the true outcome of it is really realized, but that is exciting.

We've got a lot of work still in progress. We're still trying to figure out things like our support model, like how we really do SRE and make it scale across the whole team; like how do we retool our processes from this kind of manual, handmade per process to something that is API-driven and eliminates handoffs, kind of implement lean across our process world. We're struggling with that. It's a work in progress, but we've got a lot of work left to do.

And so if I kind of finish up with things we've learned along the way: if I go back to the story of this junior engineer who kind of launched our Rundeck idea, often innovation comes from unusual ideas. The other story I sometimes tell is about Gmail, that at Google, that apparently came from somebody's 20% time. This was just an engineer who thought, "Hey, it'd be cool to have an email system." And in his 20% time, he came up with a prototype, and then it kind of snowballed and went from there.

Sometimes we think the more senior leader you are, well, all the innovation's going to come from me and my team, and I'm going to have an offsite, and we're going to incubate innovation ideas, and we're going to go that way. It doesn't always work that way. Be open to innovation coming from unexpected sources, but eventually, if you're a senior leader in an organization, it's going to need your kind of backing to get it further and get it to the next level.

Don't let perfect be the enemy of the good in this. Start small. We're not fully cloud as in all the five factors of scalability and burstability, et cetera, like we heard about yesterday, but we're starting somewhere. Let's get our infrastructure API-driven. Let's remove manual processes. Let's take our core world and start making it better that way. So that's really encouraging.

What I'll finish up with is help I'm looking for, feedback on this. If you have complaints, disagreements, questions, et cetera, more than happy to chat with you. Kubernetes: it may be and probably is the future, but it seems to change so fast that by the time we get the semblance of a strategy written down, it feels old and antiquated.

We're also grappling with how do we extend what we've done really around application deployments now and apply it to data as well. How do we take things like evolutionary database design and scale it across a large number of teams to make kind of a DevOps model for how data and schema changes, and putting data in secure, safe ways into public clouds, et cetera, gets done?

And compliance as code is the other one. A lot of our compliance is manual. We need to get to continuous compliance. I liked the ideas this week I heard of kind of minimum viable compliance. We're still looking for inspiration and examples and ideas of that.

With that, I'm going to wrap up. I hope this has been of some use. It's been a really interesting journey. Want to thank the organizers again for the opportunity to be here. I'll leave you with my details and my Twitter handle up in the top left, and that this is from a recent advertising campaign we did, that our journey is really just getting started. It has no finish line, and it is going to continue. So I really like this as a metaphor for continuous improvement and continuous learning. Thank you very much.