Enabling Delivery at Scale at Australia’s Fastest Growing Digital Bank with an API Driven Platform

Log in to watch

Las Vegas 2023

Enabling Delivery at Scale at Australia’s Fastest Growing Digital Bank with an API Driven Platform

Area Lead, Cloud and Orchestration Platforms · ANZ

Our business thrives on the speed of our engineering. Look at how ANZ has built an orchestration framework to enable an API driven platform. We automate GCP, Github, Kube, Istio and much more, such that a new starter can deploy a workload on cloud run within our VPC in 10 minutes.. at a Bank!

Chapters

Full transcript

The complete talk, organized by section.

Daniel McKeown

All righty. Thanks for joining my talk today, "Enabling Delivery at Scale at Australia's Fastest Growing Digital Bank with an API-Driven Platform."

My name's Dan McKeown, and I'm the area lead for Cloud and Orchestration Platforms.

I just noticed earlier today, this is actually the fourth talk already by a bank about platform engineering. So thanks for showing up. I appreciate your time. At first I was kind of like, "Geez, maybe people are a little bit over it by now." But no, it's wonderful to see just how much interest there is in the community around fellow engineering teams and leaders solving very similar problems, which is really great to see.

Maybe before I get started, can I get a show of hands: who's working in a platform engineering team? Yeah. Okay.

Who works in an application team that's deploying their workloads to an internal developer platform? Yeah. Okay, cool. So lots of relevant stuff.

Today I'm going to take you through our platform journey, how we went from kind of just provisioning DevOps tooling and cloud infrastructure to finally being able to build a platform. I'll take you through our internal developer platform, the X Framework, no relationship to Twitter at all. We'll do a deep dive on how GitOps has been a big enabler for the build of our API-driven platform, and we'll finish up with progress and plans, the help I'm looking for, and time permitting, questions.

Okay, so before we get into the tech, I'll talk a little bit about the business context to help you get a sense of why.

I work for a large financial institution in Australia that's building a new digital proposition, a new digital bank called ANZ Plus. We launched in March last year with a really neat digital join experience, where we integrate with biometric providers and identity providers. We'll load a virtual card or a digital card on your phone within a couple of minutes.

We've got a pretty feature-rich transaction save account, which we're super proud of. You'll see the numbers on the screen there. They were released about seven months ago. We have come a long way since then, so we've got a few more customers than that, a little bit more in funds under management, but those numbers are something we're really proud of.

But things are going to get far more interesting next year when we migrate the existing bank's customers into the new digital proposition, ANZ Plus. ANZ Bank has about 6 million customers, about $300 billion Australian dollars in home lending, and the entity as a whole has about a trillion in funds under management.

So why is ANZ Plus investing in an API-driven platform?

The visualization you see on your screen is a snapshot from our roadmap. As you can see, we're working hard to not only meet the current feature set of the retail bank, but also exceed it. As I mentioned, Transaction Save has already launched. We've got home loans in pilot testing, but we're yet to release credit cards, business banking, brokerage, insurance, frequent flyers, Cashrewards, all those many, many things that the current retail bank offers.

We're forming new feature teams almost weekly to deliver the roadmap, and we're not stopping anytime soon.

We're quite lucky at ANZ to have leadership that understands that the business thrives on the speed of our engineering, which is really important. Because of that, we've been given the endorsement and the resourcing we needed to build a platform to support a build of such a large scale. We're talking about thousands of engineers, multiple years, and hundreds of millions of dollars.

All right, let's talk about our platform journey. I think this will be familiar to many of you that are also working in platform engineering teams.

Three and a half years ago, we had nothing. We had to choose our cloud provider, we had to choose our DevOps toolset, and then we sort of started off with this noble vision of wanting to abstract away the complexities of the cloud. But the reality was we had to certify, provision, and get regulatory approval for all of our cloud and DevOps toolset.

March last year, we launched, and the platform's been rock solid, and we're really proud of that. But the developer experience we offered was poor. Repo sprawl and inconsistencies meant that not only was it really difficult to consume what we offered, I just scratched to call it a platform, what we sort of v1 of our offering was, but it was also very difficult for us to run.

So in September last year, we partnered with Google to help design our next iteration. Now we're well progressed with the build, most workloads have now been onboarded to the new platform, and we're adopting it and extending it as we learn.

All right, so let's talk a little bit about our internal developer platform.

We've built an API-driven platform known internally as the X Framework. What you'll see on the screen are the four principles that underpin our platform.

First of all, it's API-driven. So all key platform user journeys have been automated using CRD interfaces and a Go templating code base that executes in both pipelines and operators. Fleet-wide changes can now be made from a single location in code.

Our framework provides self-service tenancies. It's based around our workspace construct, which is roughly aligned to a value stream. Our workspaces allow teams to manage their projects, their namespaces, and application repositories are created under workspaces. We use templates quite heavily, and DevOps tooling is provided out of the box.

We've gone all in on GitOps for KRM. This has been a big enabler. I don't know if those of you in the community also struggled with having independent tenant application teams deploying Kubernetes from separate repositories. I'll talk a little bit more detail later, but it just meant that, as operators, it was so challenging for us to do things like roll out fleet-wide changes, like apply a label to this resource across the entire estate, or there's a new security control that's come in that you will need to adhere to. How do you go about rolling that out from a fleet-wide perspective?

I'll also touch on that we've started using Kubernetes Config Connector, so the KRM, to manage infrastructure resources and to create resources. We've moved away from Terraform for a lot of what we're doing.

Have any of you folks out here, and I'd love to see a show of hands, have you struggled with change management, atomic changes, et cetera, having a separate Terraform repository from your application code? Has that made atomic changes difficult, rollbacks difficult? Yeah.

Once again, I'll touch on it a little bit more later, but by leveraging KRM, by having both Kubernetes and GCP infrastructure resources in the same repository as your application code solves a ton of those problems. That really is just me rephrasing that infrastructure is now managed alongside application. It's an absolute game changer.

All right, some benefits of the X Framework.

Domain modeling our ecosystem. For us, it's CRDs. So custom resources, YAML files stored in Git, they're published to cluster. Not only allows us to automatically provision, which is essentially the main driver and super important, but we also get benefits of asset transparency because we store those same CRDs in a queryable data store. It just makes it so much easier to do things like report on usage.

Have you guys tried to scrape data out of just parsing YAML or logs or repos? We've got, I guess, a first-class representation of all of our CRDs in a queryable data store.

We also use this as a service that supports value propagation. So if you create a high-level resource, let's say it's a project ID, you need to pass it on to child resources. We use this queryable data store in our pipelines to facilitate that as well.

Reducing developer cognitive load. I'm not going to say golden paths one more time because I've heard it probably four or five times already. But the real key thing that we've done here is that we've taken away the need for the application engineers to work with large surface areas of primitive configuration.

Those raw resources, it's repetitive, it's verbose, it's that massive surface of configuration. If you're a Golang API engineer, you shouldn't have to worry about that. We've made it far easier for them to configure not only their cloud infrastructure, but also the DevOps tooling by interfacing with 10 to 15 lines of YAML.

Operator toil. I've said it before, but fleet-wide changes, as an administrator of a platform, you need to be able to make fleet-wide changes. For us, it was all about consolidating on GitOps and consolidating on that Golang templating code base.

Finally, community governance. Now this is probably the thing I want maybe you guys to take away most of all out of this whole thing.

Building a platform is not something that you can impose on an application team. Build it in a platform engineering team and then go, "There you go, application engineers. You have to use it." It just doesn't work.

So we realized very early on that we needed to set up an appropriate governance room. We borrowed heavily from Tekton CD. We set up a steering committee with diverse representation.

For those of you out there working in platform engineering teams, or application teams that may not be fully on board with your platform, set up that steering committee. Get representation from the application teams. Often those engineers that are the most vocal, that might be causing you the most grief, are actually going to become your strongest allies once they're in that position on the steering committee, and it becomes our platform, and you get that program-wide buy-in.

All right, just talk a little bit about the interfaces that we provide.

So CRD YAML files at the moment, they're created either via a CLI or via Backstage software templates. Sure, if you're a YAML weaver, you can handcraft them yourself. You're welcome to do that as well.

But I just wanted to call out that we're also looking at building a custom, richer UI next year, and we're also exploring creating a CDK as well. In Australia, Amazon is pretty big. A lot of application engineers are used to using CDKs to declare their infrastructure. So we're looking at providing a mechanism by which we can generate the CRs off the back of a CDK as well.

Embracing GitOps for KRM.

All right, so prior to going all in on GitOps, we faced multiple challenges caused by our teams doing CI/CD to deploy resources directly from independent repositories.

Large, static, long-lived clusters. Can I get a show of hands: who has large, static, long-lived clusters? Yeah.

Who's able to easily recreate clusters on the fly, tear them up and down pretty regularly? Yeah. Okay.

Because we had application teams all deploying independently, if we spun up a new cluster as a platform team, it's empty, right? How do I go about getting all of those resources deployed such that the workloads are hydrated into the new cluster? I've got to run like a hundred separate pipelines. It was a really, really big challenge for us as well.

We're relying on Velero backups for DR. So this one, again, I'm really interested in a show of hands. Who uses Velero backups for DR? Yeah.

Who completely manages their stateless backup through GitOps? Yeah. Cool.

Listen, Velero's great and it saved our bacon. Fortunately, it was in non-production, but we've absolutely done a full restore using a Velero backup. But it only gets you about 90% of the way there. All of those dynamic cluster-specific values that, if you brick a cluster, need to be recreated, you need to manually patch those in. So that's an hour or two to your outage time.

Configuration drift. What we found was that any manual intervention was really quite rarely backported into Git. Because `kubectl apply` is a patch, it's not going to delete explicitly for you. We found that those value stream, those application team pipelines didn't clean up after themselves. That resulted in unmanaged dangling resources. So our representation in Git just didn't represent what was in our environments.

Inconsistent environments. Now this is really around versioning. It's a real challenge when you work in a large program and you have multiple application teams all, and we love autonomy, they need to be autonomous. But what does a version mean for your system?

You've got your platform. Let's say right at the lowest level, what's your GKE version? What's your mesh version? Then you've got your cluster bootstrap, and on top of that you've got all your workloads as well.

For us, we had inconsistency across our environments. I couldn't easily say, "Put exactly what's in SIT into non-production, platform workloads and all." So by conforming on a central GitOps repository, we now have strongly versioned resources and workload configuration, and we're now able to reproduce environments.

Finally, different CD pipelines and processes for prod and not prod. We use the same repo and process for both production and non-production, and we're finding that we're discovering less issues with pipelines in prod because ultimately they were just different from the ones in non-prod as well.

Right, let's talk about benefits for GitOps.

I'll reiterate disaster recovery. GitOps simplifies disaster recovery through version control, declarative infrastructure, enabling easy rollbacks to known working configurations, and provides a clear auditable history for rapid recovery.

Fleet-wide changes. I've said it before, but as operators of platforms, you need to be able to consistently roll out fleet-wide changes to your cluster. For us, the way that we achieved that was converging on a centralized Golang code base that executes in both pipelines and operators, so pipelines and server-side.

Finally, through this, we've now put ourselves in a position where we can automatically provision clusters. That's not just about having healthy clusters that are recreated on a regular basis, but it's about providing flexible cluster tenancy models to our application teams.

We are not doing it right now. We're still quite heavily namespace-based, but we actually want to get to the position where we can quite readily offer a customer team their own cluster. And it's no overhead to us as a platform team.

Progress and plans.

All right, maybe just before I go through this, I just want to recap on, I guess, five main takeaways that I'm hoping that you guys get out of this talk.

First of all is really understanding the why. Building a platform is not cheap. So know why you're going to do it. Know what success looks like and how you're going to measure it.

Secondly, set yourself some principles for your platform build. We want our engineers to be autonomous in their decision-making, and there's a ton of micro-decisioning that goes on during implementation. So set those principles, communicate them out, and allow your engineering teams to then be autonomous in the build.

Model your domain. Think of your organization, your teams, your cloud provider, your DevOps tooling estate as just a collection of entities. It's no different from you putting together schema for a microservice.

Build an API-driven platform. Now, there's lots of different ways to do that. You could have REST APIs and a CLI that invokes them. For us, it was CRD YAML files that are in Git, deployed to cluster, and are actuated on by operators.

And finally, be community-minded, community-governed, and open for contribution. Because the success of your platform, the key to frictionless adoption, is that by it being a joint effort, it can't be done independently from your application teams.

All right. So the main focus for us so far has been automating Google Cloud, Kubernetes, and also GitHub. We're doing most of the big-hitting GCP infrastructure, the runtime stuff, namespace, et cetera, and Kubernetes we've done. In terms of GitHub, it's the creation of repos and it's the configuring of Actions pipelines.

What's next for us? It's probably going to be a focus on, I think, ServiceNow. That's our asset management system. It's also our change management system, so we can all get a lot of value out of that. But you can see there that we're also looking at other parts of our estate, such as Black Duck, Checkmarx, SonarQube, et cetera.

The help I'm looking for.

So this is really to hear from you folks. So please, if you've got any questions and you don't ask them directly, pop them in that Slack channel, the Leadership discussion. I'll get back to all of them by the end of the day, but I'd just really love to continue the conversation.

I think you guys can tell I'm pretty passionate about platforms, particularly those that offer really strongly bounded self-service tenancies. We're a bank. We can't have folks from one team deploying into another team's workspaces. We've backed it all up with OIDC, et cetera.

But yes, here's a couple of points that I'm interested in.

If you've gone through a similar journey, progressing from simply provisioning cloud infrastructure and DevOps tools to finally being able to build a platform, how'd you go about adoption? It's not easy to re-platform.

When designing your platform, did you bake off things like Lumigo, Crossplane, KCC, or did you build your own custom orchestration? What did you go with and why?

What's your view on rendering intent to primitive server-side versus in pipelines? Do all primitives need to be stored in Git, or is intent enough? This is a War and Peace type debate in our team. It was, anyway. To answer your question, intent's fine for us. So we trust our operators, we bootstrap accordingly, the operators are there, we deploy the intent, happy days.

What's your entry criteria for building an operator? Are they worthwhile for simple use cases? Once again, there's always a lot of competitive tension in our team between do we do something in a GitHub Action, or do we do something in an operator?

How much of your cloud provider and DevOps tooling estate have you been able to configure through platform APIs?

And finally, what approach does your team use to automatically provision clusters with KRM config originating from multiple tenant repositories?