The Art of Platform Engineering

Log in to watch

US 2021

Download slides

The Art of Platform Engineering

Sanjeev Sharma

SVP, Head of Platform Engineering · Truist Financial

John Comas

SVP, Platform DevOps Automation · Truist Financial

Sanjeev Sharma is an internationally known DevOps and Cloud Transformation, and Data Modernization thought leader, technology executive, and author. Sanjeev’s industry experience includes tenures as CTO, Technical Executive, and Cloud Architect leader. As a former IBM Distinguished Engineer, Sanjeev was recognized at the highest levels of IBM’s core of technical leaders. Sanjeev is currently a Head of Platform and Automation Engineering at Truist, the 6th largest bank in the US, created by the merger of BB&T and SunTrust.

Sanjeev provides leadership to drive the adoption of cutting-edge solutions, architectures and strategies for DevOps, Cloud, and Data driven transformations, working with C-level executives leading these transformations. Sanjeev published his 2nd bestseller book, The DevOps Adoption Playbook in 2017. He regularly blogs and podcasts on DevOps, Cloud, and Data Modernization on his popular blog http://sdarchitect.blog

John Comas is currently Senior Vice President of Platform DevOps Automation at Truist Financial managing the DevOps transformation for the 2nd largest bank merger in US history.

Formerly, Manager of DevOps Solutions at NBCUniversal for over 9 years where over 200 critical applications were automated with DevOps enterprise services.

Chapters

Full transcript

The complete talk, organized by section.

John Comas

Hi, everyone. I'm John Comas.

I was born and raised in Northern New Jersey, and, well, I guess you can say I'm a Jersey boy. I've been working with DevOps principles and practices even before DevOps was an industry buzzword. I'm currently SVP of Platform DevOps Automation for Truist Financial. I was previously the DevOps leader for barnesandnoble.com and NBCUniversal. I received my PhD in systems engineering/DevOps from Stevens Institute of Technology in Hoboken, New Jersey. And in short, I've been a lifelong advocate of advancing technology through promulgating the principles of enterprise DevOps and the elegance of standardization and simplification.

Sanjeev Sharma

Thank you, John. And thank you, Gene, for having us again at the DevOps Enterprise Summit.

Just like John, I've been in the DevOps industry since long before it was called DevOps. I actually met John in 2013. He was one of my clients. I was at IBM, and I was the first distinguished engineer at IBM who was helping clients adopt DevOps. And I got an opportunity around that time, as Gene mentioned, to work with dozens of clients around the world, helping them adopt DevOps at large scale.

And today, I'm at Truist, the seventh largest bank in the US and the fourth largest insurance holding company in the planet. And I'm taking what I learned working with several of those clients and implementing that, working with John, as he mentioned.

Now, for those of you who do not know Truist, it was created back in 2019 by the merger of BB&T and SunTrust, and that made us a very large bank. Very rarely does a top 10 bank enter suddenly out of nowhere into the top 10 list, but here we are. The merger is ongoing, and while the merger is ongoing, we are taking this opportunity to improve our capabilities, improve our developer experience, our developer productivity, and that's what we're going to talk about today: how are we doing that?

So to take a step back, let's talk about what is a platform. We call this session The Art of Platform Engineering. It sounds very philosophical, sounds very artistic. But at the end of the day, it is all about improving developer productivity and developer experience, as I mentioned. And there's not a better quote or description I could find on that than Matt Skelton's description or definition of what a platform is. And I'll read it verbatim for you because it is concise, and I love things which are simple and concise. He says, "A platform is a curated experience for engineers to accelerate the delivery teams that use it."

Let's dissect this a little bit, because this is the philosophy. This is what we are building at Truist, and this is what we're going to talk about. The first keyword there is curated, as opposed to ad hoc. We don't want developers to have an ad hoc experience which varies based on how long they've been there, what technology they're using, what tool set they're using, who was on their team, what kind of product are they building. It should be a curated experience. Now, I agree that in large enterprise, it cannot be the same experience. It might vary. But we want to reduce that variance and make it a better experience, for lack of a better term.

It's for engineers. Now, we are using the term engineer here generically. Anybody who's a stakeholder in getting requirements converted to code running in production. We're using the generic term engineer here, and running in production also. So it includes people in operations, it includes people at the help desk, incident management, everybody. And our goal is to accelerate their capability to make them more productive, make their life easier, reduce toil upon them. That, at the end of the day, is the definition of what we are calling a platform and what we are building here.

So when I got started, I've been at Truist only around less than a year and a half, and John joined a couple of months after me. And we looked at what was going on, and we decided, okay, let's say if we had a blank slate. Now, trust me, we don't. Obviously, software is being developed here. But if we had a blank slate, how would we design a platform? What would we want it to look like? And we came up with a few tenets. Now, if you've heard me speak before, I've been talking about these tenets long before I joined Truist, so this is not new. But these are what we are implementing right now, and I'll go into detail about how we are implementing it later in the presentation.

First and foremost, the first tenet is everything should be self-service. So how can we take all the capabilities and the services which our shared services organization, which we are a part of, is delivering to the developers, to the engineers, so that they can do their jobs? We need to make sure it is self-service. It's not manual, it's not ticket-driven.

Second thing, and this is very important when you're talking to a large enterprise, especially one in a regulated industry, it needs to come with permission to act. It's pointless having something which is self-service if you still need to get approvals from five people before you can self-serve yourself. In order to give permission to act, what we need to make sure, and that's our responsibility, and we'll talk about that, is that we build guardrails around every service we are delivering via self-service so that people don't break things and it doesn't create jeopardy for us and our company.

All of this then will only result in if we have a culture and environment of trust. The engineers who are consuming our services trust us that the service will behave and perform and function as designed. We'll be able to deliver them the SLAs we are promising, or SLOs, and we can trust them that they are not trying to bypass all the guardrails and try to game the system for us. There's a mutual understanding of trust.

As we started building the platform and designing the platform, we looked at it as a layered cake approach. We need to have an ability to provision environments and configure environments and de-provision environments, an environment pipeline. Next, we need to have an application delivery pipeline, which is what John will talk about. He's responsible for that. And all of this needs to be secured and compliant. Even if you're not in a regulated industry, you need stuff to be secure and compliant. But when you work for a bank, when you're in a regulated market, everything needs to be secure and compliant.

We started building this multi-layered cake, and we're in the middle of doing that. We're going to share how we got where we got and where we are headed. But here, very briefly, is the capability map of what each of those areas looks like. So we have environment engineering, which forms the basis. You can't do anything without an environment. Then you have the ability to capture your requirements, write code, test code, deliver code, the entire application delivery pipeline. And then on top of that, you have a data pipeline. How do I get data to the right people at the right time so they can make the right decisions? And this includes test data. This also includes other data which is feedback coming from operations as to how the application is performing and behaving in the real world. And obviously, all of that has to have an aura or a layer, a fondant layer or a chocolate layer on that cake of security and compliance.

On the top, you see what we are specifically building, a portal, by which the developers can consume all these services via self-service and participate in that curated experience we are talking about.

What I'd like to do now is hand it over to John, and he'll talk about the application delivery pipeline of the layered cake, which he's responsible for. So over to you, John.

John Comas

Thanks, Sanjeev.

When I started my career back nearly 20 years ago now, one of the most pervasive industry challenges I encountered was how large corporations had difficulty in releasing software successfully into prod. There was a definite need to deploy changes rapidly to a live customer-facing system with minimal to no disruption, but there was no mechanism to do so, and teams were not organized in such a way to make this viable.

In and around 2008, the industry really began to see the DevOps movement take hold as a way to modernize traditional software development practices. But one of the challenges I encountered was that principles and practices of DevOps were often viewed as more conducive toward a startup or smaller organization. Implementing DevOps at a large corporation was seemingly a monumental task that involved changing the way hundreds or even thousands of people work daily and how they interacted with their peers. And this is where enterprise DevOps was born for me. Much like the agile process itself, getting the IT organization to scale DevOps consisted of many small, iterative steps to achieve a cohesive, well-oiled end-to-end process.

It's crucial that when you embark on your enterprise DevOps journey, that you steer clear of pervasive anti-patterns which will present themselves. I've learned that these traps can sneak up on you and muddle the waters for your transition. Firstly, and most importantly, DevOps is not a new silo that sits between dev and ops. DevOps engineers are highly comprehensive in their knowledge set and understand the application undergoing development, the SDLC tooling, the hosting model, et cetera. DevOps is not a tools enablement team for developers. DevOps is also a cohesive end-to-end automated process, and you cannot automate half of the software development process. I've learned that to be successful, it's really all or nothing. For example, you can't say that we implemented continuous delivery for dev, QA, and stage, but we're still doing manual prod deployments. That just doesn't work. And we have to be careful not to confuse SRE, site reliability engineering, with DevOps. We need to prevent the use of non-standardized and/or the use of multiple team-centric DevOps frameworks, which can create a hero anti-pattern.

Next slide.

So as we all know, the term DevOps, as well as the term DevSecOps, is a portmanteau of development, security, and operations. In the industry, the term DevOps has multiple definitions and perspectives on its core philosophy. In my career, and here at Truist, I have defined DevOps through the use of the five C's: continuous integration, continuous delivery, continuous testing, continuous monitoring and feedback, and continuous compliance.

And while we all know the core of what CI, CD, CT mean, there are key aspects to these practices which a DevOps leader must be aware of. When we implemented CI, we're not just implementing continuous build, we're implementing full, true continuous integration. Remember that you're not just setting up pre-flight builds, nightly builds, et cetera. You have to successfully branch and merge and continuously integrate your developers' check-ins back to the mainline.

With CD, there's a big difference between continuous delivery and continuous deployment. You have to judge for yourself what is the most realistic and appropriate for your organization. And I've learned that it depends upon the application which is best. With continuous delivery, your DevOps pipeline is creating a deployable asset and release package, which is deemed fully tested and approved and in a holding pattern, ready to be deployed to prod. With continuous deployment, the deployable asset is automatically deployed to prod as soon as it's deemed ready. This is a critical implementation which needs to be made up to your industry type and organizational appetite for change.

Also, and I can't stress this enough, never silo your DBAs. DevOps for the database is critical. The DB changes should be implemented right along with your non-database code in the same CD process.

So if you ask me what the single most important step toward implementing enterprise DevOps has been for me, I would say that it was simplicity. I started the enterprise journey being as simple as possible, never trying to boil the ocean. The goal has always been to improve the quality of code developed, to deploy faster and more efficiently at a reduced cost. I like to promulgate the idea of the five-point DevOps star: simplicity, traceability, accountability, repeatability, and reliability. The DevOps pipeline has to build, configure, and deploy software of any platform and technology.

We designed a robust system that is both highly scalable and highly available. And so quite simply, the end-to-end pipeline should be able to deploy anything, anywhere, on-premise or cloud. You should be able to achieve everything with your single enterprise pipeline.

Next slide.

So to the greatest extent possible, we standardized our tool sets and promulgate a single unified path to production. Standardizing build and deployment practices reduce your costs and prevents errors through automation with faster and more frequent releases. We create a unified development standard which deliver confidence in code quality. Your pipeline needs to exude confidence and always reliably and rapidly push out changes to production to meet the ever-changing business needs.

All developer tools used in your process need to come from the enterprise toolbox, and this is very critical to ensure stability and compliance. And so as you can see here, I like to look at DevOps tooling as a pyramid. At the bottom of the pyramid, our developers have access to a rich and diverse set of tools from the enterprise toolbox. But as we move up the pyramid, which represents the various states, dev, QA, stage, prod, the standardization of the tool set narrows. So when you reach the apex of the pyramid, we're to the greatest extent possible achieving a unified path to production.

Next slide.

Much like our holistic universe, Mother Nature, and humanity itself, our IT enterprises are comprised of highly complex systems of systems where each individual system, while capable of independent operation, interoperates with other independent systems to create a fully comprehensive system, which is overall greater than the sum of its parts. You have to be very careful and critically aware of system interdependencies and how you roll out releases to production. That's something I've learned throughout my career. Transitional state deployment problems are actually a core of my PhD thesis.

Application interdependencies and the potential impact a deployment may have to live systems needs to be very carefully analyzed. I cannot emphasize enough the importance of understanding the effect a deployment can have to the holistic system undergoing change.

Next slide.

Enterprise DevOps standardization and simplification is a fundamental building block to our core implementation. And because our systems are so complex, you need to bring order to the chaos and be simple. I've always said, think about how you can do more with less. Think about using an aphorism like, "I'm implementing a single button instead of a QWERTY keyboard." The simplified methodology allows you to focus your energy on developing the logic necessary to deal with enormously complex, interdependent software systems of systems. You have to understand the transitional states during a software deployment and continuously assess risk throughout all aspects of the enterprise DevOps pipeline.

When I was working in e-commerce, I remember encountering a situation where in a highly complex order management system, a change to one of the key middleware messaging systems between our front end and the fulfillment system caused major disruption. During a deployment to the live system, customers were able to place online orders, but they could not see their order status being updated as the deployment was affecting the messaging queues. The messages were there and generated properly, but just being held in a queue for release post-deployment.

So frustrated customers who couldn't see their updated order status called up customer service requesting refunds and order cancellation, which the customer service reps obliged. However, even though customer service reps canceled the order, those cancellations didn't update because the message queuing system was still backlogged from the deployment. So all those orders still flowed to the fulfillment system and got processed. Even though the order was canceled and credit cards were refunded, customers still received their orders. And since you couldn't recharge the credit cards of customers, people received free merchandise. And you can imagine, that's not great for the financial health of your business.

Sanjeev Sharma

Thank you, John. Now let's go into how do we take-- Where are we? Let's go into how do we take these two banks which are separate, which had different philosophies, different IT technologies, how are we merging them and bringing that simplicity that John talked about?

If we look at the organizations, they were very diverse, very different. And I worked with dozens of large enterprises around the world, and they all have a high variance of technology stacks, tools being used, team maturity. And now you're taking two enterprises who had that variance and putting them together and trying to create one unified organization. How do we bring all those people together? Remember, we are talking of tens of thousands of developers supporting thousands of applications for tens of millions of customers.

Furthermore, our goal was not to build to support the bank as it would be today, as a combined Truist, but to support the bank as it would be three years from now, based on our strategic plans and our growth plans.

One of the things we are doing is, and for those of you who watched the TV series Loki will get this reference, we are acting like the TVA. We are pruning how things are done. We are reducing it, moving up John's pyramid of standardizing as you move up to higher and higher environments and towards production. So reorganizing ourselves to support this large and diverse team set, even as we start moving towards standardization, is very important.

So we are the platform team, which is taking the capabilities from across all IT shared services and delivering it as the platform to the developers, to the engineers who are consuming them. What John is doing and what he talked about is, how are we taking these diverse application delivery tools for the various teams we support and helping them standardize. Not standardize just in terms of the standards we are writing, but standardize on that unified path to production so that we can make life easier for not just ourselves, but like it was talked about in the "Dear Auditor" book.

Where are we today? And you see some color coding here, and I'll go over it in detail, but I want to walk you through what we have achieved. Now, I've been here, as I mentioned, slightly over a year, and we are still very early from that perspective in our journey. But we weren't starting at zero either. We have a bank with a lot of automation already in place, but it is still a long way to go. But let's talk about it area by area.

First of all, I'd like to start off with test automation. That's probably the most mature area I have as far as automation is concerned. We have 100% coverage for our performance testing, for example. I just got an email a few weeks ago from one of the projects, which is way ahead of schedule, which is rare to hear of nowadays for any project anywhere. We are ahead of schedule of getting all the performance testing done. But that's an area where we are very mature, but it is not an area where it's self-service, by definition, because performance testing is highly specialized, and we have a specialized team which goes in and works with the application teams to figure out what needs to be tested, what kind of loads are they expecting, load and stress, and all the various criteria which go into performance testing.

The other aspect of testing where we have pretty much close to 100% coverage is test automation. We have a centralized test center of excellence, which has developed a test framework which everybody uses. And all the teams which have been onboarded to it, and as I said, it's close to 100% adoption, utilize that test framework to test their applications. Of course, there are always outlier applications, which the technology doesn't fit the test framework. Barring those, the rest are using the automated standardized test automation framework we have. We are gathering all the test data at a centralized location to ensure coverage and ensure compliance to all our QA requirements.

And this goes on actually even in the area of security testing, where we are doing code coverage analysis and code vulnerability analysis. All of that is automated and is done to a very large extent by self-service, by the individual teams who are wanting to consume the set tests.

At the other extreme, we have areas like SRE. In fact, I mentioned SRE now because just this morning, I got an email from a project where we are piloting SRE. SRE is a new area for us where we are looking at what the most common toil, as the SRE folks call it, areas are for our operations team. We are working with one of our divisions, and we have experimented with what can we automate to reduce that toil on the operations teams when it tends to handling repetitive tasks. And this is mostly in the data space where we have started, and we are helping the operations team address some of these common issues which can reduce the toil and make life easier for our operations team.

Another area of, actually a place where we already have a portal, a self-service portal, is in the area of environment provisioning. Putting out servers. All that work is completely automated. The workflows are automated. Our teams can come in and provision their own servers. Where we are looking at right now, where we can truly make it self-service, is in the guardrail space. Remember the four tenets I spoke about? Self-service is one of them, but self-service is kind of not possible to roll out to various teams to consume if we don't have the guardrails in place. Without the guardrails, you cannot control what kind of service people publish, configure, and provision for themselves. So we are working on providing those guardrails. That way, we can open up that self-service portal to anybody, any team that needs servers.

To put it very frankly, I can go into test data automation. All the test data we use is masked. And all the privacy data is obfuscated in it. That's 100% coverage. What we want to get to is self-service of test data. So we can use data virtualization so that a delivery team can self-provision data into their test environment. We do that today, but it's not self-service.

Another area where we have a lot of atomic automation, and I call atomic because the automation works, it's just all the Lego bricks, or as John put it earlier on, the individual keys or the keyboard are there, but we don't have the QWERTY pipeline, so to speak, the keyboard, is in the space of what John was talking about: the application delivery pipeline. Your builds, CI, CD, all the gates and automated gates that need to go through in order for us to be ready to say a particular application is ready for deployment to production. All those individual automations are available.

What we are working on is creating a composable pipeline so that people can self-serve that pipeline itself and with the push of a button, provision that pipeline. And of course, the pipeline will need to be configured because based on the application, based on the technology stack being used, based on also the maturity of the team, the level of hand-holding which we in the DevOps team need to provide with the team, what John was referring to as synergize. How much hand-holding is needed depends on the maturity of the team.

The really dark color, which is purple, are things which are full adoption. That automation is built. We have 100% adoption, varying levels of self-service. The orange area, ones which are coded in orange, are work in progress, where we are building automation and onboarding teams, doing pilots, like I talked about earlier on. And then there is work in progress. These are some new areas like SRE, where we are doing some pilots to really ourselves understand what do we need when it comes to SRE? What do we want to focus on? Where are the low-hanging fruits, and what should our long-term strategy be?

We do not today, as you can probably guess, have a true platform. It's not a platform. These are atomic services which are available. Some of them self-service, some of them not. What we really want is a portal, a marketplace, just like a private PaaS, where our teams can go in and consume these various services and compose them together to create the pipelines they need for their specific consumption needs and their specific requirements.

So how can you help? We're on this journey. Journey is going to go on for a while. But if you have been on the journey that we've been on, and you're a little bit ahead of us, we'd love to learn from you. If you know where the landmines are and where the trap doors are, share that with us so that we don't step in the same ones again.

Secondly, we need some guidance. We are still figuring this out as to how to make the onboarding to the platform itself self-service. Which means if one of my sister organizations, let's say the network team, wants to build self-service for our SDN capabilities, and we want to make that portal, that available on our platform, what are the SLAs? What are the criteria we use to allow them to self-serve onboarding their own new service or updating a service they have available? Because we are just going to put on the services that John has on this broader platform. As you saw, it's much broader than just DevOps. Other services will also be available so that we can have an end-to-end developer experience.

And lastly is, do mandates work or do we go with the approach of if you build it, they will come? So we'd love to get some feedback from you.

With that, I'd like to thank you all for your time. We'll be on Slack, taking questions and answers. And thank you. It's great to be here.

Q&A

Thank you.