Log in to watch

Log in or create a free account to watch this video.

Log in
San Francisco 2017
Share

DevOps at American Express

DevOps is the next step in our transformation journey at American Express.


We are one of the horses, a world class financial services company founded in 1850. Since 2011 we have been laying the foundations for DevOps by focusing on: developing our people, bringing in engineering skills, driving agile and toolchain automation adoption and co-locating our teams. We are now increasing our focus on shifting our service operations structure to improve availability and reduce service disruptions.


This is our story so far.

Chapters

Full transcript

The complete talk, organized by section.

Lee Barnett

My name's Lee Barnett. I work for American Express. This is Chad.

We'll do full introductions in a moment, but thank you for coming along to our talk this morning. We're going to talk about the DevOps journey at American Express.

Picture of me on the right-hand side there. I've been at American Express since 1979, so a couple of years now. I did have hair when I started. It's nothing to do with the company, but there you go.

I worked for the first 10 years in the operations side of technologies and then moved into development, so the last sort of 25 years has been in development. I remember that first couple of weeks of moving from the superhero support operations world to development, where I had six months to deliver something. And I remember thinking, "What am I going to do for the first five and a half months of this? I'm so used to doing everything right now."

I now currently work in delivery and am responsible for tool chain automation.

Chad Avery

My name's Chad Avery. I'm a director in our service operations team. I'm based out of Florida.

I just realized when you said you had been there since 1979, I was in kindergarten, just to throw that out there. And I'm an old guy, so.

I run our DevOps enterprise program, and I'm based out of Florida.

Lee Barnett

You do know, Chad, in England, they take us straight out of kindergarten into school.

Chad Avery

Oh, okay.

Lee Barnett

Okay. Just put context around that.

Chad Avery

There you go.

Lee Barnett

So American Express. Who's heard of American Express? Who's got an American Express card? Yes! I love that. There's a lot of Amex people in the audience as plants, so we were just taking names and numbers as we went through that one.

So a little bit about our company. We were established in 1850, so coming up for 170 years old now. We have, and these numbers are at the end of last year, 5.4 billion net income. We have a trillion worldwide billed business and over 110 million cards in force around the world.

Our key kind of business areas that you probably are aware of: the charge card and the credit card. Big travel business. Traveler's checks, which is kind of on the way out. I used to work for the TC business. Moving into lending and some other areas of the business.

Chad Avery

I'll give you a little bit of an overview of our technology organization. Lee will get through some of the stats. But if you look, one of the things that you'll see is that we're trying to implement DevOps in a very, very big company with a wide breadth of technology. A company that's very focused on security and enhancing the American Express brand.

Lee Barnett

And there's numbers on here. Again, these are really focused on the first half of this year.

You echoing with me there?

Chad Avery

I'm sorry. Yeah, I'm trying to mute you out.

Lee Barnett

Okay.

So a little bit about what we do. American Express, when I talked about the charge card and the travel business, those sort of things, who uses reward points on some product or other? American Express invented those.

Next week is Thanksgiving. You're hearing this from an Englishman, which is great fun. But Thanksgiving. So Black Friday, Shop Small Saturday, Cyber Monday. Shop Small Saturday, American Express. All these things have technology demands behind them, and they're leading to sort of innovative changes throughout the industry.

So just some of the numbers from the first six months of the year. Five and a half thousand employees work in technologies based across the globe. Phoenix is our headquarters for our technology organization. We've got over 3,000 apps out there. There's all those different business lines there.

We have products that stretch across business lines, so when we start to talk about some of the capabilities as we went through delivery transformation, I'll talk a little more about Agile and cross-functional teams and those sort of things.

Forty-eight thousand servers. Just a couple sitting out there that we have to manage and look after. One of the key things you'll see is that we are 160 years old. When technology started, we started. So we've got some apps that kind of stretch back almost to when I joined the company. Perhaps not quite that old.

But nearly 6,000 mainframe code deploys in the first six months of this year. Alongside that, 27,000 other deploys. That's around 300 deploys a day for a major financial institution.

If you look back over time, that's a 4X growth in the last couple of years since we went down this transformation journey as we're leading towards DevOps.

So the story itself. It's a multi-year story. We started in 2011 and the focus then was, how do we get rid of waste in our development process?

We were Waterfall. Everything was Waterfall. When we looked at it, our testing was done kind of the day before implementation, and we weren't really checking the code and we didn't have great testing in there. But we were doing a lot of things again and again and again. And every time you wanted to get more data, every time you wanted to run a test, we didn't have anything we reused in there. So we really focused on, how do we remove the waste? And shift left kind of came to the fore at that point.

Over the next couple of years, as we get into the delivery transformation program, we focused on three things: people, process, and tools.

And the people. We've heard a couple of people over the last couple of days talk about insourcing. We changed our model from predominantly outsourced to much more sort of 50/50 focused on insourcing and the vendor model. Now recognizing, as you go through this change, it isn't just about those people that are full-time employees in the company. You have to bring your vendors with you. And again, yesterday there was a couple of references to that. The change is all-encompassing for all the service providers, all the people we use to write the code for us.

We're Java and COBOL predominantly. COBOL on the mainframe, Java for everything else. But there's a whole bunch of other stuff in there.

We changed the process. So we weren't, or we are still going from Waterfall to Agile. It's a multi-year journey. There's no magic switch in there. There's a lot of training has to go on. You bring in experts to talk. You get people involved in sessions for a day or a couple of days. You bring the business along with you. You focus on the leadership. You focus on the engineers. There's many, many components to that.

And so we moved to Agile. We're continuing to go through that. We're pretty much there now, but there's still some pretty large mainframe backends that you can't release every two weeks. You have to plan differently and think differently about what you're going to do with those.

And then automation. So we went people, process, and then we started to look at the tool chain piece of it. That's kind of where I specialize. Looking in the early days at service virtualization and release automation.

We then moved into, and we still have challenges around, test data. Most of our data sits on big DB2 backend engines. We've got to get that to the front end, to the mobile, to the web, and we've got to get it into testing environments on a very frequent, regular basis. And with some of the new legislation coming in, certainly in Europe around GDPR, looking at how you use synthetic data to meet your needs as well. So there's a lot of work and a lot of movement going on there.

I think when I look back on the journey, release automation was probably the easiest piece if you're going to start somewhere. Look at the tools that are out there, and that's something that the engineers, once you help them understand it, very quickly can pick up on.

The other end of that is regression and automating regression. But they're some of the learnings as we've gone through the journey. I've been talking to some people here who have looked at it from different angles, from using an orchestrator. But release automation was probably where we saw the biggest bang in the early days in terms of return.

And then now very much focused on CI/CD/CT: continuous deployment, continuous integration, and continuous testing. Work in progress. We've got pretty good tool chain in place, pipeline in place for distributed. We do on the mainframe as well, and we're investing more in looking at that as well. And now today, continuous testing, having environments always available for whenever the engineers need to do the testing.

Chad Avery

If we look at what was missing, we spent a lot of time in 2016 looking at what the industry was doing around DevOps, and especially financial services organizations and what DevOps meant there. And we really anchored on five pillars for what we wanted DevOps to be at American Express.

The first is operations-focused design. So Lee talked a little bit about shift left, and you've heard that from some of the speakers over the past couple of days. But we really wanted to get the operations teams involved up front in feature development and code development, when there's an opportunity to really make a difference around things like availability and resiliency and that kind of thing.

Lee Barnett

So do you think we're up with Disney with our graphics? Good. Thank you. We worked hard on this, believe me. There is one that moves in a minute. Really proud of that one.

So I just talked briefly about automated CI/CD/CT and Agile at scale. These are two areas that we're continuing to invest in, to look at opportunities.

While I'm up here talking, there's probably another five tools that have been released that we could look at for release automation or test data or something like that. It's an ever-evolving world. And when you start on the journey, picking products is an important challenge and a large part of the challenge, but keeping abreast of where the industry's going is just as important. The changes are rapid in this space, and you have to build a pipeline that's flexible. You have to focus on integration. You can't really focus on one size fits all.

So, if we were to put our NASCAR slide up there, then most of the companies that are upstairs in the exhibition hall appear on there. I walked around yesterday, I think 80% of them appear somewhere in our pipeline.

And Agile at scale. So we started on the Agile journey in 2012, made significant progress, but last year recognized that that's great. So we've got all these teams that are working on a two-week cadence. That's the cadence we picked. Going fast, effective, lean thinking in there. But when you look at our products, they stretch across platforms. Thirty platforms wouldn't be unusual for a deployment in something like the card business that we have. And so you have to start to think about how you bring the teams together and operate at scale.

And so this year, we've made a significant kind of move into the Scaled Agile Framework, 4.0 at the start of the year, 4.5 now.

Chad Avery

Yep. So the next pillar is development line break-fix. And really what we mean by that is having a shared sense of ownership for production code across all the teams.

There was a history in our company of the development team writing code, looking after it for a warranty period of X number of days, and then it becomes an operations problem. One of the things that we've spent a lot of time this year doing in our DevOps journey is taking some of the responsibilities that were previously the purview of the operations team and actually moving them into the development team, and moving resources out of the operations team into the development organization. And that's what we'll talk about, some of the results, in a second.

And then monitoring designed in from the start. We use the term "data in your face" a lot.

Lee Barnett

In your face?

Chad Avery

In your face. In your face.

We want to make sure that telemetry and dashboards and really looking at our applications from a customer perspective and from a transactional perspective is part of everyone's day-to-day job. So we've spent a lot of time this year on that.

One thing I'll mention about this slide, you'll notice that the bottom three are in yellow. The reason for that is in 2017, this has been our primary focuses in these three areas.

So what are we solving for? We'll talk a little bit about where we are, what we're going through.

Lee Barnett

I like this slide because in the day, back in the day, I could throw you the code. I wish I had some code I could throw at you, Chad, in your ops world.

Chad Avery

Yeah. And I would probably catch it.

Lee Barnett

Sometimes you did.

Chad Avery

Yeah. Sometimes.

And that's really one of the key things. So one of the problems that we saw, and Lee will talk a little bit about the Whac-A-Mole in a minute, is as the model that we were in where the development team kind of threw code over the wall, we had developed a lot of technical debt. And this technical debt was not getting worked down. So what we saw out of that is a lot of repeat problems, and that's one of the things that we've tried to tackle.

Lee Barnett

Yeah. And technical debt's an interesting one. I think as you start on the Agile journey, when you start with your business partners, technical debt's an issue. In the past, it's always been hidden. When we did Waterfall, it was just something we did on the side, whether it was infrastructure or something to do with maintaining the code and uplifts.

When you move into Agile, it exposes those technical needs. In the early days, the business partners sort of look at you as though you're green and think, "Well, you've always done that. Why can't you continue doing it?" But you have to prioritize.

Over time, through discussions, through arguments, whatever you want to call it, what I found is that as teams mature into the Agile process and into true product owner prioritization, technical debt just becomes debt. It just becomes another story. But it is a journey. There is a pain point at the beginning, but through the journey, it just becomes part of the prioritization you're going to go through.

And the Whac-A-Mole thing, I actually was going to bring a hammer and hit Chad on the head, but they told me there was a health and safety issue with that, which is pretty disappointing.

Chad Avery

Code of conduct. Pretty sure it was in the code of conduct.

Lee Barnett

But Chad mentioned, as you go through the DevOps change, all those things that just consistently fail, you start to pick up on, and you spend time fixing them. And Chad, in a moment, is going to talk about some of the findings we've had from a metrics point of view, and you'll see the impact of that in some of the numbers we put up there.

Chad Avery

Just the last two things on here. I kind of touched on it earlier, but as we went down this journey, we looked at production support tasks specifically. And we broke some of that up, as I mentioned, and handed it from the ops team over to the development team, and then the resources actually went with it as well.

All right. So, our approach. I'll talk a little bit about this. One of the first things that we did as we started on this journey is we established a centralized DevOps implementation team, and that team actually reports up to me. And basically, I have a group of subject matter experts on DevOps that we have embedded into each one of our unit CIO organizations, and they're actually coaching teams through this journey.

The next thing that, or I wouldn't say sequentially, that we did, but one of the early things that we did is we also established a DevOps community of practice. When we started talking about DevOps, what we found is there were a lot of teams across the organization that had started doing DevOps organically, and we wanted to put a forum together where we could bring these teams to talk about best practices and figure out ways to kind of break down some of the barriers that we were seeing as well.

Lee Barnett

And that's a pretty standard thing at American Express, communities of practice. We have them for big data, we have them for automation. There's a whole bunch where you bring like-minded people together, share ideas, expose problems. They're a great way of getting to root causes of issues and getting a common feeling amongst a group of engineers. We have regular meetings for these communities of practice, and quite often there's 200, 300, 400 people attend these things.

Chad Avery

Yeah. That's right.

As you can imagine, with a company the size of American Express, we could not do a big bang and try to roll out DevOps all at once. So we actually went to our unit CIO organizations and basically looked for a couple of what we call technical platforms that we could start on. And a technical platform is a group of applications that are kind of grouped together by the technology function. So we're working right now with about 42 different technical platforms, and that represents somewhere around 200 application teams.

And as we went down this journey, one of the things we wanted to do is we wanted to drive consistency. We didn't want to be so prescriptive with DevOps that there was a checklist that everybody had to check off 10 items to be DevOps. We wanted to allow some flexibility, but at the same time, we did want some consistency. So we built a playbook around our DevOps journey, and we use it as very repeatable, and we build on it as we go along and learn things, and this is what we're using to kind of coach the teams through the journey.

Lee Barnett

And through that journey, back in 2012 when we started on the Agile piece of it, we created a playbook and we iterate on that playbook all the time. It's openly available. We've done the same with automation. So they again become some of the artifacts that we found very useful, especially as you've got a global audience. Having access to that type of thing if you can't get to somebody is a very useful artifact.

Chad Avery

Yeah. The last item on here is communication. So I think this is extremely critical in educating your workforce, educating the different parts of the organization around what DevOps is, talking to our business partners, explaining what the differences that they're going to see from a DevOps perspective, what the benefits are. So we spent a lot of time building a communications plan that we rolled out.

Another thing that we did early in the journey is we actually brought Gene Kim in to our Phoenix location and had him for a day, and really got a lot of benefit out of him talking about what other horses were doing, if you will. And that was kind of a good jump start to our journey.

Lee Barnett

And when you look at DevOps, there's lots of frameworks out there. CAMS, CALMS are the general ones that people look at. We looked at frameworks. We use frameworks loosely. But you have to adapt to what your organization needs. Don't just take something by the book and go with the book. Work out what you're trying to deliver. What's the end game for you? What are the outcomes you're looking for?

Use some of the knowledge in the frameworks. It may work all the way through, but for American Express, we adapted as we went through and built our own version of the DevOps framework.

The other thing that we found when it comes to discussion is what you do around metrics. How do you measure your success? And in the early days, we talked a lot about maturity. Maturity brings with it, eventually, a view to how do I game the numbers. So we're much more focused now on proficiency and helping people along a journey, recognizing that some are starting, some are towards the end, but there's all different areas and opportunities, whether it's DevOps in total, it's automation, it's Agile. So now we have proficiency metrics as an approach to what we do.

Chad Avery

Yeah. And speaking of metrics, I'll talk a little bit about the results we're actually seeing as we go through this journey.

When we set out on the program, we really looked at four key areas that we wanted to measure ourselves against: execution, quality, efficiency, and availability. Some of this might be difficult to see in the back, but I'll talk a little bit about each one of them.

I mentioned that we're working with 42 technical platforms right now, and it expands about 200 application teams. When we look at quality, what we were trying to measure is incident rates. So we took a baseline, or we take a baseline early in the journey, and then we measure that on a month-to-month basis as the teams go through the DevOps journey. We right now are seeing about a 5% decrease in incidents. So that's okay. I don't think that's quite where we want to be. I think we'll see more as time goes along. But we are seeing the incident count go in the right direction.

One of the biggest increases that we've seen is around problem resolution rates. So we have a tendency at American Express to have problem tickets that are generated after several incidents that would go into a backlog and not get worked. And they would stay open for months and age. On the teams that we've taken down this journey, we're seeing 149% increase in problem resolution rates after the switch. So that's really, really encouraging.

We also said that DevOps are, actually, our senior leadership team said that DevOps can't cost more money than what we're spending today. We purposely did not set cost savings targets for DevOps. But at a minimum, it has to be cost neutral. And so far, we've been able to hold the line on that as well.

And then finally, and probably the most interesting stat on this slide, is MTTR. When we set out on this journey, we fully expected to see a dramatic decrease in the mean time to restore, as the idea is, as the ops and dev teams work closer together, when there is an issue, you should be able to resolve it more quickly.

On the teams that we've taken through this journey, we're actually seeing an increase in MTTR so far, which is not exactly what we expected. Our hypothesis is that, I mentioned that technical debt is kind of a problem at the company, we think that we had actually artificially lowered our MTTR by keeping incidents around that were really, really easy to close, like these Whac-A-Mole type things that Lee mentioned earlier.

So one of the first things that happens as these teams move to a DevOps grouping is you get rid of that noise really quickly, and what's left are the tickets that are harder to close. So we think that's why we're seeing this increase. I will say that this number, it's a 5% increase now. It actually was higher a few months ago, so I think this bubble is starting to wear down, and we're starting to get a little more efficient.

Lee Barnett

So what you'll see from these numbers is, yesterday morning when Gene kicked the conference off, he shared where the industry's now seeing, where DORA research is showing the numbers going. Our numbers are moving in the right direction generally. So it's a journey. What Gene and Nicole and that are talking about is a four-year journey. We're still in the early stages of moving everything towards this.

On the 5% increase one, I made a point earlier on, we're doing four times as many changes now as we did four years ago, five years ago. We have more incidents because we've got four times as many changes, but the impact of those incidents is significantly less. You haven't got a three-month or six-month or even an 18-month deployment that goes in and causes major issues. What you do have to be aware of, though, is that lots of small incidents can also hurt the business, and beware of death by a thousand cuts. But just recognize, as you put more changes in, you're quite likely going to get more incidents, but the impact is less.

Chad Avery

So there's a lot of words on this slide, and I'm not going to read all of them. But I'll tell you one of the things that we did as we went down this journey is we tried to collect lessons learned from the teams that are going through the process, so that we can take those learnings and apply it to other teams as we start.

I'll start on the beginning. One of the first things that we've found that's been very, very effective is getting the product owner and the business partner involved early in the process. And like I mentioned earlier, telling them what they can expect, what they're going to get out of DevOps has been really, really key.

Additionally, a lot of our teams that are going down this journey have decided, as we reorganize, to have a centralized quote, unquote, "DevOps team" that's responsible for production support, is responsible for things like automation, performance engineering, that kind of thing. And that's been very successful in the teams that have done the best.

Lee Barnett

And people at Amex have heard me for years now say automate. As much as you automate, you also need to think about the automation itself and connecting the automation together. You truly get at your pipeline when you automate the automation. So don't ever lose sight of that.

People say, "Yeah, we got Jenkins, we're automated." That's just one piece of the pipeline. You have to have front-to-back capability that's all connected to enable single click in there. So the process efficiencies, the culture. This is a mindset change more than anything else. The tools have existed for years, the capabilities have, but helping engineers, helping leaders, helping the business understand the benefits of the change is an absolutely vital part of it.

If you ignore that piece, the rest of it becomes very, very difficult because you're always trying to compete in a space where, "Well, I've always done it this way. I know I'll just about get there." This is a culture change.

Chad Avery

Yeah. And then I'll talk about organization a little bit as well.

With a company the size of ours and with us having multiple development centers across the globe, co-location is very difficult. But where you can co-locate, we found that to be very, very beneficial as you go down this journey, as you can get your ops and your dev team closer together. That's really, really important.

Where you can't, take advantage of the tools that we have around video and audio and telepresence and things like that, because that tends to bridge that gap a little bit as well.

Lee Barnett

And so I think the key is, as you go through the transition, it's about thinking differently, lean thinking, enabling your teams to fail fast, and adapting as you go. It's a continuous journey.

Chad Avery

So in addition to the team learnings that we found as we went through this journey, we've also run into some, what I'd call kind of program learnings. And these are the things that we're still really working on, things that we haven't solved so much for.

The first one, and probably the biggest one, is separation of duties. As a financial services company, obviously we are very highly regulated, and we have to be very specific on compliance rules and things like that. So we are working very closely with our information security and our compliance teams on making sure that we have the right solution for production access controls.

Listening to the Capital One guys this morning, they seem to have it all figured out, so maybe we'll just ask them. But for us, it's a little more difficult.

Lee Barnett

They know things.

So we're trying to work through that. And yeah, I talked earlier on about tech debt and the challenge of balancing the features against the technical demands. It's a journey. Just recognize that going in. There's no easy answer to it. It's something you need to talk about. It's something you need to work with your product owners on.

Chad Avery

Yeah. And the last item up here is really around culture. And, gee, I'm having a hard time with this thing. It doesn't like me.

Making sure that, one of the pushbacks that we've heard as we went on this journey is our development team not wanting to be on call. And I get it, you know what I mean? I'm an operations guy, and nobody likes getting woken up at 3:00 in the morning. But having that shared sense of ownership is critical to making this whole thing work.

So we're working through with our development teams on how to really explain this culture change and make it palatable to the development team who may say, "I'm past that. I was a production support guy early in my career, and I've made it past that. I don't want the pager duty anymore." Right, so.

Lee Barnett

You're looking at me, aren't you, when you say it?

Chad Avery

I am looking at you.

Lee Barnett

Okay. All right. Fair enough.

Chad Avery

It's your fault.

Lee Barnett

So these are just the three key areas. There's a whole bunch more behind this, but these are the three that Chad and the team are really focused on for this year.

So what's next? This is a continuous journey, and this slide really just sort of pulls together the five key areas we're looking at: scale, integration, proficiency, model, and roles. It's an ongoing journey. I talked at the beginning about people, process, and tools, and it continues to be about those three key components. You need to focus on all three.

You need to continue with the training and the development. Scaled Agile's been an area that we're pushing lots of people through this particular year. Tool chain, we continue to invest, develop, implement, and integrate the tools in there.

And the roles. Site reliability engineer. We've changed some of the organizational construct around this, brought in some new roles that we recognize are key to us being able to continue delivering in this space.

Chad Avery

Yeah. And the other thing I'd say, as we look to 2018, our approach for DevOps is kind of to do a deep and wide approach. I mentioned that we're working with around 42 technical platforms. We expect next year to expand that out to a larger portion of the organization, while at the same time using our centralized DevOps team to go deeper into some of the more advanced DevOps maturities and try to kind of pull up those teams in terms of how they're doing their DevOps practices.

Lee Barnett

Cool. And that's it. Thank you for listening. Glad you enjoyed the graphics. We worked hard on those. But enjoy the rest of the conference. Thank you.