Improving DevEx at American Airlines
Join GitHub and American Airlines in an insightful discussion exploring American Airlines' journey of improving their developer experience, from tackling initial pain points like boilerplate tasks and cloud permissions to implementing solutions with Backstage. We'll discuss how these efforts have evolved, the current state, and the tangible benefits to the business, along with our future strategies. Perfect for developers, team leads, and stakeholders focused on enhancing productivity and outcomes.
Chapters
Full transcript
The complete talk, organized by section.
Philip Holleran (host)
So today we're talking about improving DevEx at American Airlines, discussion with Phil.
Hey, I'm Phil. I'm the Field CTO for the Americas at GitHub.
Tim Haagenson
I'm Tim Haagenson. I'm the product manager for developer platform at American Airlines.
Conversation
Phil: So as we kind of kick things off, really the goal is just to have this be a conversation about a lot of the things that American Airlines has done over the past several years to improve and build on a great developer experience. As you kind of think about this, Tim — what would you describe as the goal of what you're trying to do for developer experience at American Airlines today, and where you're at?
Tim: That's a good question. I did put up a mantra on the screen that we used originally. This was started about four years ago when we first started looking at developer experience and developer platform. Our goal was to empower developers and boost their productivity. So at the time, four years ago, we had gone through what we were calling "delivery transformation" at the time. And part of that was shifting from old ways of working to new ways of working — to being able to just empower our developers and let them take control of their destiny, and try to help them boost their productivity along the way.
Phil: It's actually probably a good idea to take a look back at four years ago — right about now, four years ago, I think we were all kind of blissfully unaware of what was coming, in terms of just, well, developer changes, but also things that cause us to look at how we're operating the world a little bit differently. So what was it like to be a developer at American Airlines back in about like 2020 or so when you started to kick off this effort? And how has the process kind of evolved in terms of that thinking?
Tim: That's a great question. So it truly was a mixed bag. We had a few of our aa.com teams had had success with delivery transformation, and like switching to small-batch changes, like lean-manufacturing-style principles and things like that. And the company had said, "this is pretty amazing — let's see how we can scale this out and we can actually bring other people on board."
So we were transitioning from those CABs. I still remember one of my first reviews at American Airlines. I went to a CAB and I was like, "I've got Node.js. I wanna deploy this thing out to production. It's gonna gather some statistics about our customers." And they're like, "uh, no, no Node.js. No, we don't use that here. That's not approved." So it was a little rough. And I was like, "but we've already built this little thing that we can deploy, it'll be out today." And they're like, "no, no, no, no — you gotta go rewrite that in Java." So there were teams that were like that, and we were trying to transition away from that, to make the teams empowered so that they could actually deploy their code out the way that they saw fit and with the skills that made sense for that team.
Around the same time, American was starting to invest in dojos. We had attended this conference about that time — maybe six, seven years ago — and had said, "hey, you know, Target did this cool thing, let's try dojos." You'll see actually a picture — we call ours the Hangar. So like continuing on the airline metaphors from this morning, we stood up the Hangar. Actually, one of my compatriots in the audience, Ben, was the person who was leading the Hangar at American Airlines.
And we noticed within that Hangar, it was like, "okay, this is great. We're trying to teach teams these new ways of working — we spend the first month and a half just standing up infrastructure and getting the team started." It was a real beatdown. Having an entire team sit there while you just like slogged through Sherwell requests and waited for people.
Phil: So during that six weeks, you're obviously doing other things in the Hangar with them at that time, while waiting for all the infrastructure to come online. What types of stuff were you doing there as they were preparing to actually get their hands on a keyboard and start doing work?
Tim: Yeah, so we were trying to teach them around things like test-driven development. We were trying to teach them agile ways of working. So a lot of these teams, like a normal deployment cycle for them, would be every quarter, every six months. So it's like, how do you decompose their thinking around that and get them shifted around? So that's really what we were focusing on during that time period, while we're waiting on all of these other Sherwell tickets to get set up, while we're teaching them how to do pipelines. It seemed like every single team that we brought in had to — they'd never heard of a pipeline. So it's like, okay, how do we work through that and how do we get you started and get you onboarded?
Phil: And all of that kind of relied on a fairly synchronous in-person way of working, which also got upended. So can you talk a little bit about — we all went through the quick change to remote work with COVID — but how did that specifically impact you and what you were trying to build out at AA?
Tim: It was very significant. So like Southwest said this morning, our revenue dried up overnight. We went from a very successful quarter to now having to figure out, "how do we save our airline from going into bankruptcy?" And what impact that had on our shared services teams and our coaching team was — the people that we were helping no longer needed help because they were doing something else. They're going back and trying to figure out ways to bring in revenue. We were doing creative things like flying additional cargo for COVID-19 masks and COVID-19 supplies and things like that. They were focusing on that, and then our coaches were like, "okay, what do I do? Like, how do I coach somebody who isn't here? I've only ever had success coaching them in person."
Phil: That compounds a lot of challenges. You've got the need to get up and running faster, which was identified earlier on. Then you've got the need to certainly start entering new areas of business pretty quickly, and the need to help all those application teams ramp up in a way that they've not worked before. How do you start prioritizing that? Or how did you all think about where to spend your time first? Because that's a pretty daunting set of problems to have to tackle.
Tim: Yeah, that's a great question. So around this time, oddly enough, the director of user experience at American Airlines came back from a conference and he is like, "I heard Spotify's doing this Backstage thing. I don't know much about it, but it seemed really cool — you guys should look at it." And he talked to our Hangar coaches about that. And they're like, "yeah, we've got time. Let's try it. We think this could accelerate our coaching engagements."
So we stood up Backstage. It took us a couple of months to get it stood up — you know, just the initial catalog and things like that. And it's like, "okay, how do we actually help people? How can we get them from where they're having problems at?" Because right now they're out in the operation, they're struggling through things. We're not really coaching them. Like, how do we do it? And we've got this new Backstage toy to play with, to see if we can get it going. We had a lot of different things we looked at.
Phil: So how did you go about gathering up — figuring out where those pain points were, and what did you end up picking as your initial problem set that you wanted to go solve?
Tim: Yeah, so our first thing was — if you're familiar with the seven deadly wastes — like movement. So we started looking at Sherwell tickets. Movement is one of those seven deadly wastes, where you're essentially swivel-chairing and handing things around, and people are moving around and they aren't getting worked on. So we said, "all right, what are these high-volume support tickets that we can work on, that we can automate for teams, and just bring down the amount of time that they're waiting?" And then we said, "okay, let's look at other places of wait times for the developers and see if we can work on that."
We did that a few different ways. The easy way was — Sherwell has its own ticketing system, its own reporting system. So we just found the high-volume ones, and the ones that had long wait times. That was pretty easy. We went to those teams and we interviewed the teams that own those processes, and we said, "we will write the code for you. You don't have to be developers. We'll help you." We also started interviewing the software-development teams across the company saying, "what problems are you running into that don't show up in Sherwell requests?" So that's kinda like the first pieces of it, where we just started writing a few things here and there.
We also had some insights from company strategy and initiatives. Ross Clinton had literally just joined the airline and had some great ideas on things that we could do. He has this famous quote internally of "don't waste a good crisis," where he took the people that were supposed to be working on coaching and said, "alright, come up with cool things that we can do that's gonna drive the company forward."
Phil: A couple of deep-cut Target references there with the dojo, and then Ross, and a few others as well. So you talked about this — how some themes kind of emerged around the types of problems that you were looking to solve. I was wondering if you could talk a little bit about, what maybe one or two of those themes that came out that were really, really useful — and maybe something that you thought would be useful that turned out not to be?
Tim: Sure. So there are definitely a number of themes that came out of it. The tickets were a big one. We went through all kinds of random things. A lot of people ask me, "alright, I'm standing up Backstage — what did you guys work on first? What should I work on first?" That's always a difficult question to answer because at this point in our journey, we're four years into Backstage. We have 30 or 40 different plugins.
Some of our earliest plugins were things like extending Linux file systems. That was super important for us. We had a ton of Linux file systems that developers weren't allowed to touch without actually having multiple tickets. It's like — one ticket to get to the ops team, the ops team puts in a ticket with the outsource team, and then three weeks later you would get your Linux file system resized. So that's one of the first things that we tried, that we were able to do.
We also tried, early on, we ended up making some changes to the way that PE teams would deploy. We set up Azure App Services as our core deployment platform, four years ago. I'm not sure — I don't know if you guys are familiar with Azure App Services. It's okay. It's not great. It's about the same as Elastic Beanstalk. It's not a way to run an airline. We had to transition out of that. We failed miserably on it, and we learned about a year in where it was like, "oh no, we have to undo this. We have to stop forcefully pushing teams onto Azure App Services." So you learn quite a bit along the process while you're building these things out, and like you're just listening to your customers and trying to make it as easy as possible for them.
Phil: So we started cutting away at that six-week wait that you had for infrastructure here, as you're kind of coming on down — and then eventually start hitting these challenges of actually having to stand up those environments for those developers and help them kind of get that experience that they're looking for. What did you end up coming across as that tipping point, or that realization that helped you start really chipping away at that?
Tim: It's an interesting question. So a few different things. We had also done DORA metrics and things like that. DORA metrics are still a key proponent of what we're gonna focus on. But as we're transitioning out of App Services, it's like, "okay, let's focus on something more holistic, more complete for the company." We ended up settling on Kubernetes. It's a fairly complex Kubernetes installation that we have. We started off simple, but it got complex over time.
And we realized that "I've got 3,000 people to train on Kubernetes — it's gonna take me a decade. I can't train 3,000 people on Kubernetes." So we started creating our own custom CRDs. Even Google told us, "no, don't do that." They're like, "it's a bad idea." We disagreed with them, and thankfully we were right. We built our own custom CRD, so that when you deploy to production — or any of our shared-services environments — you have a YAML file that's 20 or 30 lines. It's unheard of for Kubernetes manifests. Most people have like dozens of manifest files, hundreds of lines per manifest file. So that helped us to focus on simplifying it for the users, into something that they can understand.
And then also being able to focus on bigger things across the company. Because now that they're on these shared services, I can focus on security on the shared services. It's like within my service mesh, you only have HTTPS traffic. Our critical baseline controls — that used to be called Baker's Dozens — that all teams had to focus on. Our CISO had set that up, and teams struggled to do these. So it's like, "alright, you deploy on my platform, I handle like eight of them for you. The other five is all that you have to do. I've got the rest for you — just come here and I can do it for you."
Phil: One of the interesting things I think that you've done throughout your program is also implement a lot of design-based thinking, in terms of how you've approached this. It's interesting — you called out having to train up a bunch of engineers on Kubernetes. We also heard that on stage a little bit earlier from the folks at Deere. Thousands of lines of code and YAML is kind of inherent to maybe an engineer or a coder's way of thinking, but not necessarily designing an entire solution, or kind of thinking about what that end-user experience might be. I know that I'm not great at it, because apparently I know what's going on under the hood. How did you kind of shift that approach into taking a design-first, double-out approach to problem-solving?
Tim: Yeah, it's a great question. So about two years ago — two and a half years ago — we found that we were building really cool systems that no one could figure out how to use. We started interviewing our customers. We calculate a SUS score based off of interviews — it's similar to a Net Promoter Score — and it was in the tank. It was awful. So it's like, we have these really cool things, people that understand it like it, people that don't understand it hate it because they can't figure out how to use it.
So at that time we worked with our leadership team and said, "hey, we need a designer — like, not just a UX person, we need somebody who's truly cares about design and how people are gonna use it." So she brought in new ways of interviewing our customers. She brought in simple things like click tests.
That's one of the most humbling experiences of my career as a product manager. I was certain we had a feature. I was certain, I was like, "this thing needs to go right here." I was certain of it, absolutely certain of it. We were bringing on a new feature like making Argo CD deployments easier to manage and things like that. And she said, "okay, let's prove it." So she created a click test. She brought up the screen, she sent it out to a few hundred people. We got like 40 or 50 people who responded. And she came back to me with the results a week later, and she says, "well, good news. We found where people expect this setting to be. The bad news is, only one person thought it was where you thought it was gonna be." And I'm the one who clicked it.
One of the most humbling experiences of my career, that further reinforced that hiring her was the right decision. And she's continued to do it. We're redesigning our documentation experience right now to be based off of people's learning styles. I never would've considered that. She's like, "I've interviewed people, I've been able to boil it down to these two sections. Here's the type of things that we need to include so that it's easy for people to learn about the platform. And oh, by the way, when we do this, their scores of satisfaction go way up."
Phil: I think that's one of the big things — transitioning to thinking about developer experience as a product, right? Sitting on the vendor side, from any of you that are paying GitHub money, we were constantly hearing your feature requests: "I want this to be able to do this exact thing right here." And usually it's like eight people telling us eight different things in the same way. And I'm glad to hear that we're not the only ones hearing that — that you're hearing that from your own internal teams too.
But kind of thinking about some of those bigger wins and places where you had a chance to really make an impact — there were a couple I know that you have had some really big wins on. Might want to share a few of those here.
Tim: Absolutely. So I've got a video that I'm gonna show you here in just a moment of these. This is our home-grown Backstage instance — we call it Runway. We love talking about it. You've probably seen videos of it, if you've heard of us.
There are three areas that I wanted to talk about today. There are many — we have dozens and dozens of plugins.
The first one was Cloud Elevated Access. This is our number one plugin. Within American Airlines, if you want access to production, you're not allowed to do it by default. So literally no one in our Azure instance has access to production except for about five different administrators of the overall cloud. And they wouldn't know what they're doing anyway, so they don't mess with people's production stuff. So if you want production, how do I get access? It used to be a ticket, and you would wait and you would wait and you'd put it in preemptively. If there was a SEV call, you'd have to wake somebody up. It was annoying. Or, I just try to guess that I might need production access tomorrow or in two days. Yes. So we built that into Runway. We now have a plugin where you can go in, we detect that you have the rights to that area, and you can get cloud elevated access for four hours. After four hours, we take your access away. Within the last year we've actually enhanced it to where now you're required to also put in a change-management ticket — which is still automated. You can do that and you have to link up that change-management ticket for SOX compliance and things like that. But it's completely automated that you get access when you click the button. What'd you do in there, why'd you do it, all that fun stuff.
The next one's API Management. This is one of our very first plugins that we did. We had a problem with being able to have secure API communication across the company. This is one of our plugins that we're the most proud of, because we were able to teach another team that didn't know software development how to write React code and build components for them within React — to where this is probably one of our best plugins that we have in the platform, and is completely managed by a different team. So as a product manager owning a platform, I'm pretty proud of that — that we have such a vital thing that exists that another team is able to manage it within my platform. Even though they don't require us to do anything for them.
[Video of Runway — Cloud Elevated Access plugin]
Just going through getting access. We have a few things that you have to fill out — basically related to the information that you're trying to get to, or where you're trying to get access to. You choose the type of elevated access that you need. We actually have three tiers. Some of the tiers give you less access. And then you fill that out.
[Video — API Management plugin]
You can do anything from creating a new API that's managed by the system, providing access to clients that have requested access. You can also set yours to be auto-approved so that you don't have to do it if you've got nonsensitive data. And recently, this team — actually, when Phil and I were first talking about it, I go to look at the plugin, there were new features on it that I had never seen before. Which is always exciting as a product manager — to see another team enhancing your product and making it easier for your customers.
In this scenario, I can approve or reject people access to this app. I can look at the proxy analytics and just kind of see how things are going.
Phil: That's actually something that's kind of neat. One of the things that's been big there — and we haven't actually called out anything formally in the questions — is just the amount of contributors you have to this. You said that being a product manager, going in and finding out something shifting to there that you didn't even know about. What's the community response been like to this?
Tim: Yeah, so just to give you an idea. We have about 3,500 software engineers at American Airlines. 150 are direct contributors to our Backstage instance, and then around another 50 own their own plugins. So we're at about 200 out of the 3,500 that have made core contributions to our platform.
And this last one that we're going through is our Secret Vault service. Essentially you can set up secrets through our platform to be managed and get access to your own secret vault. This one also was a way — we worked a little differently with this team. I assigned two software engineers to the team for a year that helped them build this plugin. It's not good enough for us to say, "hey, you get your own vaults and you can manage it." We also built in configurations with our GitHub OIDC Connect, so that's automatically set up behind the scenes for them. So their secrets are available to their repos. And we also did it with our shared Kubernetes platform. So that's set up behind the scenes for them.
Phil: And that's where you're starting to get at some of this other exciting stuff that's kinda more near-dear to my heart, as being inherently tied to software version control, things of that nature. Which is — it's great that anybody can maybe go up and spin up a repo or something, but if I don't have a place to develop it, test it, get it into production somehow, start getting useful information back from it, it can be still daunting to go back to that. You know, I may not be waiting on infrastructure, but I still need to know exactly what I need to do in order to get this thing production-ready. What have you all done to kind of enhance that experience for users, so that they can have that kind of similar magic experience to having a ready-to-use application?
Tim: Yeah, great question. So kind of breaking it down — we have permissions pass-through between GitHub and Backstage. All of our permissions are handled down range. GitHub is our 50/50 — we both do both — Azure Active Directory groups plus the permissions pass-through from GitHub teams to determine who's allowed to do what.
We have software templates that are scaffolded with Backstage. Our software templates are quite complex. I'll show a video on it in a moment. It creates a repo. It gives you a number of shared actions plus reusable workflows that kind of enhance your experience and make it super easy. Kicks off initial releases within GitHub. So like we scaffold directly into GitHub and control a few things. So that by the time you're done with a software template, you have an application running in non-production, directly on our shared clusters. You have your repo, you have your CI/CD pipeline, and we are now also setting up ephemeral environments that watch your PRs and set up an ephemeral environment for any of your PRs. And then update comments into the PR and say, "here's your ephemeral environment, here's your URL." And it gives you some instructions on how you can now start linking repo to repo and have these ephemerals that are nested and chained.
Phil: That's some great experience. You're essentially building the next version of Heroku in your own environment.
Tim: We truly are.
[Video of software template]
This is just an example of one of our software templates. If you've seen some of our older videos, they are quite a bit different than what they used to be — because our designer has worked on front-loading things that caused people to abandon the process in the middle of the process. She's front-loaded all of those things up at the front with a nice description explaining the things that people would get hung up on, and letting them set their prereqs early.
And then as you go through this process, it doesn't take too long. This is slightly sped up, but it's only about 30 seconds shorter than the actual time that it took me to go through this. At the end of this, you end up with running code in non-production — which is one of the biggest things that we wanted. We went from six weeks to stand all this up, to now you're at about 60 minutes, fully round-trip, depending on which template you pick.
Phil: So that's a great greenfield experience for this. What is the experience like if I am one of those people working on like a 5-, 10-, 15-, 20-whatever-year-old application that's still responsible for making sure that my ticket's issued right, my baggage gets to where I need to be?
Tim: That's a good question. So our core platform is mostly for containerized apps — 12-factor containerized apps. We do have a lot of modular components that are useful on the existing brownfield apps. We also have some software templates that do a pull request into existing repos, and will bring in the files necessary to go to the shared platform, or bring in the files necessary to give you different CI/CD components and things like that, to make it as easy as possible.
That's the end of that video. This is just kind of what's happening underneath the hood. That's something I don't have to worry about anymore. Exactly. I don't have to scaffold these out — it's just getting all set up for me and dropping that information right in.
This is an example of our own custom CRD. Currently, if you look — line 15 is actually where the most of the CRD ends. Everything after that is environment variables. So 15 lines of code sets you up with an ephemeral environment that auto-spins up on every single PR. There's also a little bit more when it comes to the behind-the-scenes, but as far as our customer knows, this is all they see.
We also gave an example here of our GitHub Actions. We are quite committed to bringing information to developers. Giving a centralized action isn't good enough. That action has to inform them where they are working, so they know what to do. So this is an example of our image-scan bot. It explains why they're seeing it on their PR. If there were vulnerabilities, you would see it. In this case, since it's a new one, there's not. But just as an idea. And we do this both for job summaries and for PRs, depending on where the person's working.
Phil: That's really helpful. I think the emphasis on plain language and being descriptive is really here. If there's one thing — and we'll chat briefly about AI in a second — that we're seeing, is that the emphasis on natural human language, and just standard communication, really helps to break down a lot of barriers for folks, whether we're talking about it being English or non-English, or just being able to make sense of what's actually going on.
One of the big things that you've been doing as well is adopting some form of generative AI to help developers out. Whether it's getting taking that old application to make it into a 12-factor application that you can containerize and leverage a lot of these other benefits, or starting from scratch — how have you used this system to surface up the benefits of that, and identify maybe where you've got some opportunities to improve?
Tim: Yeah, it's a good question. So this — I went ahead and brought up one example here with our Copilot report. So this is a Backstage plugin that we wrote that just hooks up to the GitHub APIs. We do plan on open-sourcing this — I don't know when, but at some point we're gonna open-source this for y'all that use Backstage.
It essentially just lets you know how many people are using it, what your acceptance rate is. We kind of track that over time. When we first started using it and didn't provide much training, our acceptance rate was around 15, 18%. We started adding training and our acceptance rate hovers usually 28 to 32%. So this week that I chose happened to be about 28%. We do have about 700 people using Copilot internally. We expect that to grow. It's mostly been like, "hey, if you ask, you get access." In the future, like we plan on doing signing up for various alpha programs that GitHub has — like custom skills and things like that — and building that out as well.
Phil: So you've had four years to kind of really build this up, and I know that you're probably still scratching the surface of what you'd really like to do for your developers at American. So what's next on your list? What are you excited to build for your folks next?
Tim: So: further integrating security — we want that heavily in the platform. Better catalog management with Azure catalog processors. So we have a full service catalog in Backstage. Repository ownership managed with GitHub custom properties — that's pretty huge for us, so we can figure out who owns what. Securing our software supply chain with GitHub Sigstore. And accessibility is gonna be a huge focus for us. We found that we have a lot of people that use screen readers to use our app, and we're gonna be improving that for them to make it easier.
Phil: Thanks for sharing that all with us.