Monorepos, Mainframes and Modus Operandi

Log in to watch

Las Vegas 2019

Monorepos, Mainframes and Modus Operandi

Senior Manager of Infrastructure and Application Architecture · American Airlines

Building a legacy application which had 9 development teams that needed to build at the same time for a successful deploy. Our challenge involved us moving to a 'super build' - which we tried to combine all developments into a 'monorepo'; including our challenges with the legacy mainframe.

Chapters

Full transcript

The complete talk, organized by section.

Philip Knezevich

So the other day, I was looking at how many lines of code I had written in an open source project that I was working on in React Native. It's kind of an obsession of mine coming back from college days, when I would do word counts on theses. I was very fussy to see how many lines I had gotten.

The number came out to 6,000 lines of code. But with all the accumulated packages, components, and modules downloaded, it actually came out close to about 19,000 lines of code. My reaction was subtle: I decided that was pretty good and got my family involved in the celebration. But seriously, as part of the open source community, you inherit everyone's code. It becomes your responsibility. You fork and check stuff back in, so you want to make sure you do things right. Tongue in cheek, it's 6,000 lines, but I like to think of it more as 19,000.

On the other end of the spectrum, the largest source repository in the world is built in a monorepo. It's by Google, and it contains two billion lines of code. Two billion. It seems like the stuff of myth and legend. They have close to three and a half million commits. It's 86 terabytes in file size. It is monstrous. Fifteen thousand developers have worked on this over the last 10 years. We get nervous when there are 15 developers and one of them leaves, and what might happen with the code, let alone 15,000 times that amount.

Companies like Uber, Airbnb, Babel, and Twitter all use a monorepo as well. The idea was brought to me internally by an architect in my group who suggested, "Hey, we could do this at American Airlines. We could build our own monorepo." From reading about it, the technologist in me got giddy and excited, and I thought, "Yeah, we could probably do this." But it's just another itch in my DevOps scratch, and I'm sure a lot of you have them. We wanted to solve things like code reusability. We wanted to do package management better. We wanted to do cross-project changes better. We wanted to simplify dependency management.

You're probably wondering why I picked three subjects as one. At the core of everything we do is our modus operandi: simple Latin vernacular for mode of operations. You all have a modus operandi at your company. Over time, you've built procedures and patterns and practices, and some of them may have lived for a long time. We're in the mainframe world, and some of our practices have lived back to the '70s, which is crazy. But we are trying to modernize on top of that. We have other initiatives; the monorepo is one, and there are several others. Then, at the exosphere, at the top of all of this, comes DevOps. There is the technology side of DevOps, and then there is the cultural side of DevOps. It's almost like a chemical reaction when DevOps mixes itself with processes that have been around for a long period of time. I want to tell you a story about the things we went through to try to mix these worlds together.

Before I get into the talk, a little bit about myself. I'm Philip Knezevich. I work in the maintenance and engineering department of American Airlines. I've been there 10 years and am originally from Australia. I've been told my accent switches between the two: over there they think I'm American, over here they think I'm Australian, or something in the middle. I have a developer background, and yes, I have written more than 6,000 lines of code. But I'm scratching my head because I came from the Microsoft world, where we had fancy GUIs and nice flashing things, and now we're back into CLI. Git, Node, Kubernetes, Yarn, Homebrew: all these initiatives have gone back to CLI. Software engineers in the group may think I'm crazy, but it really is a change for me to understand. I feel like I'm a generation back from that.

A little bit about American Airlines. We are the world's largest airline: largest by flights, destinations, employees, and fleet size. We fly a staggering 6,700 flights a day. In that time, we need the utmost in security, safety, and reliability. We have to make sure our pilots and flight attendants get to each flight on time, which can be challenging when they have to connect. We have to conform to FAA, NTSB, and TSA mandates. We have to work around airport congestion, air traffic, and bad weather. We have to cater for many different user needs on our flights, and people are different. On top of all of this, we have to get you all there on time.

We fly to 350 destinations across 50 countries and five continents. Unfortunately, we don't fly to Antarctica yet, but if we ever do, I'll be the first one to go. We have 130,000 employees, 27,000 flight attendants, and 15,000 pilots. Our fleet size is in excess of 950 and growing. This is a deliberate $25 billion investment into new aircraft. An advantageous offering we have against our competitors is new aircraft. From a customer experience, being in a new aircraft is like being in a new car. You don't get the new car smell, but everything else is pretty much the same.

Internally, we made a shift to DevOps. We started with the hero developers and engineers: people who had been around the industry, had been to conferences like this, and were seeing the shift. They came in saying, "We need to do DevOps. We need to do CI/CD. We need to validate test cases, security, and infrastructure as code." The people in the company who didn't know about DevOps would smile and nod and say, "That's great. Go ahead and do it," without really enabling them. That had to change. Internally, our CIO, Maya Leibman, made a change to bring a DevOps-first culture to the airline. We now have an enterprise-level toolchain, and we have changed our project structures to make sure DevOps is more heavily baked in and coupled with Agile.

The industry is changing fast. In my area alone, we picked out five areas that caused problems for us. The first is that IT is disconnected from the business and vision. I think of developers on level four in the corporate office, headphones on, listening to Metallica, or now maybe Enya and relaxing music, with keyboards, two monitors, coffee, writing code. That code gets passed through groups, pass-the-parcel style, until it hits our end users. In my world, those users are mechanics, line maintenance staff, and stores clerks. They work in airports with jet noise, planes taking off and landing, machinery, and very different computer specs. Bridging the product experience from the corporate office to that environment is something we are trying to get better at by involving the business in our Agile process and improving customer feedback.

The second problem, which resonates with me more deeply, is that leadership tracks activities and not results. Initially, I was obsessive about making sure Agile rituals were followed: daily stand-ups, backlog grooming, acceptance criteria, all the things you do within Agile. Now I've learned that teams know themselves better than we do, thankfully, and know what works in their culture. So I step back and look at results. I use the magical MTTs you've all heard: meantime to debug, meantime to repair, meantime to implement, and variations like meantime to break-fix. We look at how fast we're tracking and how much better we're improving from a time perspective.

The third problem was our project structure, which was fundamentally broken. Funding was typically done by divisions. Each division gets some money from finance and then has to figure out what to get done. Everyone knows that if their project is not approved this year, it may wait a year or longer. So the list becomes priority one, priority one, priority one, priority one, priority one, priority 10. Finance draws a line, and then everyone asks what about all the priority ones below the line. It's confusing. Now we're moving into the product model. We have a recurring and ongoing backlog, so we can keep priorities there and address what we see as most important.

Problems four and five are interesting to me. I facilitated a Lean Coffee session with 12 people from 12 groups. We got stuck on the topic that business values and IT values do not match, and all 12 agreed this was the number one problem in their group. One example was a spelling mistake on a simple HTML page. It could not get into production for two weeks because the change was bundled with technical debt changes and non-functional changes. Suddenly there was a release cycle and approvals. It took two weeks to fix a spelling mistake.

Cynicism in IT has been around for a while. I've collected these memes many times over. One that gets me is when PMs and managers ask developers for estimates and then use them as deadlines. The conversation goes: "How long is it going to take?" The developer defensively says, "If I had a requirements document, no unplanned work, and that gun developer helping me, plus or minus 30%, three months." The PM says, "Okay, three months. Good." Then that's what they use. It's crazy, and both walk away feeling awkward.

Internally, we're trying to rectify these problems. In my area, business people, such as line maintenance folks and airport maintenance and engineering staff, typically interface with a UI. Behind the scenes, they don't care what happens. We typically go to a SOA layer, then to an abstraction team, because we don't allow direct access to our mainframe. The problem is that each of these silos has its own team, managers, priorities, and deadlines. The SOA team has to deal with other groups that also have their own managers, teams, and priorities. Those groups push back because they are working with other app teams. Lines of convolution and messiness sprawl all over the place. Working with the mainframe team, which is also working with all these other app teams, becomes confusing.

We had to think differently. There is a notion of moving from projects to products. What we did was take resources out of those groups and create a product team. We're still in our infancy, but fundamentally it has been described as a "get out of my way" team, a "get out of my way" architecture. It's a small business model. We have the resources we want from each group, incubated, designed, and self-sustainable. My product team has it all. We're trying to bring speed to delivery. We have also included architects, SREs, QA, and UX as part of these product teams. One challenge is that, as we move to DevOps, product teams have their own multiple pipes, and we also have a monorepo.

Now, a little about the mainframe. I say this fondly: they are not really the black sheep of the family. If you search for DevOps, you get many examples in JavaScript, Java, and .NET. Mainframe, not so much, because they've been around since most of you were born. But there is a lot of work being invested into DevOpsifying the mainframe, and we went down that route as well. We partnered with Compuware. The first thing we did was get our source control into a proper repo. We're using ISPW for source promotion and build. We use Topaz TotalTest to execute and validate test cases. There is analysis and gate-checking before we promote to our integration environment, and ISPW helps with that. We started making sure we had the right checks and balances in the mainframe COBOL code we were writing.

Where this became challenging is that the mainframe is a tightly coupled dependency to other systems. If a field changes in the mainframe, downstream systems have to react. We wanted a way to put that into our Jenkins pipeline and bundle it together. For that we use XebiaLabs XL Release, which orchestrates JavaScript, .NET, Java-level code, and mainframe code together so it builds as one.

Let me change gears and talk about toothpaste. Anyone with kids knows they squeeze toothpaste from the front and throw it away even though there is more left. When we were writing test cases, we had a similar feeling. The mainframe is not an object-oriented framework. COBOL is prescriptive. Mainframes rely on three-letter transaction codes that interact with MFS, a message file system, which I think of as a glorified parameterizer. It interfaces with PSBs, a logical IMS data structure represented as a program. At first, we were writing tests and squeezing at the front; the tests came out okay. Then we hit conversational transactions: you send input to the mainframe, it sends a result, you make a decision, send it back, and this can happen a few times through a decision tree. We realized many test cases were not inclusive of the use-case scenarios and decision forks being made. The tool that was meant to be a slave for us became more the other way around.

You can automate and make things happen faster, but the same problem remains. Once you invest in automation, you can do things quicker, but you may miss many things. We partnered with DXC; Norm Wall has put a lot of this together. We had to think methodically about what to put into our load-testing tool. The fundamental story is: don't automate too much with fast, fancy tools. For us, we could have bought a $120 toothpaste extractor, but the toothpaste is limited to certain forms of tube, harder to clean and maintain, and has unnecessary dependencies because it is embedded in a wall. We felt that way about a lot of testing automation thinking, such as the old Mercury LoadRunner days when you had to isolate your own environment, get test data, and use an environment whose purpose was to bring it down through capacity testing. We had to think about what we were actually trying to do with testing.

In the end, the $8.99 tool from Amazon, the little glider that comes with the back, is more metaphorical for what we were trying to do. Take a step back. Look at the process. Promote inner-sourcing, open trust, and collaboration between developer communities. We actively try to get mainframers to speak to our technical people. We are promoting the idea that test coverage becomes integral to what we do, and trusting the fail-fast process of DevOps. Every developer's number one thing is, "Let's get the code into production. I don't care about test cases." But failing fast means you learn quicker, and you don't learn from production mistakes; you fail in the environments that get to production. We had to bring that mindset back. Through DXC, we save about 30 hours per build via automation.

The overall stack has two sides. On the mainframe side, we use ISPW and Topaz, and in application operations intelligence we use Compuware and zAdvisor. On the traditional side, we use GitHub Enterprise, Coverity, and then it forks between ADO and Jenkins. XebiaLabs XL Release bundles it all together. We use Moogsoft to aggregate logging into one easy-to-view place, and Dynatrace for runtime or real-time troubleshooting and debugging.

Another change of gears: plumbing. Water comes into one pipe, hits your hot water, branches into two pipes, disperses around the house, and then goes out through one pipe. Software has evolved from TCP to SOAP services, then RESTful APIs, and now microservices. But the fundamentals are that you're building, compiling, testing, and releasing code. The monorepo takes this facilitation in mind: you are here and need to get there, just like the plumbing, but everything in between can be handled in one go.

I tried to make a jingle for my team by substituting the word ring from The Lord of the Rings with build: one build to rule them all, one build to find them, one build to compile them all, and with DevOps bind them. It didn't catch on for some reason. Simply, the monorepo works much like a traditional SDLC. You sync your workspace to your repo the same way, except now you have everything. You write code the way you usually do, check it in, and review it. I'll go into that process more, because neither the check-in nor the commit is that simple.

Could we actually do this? I was asked that question, and I said, "Why don't we try a subset experiment?" We picked 15 apps. We decided on UI ones first because they were JavaScript and had a mature DevOps stack. We now have some Java and .NET in there as well. We compiled it with 25 shared packages intrinsic to American Airlines, things like connecting to flight services, cargo, and maintenance services. We have 90-plus third-party dependencies from the NPM world: Commander, React, Express, Async, Body Parser, Lodash, and Angular 7. We started with 70 developers. Some had been working in one product for 20 years and did not know about the whole other IT world around them. Suddenly we were saying they needed to bring this in and own everything we do. Our goal was one CI/CD pipeline, with targeted build and release.

The simplified monorepo view has apps sitting in subfolders as they always have. The difference is that node modules, build scripts, dist, libs, tools, maybe bin or etc folders, which typically existed inside each app structure, have moved into a shared space. App teams now reference them there and use them as necessary. I think of it as include files from the old days, though I've been told it's more complicated. If a module in the shared area is changed by app two, it may affect app four, seven, nine, and 14. App two has to ensure that what it did will not break those other apps. We check that code against everything else. If we had a more TDD-mature environment, this might be easy: run the test cases, and if test coverage is good, you're done. But many of these apps are legacy and don't know what TDD is. We had to do a lot to make the monorepo more fluent in dealing with these situations.

We started looking at tagging. Every application is attributed to certain files, and everything runs off the master trunk in GitHub. For targeted builds, we do cherry-picking. Application seven needs these modules and dependencies, and that creates the build for that app. We define the app through a JSON file. We have gotten smarter with tagging: we know which files affect which applications directly, so as soon as you edit a file, it automatically builds the associated apps.

It is scary because you don't know a lot about other people's apps. The number-one fear is to avoid using the shared area as much as possible. But there are brave people who say, "I want to upgrade from Angular 6 to Angular 7," and they have to do it for all the apps. That is where the value of something like this comes in.

From my old programming days, this was a stress: midnight, 6 a.m. deadline, someone making a massive change, and you think, "Don't break that build." Now it is not just one app. You're responsible for everything in the entire organization. That's where our check-in process has become very thorough. We queue several check-ins and analyze them before committing. Some take days. Some become so old that we abandon them because they have been superseded by new changes or a developer has a new check-in. We don't want to ignore things forever because they take too long, so we have been intensive about making code reviews more thorough.

The result is that we consolidated all our development into one source of truth. That was the big win. It's all in one place now. We do code sharing from legacy apps, reuse, and simplified dependency management. There has been a massive increase in development teams, whether you like it or not. It is scary if an app has no developers available and no test cases and you are making a change into that app. But teams have learned to speak to each other more. We started with 70 developers and are at about 95 now. We're realizing the benefits of write once, use many: DRY versus WET, don't repeat yourself versus write everything twice. We're on the DRY side.

Result-wise, it is 330 times faster to deploy these dependencies. That is a big win for us. Anyone with technical debt or end-of-service-life upgrading can see this as a solution. There is a 250% increase in developer collaboration; we see them interacting much more. We gained 4,200 work hours by eliminating duplicate efforts in package management. Given there are around 1,600 or 1,700 work hours in a year, that is close to two and a half years.

Finally, one challenge is that TDD is a problem. We need more help with TDD. We are struggling with feature toggling. I wish I had more time to speak about feature toggling; go see LaunchDarkly if you can. We have our own internal solution. We still rely on companies that helped us with the monorepo, including Narwhal, ex-Googlers who specialize in the monorepo. They came in to help. Cutting the cord with them, I don't know if we're sink or swim. I don't think we're drowning, but I don't think we're swimming either; we're in the middle.

The biggest problem is that most of our projects are still in the traditional polyrepo. I have 15 apps in my area alone; I own 330, and around American Airlines in general there are thousands. Could we ever expand this to the larger organization and become a true monorepo like Google or Airbnb? I don't know. I'd like to. It's a pipe dream. It would require a lot of cultural change.

My final slide: three different topics, one destination. We are moving to one repo, one pipeline, which also includes the mainframe. That's our goal. If you're thinking about trying some of these things in your company, try to understand the right approach. Your teams have to be transparent. You can't have cowboy developers sitting in the corner saying, "I can do this better than anyone else," trying to prove they are better and doing it on their own. You need extensive collaboration, which begins with trust. My name is Philip Knezevich. You can get me at @bigapplepk. Please tweet me. I hope you enjoyed my chat. Thank you.