Shaping the Future of Travel with DevOps and Cloud
In one of the largest data centers in the world, with more than 37 petabytes of storage and over 16,500 devices, Amadeus processes more than 49,000 end user transactions per second and more than 47 billion SQL executions daily during peak times. This is all while performing more than 5,500 IT changes and more than 540 application software loads monthly and maintaining an average availability of 99.99%.
As Amadeus is a world-class technology company dedicated to the global travel industry, it needs to go even faster and do even more. The Ops teams and the 5,000+ developers are re-organizing into DevOps teams while the technology stack is morphing into the Amadeus Cloud Service. Part of the challenge is to seamlessly move 300K daily jobs that manage some of the mission-critical functions of the business.
This session discusses the management and cultural changes that are central to making this transition a success along with some of the technology, such as Openshift, Kubernetes and Docker, and organizational changes such as Agile methodology, full CI/CD automation, infrastructure as code or jobs as code, that are crucial to success.
Chapters
Full transcript
The complete talk, organized by section.
Damien Profeta
During this presentation, we will try to explain to you how we'll shape the future of travel with DevOps and cloud.
A short agenda is: we will explain who Amadeus is. We'll make a quick poll: who has ever heard about Amadeus? Really? Oh, it's fine. Oh, yeah.
So we'll make a little bit of history to explain where we come from, because it's quite interesting to know what was before DevOps and why there was something else at the start. What are the benefits and challenges? And then the conclusion about the key takeaways.
So first, who we are. I am Damien Profeta, expert in software engineering. And here's Alice.
Alice Albano
Alice Albano, software engineer in Nice.
Damien Profeta
Yeah. So as you can see, we are developers, so you get feedback from developers that want to be also ops. We could not make the ops that want to be developers also, but we discuss with them quite a lot, so I think we will represent them in a good way.
So Amadeus. Who is Amadeus? We are a leading provider of global solutions to travel globally. So for the different ecosystem, meaning that basically, if you take a plane to come here, you use our services. But we are not only doing airplanes. We are also doing train and hotel and things like that.
If we want to have an overview about Amadeus, it's 15,000 professionals, and we are in more than 190 countries. Just to give you an overview, 190 may seem a big number. The UN has 193 countries on its list, so it's basically everywhere. Maybe we have two or three countries where we are not there, but we are basically everywhere.
And we are there because people travel all around the world, and we need to have someone to help airlines or to help travel agents to make the business, to enable people to travel. Amadeus is used from searching the flights, so when you want just to know how much it costs to go there, up to when you go inside the flight, to carry the baggage, to make the procedure for the takeoff, and this all over the world.
So we have quite a number of sites in the world to make development, to be close to the customer. So you see, quite everywhere inside the world.
And what we are doing, basically, because we are not a travel agency, we are not an airline, so we are in between. We try to connect people. So the people that provide a way to travel, so the airline, the train, the airport, et cetera; the guys that are selling them, that make a package with the airlines, the hotel, and things like that; and the people that want to travel.
So this is the three parts that we connect together to give you a nice experience during your travel. This is called the distribution, to enable the connection between the different people that travel, and then to help each part. So to help the travel agent, to help the airline. We also have IT solutions to have revenue management, sales, inventory.
And so the main part was airport, airline, but we try to diversify into hotel and railways. So you can see that we are here since a long time, since the '80s, basically. As you saw in the keynote, even the whole company tries to move into the DevOps movement. And so we are also on this path, and we will try to explain why.
So first, a little bit of a quiz to see how big we are.
How many servers do we run? Knowing that Google don't publish their numbers, so we estimate 2 million, the number of servers. And so an idea about Amadeus, how many servers? 70K? Seventy. Yeah, so not so big. 7K.
About the number of transactions per second, knowing that Google Search is kilo-transactions per second. 170. We are half about Google Search traffic, and it's increasing every day.
About the number of lines of code. So we are a very big application because we connect many different parts together and many jobs to maintain because we make many reports. We have to send the customer list to make the boarding pass and things like that. And so 300,000 jobs that are maintained every day.
So a little bit of history. In the '90s, we started to make Amadeus. To give you an overview, in '91, it was the creation of the World Wide Web. So Amadeus is older than the World Wide Web, and we already enabled people to travel thanks to TPF, Transaction Processing Facility, a wonderful piece of hardware made by IBM, but very difficult to make evolve.
It's like a big switch that you can program, a big switch for the traffic, and only do transactions. So if you want to do batch, you cannot basically do it with TPF. It was really good because you can live-update anything. You can change the CPU without outage. But unfortunately, it was not easy to evolve from a code point of view.
So while we were having TPF, as it was a complex piece of hardware, we decided to divide the responsibility between two different kinds of people: the developer that will develop and deliver the new software, and the application manager that will ensure that TPF is running well, that makes capacity planning, that the data is really secure, et cetera.
So it was working quite well, but we forecast, with the internet, so with the World Wide Web, a huge increase of traffic and a really bad look-to-book ratio. So look-to-book is a kind of keyword in our industry because we get a fee when there is a booking. And when you just look for a flight, we reply, so we are processing the query to get you the different flights and the pricing, but we do not get any money with that.
So with internet, there are more and more people that just look at the flight but do not book. And so it was not good for the business. So we had to change the way we make computation to lower the cost of looking to a flight.
So as I said, the complementary workload, so the job scheduling, was also very important at that time, and TPF was not able to handle it. And the maintenance and the difficulty to hire people to code in assembly in TPF was quite difficult.
So we decided to make a new version of our software with Open Back End. That was how we called the new framework. But we were still keeping the differentiation between the developer and the operation because we saw no need to change it.
So this was based on open standards with TCP/IP, Unix, SQL. Almost everything was written by ourselves. So I was personally a developer on the enterprise service bus, the homemade enterprise service bus, or the homemade application framework.
We tried to make it as portable as possible through a service-oriented architecture, so really a nice design. But we kept the same operational model. And so it was running quite well for 15 years, but we wanted to do more, to do even better. And so it was not enough to handle the workload that we saw.
The issue is that we wanted to be closer to the client, so to get better response time. We wanted to cope with the traffic growth, which is huge. And we were not delivering fast enough.
Today, as it was said during the keynote, we have procedures not to break anything. I think that we had the experience, it was not Amadeus, but with BA last week, that when there is an outage in the air industry, it's a massive outage. So we try to not have such kind of outage. And for that, we have a massive procedure to test really well our development.
So we only have one load per week, but we want to improve, obviously.
So that's why the goal of the new platform that we want to build is to be able to make it run everywhere, in public cloud or in our private data center, to optimize the efficiency, to be able to load more often, highly scalable because we know that it will grow and will continue to grow, and to be able to use the open-source standard because it's easier, as everybody.
So we decided to change both the framework and the operational model. So this is called ACS, Amadeus Cloud Services, and with DevOps.
So the idea is that we want jointly, with global operation and R&D, to be able, using automation, to deploy everywhere our platform.
And the key point here to understand is that we wanted to do DevOps to be able to load faster. And while the discussion went on, we discovered that the only way to have DevOps working is to use the cloud technology, because otherwise you don't have enough automation to be able to let each team have its own responsibility and to manage everything from the development to the load. And so that's why we changed everything at a single time to have both DevOps and the technology.
So now I will let Alice explain to you about the challenge that it takes to go to DevOps and cloud at the same time, but also the opportunity of it.
Alice Albano
So first of all, I think it's working. Yeah, it's working.
First of all, we're going to talk about the benefits of DevOps philosophy. There are four main pillars: the technical part, the operational part, the cultural change, and the business value of it.
So from a technical point of view, I think it's pretty clear: the DevOps philosophy gives us an opportunity to use the local development and the local testing. And that's the dream, I think, of all developers. The possibility to have continuous build, continuous integration, continuous delivery, and deployment, that's the dream of all developers.
From an operational point of view, it's important because you're able to operate your platform and your application from everywhere. You don't care where the application is running, or if it's running in multiple data centers. You can operate from your own desk.
Then there's the standardization part, which is really important because it allows automation, the possibility to automate the deployment of the application. But also, which is really important for operation, the automation of the recovery, the capability to be able to fall back what you've loaded if it needs to be fallen back.
From a cultural point of view, the product ownership is really important because DevOps are empowered with their own product. They can follow it from the beginning till the end, and they have responsibility of their own products, which means that they are more engaged. The engagement is higher, and they are much happier because they can follow what they're doing.
From a business point of view, the time to market: you can deliver your product much faster. Then there is an improvement in communication and collaboration between teams and different applications. And since you don't have to spend too much time operating the platform, you've got more time in innovation.
But unfortunately, there are still challenges in moving to the DevOps philosophy. Damien said we are a huge company. We've got sites everywhere in the world. In some of those sites, we have developers and operations together, but there are other sites, for example, in Nice, we are only developers. In Erding, in Germany, we have mainly operations. And it could be difficult to work together.
Not only from a geographical point of view, but also because there are these two groups today, developers and operation, belonging to two different management lines. So we are going right now, nowadays, through a reorganization. And the decision that has been made is to have operation and developers all together in the same teams or in the same departments, even if they're on different sites, in order to ease the communication, the collaboration, the exchange, the knowledge sharing on the application itself.
There are still some questions which are not completely answered. For example, the integration teams, because today we've got lots of applications which are still relying on integration teams before deploying their software. And for example, support teams, which are all over the world today, we don't know if they should belong to the teams, like dev and operation, if they should stay there.
Then from a technical point of view, the standardization is perfect. It's great. It's working. It's going to ease things, but it's hard for developers. You know that you need to learn other stuff. Like, for example, if you were using CVS or Mercurial, you need to move to Git, and it's not easy.
What scares developers the most, I think, is: yes, they're going to be empowered. They're going to have the ownership of their product. But that means that they're going to have to improve all the automation they've got on their chain. They're going to have to improve all the tests they have. They have to improve the monitoring they have.
And then, since Amadeus is a huge company and there are lots of applications talking together, when we've got a single customer request, it doesn't go only through one application. It may go to different applications. For example, when you look for a flight, it goes to the shopping platform, then to the availability and the inventory application, then it goes to the fare application.
And that's the fear that we may have, and that we have, is that the global operation were the owner of the customer incidents. And we are afraid that if the responsibilities are spread among all the different teams and all the applications, it's going to be harder to communicate with the customer on the state of the incident and to recover the incident itself.
And then for the same reason, since we've got lots of applications talking to each other, the APIs need to be really, really strict.
So one big question that we had when thinking about DevOps philosophy and the transition to DevOps is: is it possible to have it with the current architecture, the OBE architecture that he talked about? The answer is no. The DevOps approach would be too complex and too expensive because of the lack of automation we had on the other system, and it would take too much time in order for DevOps to develop and operate the application.
And then we would have asked them to have knowledge of the other layers, like hardware and network. That's when the cloud-based architectures come in handy.
So which are the benefits of the cloud? The most important, I think, is the separation between the different layers, between the application layer, the platform layer, and the infrastructure layer. Which means that when you work on one of the layers, you don't need to know what the others are doing. You just need to have APIs between the layers, and that's going to work.
And then there is the flexibility to manage the resources, the possibility to add new nodes on your cluster, to delete nodes if needed, the multi-tenancy, the quota limits for different applications. Then the hardware, the standardization of the hardware, then the redundancies, scalability, isolation, the tolerance to failure, which is really, really important. If a virtual machine is going down, you want to be able to delete it from the cluster immediately.
And then there is the support of the open-source community.
So now we're going to talk about the technical changes. I think that's my best part. I like it. I'm more comfortable with this part.
So we've got the different layers. First of all, we've got the hardware layer. It might be inside Amadeus or somewhere else. Then there is the infrastructure layer part, the platform layer. We have a partnership in Amadeus with Red Hat to use OpenShift. And then there is the application layer, where we have all kinds of Amadeus applications.
We are going to focus on the PaaS layer here because the applications are interacting directly with it in order to deploy and to follow the build, the integration, deployment of their application.
So inside the platform layer, we've got different PaaS's, platform as a service. I don't know if you know it. Okay.
Inside the platform layer, we've got different players inside. First of all, we are using Docker. There are two main concepts in Docker: the Docker image and the Docker container. The Docker image gives everybody the possibility to build your application in a unified way, meaning that it doesn't matter what your application is, you can build it using Docker. And once your image is built, you can ship it and run it in all Linux platforms.
And then there is the container, the Docker container, which is a running image, which gives us the possibility of a process isolation for each container.
But that's not enough, because if you have to cope with a lot of traffic, you cannot have one container or just two containers processing your traffic. And if you've got a lot of applications, Docker is not going to be enough. We need an orchestration of all the containers.
What we have chosen in Amadeus is Kubernetes, which is an open-source project done by Google, which helps us orchestrating containers. Actually, in Kubernetes, we don't just talk about containers, but there is a way of packaging containers into pods.
So via Kubernetes, you can schedule pods. So it's pretty easy. Kubernetes enables you to schedule the pod on your VM, and if the scheduling is not okay, the pod is failing, it's going to be killed and rescheduled again.
And then there are the cluster management capabilities, meaning that you're able to scale up or down the number of your pods depending on the traffic you need to deal with.
And then on top of Kubernetes, we have chosen to use OpenShift, which is built on top of Kubernetes, and it has a lot of tools especially written for DevOps, like the build config, for example, which allows you to build your application directly inside the platform. For example, if you've got a code change in Git of your image, if your image is changing, you're able to build your application inside the platform.
Or for example, the deployment config, because you're able to tell to the platform how you want to deploy your application. For example, if you want to do a software update, you can say that you want to create all new pods with your new version and then delete the old ones. Or maybe you want to do a rolling upgrade, just loading some pods and deleting the old ones, and so on.
And then there is a web console, which is always important to have, because command line is not always a good solution.
But we don't only deal with Amadeus products, the Amadeus applications. We also have and use external products. An example is Control-M by BMC. It's a huge part of our business because we've got hundreds of thousands of jobs defined in Control-M, and we have thousands of jobs which are running in parallel.
Usually when you talk about DevOps, we always talk about web applications because it's easier to talk about web applications. But you don't always have only that inside a company. You have jobs as well, and it's important to treat them like if they were part of your online software.
Because from a development point of view, you want to just treat code like all the other software. It means you want to build it, you want to run and launch your tests, and then have the continuous build, the continuous integration, and the continuous delivery of your jobs.
We have done some workshops with BMC. We were able to build a prototype to integrate using the Docker image that we had from BMC. Using the automation APIs, we were able to do a prototype with them, and we can say that last week we had the first job deployed in test system using what we have built together.
So on top of all this stuff, what we have created is our own platform layer, which is called Amadeus Cloud Services, which is built on top of OpenShift, but that allows us to gather together all the other players like Control-M or, for example, database, or the ESB that we have inside Amadeus.
So that's a journey that is not finished. We still have a lot of work to do, but we have now a running application with thousands of TPS, of transactions per second, that is running in production, in a public cloud. We will have another application that will go into production this month. And we are working on migrating all Amadeus applications to the ACS, and we're going with that.
Damien Profeta
So as key takeaways about this presentation, I think that you understood that we decided to go to the cloud because we wanted to change our organization. And retrospectively, I think it would not have been possible to go to DevOps with our OBE world.
So it means that technology and organization go hand in hand, and one enables the other. So you have always to look to both of them to choose where you want to go.
But then one really key point that is often said during conferences is that the operational model is really important. So Conway's law, that says that your technology mimics your organization, is really important, and you have to treat it and to do the reverse Conway's law. Which is to say that you want to go that way, so you put your organization first in front of that way.
And one last point that is quite important also, because there are many people that are trying to go to the DevOps model and that feel it difficult, is that if you want to change, if you need to change, it means that you are still there, that it's working, that you are growing. So it's a mark of success. So you need to be proud to have to change, because it means that it's working well and that you want to cope with the challenge.
I think that's all for this. Now, if you have any questions about the presentation.
Q&A
Q: Hello. Kalle from Finland. Thanks for the presentation. Almost at exactly the same company size as us, so a lot of silos and a lot of Conway's laws. But what is the final target of the OpenShift platform? Are you going to run it in hybrid mode or going full-stack Google or AWS, or what? Because we have a lot of problems running on OpenStack.
A: Really?
Q: Yeah. Because the version upgrades are, we have been doing a lot of many years at least, but okay, now it's going to be in that condition that you can run it in production also. But are you going to run it on OpenStack or Google or AWS?
A: Okay. So today we're running in OpenStack. Mainly all of our applications are running OpenStack for test phases and the one that is going to production in June. And the other one, I think it's not...
A: Public cloud.
A: It's a public cloud, yes. It's not OpenStack, the one which is in production since almost a year, I think. So the aim is really to have a hybrid cloud.
A: Yeah.
A: To have both private and public cloud, and on the public cloud to use the technology that is there, and on the private one to use OpenStack.
Q: Martin Gander at Home. Just wanted to say, it sounds to me like the change in terms of your organizational structure and the work you've gone through, the transformation internally and personally must be enormous. How are you really coping with that? Is it actually running fairly smoothly? How are you onboarding the teams, and how many people are involved? And could you give me a bit of background on that?
A: So it's quite recent, the reorganization. We knew that it would come, but the decision has been announced a few months ago. Talking about the number of persons, I would say it's 2,000 persons...
A: Yeah, it's possible, yeah.
A: ...that are involved. So yeah, it's really a huge reorganization that took months to decide where people will go, et cetera.
But for the applications that are already running in the cloud, the organization has been done before. And now since we want to integrate the existing applications that we have in Amadeus, we're going to do the organization, we can say division by division, in order to ease the migration.
And I think...
A: I think we are 2,000, yes.
A: I think that the aim is to... So we started the reorganization. We have a few departments that are kind of guinea pigs, where we really have dev and ops. A few ones where we have ops that are closer to the development, but we still keep ops that are the main people that are operating.
And I think that we'll continue to reorganize to more and more move to the DevOps migration, taking the experience from the guinea pigs into account for the future reorganization.
Q: And can I ask another question? What's the transition plan of moving clients over to this new architecture? What's the sort of longer-term strategic planning on that?
A: For the client?
Q: For the clients, yeah.
A: For the client, it should be completely transparent. So normally we should migrate them, and they will not see it, in fact.
Q: Well done.
A: No answer.
Q: Basically a question two in one. You have a massive load on your systems, 24 hours a day, all days. How are you actually doing this kind of load testing and redundancy, and how are you assuring that you're really up 24 hours, all days? What are you doing for especially, I guess you must have a lot of experience of how to make sure that your systems are performing and are there any day. This new concept adds to it, but what are you doing more?
A: So there are multiple things that we do to be sure that we are running 24 hours.
So first, we have global operation that is all over the world. So there are always the people, some guys that are monitoring the different figures. Then the monitoring, of course, has alarms to trigger if there is any drop or if any change that we see in the traffic that should not be there.
So the monitoring, it monitors between the previous week because we have a kind of strange pattern that we call a camel, because it makes two peaks in the morning and the afternoon. And depending on the day of the week, it changes, so we monitor from the previous week.
And I have to think about the third one. But you mean about...
Q: Predicting on one side, and another side, you're massively testing or whether you're doing it all the same time?
A: Yeah. So for the load itself, yeah, we are testing for the whole week with live traffic to see how it's behaving, if we have exactly the same response than before or if the changes are there for a purpose. If we see memory leaks or something like that. So it's a stable environment. And then after that week, if it's going well, we load to production only.
A: We've got a stress test for two or three days, I think.
A: Yeah.
A: It depends on the application. And then the shadow test, which just takes production traffic to test it with the new version.
Q: Thank you.
A: Any other question?
Q: Hello. Thanks for the presentation. Can you elaborate a little bit more on how BMC and these jobs as code are helping you through this journey? So what exactly is this Control-M doing? You have a central repository of jobs...
A: Yeah.
Q: ...and then in your code, you define where they trigger or whatever, or dependencies with software?
A: So yeah, basically we have many reporting to do to send just before the flight takes off, things like that. And so BMC is taking care that we do it in the right order, that we do not miss a job, so everything is sent before the flight takes off.
So with the migration with ACS, if a VM crashes, the job is rescheduled on another node and all things like that. And going to scale, so as you saw with 300,000 jobs that are defined. So that's the point, to use a job scheduler for that purpose.
Q: Thank you.
A: Thank you. Any other questions? Okay. I think thank you very much.
A: Thank you very much.
A: Thank you.