The Road to Enable DevOps Within the Payment Industry

Log in to watch

London 2016

Download slides

The Road to Enable DevOps Within the Payment Industry

Gebrian uit de Bulten

DevOps lead Gallia (Netherlands, France, Belgium, Luxembourg) · Ingenico ePayments/Accenture

Vincent van Kooten

Domain Manager Front Office · Ingenico ePayments

What if your system needed to handle thousands of transactions per second and if you have a second of downtime this will affect most of the biggest internet sites in world!! This is the environment where Ingenico E-Payments daily needs to cope with.

In this talk Vincent and Gebrian explain their journey to enable DevOps in their main application where they needed to refactor their 15-year-old big monolithic application to a state-of-the-art microservices platform. They give an insight on the approaches they have chosen, challenges they faced, and the road ahead.

Their DevOps journey was inspired by (and they learned from) Spotify (organizational and cultural change,) Netflix (architecture enablement), and Facebook (continuous delivery.)

Chapters

Full transcript

The complete talk, organized by section.

Vincent van Kooten

So my name is Vincent. I work for Ingenico ePayments.

Gebrian uit de Bulten

My name is Gebrian uit de Bulten. I work for a very small company called Accenture. You've probably never heard of it. It has around 400,000 people.

My role there is DevOps lead for what we call Gallia, which means the Netherlands, Belgium, Luxembourg, and France. So the people that are still in the EU.

Too soon? No.

But that's a little bit where I come from. So Vincent, what is this Ingenico stuff?

Vincent van Kooten

Let me introduce a little bit what Ingenico is all about, to give you some context.

Ingenico is a French company that was founded in 1980, and it's basically known for manufacturing payment terminals. So if you go to the supermarket, a gas station, or you check out in the hotel, there's a terminal where you put your card in, you enter your PIN code, the correct one, I hope, and then the payment is all being taken care of.

The count right now is 32 million terminals, so you can imagine what volume of payments is going through their systems. Ingenico is operating right now in 170 countries and has 6,000 employees.

Recently they decided to also enter the market of online payments. I worked previously for a company called GlobalCollect, which was acquired by Ingenico in 2014, which is based in Amsterdam. Another payment service provider, Ogone in Belgium, was also acquired by Ingenico recently. And now together, jointly, we are Ingenico ePayments.

This talk will be focusing purely on the Ingenico ePayments platform.

So let's take a look at what a payment service provider actually is. Let me try to explain that in two slides.

I can't give any names, but imagine you have a gaming platform and you sell games online. Customers go to your website, they pay, they download the game. And basically, since you're online, you want to sell these products all around the globe.

But if you have customers in Chile, Alaska, maybe in China, Japan, wherever, and you want your end consumers to pay in some kind of local payment system, or they need to pay in pesos, dollars, euros, pounds, you would have to implement a lot of payment solutions.

That's kind of what I tried to depict in this picture over here. You would have to implement all of these solutions. You have to maintain all of these solutions, while what you would want to do is just focus on selling your games online, right? So that's a lot of complexity there.

Where you see the red arrows is basically where my team within the company comes in. The previous presenters were all CTOs, CIOs. I'm not anything like that. I just run several Scrum teams which are basically responsible for the platform that does all the online processing for these payments. So very critical. Those are the red lines.

The gray lines are reporting. So you can imagine, if you would, in a regular situation, have payments going on in 50 countries, for example, you would receive a lot of reports, a lot of feedback on this, and then there's refunds and all kinds of different currencies.

So what we do is we take away that complexity, and we provide access to one single API to a multitude of payment products all around the globe. That could be credit card transactions, bank transfers in dollars or euros or Swiss francs, doesn't matter. It could be PayPal because you want to integrate with PayPal, Konbini in Japan, Citrus in India. I know that.

And we just provide it to you through one single API. So you only have to implement this, and hardly any maintenance. Earlier this year, we went live with a REST API, which is really nice. If you go to our website, you go to the developer hub, you can just take a look at it. You can see all the specs. We provide you SDKs for iOS, for Android, that you could just use and implement in your own app, also for Java.

So it's really easy to integrate with us, and we take away all this complexity, which means we own all the complexity, which is a little bit of a headache for us. I'll get to that later on.

Reporting, everything is consolidated in one single feedback to you. There is a service you can log in to. You can see all the payments, you can trigger refunds, et cetera. And as far as the funds go, we would even take care for you to aggregate everything into one single currency, in one account, wherever you want it. So no longer payments in 150 countries, all in one place.

You can imagine, so now I go to the numbers. I never do these presentations, so I went to the marketing team. We process billions of flow, and you can imagine that we cannot suffer one minute of downtime or 30 seconds of downtime. Our merchants, our customers, would be complaining right away.

So it's really crucial for us that we are stable, we maintain our uptime, and we don't introduce any issues on production. So we process billions of flow. I thought it was a lot until I saw the presentation of Barclays yesterday, but they process this every day.

And we only do this with about 800 employees, which I think is pretty nice, pretty lean. So I'm proud of that. You can see a number of customers on the right side.

There's one merchant or customer that we would have liked to mention, which is Tidal. I don't know if anyone has heard about Tidal before. It's like Spotify, but then owned by Jay-Z. And you can imagine they do some exclusive launches of albums, and if a guy like that just tweets, "Hey, my new album is online right now only on Tidal," we get the peak of traffic right away. So we also need to be able to deal with that. Gebrian will talk more about that later on.

Shift away from the monolith. All those red arrows I showed you before, the platform we have right now is not as old as 1994, but you can imagine that back then we were a front-runner. But right now, everything was in one single big monolith, and some years ago we decided to take it apart because we really needed to move into microservices to be able to keep up with, well, also with the competition.

First, we were one of the first. Now competition is really fierce, and we have this one big, huge application that we need to maintain to do all these payments in. So we started to chop it up in pieces.

I think everyone knows this picture of the elephant. We've been doing that together with Accenture. We're on a journey together. We're not there yet, but we made some huge steps over the last period. And you can imagine that it's not been easy chopping up this monolith in small pieces, migrating bit by bit, and still being able to handle up to 1,000 transactions per second if needed.

So if anything would go wrong anywhere, we would have immediate impact, and we could even lose our business because we can't suffer a lot of downtime.

Gebrian is one of our DevOps champions, I would say. That's a word I learned yesterday. I wasn't so much into the scene, but I'm convinced now, so to say. We moved into a more agile way of working, and we're really getting the benefits from it now.

I think I'll stick to that, because otherwise you don't have any time anymore.

Gebrian uit de Bulten

Vincent explained a little bit the challenge. They asked us to come together on a journey to make this possible.

First of all, Beyoncé, for example, released two months ago. Then everybody, probably some of you as well, just clicks on and wants to get that album as soon as possible. What happens then, of course, is huge amounts of load. It also needs to get paid. That's where our system comes in.

That's one. However, we're handling also credit cards, PayPal, so everything to do with payments. So we have a lot of regulation, security, that kind of things.

However, the company, as in Ingenico, wanted to go really further, like the Spotifys, the Netflixes, the Facebooks that you see here. They asked us to go together on the journey.

So I asked them, "Hey, I hear also a round thing called the Death Star. I can build that also. That's probably easier than what you're asking now from me." However, they wanted to still do it, so I said, "Okay, let's see how we do it."

So in the next 20 minutes, I'll discuss how we did it together. It was really a combination of us working together to get this done.

For me, I've presented similar things as well for our boards, and then it took six hours. So hopefully today you will be home earlier than the six hours, but let's see how far we can go.

The first part is, you heard about Spotify, Netflix, and Facebook almost the last two days. Everyone talks about these unicorns and that kind of thing. To be honest, I don't think they are unicorns. For me, they are just horses, and I probably would duct tape a horn on it, because unicorns for me are myths. And for me, it's not the case.

So what a lot of companies try to do is copy what those unicorns are doing. That's not what we did. We looked at those companies and decided, "Hey, what can we learn from them?" And just have that within our system, within our ecosystem, within everything. What can we really pick from those ones?

So we divided this huge undertaking into three areas. Organization, culture, charts: we took the Spotify example. Architecture: everybody, microservices are cool, we looked at Netflix. And continuous delivery: we really looked at how Facebook does it.

Development, how you build it, is really about teams, right? We talked about it continuously. However, for us, it's different because we, of course, are a vendor and we work together. So we constantly discuss, "Hey, we're in this journey together." So if I have a problem, I'm one of those bicycles, and I'm there, and I'm having a hard time...

Vincent van Kooten

You're one of the guys, not the bicycle.

Gebrian uit de Bulten

Yeah, I'm one of the guys. And we need to work together to get across the finish. If Vincent just kicks me and that kind of thing, it's not the way we want to work, right?

That's why I like the Spotify model. Ron yesterday from IG also talked a little bit about it, so I'm not going into that deeply. But what we really wanted to focus on is our teams. We want to create them as feature teams, and they should be like mini-startups, self-organizing, working really end to end, and taking the responsibility.

And not me as a manager or Vincent as a manager just pushing around. No, it's the team's responsibility. And all the rest, infrastructure, the app platform, iOS, et cetera, all those teams need to be enablers. They should be making it possible for features to be brought faster to production, and not the other way around, which most of us are doing.

This was our organization, how we started it. On the top, we had a product owner. We had a project manager, like we all do. Then we had vendor one, and we had another vendor as well. One is doing development, the other testing. One is mostly combined in Mumbai. All the test is in Pune. We have some project manager onshore, and we had, I think at the time, one developer onshore as well, and maybe a little bit more, and a few testers.

But it was just a mix all over the place. If something goes wrong, finger-pointing: "No, it was the other vendor. The test is incorrect. Development's incorrect."

Then we still have operations. We have what we call operations level one and level two. For us, since we are very hard regulated, it's impossible, as you guys really would like to have as DevOps people, continuous delivery, push it to production. We are not allowed to do so from regulation. We don't have the right, and we will never get the right to push a button and push it on production, because that's what we call PCI compliant. It's all regulated, so we need to deal with that.

So this is the current situation where we are. We have been pushing the teams very hard to be self-sufficient. So we took this Spotify model with the cool stuff, with the tribes and the squads and all kinds of shit, and the first thing we did, we removed all those strange names. Right? Nobody knows what it is. A squad, a tribe, I don't know. Let's just call it a domain.

That's where Vincent comes in. Vincent is this guy we call the front office product owner. It's more like a domain manager, so he's leading the domain. For me, I don't know what title I have. I call it development lead or delivery lead or whatever. And since we have this big offshoring, we have people there as well that are the leads.

However, we are not the old-school managers. We call them servant leaders, I think I heard it yesterday. We are really pushing the teams to take ownership. We still have on the top what we call projects. But currently we are finalizing these projects as well, because the teams for us are the ones that are end-to-end responsible.

So we have really the feature teams where we have people from the business. You put them in a team and make them responsible. What we call projects is more over multiple domains, and we still have this team end-to-end responsible for this domain. So we have some projects, and the other one, what we call fast track, is really about teams that have functionality that's less than two to three weeks of work. Else, we go through more to our project team.

So you need to see, for example, fraud. For us, fraud is very important. We have a team very dedicated to handle this fraud functionality. So really about having a feature team.

Vincent van Kooten

What would be nice to mention here as well is that we used to have a very classic vendor-client relationship. That's gone. That was the first thing we moved off the table.

So in these teams there are people from Ingenico, there are people from Accenture, people from other vendors collaborating in those teams. We really don't care who's paying their salary. We just want to make sure that they work together as a team and deliver what they need to deliver.

Gebrian uit de Bulten

That's really important.

We've also gone from, "Hey, this whole platform, we have a development team, then we deliver something to a test team, and they work with it." No, they are now one team. And we scattered them. So we have people sitting in Pune and one tester that's sitting in team one, for example, and we have a developer sitting in team one, and the guy next to him is sitting in team two.

We constantly have open-line communication and that kind of thing, and it worked great for us. Instead of just having an offshore team and doing some development stuff and then moving it to testing, now it's really about feeling as a team.

Still, we have what we call the PCI-compliant environment, and we're really working closely together with them. They have a really hard job, of course, about the security and that kind of thing. So we're really involved now with them as well. Daily, we have calls with them every morning to discuss how we're going to set up the environments and that kind of thing.

To be honest, going forward, those are the points also where we will focus on, to see what can we do with the regulation to get more of those people in our team, so we can really start pushing to production from the teams.

The projects as well, we discussed it. We are currently doing a proof of concept with one team where we really want to remove whole projects from there. So that's a little bit the setup we currently have.

Then the other point I discussed was architecture. I'm not sure if you can see what this is, but this is a Lego contraption, it's called. I'm a big nerd. I confess, I like Lego.

So what they have here is, once in a while they have a conference a little bit like this, and they all build a Lego machine, and there goes a football, so a soccer ball, through their system. What they discussed exactly: one ball needs to go from one system to the other one every second. So they have a very clear, what we call, interface.

I really like that from a standard, because everybody talks about architecture, it should be like Lego. You put a few blocks in it, you can reuse it, but you can build everything. So that's why I like this picture.

But everybody talks now currently about microservices, right? Yeah, we have this monolith, we need to split it up, et cetera. I'm not going to discuss it because this morning there was a great discussion about what microservices is, so I'm not going to repeat it.

However, the question that I still didn't, to be honest, really see during most of the presentations is: today we have a monolith, and tomorrow we have microservices. How do you get there? What do you do? And in this regulation?

So this is how we do it. We have a big monolithic application, and we use what we call the strangler pattern. There's a story Martin Fowler wrote about it. Really look into it.

What we do, instead of just building a complete new platform, this is a platform with a million lines of code, so it's huge, we can't just do it. No, we are slicing it.

I took the example here, for example, for getting the order. For some reason, everybody, our clients want to get an order, what the status is. I don't know why. I think if you pay us, it's enough, but okay.

Vincent van Kooten

They want to see all of their money on the back.

Gebrian uit de Bulten

Yes, they want to know where the money is.

So, get order status is one of our main goals. It's a small goal. What we did, we created a complete new platform next to the already existing one, but created only the basis. Think about logging, exception handling, everything on an application server, so completely clean, what we at the time called a clean bucket, and we slowly moved first the get order status to that.

So really focus on one particular element and get it running. And then, since we have so much high peak load, see how it works. That way, we could slowly slice our application.

The strangler pattern is about, hey, you have a tree, and you build a small one within it, and then slowly it kills the currently existing tree. So that's a little bit what we did with the strangler pattern.

We started with that, and how did we then move from there to the other one? We continued that journey, and instead of just having really this big monolith and only strangling that, we slowly created services out of it.

So really pick up particular functionality and say, "Okay, can we isolate this from what we already have existing, or should we move it to a different service?" A lot of them we now have. I think our whole platform is now 30, 35 applications, something like that, the whole platform.

So instead of just only the strangler pattern from our core business service, we also created new trees, right? That's a little bit how we did it, how we combined it.

As you can see, we are doing a lot of functionality. PayPal, for example, is an example. Authorization and the other ones are all different bank modules that we have below it. Because that's of course our issue as well. Besides that we are talking about so much load and performance and that kind of things, we have to do with tons of different systems, and we need to connect most of them to what we call providers, acquirers, banking systems. Some of them are almost like smoke signals that you need to send over the line, and not all APIs or that kind of thing. So we need to deal with all kinds of different technologies and that kind of thing.

Okay, cool. Now we have that. Here is an example. Now we have a service.

What we did here was an example of: we do the get order status, and we fire as much load on it as humanly possible. This is on a test environment. It's not a Netflix style that we do it on production. We have there Docker, Amazon Cloud, everything is up and running currently in this new framework.

What you can see is that on the left side, you see how many transactions per second we fire on this one simple machine. We just fire a lot of load on it. It handled around 1,000 transactions per second, and we saw a trigger came up: it's above 50% of CPU, create an instance from scratch. Not existing, switching load over or anything, create a new instance completely from scratch. Took a few minutes.

Vincent van Kooten

Automatic trigger.

Gebrian uit de Bulten

Sorry, yeah. So it creates it, and suddenly we see it has been created, and we can handle 2,000 transactions per second.

That's a little bit where our situation, where we want to go is: because I don't know if somewhere in this world some guy does, I don't know, the newest World of Warcraft game comes out and they didn't inform us. Well, probably that one they did. But then suddenly you see a huge amount of peak load, right? And that's what we need to deal with.

Last point: commitment to deliver. That's where we focus on. We took the Facebook example. There is a link in most of my slides as well. I really would ask you to look at that link. It's really good. It's Chuck Rossi explaining how the Facebook mechanism works from going to production.

Why we took this one as an example, as a baseline for where we wanted to go: they release every week. They have a huge platform. They still have a monolith, so we tried to use it first from that one. And they can deliver twice a week or twice a day to production as well. One big release and twice per week to production.

So what is our situation? We're running multiple projects at the same time. You guys are probably doing the same thing as well. So really focusing on having more streams at the same time and that kind of thing.

Releasing to production, we did that less than 10 times a year. And months from when some poor developer committed code and sent it to production, it took months. Very high defect ratio, because what we did at the time was if we put something on production, it went wrong, we rolled it back, but we only removed that fix from the code, and it was a mess.

The deployment itself took days, sometimes even weeks at a time. So the problem that you have is if you do these multiple streams also, one project is trying to fix something, it's rolled back from production, and the other one is late as well, then the whole planning for all those projects will be delayed. We're just building up dependency on dependency on dependency, and in the end, nothing went to production anymore.

Vincent van Kooten

Correct. So what you get, you get a collision. A full collision on the environment.

Gebrian uit de Bulten

So how did we solve it? Very simple. What did the train guys do? A station. It's as simple as a train station. The train or bus or whatever, well, bus maybe, but a train or flight doesn't wait for you.

So for us, the schedule is Tuesday and Thursday. Sometimes a little bit different, but we're trying to focus on it. That's the way to do it. So if you're too late for your release that goes out of the door...

Vincent van Kooten

Then you're too late. You take the next train.

Gebrian uit de Bulten

Right? That's a little bit how we implement it. Instead of just going from, hey, we rate everything and that kind of thing.

Now we have one of the most regulated systems probably in the world. We have most of the credit cards, probably one of the biggest databases in the world from credit cards, and we now are releasing to production at least every week, sometimes twice, sometimes three times a week in exceptional cases, but normally around once a week.

Now it takes like a week, sometimes even less, to go through to production, to our environment. We have automatic test cases that can be run. 20,000 of them can be run within minutes on our platform. We have a lot lower defect ratio, and the deployment itself takes a lot less than hours, and we can even speed that up.

So instead of just constantly going forward and having a lot of teams, we just have also, again, for the developers under us, one code base.

Vincent van Kooten

We're pretty strict on the teams. Because if they're not checking in time or it doesn't work, then they miss a week, and the next slot is next week.

Gebrian uit de Bulten

Yep. And that's the only way to ensure we keep moving changes to production on and on and on.

One of the last slides is just a little bit how our environment then really looks. Our complete test environment is currently running in a cloud. You see the tools. I think you saw most of them already today. We have that running on what we call the ACP, Accenture Cloud Platform, but we're trying to migrate that probably as well. We have our environment there up and running.

We do have our on-premise environment. One thing to mention is that we are looking into new ways of setting up our on-premise environment as well.

However, what we did, because you still see here there's a big line between the cloud and on-premise. How do you handle that? Well, real simple. We just had two or three persons that we put together from operations and development, and made this automatic.

So as you can see, almost our whole environment is fully automatic. Semi-automatic, I mean, we still need to get approvals. That's the only thing that is stopping us. Our deployments are completely, fully automated. Our operations guy did a great job.

What we saw in the beginning, we had level one and level two ops. The level one, to be honest, doesn't exist anymore. They are all in our teams. We only have the PCI compliance going forward.

But now sometimes we go in one day from integration to pre-production to production.

Vincent van Kooten

And that's something that a year ago was unimaginable. It was more like a month.

Gebrian uit de Bulten

Not even thinkable, I think.

Some of the lessons learned. Talked about in the beginning as well. Really, vendor versus client relationship, it's a combination. We need to do it together. In the beginning, we had a lot of struggles and that kind of thing, but that's really important.

The other one: no separation of teams, mixed onshore and offshore. But also a very important part that has changed our world as well is really when we had a business person that was really responsible for it on the floor, in the team. So not an IT product owner, but someone from the business end-to-end responsible for that.

Vincent van Kooten

That's something that's been happening over the last months, because we've been trying to put the teams together. So we've been just going to the business, pulling the people in, and now they are sitting with us. That really makes a change.

Because otherwise, sometimes you would have surprises that the business says, "Ah, this is not what we wanted." So we said, "Okay, just come sit with us, and before it ships, you say it's okay. And you define what we're going to build." And it works.

Gebrian uit de Bulten

For us, a few things are really important. Indeed, don't use feature branches. That was for us, we tried it, it broke everything. Just one branch, one branch only. And still our teams are discussing, "Hey, Gebrian, can we have a branch?" Hell no. Absolutely not. One branch for everything. One ring to rule them all.

Overall for me, we had a lot of regulation about offshore teams and the delivery. What we did in the beginning was really a little bit strict. We said about autonomy as well. We started really strict, and now we are in a situation that we are loosening up the teams.

So to make sure that everybody looks at it the same way, and we did that with automatic tooling, about quality, about performance, everything. Everything for us is automated.

Last slide. What are we still then from, hey, cool, Gebrian, cool journey. What do we still need to do then?

One is we are pushing through getting these feature teams up and running.

Vincent van Kooten

That's really important, yeah.

Gebrian uit de Bulten

So purely independent ones.

The other one is our development speed. Since now we have a new platform, we have heavily invested in creating a new platform. We need to go faster, faster and faster.

Cloud possibilities, I think that should be quite clear.

Metrics. What we have, to be honest, in the journey, have been lacking was a little bit about the metrics. How much we improved and that kind of things. That's something we need to do going forward.

Last point, Conway's Law. I think, for me, it is both ways. On one end, it is not only that you need to change your organization. Sometimes you also need to change your architecture, everything around it. It's both ways. If you really want to go forward, you need to change both.

And the question that I wanted to ask, what are we still looking for? You can already see it: PCI compliance in the cloud. How are we going to do it? We don't know yet. We are investigating it. We'll see what happens. And the same with Docker.

Our environment is quite regulated, strict, has a lot of rulings and that kind of thing. But our main issue, it's that, everything you have as well. However, we have tremendous loads going through our system. Really tremendous, with high amounts of peaks.

I think that's it for me. Vincent?

Vincent van Kooten

If you give me the clicker for the last slide.

Thanks for the really great explanation, Gebrian.

Something I want to mention, because we were discussing this yesterday while we were sitting here listening to the other presentations, actually something really big was happening back in the office. Because we've been going from this monolith to all the small services. We chopped it down into, I think, even more than 25 different services right now.

One of the last big migrations was taking place yesterday while we were here, and we were following on WhatsApp what was going on, and everything went smooth. And I was really, really proud of the teams. So if you guys are watching, thank you, guys.

I think a couple of months ago, this is something we wouldn't have trusted. We would have been there, we would have canceled this, we would have been there in the office. But everything went fine. The teams took care of it all together. They pushed everything from integration, pre-prod, production, moved the flow.

And to give you an impression, I think it was about 10% of the entire flow that moved yesterday, which, if you looked at the numbers before, is a lot of payments. So we were sweating a bit, but everything went fine. It really proves for me that the direction we've chosen is working, and that we together have to continue with the journey we started.

Last slide. We're Dutch, so we like this guy, Verstappen. Yay.

I heard today he crashed his car, but that doesn't matter.

Gebrian uit de Bulten

Shit.

Vincent van Kooten

Anyway, I think it's really important to not be afraid of change. Change will always be there. You need to embrace it. You will always need to change. Because if you stop doing that, you'll lose, basically.

So like these guys at Red Bull, they put a 17-year-old guy in the car. Maybe he was even 16 years old. It took a lot of guts to do that, but it paid out. They had their first victory since a long time.

Gebrian uit de Bulten

Didn't even have his driver's license.

Vincent van Kooten

Yeah. He didn't even have his driver's license.

And we are doing the same thing. We have a risky business. We need to make sure it goes fine, but we don't step back on improving and changing the way we work.

That's it. So thank you all.

Gebrian uit de Bulten

If there are any questions, then we are around still for probably one hour. Just grab us. We're quite open.

So, thank you all.