What's It Like When DevOps Is Just Another Day at the Office?

Log in to watch

London 2018

What's It Like When DevOps Is Just Another Day at the Office?

Director IT Innovation Process & Agile Coaching · bol.com

bol.com is the largest (online) retailer in the Netherlands & Belgium.

Frederieke Ubels joined the company when it was 2 years old, in 2001, as head of their bookstore. In 2008 She switched to IT and helped implement scrum in the whole company - then around 100 people. Since then they've grown very fast, to 1,400 people.

At the moment she's part of their IT & the bol.com management team. Within these teams she's responsible for their business & IT innovation process and agile coaching. Also she's participating in or leading large IT programs or projects. To make sure their innovation process is as effective and efficient as can be, she focuses on removing bottlenecks: on the business side as well as within IT.

In the past years she was responsible for their transformation from 'just agile' to DevOps, the only way for the company to make sure that they could grow & scale any further.

Chapters

Full transcript

The complete talk, organized by section.

Frederieke Ubels

My name is Frederieke Ubels. I'm from the Netherlands, from a company called bol.com. Does anyone in the room know bol.com? Yeah? Okay. All speaking Dutch?

So, bol.com is the largest retailer in the Netherlands. We're a pure player online, selling everything from books and entertainment to baby apparel, and from stuff for home and living to everything from health and beauty. And to people abroad, it's easiest to describe ourselves as that we're the Dutch Amazon.

It's our mission to be the best place to buy for all consumers in Holland and Belgium, and to be that, we also have to be the best place to sell for our partners and our retailers. So we're not just a store, we're also a platform for all other retailers to sell their stuff on.

We've started in 1999 as a startup of 10 people in a portacabin, and ever since, we've grown very fast in all kinds of aspects and dimensions. So in the number of categories in our store, number of products in our catalog, but also, luckily, in the number of customers, the number of orders, in revenue, and in people in our building. At the moment, we're with 1,400 people in the World Trade Center of Utrecht, in the center of Holland.

When you're growing at a pace like that, you're transforming all the time, whether you want to or not. From the outside, because you have to keep up with all the changing demands of your customers and your market, but also from the inside, because you have to make sure that you're set up for success and that your processes and your organization and your people are fit for purpose for all those changes on the outside.

And that's okay. It's evolution. It's continuously improving. It's being a little or a lot better every day. And it also offers a lot of opportunities for people in the company, and I'm a living example of that.

I started in 2001 as a head of our books department. So my background is not in dev, and it's not in ops, it's in business. And in 2008, I crossed over to the bright side of IT, just in time to witness our biggest and last-ever waterfall project, and also in time to be part of our transition to an Agile way of working with Scrum.

And ever since, I've been responsible for our way of working within IT. So not the why or the what, but the how. How do we go from a brilliant idea on the business side to having that working for our customers in production? How do we go from the aha to the ka-ching?

I think in some companies, I would be a process manager or a transformation manager, but somebody once called me a catalyst, and I think that describes it very well.

My role started and is still based within IT because, traditionally, IT is seen as the biggest or the most visible bottleneck. But I think you and I know that what looks like an IT problem most of the time is a business problem or a company problem, so I spend a lot of my time with business people as well.

But talking about IT, we have in-house IT people, systems as well as infrastructure. We're with about 400 people right now. They work in approximately 70 Scrum teams on a lot of services that all run in our own data center.

A little over three years ago, we went through a major transformation that was planned, and that brought dev and ops closer together. And I want to start by showing you how that works out three years later in day-to-day life, and also how it helped us grow and evolve our company further.

And after that, I will dive in a little deeper into this transformation and also in what started it. And in the end, we'll have a look at some challenges that we are still facing right now.

I want to start this by looking at it from the perspective of our most important entity in a company, and that's the team.

So I want you to meet Team 4B. Team 4B is responsible for part of our financial processes, together with the financial process business team, of course. And Team 4B is a product owner, it's a couple of software engineers, and they have a test engineer. And they work closely together with some people, like a business analyst, an IT engineer, sometimes a project manager or a coach from the business side.

Team 4B is a proud owner of a couple of services, independent services. For example, the FFT, our financial fulfillment service. And correct me if I'm wrong, Marcel, but it's responsible for matching incoming payments to customer invoices and orders.

And they build new features into those services. That's why we call them a feature team. But they also do changes and bug fixing, of course. And they're responsible for deploying those new features and bugs, et cetera, to production themselves.

And they can do so because their services are really independent. They are compatible, they're resilient, and they're always online deployable without downtime.

During the daytime, they monitor their services. They take a look at how they perform, if they're up and running, if their response time is okay, and they also build dashboards for their business stakeholders so that they can monitor the most important functional KPIs on a day-to-day basis. For example, the percentage of invoices that is matched, or the number of people that complete a transaction in our webshop.

So this is Team 4B. And together with the other Teams 4, because there are more Teams 4 that all work on the financial process, they form a fleet. And those teams work loosely coupled and tightly aligned, just like their services do. And they share a team manager, but it's a vacancy right now, and they share a floor in one of our buildings together with their business department.

The financial platform operations fleet is not our only fleet. Together with a couple of other fleets, they form a space. In this case, the retail platform space. And they have an IT director. It's called the space director together. He's over there. And one or two business directors, and they all work in the same business domain.

And more importantly, they share an SRT, a space reliability team. It's the team with the blue people in the middle. It's a team of former ops people that help the fleets and the teams to run their services as autonomously as possible.

So we don't have DevOps teams, and we don't have one ops guy in every team. But we have SRTs, space reliability teams, instead. And that's because we think that the ops capability is something to treasure and to nurture, and we don't want it to constantly be overwhelmed by the majority of dev priorities that we have and the dev topics. So that's why they're in a separate team.

During the daytime, Team 4B runs their services autonomously, but during the night, the space reliability team takes over. And that might sound a bit strange and totally not DevOps, not having devs woken up at night, but there's a reason for that.

It takes a certain kind of person, we think, or a certain kind of personality, to do the right thing when you're woken up in the middle of the night. And, to be honest, we don't want all our dev engineers, especially some of them, to touch their systems in the middle of the night with a sleepy head, because there's only so much of an ops mind you can put into a dev body.

But there's also the matter of financials and of efficiency. It's very costly, it's very expensive to have everybody on call in an on-call pool, also when they're not really on call or when they're not woken up at night. So that's how we do it right now.

And, well, don't forget, those guys are all space buddies, so if somebody in the SRT is woken up in the middle of the night because of somebody in Team 4B, they sure know where to find them the next day.

So this is what our total IT department looks like. It's now almost three years ago that we toppled it from a dev and an ops department to spaces like this. So we have three feature spaces with the fleets and the teams in it and the SRTs, and they are aligned to business departments.

And we also have a technical platform space. It's on the bottom. And as the technical platform space director always tells us, they're the ones that keep the light on and keep the water running. So there's an infrastructure team in there, a tooling team, and also a small security team, and they support the rest of the feature spaces.

This spaces model helps us to localize or compartmentalize dependencies that we have. So instead of all of the company to one department, there's now spaces with one or two business stakeholders. And together, those stakeholders and IT are responsible for what's happening in the fleet, so for the people as well as the product.

So they're the ones that prioritize what the fleets are working on, which fleets to grow and which to stabilize, but also, if necessary, what the priorities of the SRTs are.

And with this shared responsibility comes also shared ownership. So business stakeholders feel very much owner of what we used to call IT priorities or IT projects, like security or performance, and whether or not we're ready for the holiday season, so the most beautiful, the most wonderful time of the year, when we do almost 25% of our revenues in just six weeks' time.

On the team level, business and IT together decide when to release and what. And they do so more and more often, as you can see in this graph. It's our deployment frequency per month over the last four years. It has increased very much, and for us that means that we have smaller releases, and that they are less risky, and that we have a faster and faster time to market.

And it also is a metric or an indicator for flow. If this deployment frequency stagnates, then it means that there's too much release overhead and that we should investigate what bottlenecks caused that.

I hope you noticed that we deploy all year round, also during November, when the impact might be very big if something goes wrong. But we trust our teams to deploy also in that time period, where we used to have a deployment freeze or a release freeze for, I think, two years ago.

So less risky doesn't mean that there's no risk at all and that never anything goes wrong. Some things go wrong, and when they do, we do postmortems, and we try to keep them as blameless as possible.

And our board of directors has been a great help in that over the past years, because in front of the whole company, in all-team meetings, every now and then, every few months, they hear our confessions of things that go wrong, of big outages, but also things that nobody noticed. And by that, they want to show us that it's okay when things don't go as planned, as long as you fix it as fast as possible, as long as you learn from it, and you share all those learnings.

And there have even been times that all kinds of teams were sending each other emails that started with, "I have a confession to make," and that was their way of sharing things.

By the way, we also do postmortems when engineers leave the company, because we don't like that, and it's also something that went wrong. And it's something that might have even more impact than a big outage, especially for the long run.

So we try to learn from failure, but also from success, of course, and from experiments. We even made our own version of PDCA. It's called CISL, as in, "Hey, let's do CISL together," or, "Have you been CISLing that first?" And it stands for Continuous Improvement through Structured Learning, and that's exactly what we aim for.

So it's all blue skies, no clouds, and the land of the Teletubbies and the unicorns?

Well, it depends. Lots of times it is, but not always. We tend to focus more on the challenges that are still before us, and I will tell you a little bit more about that later.

And it hasn't been like this forever. I said three years ago, we went through a major transformation, and it was quite hard work to get this all up and running.

So to show you a bit how that went, I want to take you back in time. And we go back in time to the summer of 2014. And this summer, those were the best days of our lives. We had just completed the coolest, the biggest project we'd ever done. We had insourced our web operations by building our own data center. And it was the coolest data center we could imagine.

Until then, we had outsourced our production environment to a third party. And although those guys did a very good job in teaching us about reliability and about uptime, it also restricted us in many ways. For example, we were only allowed to deploy once every four weeks. And we had trouble getting the best engineers on board because, hey, when you're not allowed to touch production, what's the fun?

So all in all, there was a big wall between dev and ops, and we wanted to get that out of the way.

And the summer of 2014 was also the period in time after five very successful years of Scrum. It had proven to be very scalable, and in those years, we had grown from, I think, eight or nine teams to 40 teams. And our IT value chain, it looked a bit like this, and that also posed a problem.

Because we had a feeling that, yes, we were scalable, we could still grow, but we couldn't grow very effectively and very efficiently. There were all kinds of teams working on the same environments. They were working in the same tools. Some of them were even working in the same code base. There were dependencies all over the place. You were always waiting for someone. Someone else always broke something.

So yes, we could grow, but it was very frustrating for people. You heard it when you talked to people, and you could also see it in our metrics, although I realize that it's very difficult to see on this screen.

But there's a vague line in the background. That's the number of IT people, and it was growing very consistently. But the number of user stories, our output, wasn't growing accordingly. And I know it's not the best metric for it, but it was very ominous at that time.

Another problem that we encountered was that we found that it was very difficult, more and more difficult, to release our sprints. And at that time, time to production was our most persistent bottleneck. It's the time between the end of the sprint and releasing to production.

And as you could see in the talk before, there can be a long period of time between, and that was also the case in our company. So when the sprint was finished, we still had to do some performance and load testing. We had to do some integration testing, regression testing, business acceptance testing. So all in all, it took about three weeks, and then we had to deploy to production in one big bang, everything at the same time, in the early morning with downtime.

And three weeks was kind of okay when you have a rhythm of four-week sprints, but when it's more than four weeks or five weeks, there's a queue of sprints and you can't release anymore.

And we had no clue how that could have happened because we had just built our own data center. It was very cool. It was built with the state-of-the-art DevOps principles in mind. So infrastructure as code, configuration management, continuous integration. We had everything. We had very cool tools, and it's 2014.

And you see the poker faces of the system engineers. They are very proud of what they had accomplished.

So in our minds, we had dev, we already had dev, we had ops, so we could sit back and watch the DevOps automatically happen. Well, it didn't. And it turned out that when we insourced web operations, we also insourced the wall between dev and operations.

And it didn't help that everybody had his own vision of what this DevOps thing was. Was it automation? Was it culture? Was it organization? Was it tooling? Could be everything. And I remember that there was somebody who compared it to, said, "Well, DevOps is like teenage sex. Everybody's talking about it, but nobody has ever done it."

So we had to find our own journey, and we had to find our own metaphor, and preferably not a teenage sex metaphor.

And we found a great metaphor in this: to put a man on the moon. That's what we called our project.

And, well, I don't know who of you was there in the '60s. I wasn't, but it must have been, wow, it must have been wonderful that Kennedy said, "Hey, we're going to put a man on the moon." It's a very clear goal. You can see it there. You can see it every night. You don't have a clue how to get there and what it might look like, but it appeals to creativity, to inventiveness, to commitment of everybody involved to get there.

And for us, our goal was not doing DevOps. Our goal was to be scalable again in a productive way and have fun along the way. That was our moon.

So, with this in mind, what did we do? Where did we start?

Well, first of all, we soaked ourselves in DevOps inspiration. So we went to conferences, to DevOpsDays. I listened to hours of podcasts on the bike. And we also looked at other companies that went through hypergrowth, and Spotify was a very good example of that, and they made beautiful movies about it.

So we watched those with all of our teams. So I guess I've seen the engineering culture movies 40 or 50 times or so. And we asked the teams, "Hey, describe your moon. What does your moon look like? Imagine it's two years from now, we're twice as big, and everything works fine. You can be productive. You can really have an impact."

And it was a good thing we asked them because they came up with a lot of different things that we had already thought about. And it turned out that the clue to being a better company for us was having less dependencies and more autonomy.

So then, after that, we could ask them, "Hey, what are you waiting for?" And then the movement really started. And there were teams that needed no more encouragement. They jumped right in. They made their own flight plans to the moon by killing all their bottlenecks. And every time they reached a milestone, we celebrated with very officially looking certificates of NASA and with pieces of cake and bitterballen, of course.

But there were also teams that were more hesitant because autonomy also can be quite scary. It's not just the right to do everything you want to do or think is necessary to do, but it's also the responsibility for all the things, and you need the capabilities that go with it.

And that was around this time was also when we started with the SRTs, with space reliability teams, and they were a big help in transferring knowledge and responsibilities from operations to the dev teams. And also the frontrunner teams, the ones that jumped right in, they shared everything they learned in storytelling sessions or over lunch.

So in this way, a kind of a phased approach evolved of teams first starting to build really independently, so making their services compatible, resilient, and online deployable. And after that, making sure that they could deploy them themselves through production. And after that, running it themselves.

And if you've built it and you've deployed it and you run it, then you own it and you love it. It's your baby. So we call it, "You build it, you run it, you love it." And the acronym for that, YBIYRIYLI, is still the way we describe our way of working.

So by transferring all this knowledge and all this work from ops to dev, we created room for ops to get out of the firefighting modus and to do their own work again, get out of the ticket business and automate even more. And they did. And it's still the basis for our scalability at this moment, three years later.

Because in the past three years, we've grown very fast in biz and in dev, and relatively moderate in ops.

Also, according to productivity and scalability, we were back on track. The number of user stories was more in line with the number of people in IT, and our quality was also improving. The number of incidents was going down and down, and halfway 2016, we even stopped measuring incidents, and we switched to mean time to recovery because it's more relevant for our teams.

So all of this is not just great for IT, it's also great for our business because IT is a driving force behind our innovation and our growth. And if it were not for this man on the moon, we couldn't have had those revenue growth figures that we have had over the last three years.

In 2016, we transferred our mission controls back to the space directors. They're over there. And the day-to-day mode was on, but it didn't mean, of course, that our transformation mode was off because, well, if you keep on growing, you're never done transforming.

And this spirit of continuously hunting down dependencies and bottlenecks is still in the teams. Some of them call it "You Love It" phase. Some of them call it Mars. But it's still going on.

So that brings us back to the here and now. And I said we still have some challenges before us, and I'm very curious to hear from you if you could help us with those, maybe over lunch later.

One of them is that, yes, we have a lot of services, but we also have legacy. We're 18 years old, so there's two big legacy systems that we have. One of them is our order management system, and Team 4B is taking that down step by step. But the last bites are, well, they're tough.

And the other one is our webshop. We're not taking that down, but we managed to deploy it first four weeks, then every two weeks, and now we deploy it daily and automate it. And, well, we don't know where it goes from there.

And another thing is the cloud. In 2014, everybody was already talking about cloud, and we had just built our own data center. And that was okay because we could kill most of our bottlenecks in there. But times change, and we're in the middle of a big cloud project right now. And that also, that's going to be a lot of more different ways of working, more autonomy. But it also brings back our biggest elephant in the room. So what do we do with the 24/7 for devs? We're not finished discussing that.

And last but not least, a challenge that's not just technical, but more way of working and cultural: it's how do we keep everybody motivated? How do we keep everybody in? And how do we stay the best place to work, where you can be the best version of yourself and also do what's best for the company?

And we know that autonomy plays a big role in that, but it's also about alignment. And how can we stay away from command and control on the one side and chaos on the other side, and still give everybody focus and let everybody know what they're working for?

So I think together with the cloud, this is going to be our next big step that we have to take.

So now I could close off with a couple of takeaways that maybe you could use in your journey to the moon or wherever you're going, but I will not be doing that. Instead, I want to tell you a story. And this story for me really captures the feeling of DevOps.

And it happened last year at Black Friday, our best day ever, as we have every year. And we were at the office with a couple of hundred of people who just wanted to be there.

And there was one room in one of our towers, and Team 4B was there together with some people from ops and from the business and from management as well. And they were all watching how our most important payment method, iDEAL, was very flaky at that moment. Interfaces of the banks were down a lot. They were overloaded.

And as a result, some of our customers didn't get a confirmation of their payment, so they paid again and again, or they started calling our call centers.

Now, two years ago, three years ago, in the old days, there would be an ops guy watching technical monitoring and saying, "Hey, it's down a lot." And he would look up at the highest-paid IT person in the room. He's there. And he would say, "Okay, let's shut it down."

But not this year. This year, there was the business owner of our checkout sitting together with the customer service manager, and they were watching the queues in our call center. And they had decided together that they would shut iDEAL off if the queues would be longer than a certain number of people, or the waiting time for customers would be longer than a number of minutes or seconds.

And it wasn't, so we stayed in business.

And at the same time, somewhere late at night, the ops guys started to pack their things and wanted to go to the bar. And at first, that startled the biz guys and the dev guys, because, "Hey, what are you doing? You're going to the bar."

Until they realized, "Well, we are running the show right now, so let them go to the bar."

And I think that that's DevOps all through the value chain, and that's the ownership that we want our people to feel and to enjoy.

And that's it. Thank you.