DevOps and Organisational Transformation From the Trenches

Log in to watch

San Francisco 2017

DevOps and Organisational Transformation From the Trenches

This session will present a detailed case study on the organisational transformation Seek has undergone over the last 4 years within its IT Department. Responding to rapidly changing market conditions, Seek significantly expanded its IT workforce in 2014 to meet new challenges in a marketplace it had traditionally dominated. Through broadening its technology platforms, fundamentally changing the focus and way IT engages the business and delivers product, it is now in a constant cycle of continuous improvement focusing heavily on decentralisation, autonomy and automation.

The presentation will focus heavily on our people, processes and the way we used technology to achieve our objectives along with the numerous challenges we faced in this environment of cultural change. This is a presentation delivered from the coalface, it is not a prescriptive manual on how to “do DevOps”, there are more than enough of these in existence now, but instead will take the audience on a journey through this four year timeframe, not shying away from the mistakes we made or sugar-coating the difficult decisions we faced.

Starting from a time when our Operations teams were hamstrung by a technical debt legacy, constrained to a single vendor platform, living with constant system failures, reactive decision making and a hero culture that led to countless late nights and production deployments that only numbered 2 per month. We delve into how we set about meeting these challenges and also when we had to change our approaches to continually get better. We will look at how an ambitious project to get rid of testing environments took us into our first foray into AWS and using the hard lessons gained from that experience, how we pushed forward, focused on constantly improving the three key areas of people, process and technology and eventually got to the other side with a now incredibly diverse technology platform that would have seemed like a fantasy when we started.

Finally we will explain why we are still not satisfied and how we still need to keep improving. This is not a “tech heavy” case study, nor a lecture on the wonders of DevOps. It's a warts and all story of organisational change, full of hope, suffering and redemption with broad appeal to both technical and non-technical people alike.

Chapters

Full transcript

The complete talk, organized by section.

Andrew Hatch

Hello, everyone. Thank you very much for coming. My name is Andrew Hatch.

I work for a platform engineering team. I manage that team, formerly a DevOps team. If you want to reach out to me, you can catch me on Twitter at my handle down there.

The purpose of this talk today is, I called it "From the Trenches" because this is not a talk about how great or super awesome we are. This is actually just what we did: the journey that we went on, the troubles, trials, and tribulations that we went through, where we are today, and the learnings.

If you can't tell by my accent, I am not from around here. I work for a company called SEEK, based in Melbourne, which is down in the southeast corner of Australia. The climate is pretty similar to San Francisco, actually, so I'm quite enjoying it here.

The company's been around since 1997, and we have over 700 people who work in our little Melbourne office. We still run some systems out of an old data center in Sydney, but since 2015, we've been putting everything into AWS, where we literally have hundreds of systems and services.

We are known for building great websites, in particular for employers to post jobs and for employees to apply for jobs. That's our core business, but over time, we're moving more into the human capital management market. At the moment, we are still the number one job site in Australia and New Zealand. We do have an office in Auckland, New Zealand, as well.

In recent years, we've actually started investing in other businesses in Southeast Asia: JobStreet in Malaysia, JobsDB in Hong Kong, Zhaopin in China, and also Bdjobs in Bangladesh. We also invest across the globe, in Catho in Brazil and OCC in Mexico. This is all around similar companies that have our philosophies, where we think that we can help them and they can help us.

In our 20th year, we managed to generate a billion dollars in revenue. That's Australian, not American, so it's not as much as you think. But ultimately, it was great for us. It was our best year ever and pretty good for a little company down in Australia.

In Melbourne, though, this is what we focused on. We currently run 15 programming languages in production and supporting production systems. We run things on a mix of Windows and Linux. We favor small teams, the two-pizza teams you talk about, with APIs that share information. We don't like teams building huge class libraries that then have huge version dependencies and deployment dependencies and things like that. That's not what we like.

We did over 1,000 production changes in September, which was a massive deal for us, and I'm going to go through exactly why that was such a big deal. Those changes were done in real time, while people were actually using the site, browsing the sites, et cetera. It says 90 on the slide, but I checked just the other day: we have over 100 accounts we currently manage in AWS, too, which is quite extensive for the size of our workforce. Our IT team is only about 120 people.

This talk, as I mentioned, is not going to be about how awesome we are. It's going to focus on this journey and the impacts it had to our people and our processes. That's the main part. Our technology is also going to factor into this, too.

If you are worried that I'm going to tell you how to do DevOps, don't be. I'm not going to do that to you. What we are going to look at is the last four years and focus in on the key points in those last four years.

Let's get started. In 2013, we were an agile development shop, so we did understand the virtues of Scrum, Kanban boards, and things like that. We had cross-functional teams, but those teams only for product and delivery. We had a product manager working with a developer, working with a tester, working with a UX person, but just for product and delivery.

To think of what this scenario looked like back then, think of every product team as a little castle on an island. They worked together, but they had a lot of dependencies among each other, and because they were all in isolated little castles, they really didn't exchange information very well. It could almost be considered like carrier pigeon. It wasn't really automated and really slick like it is today.

The other problem was our operations teams lived on their own island, too. Very similar scenario. There was a great amount of distance between them, and actually getting anyone to see eye to eye was really hard.

Looking at this model that you can see here, it's like what we're saying to our people is, "You're mostly free to build your code and test it, but not to deploy it or support it."

Which led to a lot of challenges, namely that because ops were in control of everything, we had a deep love affair with a particular vendor. We built everything on their platform. That's not a negative comment on this vendor, by the way. It's just that's what ops mandated.

What it did lead to, though, was us building a giant monolithic system. For those of you familiar with 2001: A Space Odyssey, I couldn't think of another thing for a monolith.

The problem was that it created this bad culture. We had a superhero culture in product and delivery, so when something went wrong, there were four or five people that you just went and spoke to, because if you didn't get their okay, you wouldn't touch it. In operations, it was like a firefighting department. They just ran around from one fire to the other.

It led to other challenges, namely that ops policed absolutely every single change that went out and asked 120 questions before things got out, because they saw it as a grenade coming over the wall every single time there was a release. I'm serious.

And every time product and delivery said, "Oh, look, we want to bring out some new... We've heard about this thing called RabbitMQ, and we've heard this," ops were like, "Don't want to hear about it. Doesn't run on that platform. Doesn't do what we say it has to do. We don't want to know about it." As you can imagine, change management processes were a bit of a battle.

This sounds like a familiar story. I'm glad this is not an Australian thing.

The worst problem, though, was that our development test environments were always breaking. Why? Operations policed them, too. So if you were a developer back then at SEEK, you had control over your laptop or your desktop, and that was it. That was really it. That's all you could control. Not great.

If we think about this in pure strategy terms, we had what we would refer to as a chain-link logic system. If you think of a chain, a chain's only ever as strong as its weakest link. It doesn't matter how strong you make certain links. If you're all agile in one area, if your operations team is still back in the dark ages, it's not going to work.

You build up inventory, or batching, batch sizing, as we sometimes call it. One release is depending on another release that's yet to come out, and then this one broke because that one wasn't updated with the next version, and so on. Similar story. That's just to get in a staging environment, let alone trying to get into production.

So we had to change things. We were only doing two deployments to production a month at this stage, which for SEEK at the time, we were the market leader. Thankfully, we still are. I've got a job. But two deployments to production a month wasn't great, and we knew we had to change.

The first thing we said, looking at it, is this is a product of our culture. If you have operations teams that prize stability and resiliency and are purely focused on costs, this is what you're going to get. You're going to get big-bang projects, huge releases, and you're going to have a command-and-control hierarchical management style that's going to try and keep the whole thing under control. So, business as usual.

To change, we needed to change how our people work. That was the main thing. At least shift these castles onto the same island.

In our processes: where are our bottlenecks? Where are our constraints? How can we reduce the time it takes to do things and the waste that we are producing, or the technical debt?

Lastly, we were always in a data center, a data center that we had to deal with. It was old, clunky, had challenges with the vendor who was operating things. Thankfully, a huge American company opened a data center in Sydney in 2013, and we thought, "Woo-hoo. Finally, we can update our platform, and we can have more versatile infrastructure, more elastic infrastructure, and hopefully reap those benefits."

But then 2014 came around, and the wolves were sort of at our door, because for those of you who might have heard of companies like Indeed or LinkedIn, they started entering the online employment marketplace and made their presence known in Australia. They're not the only ones. There are quite a few.

So we had to double down on our competitive advantage. What has made us the market leader? How do we strengthen that? Let's not try and do everything that everyone else does. Let's just keep our brand strong.

We upped our headcount by 40% in a single year, which meant we had to up our operations count at the same time, too. Our clunky old deployment system, how was it going to serve all our new people? It wasn't. So we tried to re-platform them. The analogy is, let's try and make them into trucks that are delivering product rather than a clunky old conveyor belt.

On the last slide, these development test environments, how are they going to support all these people? They weren't. So we had this idea: why don't we shift the development test environments into AWS?

All sounded like great ideas at the time, and they were.

But one of the things that we wanted to do was when we were going to get more operations people, we didn't just want to hire more sysadmins. We wanted to find some unicorns, specifically developers with a great ops mindset, or ops people who actually understand what version control and continuous delivery pipelines mean.

That took us a long time. We ended up finding one gentleman from Spain, another gentleman from New Zealand, another gentleman came by way of Iran. So we had built quite a multicultural team because talent was hard to find back then for us.

The other challenge we had was that the deployments were setting the monolith on fire because there were so many. Of course, what this meant, it set the monitoring system off all day and all night. If you were on call back then, you would have made a great extra on the cast of The Walking Dead, because you were like a zombie. You'd come into work, you'd be on 10 coffees before morning tea, eyes sunk back in your head. It was pretty gruesome.

The other idea, we thought, well, why don't we put the monolith into the cloud? Because you can do that, and that must be great. We'll create the same IP addresses and the same host names, and we'll just clone it out like sheep. You just want another environment, you hit a button. Sounded like a great idea.

The problem was we had a monolith, and monoliths don't work in the cloud like they do. So these environments started springing up everywhere as devs and testers just created more and more and didn't turn them off, and then decided, "Oh, great, I can get an M10 20,000x-large instance. I'm going to put that in as my web server, because we're Google, right?"

It had a very, very, very, very negative effect on our cost. So we got bill shock. We survived. It did set off alarm bells in our finance department, and we kind of pretty much concluded that lifting and shifting your data center in the cloud is not a great idea. It can turn into a monster really quick.

But we also realized, too, this thing is not a switch. You don't just turn it off or on. But we did learn a lot from this process. It was hard. It was a real grind. We worked crazy hours, but we were up to 100 deployments a month. So something's obviously working. We're not sure what it is, but it's working.

The thing we learned was that we're in a data center, and when we have a monolith in a data center, it's like having a building on foundations that are just crumbling. If you just keep trying to add more levels to that building, it's going to fall down.

But the thing is, we need to free up our ops people because we wanted to shift more into the cloud. So how do we do this?

Well, we have to find time. That's our biggest challenge in any organization. Where do you find the time? You're not just going to get time. You have to find ways to actually explore and find it.

Two ways we thought: unplanned work. Unplanned work is the killer. That's when the pager goes off, you drop everything, you lose two hours of your day, or someone just comes up to your desk and says, "I need a deployment. I need a deployment." It's like, this is crazy. There goes another hour of my day, or another hour of my day.

The other is boring work. There's this thing that sometimes in these hierarchical structures, it's like, "Well, no, you can't do that. That's my job. You can't do that." And it's like, but I just have to push a button on a tool. That's all I need to do. So we thought, where are we actually doing things like this, and how do we get rid of them?

The first thing was our people. Moving our ops guys into delivery was working. We had 100 deployments, but they often just had their own vertical on the Kanban board. So we want this to start flowing through. How do we do this?

We took a different approach, and we said, why don't we turn them into coaches and actually show devs how to do these deployments so they can drive the trucks themselves? Here's the button, push the button. There's the pager, there's a monitoring system. Just watch it, and if nothing goes red and things don't fall over, then you're fine. You're good to go. You can keep doing that.

The other was that our monitoring system was turning us all into zombies. Well, why don't we just get rid of this thing and actually build a monitoring system with a brain? Now, I think there are tools out in the marketplace for this, but we didn't have them back then. So we just brought in a new system. We tuned it and tuned it. We wrote scripts that it can trigger to remote into machines, to recycle services, to clean up disk space, ultimately put the fire out. That got rid of a huge amount of unplanned work.

The other is infrastructure as code. We were a predominant Microsoft shop back then, and then PowerShell DSC arrived, which is great for these kinds of things. But it was more than just a server. It was like, how are we actually deploying things like APIs? How are we actually doing the work that we're doing?

Ultimately, what we did is we were seeing times where three weeks to do a task was dropping to 30 minutes. This is where we were finding all the time that we needed, which we wanted to safeguard so then we could help support us moving to the cloud.

I do get that it's 2017 right now, and you're probably all running in the cloud, most of you, I assume. In Australia, that wasn't really happening in 2015 a lot. Some companies were. Little startups were, that's fine, but in more established organizations, this was a hard hurdle, and especially for a business that's grown up in the data center. All our services are built in the data center, all our processes, everything is orientated and geared to that. So how do we change this?

The first thing we had to do, we knew we had to devote more operations people in a particular castle. We had to pick a project that we knew was like a greenfields project, so not tightly integrated in the monolith, but sort of sitting off the side. We taught them how to monitor, we taught them how to build things, and we taught them how to deploy to AWS, and it worked.

We actually took what was originally one of the worst on-call systems that we ever had into one of the best, which it still pretty much is today. We thought, "This is great. There's value here."

But this was a small thing. How can we do it better? So we thought, why don't we actually create a team that is not reliant on anyone, and we build all this ops capability in with the dev capability, and we can barely distinguish between the roles anymore?

Along with that, they still have to monitor things, but they have to use a more... Everyone calls them microservices. I call them right-size services, because I don't really know what the definition between a micro and a macro service is, so we call them right-size.

Ultimately, the other thing: you can choose whatever technology you want to run it on. Do you want to run it on Windows? Do you want it on Linux? Do you want to use Docker? Do you want to use something? You can do that.

But the other caveat that we put on this team was that you need a continuous delivery pipeline for every single release, which means when your code is checked in, it gets built, it gets tested, and it's in front of our customers, and it's done in 20 minutes.

And it worked. Who would have thought? All of a sudden, the business thought, "Hang on a minute. You mean I can make this change and it can be in front in 20 minutes with no downtime?" Like at 10:00 in the morning when everyone... Sorry, no, lunchtime for us. That's when everyone who's on their lunch break and hates their jobs comes to our site. This is a really good idea.

So we thought, there's a lot of value here. One thing is we're not arguing anymore. Our deployment notifications are Slack posts and emails, and we do so many of them now we can barely keep track of them.

The other great thing is this massive chain-link logic system that we had for everything else was reduced to something small, a small chain, predictable links, equal size, equal measure. It became very quick and very predictable.

By 2016, we were up to 500 production changes in a month. So in the space of two years, we'd gone from two to 500, which was a great story. The business was very happy about this.

The other awesome thing is that we still have incidents, but our time to resolve them has fallen through the floor because the people who are actually fixing the incidents are people who wrote the code in the first place. There's no having to go up one hierarchy and then down the other for communication to work. The people are just there, and they're doing it.

And then we thought, well, maybe we could lift and shift the monolith now because we're getting all the experience. We canned that idea. I don't recommend doing things like that.

Our monolith looks like this now. We just take chunks of it at a time and reappropriate it in AWS. So if there's an extra week or two on the delivery time on a project, that's okay. It makes sense.

The best thing, though, we got what we wanted. We got time back. As these teams proliferated, we got even more time. So we started sending more infrastructure to the tip, or the garbage dump, I think as you call it here. We started to think better. We built in more automation into everything that we do. And you know what? In the end, we actually had an operations team that thought and acted like software engineers.

I read Google's SRE book earlier this year. I don't know if anyone's read it, but reading that, I'm like, "Yeah, this is great. We did something like Google." We actually thought, you know what? Software engineering, I'm from a software engineering background, that works in operations, too.

The best part, we got sleep. And when you've got a happy employee and a rested employee, they're happier at work, they're happier at home. In one year, we had four babies born in my team, which is something we're celebrating. But then as a manager, I have to balance people. Childcare is not working or anything. But anyway, I'm very happy for them. I've got two kids.

So out of all this, the second part: what did we learn out of all this?

The first thing that we learned is the cloud is designed for consumption, and ideally, it wants you to keep eating more and more of all the products and services that they provide. I went to re:Invent last year, and I was utterly blown away at the rapid pace that AWS builds things.

The catch is it bites you on cost, and we worked that out. We have spent a huge amount of effort in trying to control our costs. It's got great technology, but no one wants a monster truck when all they asked for was a little scooter. So there's some enthusiasm that you just might need to control with the more techie-minded people.

The other thing is, it's not invincible. That's an actual thing. If you check the SLAs, some services can be down for 21 minutes per month, every single month. That's something you have to architect for.

The other thing is the cloud doesn't have infinite capacity. They say they do. Try getting a C4 large instance in the AWS Sydney data center in March of this year. You couldn't. Someone spun up, I don't know, 10,000 of them or something, and that was it. It was all gone. It was out of the pool.

The thing I'm trying to say here is that from a more traditional software solution, enterprise architecture, non-functional requirements, they are still a thing. They are still really, really important. They're just different in the cloud. You just need to think differently about how you do it. But they are disciplines that are still as valuable then as they are now.

Your finance department, your poor old finance department. Everyone gets swept away in faster deployments, and everyone's got all this great tech and everything else, and maybe your finance department's fine if it gets one bill. But what if you do a consolidated bill, and all of a sudden, they have a massive, massive 20-page document of things that they have to process? How are they going to know what to do with it?

You need to understand how tagging policies work. You might do separate accounts so they can capitalize or operational-expense things. I don't know what it's like in the U.S., but in Australia, the way that we deal with capitalized and operational expenses has different implications for tax and end-of-year reporting.

The other thing is reserved instances. When you reserve capacity, how does that work? Is that a capitalized expense or an operational expense? It might depend. You might need a tax ruling on something like that.

Your vendors. Don't allow yourself to become captive to your vendors. You need to be in a position to churn and burn your vendors, and I apologize to any vendors in the room, but I'm sorry, this is a thing.

Once you start doing this rapid pace of change that we talk about here, and especially if you then hook yourself onto AWS's rocket ship and you start flying away with them, they move fast. You need to move fast to capitalize on the changes that come out, and your vendors do, too.

So if you're not getting the value of them, the last thing you want to be doing is signing a three- or five-year deal which locks you in, which locks your people in. What happens in the three or five years when everything has changed? That's something to consider.

So what have we learned about DevOps?

We've definitely learned it's all about culture. It's not a team, and it's organizational in focus. When we think about organizations, organizations are unique. If you look at a snowflake under a microscope, they're all different. So if someone says, "Here's the instruction manual. I followed it in all these other companies, and I'm an awesome consultant. You can follow it here, too, and it'll be exactly the same," or, "Here's a really great tool that I've put DevOps on," guess what? They don't exist.

Sure, you can inject some of this goodness into various links in your chain, but if it's still on fire right at the end, guess what? You're still in the same position you probably were before. You have to look at everything.

Build smaller teams and build more of them. I think the gentleman who spoke earlier from Nike talking about pizza teams, couldn't agree more. That is really important.

We used to have a DevOps team. We don't anymore. We actually went and spoke to our stakeholders, critically assessed and evaluated what we do, and we thought, you know what? We're actually a platform engineering team. That is what we do, with our focus areas around operations and site reliability. No one has DevOps in their title anymore. That's a great thing. We understand it's culture, and I got tired of being sniggered at and laughed at when I introduced myself as a DevOps guy at every meetup I went to.

The other thing is leadership. This is really, really important when you start evolving. You can have someone who sits there and tells you what to do, or you can have leaders who actively show people the way to do it. People who actually actively want to employ people who are smarter than them. Leaders who aren't obsessed with being the smartest person in the room, but they actually want to get the smartest people in there, and they see their job as helping them and helping their careers. It's a different mindset, but it definitely works.

The other is your focus. Everyone can be focused on cost and the bottom line. Try focusing on how much waste you're actually creating or reducing in the value that you're providing. Leads to much less reactive decision-making, I've found.

Culture. A blame culture doesn't work. If you're in a blame culture, you're not going to speak up. You're going to be quiet for fear of having your head chopped off if you stick it up. It's okay to fail as long as you learn from it. As long as you conduct experiments, you understand what went wrong, you do understand that you don't have evil people in your workforce. Everyone's trying to do a good job. No one wants to get stressed out and upset at work. Look after your people.

Balancing all of this, I like to think of it as, I read in a strategy book once, the tension between decentralized autonomy and coordinated action. This is really important. When you grant more autonomy, you do need to raise the technical capability of the people below you.

I think earlier in this conference, someone mentioned Captain David Marquet, the U.S. Naval submarine commander. This is a point he makes really strongly. If you haven't read the book Turn the Ship Around!, I strongly recommend it. It's one of the best books I've read.

So when we give an example like this, if we think of something decentralized, it's fine to use whatever coding language you want or whatever operating system. That's fine, as long as you're prepared to support it. But you know what? We're still going to coordinate around security. We're still going to coordinate around vendor management, so we don't have devs signing up massive deals with huge vendors, and also PII data.

For us, we have people's resumes, we have people's personal information. We can't afford to just throw that out to everyone. We need to ensure that we protect that, because there's significant brand reputation, obviously commercial damage, if we did.

Help. This is the other thing. We're not satisfied with where we are, and we know we need help in things. We don't need several people just to wear the same hat. We really encourage people, we call it professional development, not career progression, but professional development. How much more things can you do? Are you prepared to do? That's really great. We still need certain specialists, but ultimately the more useful you are around the organization, the more you grow your career, the more experience that you gain.

Strategy. A lot of IT departments and the way a lot of them work, in my experience, don't think very strategically. When I mean strategy, I mean where you actually will do your problem diagnosis of your objectives. You'll evaluate your guiding principles. You'll have actions that you measure against and feed that back in and keep refining it. What I'm saying is, don't just roll the dice and think, "Well, this just might work." Actually have some planning around that.

The last thing is with your tech, you don't want to be chopping down trees with an axe when new chainsaws arrive. That's really important. But as I mentioned, the monster truck scenario, don't let it turn into a massacre either. So just keep a lid on certain things.

My final thoughts for all this, and thank you very much for attending. It's fantastic to be able to be here and present a story from our little company from the other side of the world.

Buy-in has to be driven top-down. That's really important. We're fortunate that our senior leaders of the organization really put their faith in us and their trust in us. That makes me feel great as an employee, to feel that people trust me enough, so I really want to do a better job.

The other is that a strong software engineering focus does not just exist in development. It exists in operations, too, and that's really important. When your operations people start thinking and acting like software engineers, it's amazing the amount of flow and throughput that you actually get.

My final thought really is you need really good leaders that keep you steered in the right direction. Leadership is not something that everyone is born with, but it's certainly skills that everyone can acquire and master. It's important that you have leaders who don't have egos, who don't want to brag about the amount of direct reports or the size of their budgets, but they're just people that want to help those they're responsible for and get the best out of them, no matter what.

So thank you very much. I really appreciate it. Thank you.