From Dev Opps to DevOps

Log in to watch

San Francisco 2017

Download slides

From Dev Opps to DevOps

Richard Fong

CICD Delivery Manager · Mitchell International

Erez Nir

CTO · Mitchell International

The Mitchell International Journey to DevOps. I had a long conversation about this with Gene Kim yesterday and we have been talking about me presenting over the years. I wanted to reach certain milestones before I actually describe our journey. We are not there yet, but we now have something I feel comfortable to talk about.

Chapters

Full transcript

The complete talk, organized by section.

Erez Nir

We can go either really fast or really slow.

Thank you for having me. My name is Erez Nir. I'm the CTO at Mitchell International, and we're a software company in San Diego. And this is the story of our journey, which we dubbed as "From Dev Oops to DevOps."

So a little bit about Mitchell. Mitchell is an enterprise software company. We deliver software as a service for the P&C industry, which is the property and casualty industry.

We have 23 of the 25 insurance companies as our customers. We have multiple business units. We have multiple business lines, and of course, it comes with different product and software stacks.

We run about 100 million business transactions a year through our systems, and about $70 billion of spend on claims, which is out of a total spend of about $200 billion. So it's a big chunk of this country's property and claim industry spend on claims.

Our customers are all basically managing all lines of insurance that you can think of, from damage to a car, to damage to a body in the car, to being hurt in a workplace with workers' compensation, pharmacy spend, and we are also starting to go global.

So our mission mainly is to help our clients to basically deliver the best possible outcomes when they settle claims. And as part of this, it's really help their customers to restore after a pretty challenging event. And think about the car accident could be a pretty challenging event, so just restore lives, and that's how we connect with our mission.

We do it with a lot of technology. We have a lot of technology and intellectual property that we have developed over the years and a lot of our expertise. And Mitchell has been around for 71 years, believe it or not. We started as a book publisher, and now we're totally in the software business.

So this is our journey, and this is a journey of managing change. This is a journey of the speed of change, and this is a journey of, I would call it, organizational behavior and change. There were a lot of changes that required to take, and I'll take you through our journey, which started with... It's a few years in the making.

So the first episode, and I'll tell you this as a story with episodes. So the first one was stabilize, and it started... I joined Mitchell in 2005, and basically in 2007, I took over development and ops, and operations, and infrastructure.

So we were DevOps according to at least the naming convention, and that's how I started. I call it the red period. It was code red almost every week, if not every week, then every second, sometimes every day. And these were code reds running left and right, and the reality is that people got used to it.

People actually got used to it, and I had one developer telling me when I said, "What the hell is going on?" They said, "Erez, don't worry. It's not about how many times we're down. It's how fast we can recover."

Needless to say, he's a former employee of Mitchell.

But this was my email. The email that went out with code red on the big font running was basically systems are down. There was a nightly deployment before, and here we go down again, and not happy with customers. And of course, our teams were firefighting. Our teams was in a firefighting mode daily, almost.

Nightly deployments, you stay in the nightly window between 10:00 and 2:00. You wake up in the morning, if they don't wake you up at 4:00 in the morning because something went wrong. And then you go again to firefighting during the day, service degradation to our customers, so they're not happy and calling my CEO to ask what's going on.

And then we miss deadlines because people are not working on what they need to do. They would work on firefighting, and we lose productivity.

So that was the story, and it didn't take a while, sorry, to figure out we need to fix it, and the problem is us. We need to fix something fundamental. And we decided to go to WAR.

It's not war-war. It's WAR in weekly, daily, and monthly operations review. That's how we named it. We created this scheme. Because I was DevOps, I owned all the stakeholders.

Basically, we said every day at 4:30, we're all sitting in one room. We're talking about the incidents of the day, the deployments planned for the night. Once a week, we're going to watch over metrics and statistics and understand, are we getting better or not? And we'll rinse and repeat until we get better.

It took a while, I'll tell you. Stabilizing, if you saw yesterday the talk about moving a big ship, it took a while to actually stabilize this.

And the concept was we had this all... This is like on an intranet. Before SharePoint and Confluence and all these other tools, this was our dashboard. If it was red, it was a code red. We had an actual outage in production. If it was blue, it was just an incident. Sometimes we were able to prevent an outage, sometimes we were not.

All I wanted to say: give me a clean week. That's all. We called it a white week. It was our white whale. Just think about Moby-Dick. So this was our white whale. We were just chasing one week, and it took a while.

As I said, I joined in 2005. This started about 2007, and it took a while to get this white week. And once we got one, people say, "Wow, we can get used to that."

I promised the team that when we get a white week, I will open a very, very expensive bottle of champagne that I kept in a fridge for a long time, and we were able to pop it in March of 2010. So that just gives you the feel for a timeline and how long does it take.

And one should think, okay, so this is nirvana. I got DevOps. We got stable. The teams are working. Everybody's happy. And as you know, and I know, everybody knows it, the only constant is change in life. So here we go.

And we go straight to episode two.

So at around the same time, we started looking at our engineering culture and said, how do we elevate this culture to be much more significant? A lot of our people are engineers. We needed to show them the value, the respect, trying to empower them. And part of it was also the conversations around Agile and Agile transformation.

I tried Agile. Actually, I tried XP in '97, '98. It was going well in very small teams. I had two or three teams in my organization at Mitchell that were doing Agile on a small scale. I just made sure that they are outside my office, so I can see that daily stand-up and... But we were not sure how it would scale.

We launched an event for our engineers. We call it DevCon. We launched it in 2011, and we said, "Let's get all of our engineers together for one day of content, keynote speakers, breakout sessions, and really put everything on the same theme."

And I remember in 2011, and I showed it to the team, I did not even mention the word Agile in my conversations, but our speakers did. Chris Fry from Salesforce, we had speakers from Microsoft, Intuit, and they all started talking about Agile and the transformation that is now taking on in the industry.

And within a year, we launched five release trains. So this is a SAFe terminology of five release trains. We formed 60 Agile teams. We completed 15 PIs or PSIs, and we completed 73 sprints in one year. So this was a swift move, very fast, very quick, and again, you almost think like, I'm in nirvana.

We moved into a new building in 2012. We needed... and it was a permission from God to put anything on the walls because it was so clean, and you know how it is when you move to a new building. Within a year, all the walls, this was our wall art. Because I always said, engineering is an art, as much as it's engineering as a software. This is our art. This is our work of art. It's going to go on the wall.

So this is how we did it, and we paved our own path. Basically, we adopted SAFe, and we adopted it, and then changed it, and then invented, and innovated, and improvised to make it work for us.

So it should've been awesome. And then comes episode three.

And episode three is, we're misaligned. We got one cog of these wheels to go really fast. Everything is relative, but in our vision of releasing once a year, once every 15 months, moving so fast to release every 10 weeks was awesome. Was really fast, but nothing else was as fast.

We have many other processes that didn't run as fast as our development teams. So it felt from a speed to value, which was one of the things we were chasing, it felt like we have a roadblock. Perfect road looking ahead, but somehow the things are not moving fast enough. Our quality was not meeting the demand that we wanted to see, and we started seeing some things breaking in the process.

At that time, I was looking for another answer.

I was lucky because to our DevCon, on DevCon 2, we had Jez Humble. Lucky me. And on DevCon 3, we had Gene Kim. Lucky me. And they wrote books, and I read the books, and I listened to the books, and it was awesome.

So the first book was Continuous Delivery, and the second book was Gene Kim's The Phoenix Project. And as I read them, basically starting to think, okay, clearly we're not a unicorn. We're established enterprise, legacy code. We're horses because that's what they call it. That's fine. So we are a horse.

So how do we make it work for us? How do we make this whole movement of DevOps work for us? And the evening before the DevCon that Gene joined us, Gene and I had dinner, and I asked him that question.

And guess what happened?

Gene Kim

And maybe to paint the genesis of how this conference came to be, I remember a conversation I had with Erez Nir. He is the CTO of Mitchell, an amazing, leading claims management software company. They were founded in 1946. They have 15,000 repair shops as their customers.

And he asked me a question, "What do DevOps transformations look like when organizations have legacy code and processes that existed for decades?" And I told him, "Honestly, I don't know."

But I think Erez was thinking something that we're all thinking, which is, what do DevOps transformations look like outside of the unicorns?

Erez Nir

So I guess that's why we're all here for the fourth year in a row.

And that's when I went to the conference on the second year of DOES, it was in '15, and pretty much I figured out that bridging this great divide and getting our problem solved, one of the things that we're missing to help with our quality and speed of delivery and speed to value was that we did not have enough automation in our processes. It was just not there.

We did not have a decent pipeline. We did not have a decent practice around CI/CD. We had to build this capability before we can go anywhere else.

So we said we need to level up. We need to take it to the next level. And we put a case together. We started looking for the right resource, and then we found in San Diego an ex-Yahoo, ex-Intuit guy whose name is Richard, Richard Fong, and he's going to join us and tell us a little bit about what he did.

So Richard, here we go.

You changed.

Too early. Yes.

Hey, you changed.

Yeah.

There you go.

Richard Fong

Thank you, Erez.

So when I first joined Mitchell, the first thing I did is, okay, what challenges are there? Of course, as I look around, I said, "Yes, we are enterprise. We are legacy. Oh yes, we are monolithic."

But the more interesting thing is that, yes, I've been doing CI/CD for a while now, but it's all on the Linux side. But what's there in Windows?

So I take a look at the landscape of Mitchell, and it's like 80% of them are running on Windows, and 20% are running Linux. So I say, "Okay, great. What tools are out there?" Well, very interesting. Not much.

Then I start looking at the architecture of the product. I need to know what I'm releasing. Actually, more importantly, the developer guys, they need to know what they are releasing. So nothing is documented.

When you have a monolithic application and you try to push that to production, it's very important to know that everybody have domain knowledge on their own components or services, but when it gets to the end, there's no single person who actually knows that picture.

And then furthermore, you have components that are actually in production that is often... I mean, nobody owns it. So what do you do?

They have a release process. But actually, because we are practicing Scrum, we are doing Agile. Every team are doing Agile on building features, but the dev guy, the operation guy, SCM guy, is still very downstream. It's very waterfall in terms of the release process and how product are going to production. So you have this silo in the release process.

Because of the silos and different teams, they all have their own problem. You have teams that actually have really good web operation, but they're not so good with QA testing on automation. And then you have other one which is really good in QA testing, but they're terrible in operations.

And then the leadership, what's in it for them? Well, I'm here to build product. I'm making money for the company. This CI/CD stuff, I don't know how is that going to help me.

So actually, this is funny. When I joined, I partnered up with an in-house resident, Raj Makkers, my partner in CI/CD. He came up with this quote, which is, "You know what? We need to build a highway, not traffic lights."

I was like, "What does that mean?"

He was like, "We need to build a CI/CD pipeline that enable developer to move really fast. We don't want to stop them and building traffic lights and gap them every time they move to a corner and say, 'Okay, you have to stop here and then wait and then move forward.'"

I said, "Wow, that's interesting." So with that mantra, we start, okay, let's take that and see what we can do.

So there's a lot of tools out there. Quite frankly, like even me, myself, or probably like most people, have not experienced every single tool for CI/CD out there, right? So we say, "Let's paint a picture. Let's just don't think about tool and think about what we want to do."

So we actually draw this picture on a whiteboard, and then we make a little prettier, and we still have more work to do on the prettier side. But we say, "When you develop code, you're really just writing code. Well, what we want to do is take your code and bundle it up into a small component so that we can ship it and it's more manageable that way."

Well, at the end, I still have my service. So if we can take all those component and assemble them back and ship them as an independent unit, a single unit of deployable, that would make things a lot faster and easier to manage.

This actually create a lot more confusion because then Docker comes about and say, "Hey, we got Docker container."

I say, "No, no, no. We're just using this picture to say that we need to ship the whole thing as one unit."

But then this actually later on kind of fit into picture as well because we're saying, "What goes into the Docker?" Oh, just RPM packages and things like that that we build. Maybe an EAR file, a JAR file, or who knows.

But then now that we say, "Okay, now we have a single unit of deployable artifact," then we say, "Well, really just the system is just a composition of all of them."

So then I was like, okay, talking to guys who's doing microservices in Docker, and they say, "Yeah, I have this thing called a bounded context."

I said, "What is it?"

He was like, "Oh, just a bunch of different container running, and then they work together."

I was like, "Okay, that sounds like a system."

So I say, "Okay, that sounds great. It fits in the picture, too."

So when we have this picture of system creation, and we say, now we can ship the entire things that is a functional system. I can ship the product. It can deploy to anywhere I want to, and then not only that too, when you have a development pipeline, dev, QA, UAT, prod, I can promote that pretty easily.

So we're like, okay, there's so many technologies. As I mentioned, there's even more technology than I can handle. Needless to say to give it to the developer, have them learn all of it, good luck on that. Then I go to the other side and say, "Give it to the operation. Tell them to learn it." Well, good luck on that too.

So what we need to do is make things simple, right? So how do we simplify that? Maybe abstract it so that they don't have to look at all the tool themselves.

So then Erez comes, say, "Go to this conference and DevOps Summit and tell them all about it. And you know what? Geek out a little bit. Tell them a little bit more on the technical side."

I was like, "We only have what? Five, 10 minutes. What can I say?"

So I focus on the pipeline picture that we created because remember, we're 80% Windows and 20% Linux. There's no packaging mechanism on the Windows side, not really. Needless to say, not just Windows, but something that works on Windows and Linux. Well, nothing out there.

So what we did is we created our own package manager. We called it the Mitchell Package Manager. Then the PM comes about and says, "Oh, I get it. MIPM." Okay, so I said, "MIPM it is."

Monolithic application. So now that we have a package mechanism, so we were able to leverage it to decouple the monolithic application into individual deployable unit, a package or a container.

The tool itself help us to unify all the technology. No matter you're using .NET, you try to encapsulate your DLL to deploy, or you're in Java, you're doing Java or Node.js, Python, it works for all of them because it's just content management. You're taking something, you drop it onto the server or a container.

It works on Windows and Linux. Yeah, we build it on Python, so it works on anywhere, kind of.

We use a simple JSON configuration to define on how and what to go into the package itself, and we borrow a lot of that schema from Node.js. The reason I do that is because a lot of the .NET guys or Java guys, they already been working with either JavaScript or AngularJS, so they understand the NPM technology. So it's like that's one less thing for them to learn. So we take that and we change it a little bit.

It is cloud-ready. We can spin up an EC2 today, which we can say, well, you can just do the same installation, install the package there, and up you go. So that make that transition a lot easier.

Lastly, the local developer, actually what we didn't know is that they were like, "Hey, you know what? I can spin up a local VM. I can just deploy this thing to it, and before I check in my code, I can do local functional testing."

It's like, okay, great.

The way we decouple it is we abstract it in each of the deployment into three layer. There always going to be an environment configuration. Every environment have different config. The application become now immutable with the packaging solution. It always sit on some image. Even with Docker, you pull from some baseline image and you start adding your own content to it, and that's your application.

So what the packaging tool does is they're actually targeting for the application layer to help developer, which they need help the most, to say, "Okay, I'm coding something. Make that easy for me to deliver what I needed."

To understand that a little bit and what is the benefit is that the tool allow us to, say, promote from dev to QA, and it was able to identify the changes to say, "Package one, I don't need to update, so leave it alone. I'm just going to update what I needed to update."

What that does is going to give me speed because now I don't have to do the entire thing.

To understand this a little bit more and what is the impact, think about enterprise application, not microservices, and I have hundreds of this. Literally, this is a real case. I have hundreds of these packages that I need to manage and I need to deploy.

Imagine the tool is able to identify the changes, and now we are only deploying what has changed. So not only my application deployment now is stable because we guarantee and deploy exactly what you wanted or what the developer wanted. And also when you do promotion from environment to environment, we are only pushing what has changed.

So imagine hundreds of those changes are being deployed versus, in this case, maybe just 10 of them. So the deployment actually was able to reduce by tenfold.

So the outcome of this simple tool that we built, along with many... We have built many tool, but this is just one example which were the biggest impact. We used the tool. We were able to componentize our monolithic application, so now the developer can manage individual components.

We build a tool that is specific to target, so it can work on Windows and Linux, so it's platform agnostic. That philosophy actually carry on to other tool selection and also something we build as well, that we always make sure that it works on all of the environment.

The product itself was undocumented, which is something that we did. We actually have to go and do a survey on what we call a tech product taxonomy. We mapped out every single product, every single component, every single services so that we have the entire inventory, and then we go to the business unit and identify each individual owner.

Sometimes there's still lost component, but what we did that is quite clever that we go to the source repo. Hopefully everything's in the source, right? We go to the source control and we say, "Do a scan." If you're .NET, we know you're using a solution, and if you're on Java, we're scanning for the POM file. So we have the entire inventory mapped pretty easily.

The release process. So yes, we build a pipeline. We make the pipeline transparent so that everybody can see what is going on in the release process. Not everybody can click the button and go do a deployment. We didn't let the dev do that yet. But what we allowed them to do is we created this transparency that everybody know what is going on in a deploy. So now they can just look at the pipeline and see what's going on.

We partner with the business units to solve their problem. So the people aspect of that is more crucial because we're not going to be successful just by building tools. They have to get something out of it.

So by identifying the core problem, and then once we had pushed out a tool and helped them to onboard, and the problem goes away, this actually pushed them to go and incentivize them to say, "Hey, I want to work with you guys. You guys actually helped me a lot."

And then in addition, other team that we have not even engaged come to us and say, "Hey guys, this is great. Can you help me? I want some of that."

So there's some results. I think I'm going to give it back to Erez. He's going to talk about it.

Erez Nir

Thank you, Richard.

So next episode is initial results, and I must admit that every year I come here, and every year I'm having a conversation with Gene, who's telling me, "Hey, can you come and tell your story?" And I don't know. I'm only telling stories when I have the facts on my side, and it's really about achievements and results.

And I was not feeling comfortable to come here before, but last year I told Gene, I think that next year, I'll be able to talk about our results because I knew what's coming.

So once we implemented CI/CD, and we got this going on on a pretty large scale. If seven years ago it was a big monolith, yes, there were components, but it was a big monolith, very unstable deployments, long effort. Dev didn't have much control over this.

About four years ago, we said, let's just try to stabilize it a little bit better, and we packaged the whole monolith into a single unit. But now deployments were really long. Stability was there, but the flexibility was gone, and speed was gone, and different teams were helping.

And today we're deploying, dev controls the dependency, dev controls the componentization, the time, and what to release. So we're definitely making progress. And as I said, this is a long journey. For a very long-lived enterprise architecture, this is a long journey.

We were maniacally watching our stability of the pipeline. I went to Pivotal Labs. I've seen how they actually watch this all the time, and we really liked what they're doing. We started like this. There was a lot of the orange, red bars basically saying pipeline is not stable.

And then we moved to this, which is we are now watching... And by the way, when you look on the scale, the old scale was in hundreds of minutes. The new scale is in minutes. So we definitely speed up the process, and we are watching it like hawks to make sure that it's really stable.

And then, of course, when you look on productivity, there's nothing to compare from deploying once every two weeks to deploying on every check-in, is something that everybody can appreciate, especially developers.

Integration and other environments are not stable, and sometimes when they're down, they're down for two, three days at a time until somebody actually takes care of them. To now having a team that basically fixes things up in a matter of minutes.

We're moving from this war room of attention for deployment to basically watching Jenkins. Jenkins runs, deployed, move on.

And the one significant thing, when you talk about hundreds of developers, we ran statistics to understand how much time developers actually code. And we were at 40%, which was meaning, what is everybody doing with the rest of the 60%? Well, there was meeting, but there was a lot of operational concerns, stabilizing, chasing. We have now moved to 60% of coding.

If you multiply this by the hourly rate and how much money this is saving, this is huge productivity, and this is what business folks really understand.

So measuring happiness, invaluable. You cannot measure... I'm a developer. I ended up moving into production system and infrastructure. I am this. I am happy when my code is running in production. Measuring happiness is invaluable, and our teams are really happy that we can actually let them run as fast as they can.

So where is our next episode? Where is this journey going?

And you all have little kids running, sitting in the back of the car and say, "Are we there yet?" I'm having it all the time. I have kids in the back yelling at me, "Are we there yet?"

DevOps is a journey. We are just somewhere in the beginning. We have a lot to go, but we are looking for ways to actually improve.

And the one departing comment that I will make is the comment that I made in our last DevCon. Having awesome, shiny pipelines, really nice, and I really know how to appreciate plumbing. I like to play with it sometimes.

Having no water in the plumbing is meaningless. For me, water in this plumbing is all those automated processes, all the automation of QA, all the automation of deployment script. Without it, we did nothing. And for me, without this, we will not be ready to call ourselves true DevOps in nature.

So we need some help.

The next step that we're going to take is on our QA automation, and we have interesting environments, legacy environments. We're looking for ways to look for how to automate our processes on QA side.

Managing DB changes. All of us are talking about pipelines, and only now the comments are starting to come. What are you doing with DB schema changes? What do you do with DB package deploys? They're not participating in this conversation.

My DBAs are still very concerned and very strict on who touches what on our DB. So getting that into the pipeline is really important.

And of course, you heard Richard, we are predominantly a Windows shop. We have a lot of Java and Linux running, but we are predominantly a Windows shop, and the tools and the Windows are still scarce. So we would like to have any advice.

If you like what you heard and you want to hear more, feel free to contact us. If you have ideas and you can help us with some of these other topics that I brought up, feel free to reach us as well, and we'd like to share it with you.

Thank you very much.