A Healthier Software Development Life Cycle with DevOps

Log in to watch

London 2017

Download slides

A Healthier Software Development Life Cycle with DevOps

Chris Hill

Software Development Manager · Jaguar Land Rover

A Healthier Software Development Life Cycle with DevOps

Chapters

Full transcript

The complete talk, organized by section.

Chris Hill

I'm Chris Hill from Jaguar Land Rover. I'm here to talk about a healthier software development life cycle with DevOps.

Just to give you a little background on Jaguar Land Rover: we're roughly about 40,000 employees, 22 billion pounds in revenue per year, roughly 900 software personnel, 135 different software groups within our organization.

I actually just recently made a move. I'm from Portland, Oregon. That's actually where I met Gene Kim. I moved my whole family here to the Midlands, United Kingdom, specifically to work in HMI proof of concepts. I'm now here working mainly in infotainment and production software.

So before I get started, I do have a quick story to share. This is an infotainment system. I think you guys are pretty familiar with that. It's a touch screen that we can manipulate as we're driving the car.

My second time here in the UK, I jumped into a supplier's brand-new Mercedes. There were four of us, and this was early last year. And there was a combined experience in the vehicle of about 30 years of automotive software experience. I only made up about four months of that. So you can imagine the level of talent that is sitting in this vehicle.

So, as you can imagine, we set the navigation system. We have music playing. Typically, navigation fades in, music fades out, right? You guys have dealt with that before. We get a phone call on our way. Phone call comes in, music fades out. Twenty seconds into the phone call, the music starts back up again, but the phone call's still going.

So now we've got navigation that's cutting in and out, we've got the person on the phone trying to give us directions, and then we've got music playing as well. So total chaos actually ensues in this vehicle. The driver is trying to use voice commands to stop the navigation. I'm in the passenger seat trying to figure out how to shut off the music, but the infotainment system thinks that we're actually in phone mode, so I have no ability to shut off the music.

And finally, the guys in the back say, "No one is doing this right."

And so I really thought about this, and he is right. He's referring to the automotive software industry, but no one is actually doing what we all feel could be the most ideal way to develop software. It's obviously something that's very complex.

So what I did is I went back to Portland and I met with the rest of my team, and we decided that we were going to go through and come up with indicators on what we felt was an unhealthy, or indicators of an unhealthy, software development life cycle.

These are those indicators. So my first one is: our devs build prod on their own machines.

This one scares me. I can't even fathom how many production software machines I use on a daily basis that were probably built, or could have been built, on a developer's machine at their home, or maybe even in their corporation, just on their machine.

When you've had to ask yourself during an outage, "What version is running in production?" as well as, "What developer did the build actually happen and deploy on?" you're probably looking at an unhealthy software development life cycle.

Version 11, build Dave, right?

You probably don't have continuous integration. You probably do have some manual testing and manual deployments. You've got variability in your deployments. You have no idea what's running in production. If Dave goes on vacation, how are you going to reproduce that build in order to debug whatever's going on, right?

"Oh, that? We'll fix it later."

So this is my technical debt story. I feel that infrastructure and product rely heavily on each other, right? They come hand in hand. I feel infrastructure typically is the second-class citizen and is much easier to accumulate technical debt in because you find yourself saying that, "Well, the product is still functional. We haven't really thought this out completely. Let's just ship it."

What I've seen is these technical debt accruals actually happen in small batches, which is funny. We promote small batches in DevOps, but technical debt probably doesn't fit into that category.

I've also seen some technical debt that, even though when you accrue it, you know we should be doing this in an ideal way, and this is the way that we'd actually like to do it. When you go back to actually taking care of the technical debt, I've seen that a lot of times it doesn't actually get done the most ideal way that you'd like it to be done.

I've approved my own pull requests. So this one for me is pretty funny. We can map out an entire path to production. We've got an entire branching strategy. He makes his way all the way to the master branch and then approves it himself. We've got this whole mechanism that will gate all the way to production, and then he takes a step and approves it himself, which ultimately defeats the entire purpose of having that approval gate, right?

So you're going to miss out on accountability between developers. There could be a teaching moment there. There could be many opportunities for you to gate the software to get into production or into your master branch that ultimately you miss out on all of that if you're approving things yourself, right?

"We didn't have time to test." So this is probably one of my favorites.

Especially in the proof-of-concept world and in minimum viable product world, I see a lot of things just get disregarded as, "Oh, no, we're just trying to do this minimum product. We're not going to try to do any of this testing stuff that you talk about."

And I'm a huge proponent of an elementary type of test or basic fundamental type of testing. I find it strange that sometimes I can't go to a developer, let's say an app developer, and I can't just ask him, "Show me the test that just says your service is running when you're running in production." And they can't even show that to me. They typically are focusing elsewhere.

And I feel like you don't have to boil the entire ocean when you do testing. I feel like you can remain fundamental. You can remain elementary. You don't have to go and completely boil the ocean in terms of testing.

"That's the integration team's job." So my colleagues will laugh at this one, but if you've got supplier A who delivers software to you and supplier B that delivers software to you, and they depend on each other, and they don't get along with each other, and the deliveries that come out of both of them don't work with each other, I've seen cases where instead of actually fixing the root cause of that problem, you hire supplier C to make sure that A and B play nice to each other, right?

Unfortunately, this is automatically siloing your groups. It's automatically putting C at all of the risk, and it's giving basically the entitlement to A and B to deliver something that could potentially not be of top quality.

In fact, a lot of times I hear the excuse from A and B that says, "Well, it's C's fault because they couldn't figure out how it worked with everything else." Right? Super frustrating.

This happens: rush deliveries without testing. I've seen things come in that were completely not prepared at all.

I've seen no virtualization or simulation. If you have a situation where you have multiple suppliers that are depending on each other and you need for them to develop with autonomy, it's important that they're not only delivering code to you or a product to you, but they're also delivering the ability for other teams to simulate their software, their latest software, at the same time.

This is really critical for me, especially in areas where, especially at JLR, where we have 25 different suppliers working on the same product. They all have to work together with each other, otherwise nothing will work.

So what's the impact on the business?

There's an impact to product: delayed or buggy releases, brand confidence, lost profits, outdated tech. I think it's awful that there's a possibility you get into a brand-new vehicle this year where the software was actually developed four to six years ago. To me, that's embarrassing. However, I understand why it is the way it is. There's no reason we can't change and get better.

The impact to people, this was the one that probably hurts me the most. If engineers can't see their work go into fruition, or their work make it all the way to production, and they've got no closure on their development, on their ongoing development, they feel disenfranchised. They feel disappointed. They feel like what they're doing isn't actually helping the organization. They feel like they're not contributing value.

I've seen engineers turn over because of this, and it really is painful to watch talent leave your company. I'm sure we've all had good talent leave the company, and it is painful.

The triage time tends to increase as complexity rises, and as you're trying to Band-Aid fix things, the amount of time it takes to actually figure out what's wrong with your production system increases, right?

Senior engineers obviously turning over. Technical debt accumulation. This is obviously painful for some engineers.

I feel like back in the States, where I'm from, when you graduate with a computer science degree, you may or may not have $80,000 in financial college debt, and you go off and you're ready to start your career in the software industry, and you think, "Oh, I'm finally in a position where I can start going the other way and paying off the debt," and then you accumulate it on a day-to-day basis at work. It can obviously weigh on your shoulders, right?

So, in order to remain competitive, we have to adapt, right?

Motion creates emotion. Emotion drives passion. Passion drives change. So if there are ways we can be strategically disruptive, right? And I say strategically because there are ways to not be strategically disruptive. But if we can instill the change and encourage engineers to speak up, I feel like these changes are actually reasonable.

What hurts me a lot is if I hear an engineer that says the reason they didn't bring something up is because they felt like the resistance was going to be too much. And it's really disheartening to hear that somebody could be in a culture where they feel like they can't actually come forth with their ideas, right?

Just as the last presentation said, thank you for doing that: "It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change."

Obviously, Charles Darwin is referring to the fact you must stay flexible. You must stay competitive. If not, you will lose out in the market. We will not have a company to work for anymore, right?

So there are some principles to making this right this year.

We can increase deployments per day per developer. This is obviously a great indicator that Gene uses.

We can automate at scale, make sure that we're actually adjusting our infrastructure based off of the incoming development demand. This was a huge win for us.

Identify bottlenecks on iterations. Sometimes you'd be surprised if you look at the difference between the time spent between developer committing one feature to the next feature. If you actually understand their iteration cycles on what they're going through, you'll find that the creative process is probably a small percentage of that.

If we find bugs earlier and we find bugs in an automated fashion, they won't actually get to triage. I've seen cases where we have some sort of defect, we have some sort of bug, get all the way to triage, and we waste an engineering day trying to figure out where it goes. We waste an engineering day trying to figure out purchasing with the supplier whose fault it was. And then we waste another day with the supplier actually trying to figure out how to fix it, and then maybe another day for an actual fix and delivery, which probably takes 10 minutes, right?

So if you can resolve those issues before they ever get into a triage life cycle, you save your organization a significant amount of time. Does that make sense?

Of course, remain agile in our development.

So I'm a huge proponent of self-service capabilities. I like to be able to allow the suppliers to work on their own with autonomy. I don't like to hear the excuse, "The delivery we made is of low quality because we weren't able to actually replicate your production environment."

Okay. Well, let's give them the tools to replicate the production environment. That way, everything they deliver to us should already meet and should already be validated against our quality criteria, right? Instead of an integration team, your integration team would actually be your team that writes your automated tests so that your deliveries are either accepted or rejected from suppliers, right?

Decrease barriers for change. This goes back to when you're speaking up about whether or not you feel like something is right and you feel like something should be changed. If you have less barriers to get those ideas out in the wild, or you're able to establish an infrastructure so that they can test their ideas, typically that creates a more flexible and more accepting change environment, right?

Spread integration efforts across all teams. So this goes specifically with self-service.

I had an incident where we were talking to a supplier, and they wanted to install their software on one of our servers, and they wanted to install their software on one of our components. And they said, "Yeah, schedule about a week with one of our systems engineers to get it all installed and set up, and then you can incorporate it into your component."

And I said, "Well, I'll add you to two Git repositories. One, infrastructure as code. You can change your own infrastructure. You can build it how you want to. You can replicate our entire environment. And here's the source code to the component. How about you put it into our component for us? You tell us whether or not this is going to be of value for us or whether or not it works, rather than me having to employ somebody for a week to try and get your software working on our system," right?

This is the flexibility. This is the empowerment.

Oh, that one looks terrible.

Collaborative agreements within partners. So what's really important to me is when you establish a relationship with a supplier or with another vendor, I like to get out of this buyer and seller mentality.

When you have a partnership that isn't built off of trust, when you have a partnership that is built off of blame or you're wringing the other person's neck or always telling them that you did something wrong, your scope was incorrect, you didn't follow the scope, you didn't listen to our changes, et cetera. When you have a negative relationship like this, typically, your software will follow the same negative path.

If you're able to establish a relationship, almost like dating, where you can be fully committed to one another and realize that both sides will make mistakes, both sides will be there to help you fix the mistakes, and both sides will make sure to stay accountable for the end goal, which is your product, everyone succeeds, right?

Decreasing developer ramp-up time. I really like the ability to take a brand-new developer, put them on a project, and in less than 30 minutes, get them up to speed with making a change off of the entire end-to-end system. Right? Here are the tools. Here is everything you need to make your change within a half an hour.

Even if it's the most complex system, I still like to give them the capability to know how to make a change, where everything will go within a half an hour. This not only is helpful for keeping developers, moving developers from project to project. This is also helpful for making it not seem such a huge burden to move from project to project, or a new developer.

I've seen ramp-up times in terms of months rather than days, sometimes six, seven months, just to get to the point where you're actually contributing well on a project. That's a little extreme, but it is reasonable to say that the majority of projects I've seen, a complete developer environment setup is usually about a week away.

I want to get away from that, right? I want to get to the point where we're doing 30 minutes, you're ready to go.

Establish a dev community. This is important not only with transparency, this is important to make sure and provide help to any new starters.

I didn't think it was possible, but we've got a sarcasm-free help channel in a British company. You can actually go to this help channel on IRC, and you can actually get somebody who's genuinely interested in helping you and not just making fun of the fact that something was done wrong, and we never should be explaining it this way. This is actually genuinely helpful.

I love the idea of ChatOps. I love the idea, "Hey, Chris, I need a new account," or there's a huge backlog of requests. I can assign one to myself and just type a chat to a bot and get what I need done via backend Python script.

I love the idea of pipelines spitting out activity to me. I can see the chatter. I can see the actual software activity happening.

When we have tech discussions on IRC, we've got timestamps. We've got a historical log of how we came to a decision. This is really powerful to me, to have a community that we've built it around.

So results so far. We have seen build frequency increase by about 200 times.

The community, like I was mentioning, has gone from 15 users to 120 users, not including bots.

I thought it was funny that Jason mentioned, he said, "Services with personality obviously have configuration drift." We actually inject personalities into bots now. We're actually giving machines personalities on purpose so that they'll actually give us some sort of a witty answer when we give them some sort of a command, or will react differently to some people than other people just to troll them.

Our prototypes can reach about 5% representative of production in a 20th of the time. This actually is my biggest indicator that we're just starting. I can't wait to see where this actually is in six months.

Six other software organizations have now adopted the toolset. We now have people who have never operated in a DevOps methodology, or never operated in this way with continuous integration, continuous deployment, using our toolset, which has been great.

Increased developer happiness. This obviously goes back to the turnover problem and the senior engineer issue. We actually have engineers who want to stick around, and we have engineers who recognize that what's happening is actually going to come to fruition.

Our biggest surprise, which you guys have heard all the time, and it's no surprise, and that is most of our problems are not technical. Most of our problems are behavioral-related, process-related, bureaucratic-related, some other part of the organization-related. If it were technical, we'd have them solved already.

We are still looking for help. We have more vehicle variants than grains of sand on this earth. What automated tests should we run based off of the incoming change? What makes sense?

We'd love to do hardware simulation pre-silicon. This is the idea of left-shifting. We'd like to be able to take hardware that we're going to put in cars, be able to test, or at least get to 95% validation before it's actually delivered to us. There's no reason we can't do this.

Systems engineering and infrastructure junkies. I can't tell you how happy I get and how excited I get when I'm sitting in an interview and somebody mentions Packer or Terraform or Chef or Puppet and tells me how they've been using it to help out their previous companies.

Like I said, I am actually located here in the UK now. I did make a big move. I am on this time zone. Would love to hear from you. Would love to have help. Would love to have discussions, anything related to DevOps.

I appreciate your guys' time. I appreciate the DevOps Enterprise Summit, and have a good rest of your conference.

Thank you.