Transforming Traditional Enterprise Software Development Processes by applying DevOps and Agile Principles at Scale

Log in to watch

San Francisco 2014

Transforming Traditional Enterprise Software Development Processes by applying DevOps and Agile Principles at Scale

Vice President of QE, Release and Operations · Macy's

How to transform traditional Enterprise Software development processes by applying DevOps and Agile principles at scale instead of the more typical approach of scaling scrum. This approach starts with clarity in business objectives for the transformation. Next it highlights the importance of creating an Enterprise level continuous improvement process, which is very different from an aggregation of team level continuous improvement process.

One of the most important steps for creating an Agile Enterprise is keeping code releasable across the Enterprise.

This presentation will go deep on the fundamentals of Devops, CI, and CD based on what has been found to be successful transforming legacy organizations. The final step will provide a framework for re-thinking the planning process to provide an Enterprise level backlog and long-term commitments.

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

Thanks, Gene, for a great group of presenters that have been here today. I think really high-quality people. In reflecting back on yesterday, I realized that he'd collected a group of my people. I realized throughout my entire career, I've always felt the need to change the world.

I don't know why. Maybe it's a defect that I have, and I didn't consider that I couldn't change the world. And when you look at changing large-scale traditional enterprises, you need people like that. You're not going to change horses into unicorns if you don't have the Don and Don Quixotes of the world that are willing to race into windmills and challenge the status quo and make it happen.

So, that was very encouraging for me to see that there are other people out there in the enterprises working on trying to make these things happen. I was a little frustrated, though, as an executive that leads these large-scale transformations, first at HP and now at Macy's, that the executives are still getting in the way, the resistance. They're one of the barriers to making this happen. And it's frustrating for me because I feel like if they understood the potential impact that this could have on their business, they'd be in front of it, they'd be leading the change, and they'd be helping make it happen.

And I think it's important that we figure out how to get them engaged, not just supporting, but in actually engaged in helping to lead the transformation because they're uniquely positioned in the enterprise to harness the resources that you need to bring to bear on problems of this size to get them solved. And I think if they understood that you could potentially improve the development productivity by two to three X, would they care? If they could start saying yes to all the business requests that they figured out how to say no to, would we be able to get them engaged and get to the point where they really understand how important this is and how critical it is to their business? Two to three X, saying yes to all business requests seemed fairly far-fetched.

I'm going to spend a few minutes going through the LaserJet experience because it's documented business results that are out there in the industry. It's not protected by IP, and it shows what is possible. And having led a transformation at HP and Macy's, I would say that these results are not only possible, they're what's required to be successful and competitive in the business. In 2008, I took over the LaserJet business, the embedded firmware for running all the LaserJet printers, and it was the bottleneck for the business.

We could not add another product to the vintage chart, which was our plans for which products we're going to deliver without checking with firmware. Marketing had essentially given up asking for features. We had added resources around the world as quick as we could, and we couldn't get off the bottleneck. Through a process of starting down a path of continuous improvement and agile, we completely transformed that, and within three years later, we looked back, and it was a situation of doing a bunch of sprints together.

We looked back, and firmware was no longer the bottleneck. Basically, we had taken the development cost down from over $100 million a year to $55 million a year, completely transformed it. We were supporting 140% increased number of products under development, and our capacity of innovation had moved from 5% to 40%. If your executives felt like they could do that for your business, can you think of anything that should be a higher priority for them?

And I'll leave you with the book. I won't go into a lot of details because it's there, but it's in the public domain. You can use it. It's a quick, easy read.

I would've made it shorter, but they wouldn't have published it. I would highly recommend that you get executives to read it and go through it. It's fairly simple. It's not overly technical, but it does kind of tell you what a journey feels like and what it looks like.

I'm working on a second book that's really... The first book's a case study. The second one's more of a how-to. I'm going to focus a lot more on the DevOps side, but I thought I'd just lay out the framework here briefly.

And I would not recommend anybody do agile. I would not recommend anybody do DevOps. If you're going to talk with your executives and you're going to get them to engage in the change, start with the business objectives. You don't do a major transformation for the sake of doing a major transformation.

You do it to meet some business objectives, so you always need to start there, and that's the first place to start. The next thing is this is going to take a while. It is going to be a journey. You're not going to go off and put together a four-year plan and go implement it.

You're going to need to learn and adjust along the way, and it's not just at the team level, it's also at the executive level. And the key thing you need to do is set up sprints, set up objectives, and learn and adjust along the way. And as an executive, my main role was helping to sort of paint the vision, the business objectives, the direction we were going, and then lead the effort on the continuous improvement process where we would set objectives for a month, and then I would spend my time wandering around trying to understand the things that weren't getting done. Because as an organization, we felt they were important, we felt like we could commit to them and it would be getting a milestone, but as we got further into it, we realized they weren't getting done.

And it wasn't going up and saying, "Steve, why didn't this happen?" It was, "You know, Steve, we thought this was important. We thought we were going to do it. Why isn't this getting done?" And it's a very different approach for a leader to take when you're going out and being an investigative reporter and trying to understand what's working and what's not working in your organization. And then as a leader, if you prioritize the things that were getting in Steve's way and help improve themIt builds credibility and then people are more willing to share what's going on.

They're more willing to share the constraints. They're more willing to share the problems. And you can start a process continuing to improve your process. One of the biggest things that I learned, we looked at those business results, and we looked at the transformation, and then we looked at how the teams operated, and we didn't really tell the teams how to do Agile right.

We had some teams doing burnup, burndown charts. We had other people doing more traditional things. And what we realized at the end was, in the enterprise, how the teams behaved was a second order effect. How the teams came together and delivered value to the customer was a first order effect.

Most large enterprise implementations start with the team and then try to figure out how to scale up. And I think what the industry's starting to realize is that never really gets you to the point where at the enterprise level you're continuously releasing code and doing the basic Agile principles that were put in place. And I think that's one of the primary reasons DevOps has come on the horizon, is because Agile failed when it scaled to the enterprise to accomplish what was one of the basic principles. Steve talked about that a little bit.

And so I'll go a little more deeper into that and spend most of my time there because I think that's what the conference is about. I will say that on my blog site and in my next book will be a key focus on how to apply Agile principles in the enterprise. I think it's missing. I think it's hard to get executives to figure out why they treat software differently.

Hopefully, I've characterized that. I think there was somebody that had that as one of their asks yesterday. You can go look at that or follow up with me offline, but I think that is a key part of changing, because once you're delivering code on a regular basis, you need a good process for having a prioritized backlog. One of the things they say start with business objectives.

When I went to Macy's, there were a couple of observations that were very interesting. One was they hired me in, they brought me in as vice president to lead this change of continuous delivery and make it happen. I read Jez's book. I was all excited.

I met Jez. We talked, we went round and round, and I go, "I'm going to go do continuous delivery." I got there for a period of time and realized that we were doing it all wrong. I had set off to do continuous delivery, and we were taking one application and trying to deploy it all the way out, and it was taking forever, and it wasn't delivering any business value. And it wasn't helping, and the organization wasn't getting excited and enthused about it.

A long debate with Jez and Tim Brown and others at ThoughtWorks one afternoon, I walked away with, okay, I knew this before, you need to start with the business objectives. So what are the objectives that you're trying to achieve with DevOps? And what we got to at Macy's is, well, really, we're trying to increase the quality of feedback to the developers. And we're also trying to make sure that we're testing in an operational-like environment as we possibly can, as close to the developers as we can.

And then we're trying to reduce the time and resources between a release branch and production. And we're trying to improve the repeatability, deployability, and stability of our environments. Those were the fundamental things that were plaguing the business, and everybody in the organization could rally around that. You could track it.

You could measure it. You could prioritize what you were going to work on, and you could make progress with it. The next thing that was interesting to me is I thought we had a continuous improvement process when I joined the organization. I didn't think to ask about did we have a good build process.

It took a while to convince the organization that really we need to version all of our artifacts. Snapshots, I'm not sure if anybody's done snapshots, but snapshots is not really versioned artifacts. And we went through this process of if we're going to build up a large enterprise system, can I get a green build with version one of all the artifacts with a few simple automated tests? And then if I do another build with version two and component B causes it to go red, can I back out component B and have version two of everything else and get back to a green build?

If you don't have that fundamental in place, your architecture's too tightly coupled, your build process is tied to its dependencies, and you're not going to be able to build up a very large and complex system in the enterprise. And so I made the mistake of not starting soon enough on the build process and making sure it was independent, and I would highly recommend people start here with your architecture. The next thing, and it seems like I make this mistake every time, is under-investing in test automation. I made it at HP.

Every time we looked back, we said, "Oh, yeah, we should have done more investment in that." And every time we thought we invested a lot more. It's the key thing if you're going to go fast, you're going to always releasable code, test automation is key. When I got to Macy's, they were running about 1,700 automated tests every 10 days on a release process, and that's how long... Or manual tests on a release process to get it done.

We're in the process. We have about 5,000 automated tests running daily now, and as we scale that across the browsers and then we scale it across the different ways that you can go through the data paths in the thing, within three months, we should be running about 100,000 automated tests a day. So if you think about the quality and the code coverage you get with 100,000 tests a day versus 1,700 every 10 days, it gives you a feel for what types of coverage and quality you need to have, and you can have with test automation that you can't have in other places. The key thing you need to understand, though, is how to architect that correctly.

And a lot of people go into doing test automation, and they turn it over to their quality assurance team to automate what you have been doing all along. And I would say that's a huge mistake, having done it and having been a part of that. You need to get some of your best development architectsTo engage with your best quality engineers to understand where and how you need to test it. Jeff Morgan's got a really good book on how to do it.

Basically, if you think of the idea, if you go to Macy's website and you're going to write a test on do I go to the homepage, and then I go to men's, and then I go to pants, and then I go to Levi's, and then I pick a product. If you run down that path and you write a bunch of tests like the manual testers would do, and you end up with thousands of tests that go down that path, if you change the homepage, you fundamentally have to go in and change all of your tests, and they're not maintainable, and you'll get behind, and your automated tests won't work. If instead you create a page object for each page on your website and then he also has a data magic gem that when you hit that page, it automatically randomly fills in data. Then when you write an automated test, you say, "When I go to checkout, make sure I get the right tax price." Well, if you have tests written in that way that are actually validating what you're trying to validate, instead of testing the whole path, when you go back and you need to change the homepage, you change one page object.

You don't change thousands of tests. And you're going to struggle with a lot of test automation if you do not architect it correctly, and you will fight this on an ongoing basis. So invest heavily. This is an example.

You don't have to do it exactly this way. Think about how your code's going to be changing. Think about how to componentize your test architecture so it will work. Otherwise, you're going to die under the weight of maintaining your automated tests.

It's fine when you get a few out there, but when you start building up larger and larger groups, it gets much more difficult. I'm a horse guy. It sounds like there's a lot of people in the room that's some horse guys. There's some unicorn people out there.

This is for horses, and horses are somewhat unique. And it's not that you can't apply DevOps principles, it's just that you've got to understand that you're a horse, and you've got to deal with the fact that you're a horse and approach the problem that way. One of the interesting things to look at is what is the cost of failure? If you're a Facebook app, you can probably afford to get by with a lot of unit testings and approaches that way.

If you're a medical device, you might want to do a fair amount more system testing and a little bit of more rigorous testing on the end that's going to take a little bit longer. If you have ease of deployment, if you're a Salesforce, a Facebook, a Netflix, where you can deploy it, if you see something quickly, you can do a canary deployment, you can turn it off, you can change it. It's going to be much different if you have a laser printer that people are only willing to upgrade the firmware once a year, if that, or maybe never. You have to be much more careful.

So think about where you are in this equation and how you should be treating the problem. The other interesting thing is how like your production environment can you make your large-scale testing environment? Think about hundreds of thousands of tests running as close to the developers as you possibly can. If you're a car manufacturer, you can't have a bunch of cars driving around in the parking lot running hundreds of thousands of tests.

It just doesn't work. So you're going to have to do emulators and simulators. If you're Google and you've got gobs of money and you run on servers, you can afford to make your test environments look pretty much exactly like your other piece. And this last one, I think, is the most interesting one.

Most of the unicorns you hear talked about have microservices-based architectures. It enables pizza teams. It enables small groups to operate independently. It enables them to deploy independently.

It enables them to treat all that differently. The horses that I've lived with have much more tightly coupled architectures that enable building up and treating the whole thing as a system. And while it's ideal and best to break the architecture down into microservices so you can allow people to run independently and go fast, the reality is, do you improve the development productivity first, or do you improve the architecture first, or do you do it a little bit at a time? It's a challenge.

But I will be talking more about how do you deal with a tightly coupled architecture and how do you build it up in a releasable way. There are other unique challenges of being a horse. If you are an embedded software and firmware, you're going to need to deal with simulators and emulators. If you're SaaS, web, it's continuous delivery.

If you're packaged software, you have to deal with the complexities of branches. Anytime you have a branch in your system, you should think of it as evil. It takes away from the efficiencies of software, the idea that you write it once, you deploy it, and you successfully get it out there. If you have branches in the system, it's waste, especially if you have any manual testing, it's additional waste.

You need to avoid it at all costs. One of the things that you're looking at from improving the productivity of the organization, its effectiveness, is what you're trying to do is you're trying to localize and find the offending code as quick as you can. And if you're dealing with a very large system of hundreds of developers and you're doing a release integration cycle once a month, once every two months, a build, a development, and then a test cycle, what you're trying to do in the end is you're trying to find who amongst these hundreds of developers in the last six weeks brought in code that broke the build. And you get to the point finally where you go, "Steve, what were you thinking when you brought the code in six weeks ago and you broke this functionality?" To which Steve says, "Six weeks ago?

Are you sure it wasn't Gene? Because I haven't been in that code for a while." And all you're doing as a manager is beating up on Steve, and he's just getting frustrated that he didn't get it fixed and he's feeling bad. Whereas if I had a system that designed it to where I gave him feedback that it was one of you three this morning at 11:00, you're going to become a better developer, and you're going to become more effective, and your whole organization is going to be more effective. So when you think about how do you build up your system and put it together, what you're trying to do is localize that down to the fewest number of people as quickly as you possibly can.At Macy's, Macy's started in the 1800s.

I'm not sure there's code from the 1800s. But from the 1960s, I'm pretty sure it's still there, and some of the people who wrote it in the 1960s may still be there too. So how do you take a very large enterprise system that's tightly coupled all the way from the website that's Java-based, fast-moving capability type of things, all the way back to mainframe, AIX type of systems that are slow to move, hard to qualify, and going and using much more traditional methods? Well, think about building this up in such a way that you're trying to localize the offending code as quick as possible.

You'd start with taking any individual component, making sure it has all this automated unit tests running. I don't see any reason why a developer should ever be allowed to check in code that has a unit test that doesn't pass. Next thing you're going to want to do is stand up as much of the enterprise system as you possibly can as often as you can to localize it. You don't want to use mocks unless you have to, or service virtualization.

But when you need to, it's extremely valuable. So between the legacy organizations in Macy's and the agile organizations, we use service virtualization to mock out and sort of localize the code. We've got people in San Francisco, and we have people in Atlanta, and we're trying to isolate those differences and sort of make sure that this piece is fully qualified and working before we integrate it with that piece. As we add new functionality, we use the surface virtualization to make sure it's working with the two pieces, and then on an ongoing basis, you need to make sure that you put the system together on a regular basis.

One of the things that I learned at HP that was very interesting is we started with some rules that sort of said you couldn't commit and run. It was a lab felony. You need to stick around until there was a green build. Because what happens is you end up with a train wreck, and somebody has to fix it, and nobody else can check in code until it gets working, and you have this big problem.

And we came up one time we'd had a train wreck for about three or four days. No code was getting through. And I went to the build team and said, "What's going on? We got to get some code through.

I got to get the feature through put up." And they said, "Well, we stopped, we added a bunch of more tests because the code base was getting unstable." I was like, "Well, okay, so it's not getting any more unstable because no code's getting through, but it's not getting more stable either. What we really need is a process to let the good code in and keep the bad code out." And they came up with autorevert. First they sort of said, "Duh, easy for you to say, hard for me to do." And they came back a few weeks later with something they called autorevert, and it's later more industry standard called gated commits, where there's a certain minimal level of quality that your code has got to stay at an ongoing basis with your build acceptance test. And if it doesn't pass that, it never makes it into the SCM.

So it basically takes that train wreck and moves it off the tracks and lets the rest of the code move on through the system. And if you're going to be building up a large, complex system and keep it stable on an ongoing basis, you need some of these techniques to work. You also need to think about when you're taking sub-components, you get them stable, and then you build them into other pieces. You may want to let the code make it into the SCM because you're trying to balance the feedback time versus the stability, but you need to build up bigger and bigger pieces of the enterprise system and make sure it's working together.

And you want to make sure that this is working on an ongoing basis. Continuous delivery at Macy's, we spent a lot of time on this. Similar to object-oriented, as you get into the test automation, you need to apply similar principles when you think about continuous delivery and how to architect it. Otherwise, you're going to end up with the same idea of long monolithic scripts that are hard to manage and hard to keep going.

One of the things that we found is to think about the job of orchestration differently than trigger scripted environments, the deployment, EDD, and automated testing. And as if you do that, it's enabling you to leverage more common components down below that can flow through the system because you can abstract out your complexity into your orchestrator that describes the uniquenesses for the different environments. The other thing that we spent a fair amount of time doing is first thing that was there was automatic deployment for the environments, and there was kind of manual on the scripting, and what we realized is there was a different branch for doing the automation in the QA environments, the pre-prods, the performances, and the other environments. So every time we moved from one environment to the other, we had to re-deal with similar problems.

What we did is the way we did scripted environments is we came up with a common script with a unique environment descriptors that went across top of it. And then as we rolled that out, we kept the same common script, the same code, the same branch, and what we changed was the environment descriptors that would call into that. And that enabled us to get consistency across the whole thing. I don't know how far people have gone down the path of keeping the code base releasable on an ongoing basis.

You'll probably run into people saying, "Well, it doesn't work for my database. I need to make a schema change." You will run into that argument frequently. Have them go read this book. There are ways of doing it.

Same with your services layer. You need to version your services layers and have the application code be able to call the version that it needs and then deprecate it. You don't do alters. You don't do modifies in your database.

It's like crossing the beams. Cats and dogs start sleeping together. All those bad things start to happen. One of the interesting things at Macy's that we ran into was when we first started setting up some test automation, we started running into these issues, and we would configure the servers, we would deploy the code, we would run our system tests, and we would have a failure.And we run into this problem of we've worked with the developers, and we go, "Well, here's the problem." And it took a while to localize, but eventually they'd either found a routing or a server that didn't either get configured right or deployed to correctly, but it was a long and complex process, and it took time.

And what they said is, "Well, wait a minute. I shouldn't even be looking at this test failure because the deployment wasn't successful or the environment wasn't right." So we stepped back and we did it a different way. We would configure our server and our routing devices, and then we'd validate that that was correct. And then we'd deploy our code and validate successful deployment at the individual server levels.

Similar to where you're trying to localize the problem to the offending code, here you're trying to localize the offending server or routing device so you can quickly isolate it. Because if you have hundreds of servers and you have something that failed in there, it's going to take a long time to debug because it's going to be really intermittent for a long period of time. And then when you've done that, your final step can validate the code. So you're trying to isolate offending code from the environments, from the deployments, and get that efficiently found.

Here is the other challenge that I'm having with Macy's. This is a little hard to look at, but the dirty little secret that Jez doesn't ever tell is to do continuous delivery, you need to have trunk at production quality. That's a long way away for most horses. During that period of time, you are getting stories done.

I heard people talking what's the definition of done. You are getting defects fixed, and you're getting tests passing. One of the roles that your product management team needs to own is getting ready for release branch also means all of those things are done and not just the code piece. So I'm getting tight on time, so I'll keep going.

These types of results are possible. They're credible for your organization, and I think if your executives understood that this was possible and you could make happen, they would be more than willing to lead the transformation. And then here's where I could use your help. I've got a lot of passion and energy around doing this.

I've done it successfully at Macy's and at HP. My last day at Macy's is next Thursday. After that, I'm trying to figure out how do we get executives to help lead the transformation in their organizations? How do we help them understand how important it is, and how do we help them figure out how to make that happen?

And I think I'm somewhat uniquely positioned as having done it to have a credible voice to help you guys because a lot of times engineers come to these conferences, get all excited, then they go back into the organization, and it throws a wet blanket on the spark and it goes out. So how do we help get them engaged and help lead some of these transformations? And I'm past time, so I think I'm getting the hook. Thank you, Gary.

Okay. All the way.