Accelerating Customer Experience Innovation Through DevOps

Log in to watch

San Francisco 2015

Accelerating Customer Experience Innovation Through DevOps

Director – Systems Engineering · Verizon

The talk walks through the the iterative process of continual improvements driving Verizon IT’s Consumer applications including:

- Coffee Anywhere – building fast changing CRM experiences

- Coffee Tech Center – creating a brand new tablet based technician experience through continuous delivery

- Overall – the experience of iterative and incremental improvement in the software delivery model across the entire portfolio of applications

Chapters

Full transcript

The complete talk, organized by section.

Chivas Nambiar

I am the director for the platform engineering team at Verizon.

It's kind of a strange title, right? DevOps is not something that you can hire for, is what we've heard for a while now. And then a DevOps platform, what does that mean?

I have to say it's mostly around our company at Verizon making a decision that we want to change how we operate. It's about changing how our technology teams get better at building products. So that's why I have this role. I'll talk to you about how I got here a little bit. I'll talk to you about how Verizon does what it does.

So let's talk about Verizon for a minute. Who's Verizon? We run some of the most reliable, some of the best networks in the country and around the world.

How many people here have a Verizon phone?

Ah, a lot more of you have to have Verizon phones, clearly.

We have a large number of customers, and there are many different types of customers. So we have consumers that get TV, phone, and internet services from us. We have customers that are wireless consumers, so anyone who has a cellphone, a handset. We also have small and medium businesses and enterprise customers. So most of the Fortune 500 is actually tied to Verizon as a customer and a client.

We have tens of thousands of employees that are in our technology organizations, driving product development, driving service development, that turn into all the things that our customers consume. So our goal internally is build relevant, intensely personal products for our customers.

What we saw with Verizon over a few years, though, is we started to get slower and slower. And if you look at the marketplace and what's going on in the marketplace, I don't have to tell you, you see all the proclamations from T-Mobile, all the proclamations from Sprint, everybody coming in and trying to do what we do, and trying to one-up us every couple of weeks in the marketplace.

So at Verizon, we had to take a step back and try and figure out: how do we get better?

That's our current challenge, right? And the tipping point for us was with COFE. Being a telecommunications organization that's been around for more than a century, it is true: we have run out of acronyms that are two or three letters long. Now we're up into the four.

COFE is not this. I'm not handing out free coffee samples.

COFE is this. It's a little bit of a terrible name, but it's a converged front-end engine. What it essentially means is it's a one-stop customer relationship tool that is used by all of our customer service reps, by all of our technicians out in the field, by all of our dispatch organizations to decide when and where to send you technicians. So it's a one-stop shop. People look at it, and it forms the basis of how we relate to our customers.

So if you think about a tool like that, that does all of this work, you can very quickly assume that that "converged" portion of it is aspirational, right? There is no way that any company that I know of has a converged customer relationship management tool that does all of that stuff.

When we started looking at this set of tools initially in 2010, we were at 44 separate systems that did all of that work.

And if you think about the history of something like Verizon, a company like Verizon, it's all based on acquisitions, it's based on mergers, a lot of companies coming together. And what that means is every time you had multiple companies come together, you had one of every system come together, right?

So at a certain point in 2010, when we started looking at this, we had 44 separate systems that we were working with.

The really crazy part was every time somebody decided to do something to simplify it, the way they tried to do it was they would go out and build a gateway and say, "Okay, there are these four systems that do the same thing. We're going to build a gateway, and that gateway is going to provide the same service to the other systems."

In 2010, when we looked at it, I think we had six separate layers of gateways talking to each other. Right? That's how complex the environment was, which is not a great place to be.

So, the before state: dozens of legacy systems.

It was not just the systems and the gateways. It was also geographical, because we had small companies we'd acquired in different regions that did things differently, so the product rules in those areas were different. So if you travel a lot and you look at market segments in California, that's very different from market segments in New York, and the competitors in those marketplaces are very different.

So we had to deal with that, and the way we dealt with it was we built small systems in each of these places, or we acquired systems in each of these places that had different rules. They all had different business processes, right?

So if you think about the flow of a call, I'll walk you through a call. A customer calls in. When they call in, you start to get something on a rep desktop that says, "Here are the systems that tell you who the customer is, what they have today."

The customer tells us that there's an issue or they want to buy a product. We then have to go through and look at all of our order management systems to see what we can offer them. We have to look at their history to say, what have they called in for prior to this? And assuming we fix their issue or we sell them a product, we then have to go touch all of the billing systems so that they can be billed, and we get paid for it.

If we have to do dispatches or technicians have to go out to their households, then that is another set of processes that we have to go through. So very complex systems.

Now, I don't have to tell anybody in this room that if you have these complex systems, when you start to make changes, you start to bog down very quickly, right? Because you have all these interdependent systems that are talking to each other.

So every time you had to make a change, you'd then find out that you're breaking 10 other things. And then the natural instinct at that point, because we were not enlightened, is to say, "Stop everything. We're not going to do this anymore." Right? Everything's a freeze, and you wait till you're absolutely sure that everything works before you push it out the door.

So very quickly, we ended up in a mode where our release train became once every two months. An enterprise release is assigned. If you have any work that you need to do, regardless of how many customers you're impacting, regardless of what the business impact of that feature is, you get to slot it into an enterprise release.

And naturally, the instinct for product teams, for business teams, is to say, "Hey, I want to make sure my stuff gets priority." So they'd go up and fill up that pipeline for years out. Right?

So if you now have a new feature that you need to put in to address something in the marketplace, your options are: go have tough conversations with every other business team to try and move something off the pipeline, or put something out there three years out.

And I don't know about everybody else in this room, but I can't think of where the marketplace is going to be three years from now. And to make the assumption that a company, an internal team, is smart enough to predict what we'll need in three years, that we're that smart, I think that's a false assumption and something that would put a company out of business very quickly.

So this was our issue, right? And we had geographical constraints and complexity. So we had to deal with all of that.

We had some really amazing dev teams, and I am privileged to work with these guys. Amazing teams that could build great products. Our technical talent, our team leads, we have some of the best in the industry.

All of these teams came together and they built systems, and I'll talk about this in a second as to what we did, but all these teams built code that used to sit there, and you've heard this theme over and over again since this morning. We used to have code that sat there for three months, six months, 12 months in some cases, before it actually went out the door. Right?

Beautiful code. Dev teams wrote perfect code, excellent code that worked really well in their local test containers. And then they would go out and put it into an integration environment six months later because that's when the release train got there. And they'd realize they'd written a beautiful square peg for a round hole. And now you're back again trying to deal with that while also trying to deal with all of the other stuff that you're trying to do for your next enterprise release.

So this was the world.

We actually made a couple of key decisions. We came back and said, "The amount of technical debt we have accumulated over time because of this design is not something we can handle easily through an iterative process. We've got to make a hard decision on some stuff."

The first piece there was we said, we're going to move away from our current technology stack. And I know DevOps is not a technology conversation in many ways, but the reason we made that decision is so that we could have a clean slate to work from.

So we used to have desktop environments. We decided to move away from that. We decided to move to web-based interfaces for everything that we do, primarily to also help us build the agility, right, and have a way for us to deliver features, code, and content faster.

We also made a conscious decision that we were going to simplify. And the target was to take those 44 applications over three years and bring them down to one single application.

Now that's crazy because, at a certain point, we had a million lines of code across these 44 systems that we were going to bring down into one single application.

We also decided to take the teams that we had and have them move to a completely different technology stack. And this was not by design, more so the fact that we decided we were going to do a web-based application. We decided to move to .NET and Microsoft technologies for the application. It was a new set of teams we brought in, and they were able to go out and build things from the ground up.

The other piece that we did that really helped us is we went in and looked at our gateway over gateway over gateway problem and said, "We're going to figure out how to simplify that structure, and we're going to take our backends and our mid-tiers, and we're going to get them to be a service-based architecture."

Now, this is not rocket science. Everybody does it today. But four years ago when we made that decision, it was a lot harder to commit to taking all these systems and re-architecting them. But I'm here to tell you that was absolutely the right decision. If we had not done that, we wouldn't have gotten where we are today.

We decided to redesign the infrastructure. So as Tom was up here talking about it, I'm laughing in my head because we used to have these conversations four years ago where we said, "Hardware is so expensive. How do we go out and build hardware that's going to be that resilient? We can never do this."

And then over the past four years, the cost of hardware has become so small compared to the amount of effort it takes to actually drive enterprise releases. We actually went out and built resilient infrastructure through software across multiple sites, and redesigned the entire infrastructure to live in that kind of environment.

The last piece was we decided to break down the silos. So when we started looking at this, we had development teams that went out and did their dev. We had test teams that came in two weeks prior to an enterprise release and tried to get everything tested. And then we had an operations team that would stay up at night, all night, trying to get this stuff out the door and invariably breaking things or rolling things back.

And so we decided that for COFE as an application and a system, we were going to have a different model. And we broke down the silos, put all the teams together, and said, "Nope, the way that we do it is we're going to iterate on COFE on a daily basis. We're going to test what comes in as you build it, and then we're going to deploy it through automation and through scriptable deployments," really more than anything else.

And in doing that, we were able to get much, much better and much, much faster at getting stuff out the door.

So that's what the teams were doing. And I was peripherally involved while a lot of that stuff was going on in the first year. In the second year, I decided to come on board with this team and take over the responsibility for the operations team. So production support, site reliability, DB support.

And in the first few weeks that I was there, I don't know how many people run operations teams here, but it seems to be a trend. You take a job in one of these roles, and the first few weeks, everything breaks loose, right? It's utter chaos and people are asking you, "Why can't you fix something?" when you just came into the job.

So we got there, we fixed it, we patched it. There were a lot of heroics involved. And then I spent some time talking to the teams to try and figure out what we thought we should be doing.

So I went to talk to our development team partners, and this is what their mandate was: build simple, reliable, personal experiences that delight our customers. This is what we want our development teams to do.

Came back to my operations teams and asked them, "What do we need to do?"

And the answer was, "Well, every time we deploy something, we break things, so we should stop as many of the deployments as possible. Changes cause defects; therefore, we should have no more changes."

And it's incredibly hard to make people who have made a decision to help protect customer experience think about this differently. Because the bottom line is you're advocating stagnation as a business strategy, which is not a great way to go.

So we went around and talked to all of our operations teams, talked to all of our development teams, and we sat down and said, "Look, this is not going to work."

The real answer is that if we can't do it fast, if we can't do it well, we have to keep doing it over and over again till we get better. And I think Jez Humble put it much better: if it hurts, do it more often and bring all that pain forward so that when you're actually doing it for real, you're not worrying about it.

In doing that, we were able to go out and explicitly do something that teams at Verizon had not done before. We said, "We're going to have the exact same goal as our development teams. Our operations teams do not have a different goal. They have the exact same charter. They have the exact same mandate. If you cannot do this, then don't think of yourself as part of a team that's built to go deliver code to your customers."

We started doing this, and the way we did this is in weekly sprints. We sat down with our development teams and said, "Tell us what is breaking in the environment. Let's go figure out how to fix it together." And I'll talk about a couple of things that came out of it.

We also started talking to teams about how to get better at it. And if you think about the feedback loops that we needed, that weekly process helped us to identify how to fix things. We started looking at it as a system end to end.

And we also allowed our teams to experiment a little bit because now they were not dealing with issues on the day of the release, issues with when we were getting things out the door.

So if all this sounds a little familiar, it's because it ties very closely with my favorite how-to, which is "DevOps: The Novel."

I don't know, has anyone read "DevOps: The Novel"? It has a different working title. It's called The Phoenix Project. Right.

I loved that book when I read it, and it was just a way of thinking about systems and a way of thinking about how to get better at doing things within a system that reflected my experience. And I love the fact that we were able to show that to other people and say, "Hey, here's your starter. Go do this."

So let's talk about impact. What did this do for us?

Across the entire system that we have, our changes went from six times a year to twice a week. That's pretty good.

In terms of agility, we had automated everything, so our releases went from hours to minutes. Great.

The really interesting part is that our business teams and our partners started perceiving IT very differently. We were no longer the necessity that they had to go out and get things out the door. We started becoming a competitive advantage because they could come into us and say, "Hey, I want something because I see something happening in the marketplace. I want to be able to make a change. Can you do this for us?"

And the confidence that we built by having good code delivered every two weeks on the dot allowed them to make business decisions that helped drive the company forward. So that was great.

I wore an operations hat, so this thing is really close to my heart, which is quality. Over the course of the three years, every year, we had a 30% reduction in the number of incidents that we had and the number of outages we had.

And if you just think about that, and if you think about the fact that we had all of these engineers who were on call 24/7, dealing with this stuff, typically at 2:00 a.m., incredible change in the quality of life for our engineers. And I cannot overstate the importance of that because these are the same people who then come back and look at everything else you're doing across the system to try and improve what you're doing.

We had some sort of second-level benefits that we didn't quite expect. So as we went through and looked at all of these systems and started shutting them down and bringing them down across those 44, what we realized was there were other systems in that same bucket that we could start to either simplify, because you didn't need all those rules, or you could kill, because now you could deliver the features and the capabilities that were in those systems in a modern fashion, in the new system, much faster.

And so we ended up reducing about 300 applications from across our portfolio, which, starting from a base of about 800, is pretty tremendous. We're working more through that.

Our enterprise release model moved to once a month. And you look at that and say, "Well, that doesn't sound great." And it doesn't sound great, but the actual benefit there is the enterprise model is now only used for some very specific things, where you have to have 10, 15 systems do an individual piece of work that has to be delivered on a specific date.

All of the other systems were actually moved to a release-when-you're-ready model, and that's why we get to two deploys a week. Right? You take away all the constraints that you have where systems need to talk to each other.

So a lot of pull-through as we went through the process.

So lessons learned. Focus on the business result. If we looked at what the challenge was, there was no way for us to fix any of this without actually increasing our agility, right? So that was picking a problem statement that was a business statement that required us to do better is what drove us to do better.

Leadership was key. So we needed support in our org. Making sure that our senior leadership understood why we were doing this, making sure that the mandates and the dictates were not helping. Getting them to buy into that, which is really hard when you're in a senior leadership position, where people have always been used to going out and telling teams to go do this, and then they go off and do it. And instead, understanding that you have to give dev teams and operations teams the leeway to do things better, and they will surprise you with how much better they do. That's important.

Building safety nets. So Tom talked about risk. And for us, the risk management was in building these safety nets, but with software-defined infrastructure that helped us scale, that helped us fail cleaner, faster.

And we automated everything. And with automation, it's kind of interesting because everybody looks at automation and says, "Well, there are always these special snowflakes." And I ran a couple of teams that were special snowflake teams. Our database teams, which I love those guys, but I asked them, "How many database changes in a year, or what percentage, can you automate so that you can roll forward and roll back without issues?"

And when we started this journey at the beginning of 2013, the answer was maybe 5%. At the end of 2013, we had 97% of our database changes doing automated deploys and automated rollbacks where necessary. And that was tremendous for our DB teams because, again, they're not staying up at night anymore doing this work.

So there are no special snowflakes.

Mainframe was another great example where we went from five-, 10-person teams working through releases on release weekends to push-button changes that were happening across the board.

So that's what we learned, and this is something that's doable if Verizon can do it. We're a technology company. We're an engineering-focused company, but our IT organizations were in the same morass that a lot of enterprise teams are. If we can do it, you guys can do it.

So I'm super excited about learning from everybody as we go through this process for not just our IT organization, but across the rest of our technology organization.

So what are we doing next? We're going to try and scale this across the entire company. So you saw something that was focused on the customer experience side of the house. We're going to do this across our network systems. We're going to do this across all of our technology organizations. And this is how we're doing it.

We're going to establish a corporate DevOps program. We have a platform and a model. And something that's kind of a little bit more unique to Verizon is we have a very strong Lean Six Sigma program internally. And so what we've been using here is using them to define the business and the measurement model.

And what that helps us with is it helps us go out and make the argument that this is something we need to do, and it helps our bottom line in many cases, and helps drive revenue acceleration in other cases.

This is something near and dear to my heart. We're going to actually be spending some time talking about measurement and metrics over the next six months with teams that Gene has set up. And if you're interested in this stuff, stop me in the hallway. I'm happy to talk about it because I think this is important. This is how you make the changes happen in large enterprises.

The other thing that we were doing, which is based on experience, is build a corporate-wide toolchain that helps you do the automation, that helps you get the feedback, that helps you move things fast.

Based on just personal experience, I recommend that you get a dedicated team that groks what you're trying to do, help them build the standards, and go out and use that to drive the experience. But one thing that's very important is don't be prescriptive.

You cannot be prescriptive about this stuff because you're going to be challenged, and it's going to look like you're stifling innovation. If you can be opinionated and if you can provide a better path than what they have today, the developers will come.

Because I was a developer, and I know how lazy I was. And if I could do something through a method and a system and a process that had been set up for me that was better and easier than the one that I had, then I was likely to use that.

So we are working through this process across our entire organization. We're excited about it, working with some very cool partners.

Our measures of success, again, this is what I was talking about. We're going to work through this. We have a target for this year and the next year to reduce a couple of our key parameters: cycle time, deploy frequency, defects per feature, and drive up our financial impact by 30% each.

One kind of interesting experiential learning thing: don't make teams compete against each other around this stuff. It's great for people to be able to point to each other and say they're doing something better. But if you take teams and have them compete on these metrics, you're going to get bad behaviors coming out of it.

The best way that I've seen teams do is if you can have them compete against themselves at points of time. So point-in-time measurements over a period of time is really the best way to drive this conversation.

Where do we need help? Standardization in measurements. How do you measure business benefits? How do you define success criteria? I think there are a lot of teams that are grappling with this. How do you get your senior leadership to buy in?

So getting there, we will help, but we'd also like some help from the industry getting there.

Last bit here. It really enhances our customer experience when we're able to deliver things faster. We look at that, and we also say we want to improve our engineering teams and the lives of our engineering teams. So help us show the world that it's okay to want to do things better, build better software, build better code.

And so that's where we want to be. If you have any questions, I'm happy to take them outside of this room because he's telling me that I'm out of time.

Okay.