Deploy or Die in a DevOps World – High Availability is No Excuse

Log in to watch

San Francisco 2014

Deploy or Die in a DevOps World – High Availability is No Excuse

With a mandate from our CEO to aggressively grow our real-time services and analytics businesses, we have found it essential to quickly implement DevOps and CD over a suite of more than 30 different services – some of which are required to maintain 99.999% uptime – while at the same time implementing a new service delivery platform and leveraging the cloud.

Natalie Diggins recently joined Neustar as VP Cloud Strategy and Platform to help with the transition. In this talk, Natalie will present the strategy used to socialize the changes and build consensus around them. She’ll also share specific elements of the plan that will be applicable to any DevOps implementation in a high-availability environment.

Chapters

Full transcript

The complete talk, organized by section.

Natalie Diggins

Hey, thanks, guys. Again, my name is Natalie Diggins. I'm the vice president of cloud strategy and platform for Neustar, and I decided to do this talk today because for so many years now, I've learned from you guys, and I keep iterating as I go. So we're doing something kind of cool at Neustar. We haven't quite figured it out yet. It's a work in progress, but I want to share back and see how we can continue to iterate.

So, how I learned DevOps. I think many of you are similar to me in terms of how you came to it. You kind of learned it. You fell into it.

Back in the day, I was working for a competitive local exchange carrier, and this was a time where we were competing with the incumbents, the local exchange carriers, in a telephony platform, a telephony environment. And the reason why we did that is because we thought we could do it better, right?

But the problem was, most of the leadership, they were all telephony people, and for them, innovation was terrible. You didn't do it, right? The answer to every deployment, to every question was no. Why? Because they valued stability over everything else. So on Mother's Day, you never wanted your mother to pick up the telephone and get a fast busy.

So we quickly found out we couldn't compete, and that company went bankrupt. So I went out with some friends, some colleagues, and we decided we're going to do it different. We're going to do it better. Somehow, we thought we had figured it out.

We went out and I co-founded New Edge Networks. At New Edge Networks, we raised half a billion dollars in debt and equity, and we built the largest asynchronous transfer mode network in the world. And on top of that, we ran pretty much every service that you could think of, from TDM to frame relay to ISDN to DSL.

Now, in our world, we didn't know the meaning of the word no. We innovated off the charts, and subsequently, our stuff didn't work. So I went from having this really locked-down, rigorous telephony environment, where innovation was bad, stability was everything, to the exact opposite, where we rolled stuff out every day in this crazy environment, and nothing ever worked.

So I kind of continued on in my career path trying to figure it out, and went to several startups, SaaS providers. And one day, not too long ago, a couple of years back, I was telling a gentleman what I was doing, and he said, "Oh, you're DevOps."

I said, "No, I'm technical operations. I do this, that, and the other."

And he's like, "No, you're DevOps."

And up until that point, I'd never even heard the concept of DevOps. It was just what I was doing every day in my work world, where I was trying to balance the speed and agility on one side with the stability of my services on another. And come to find out, it's DevOps.

So I suspect you guys are pretty similar in that regard. You may not have known what you were doing was called DevOps, but it just kind of makes sense. It makes sense because it allows you to roll as fast as you can in as stable an environment as you can.

So today, I'm working at Neustar, and I've been at Neustar for just over a year. Neustar brought me on board because I was really working with a lot of continuous integration, not quite successfully continuous deployments, but working on it, and a DevOps agile environment.

Neustar is really two different companies under one. So the first is, you guys probably know if you're porting one of your... Say you have a cell phone. You're porting from one carrier to another. We have the monopoly on that in North America. So we provide the carrier services. We also provide order management to carriers, et cetera.

So that's on one side of the house. On one side of the house, we're very telephony-focused, but in the new world environment where we're going, which now comprises more than half our revenue, we're in an information services and analytics space.

So what does all that mean? Just to bring it home into the products and services you may know, we are the top-level domain registrar for several domains, including .biz, .nyc. We have the SiteProtect products, so for DDoS mitigation, we have an incredible roster of clients who subscribe to the services.

We also do IP analytics. We do marketing services. We do web metrics, so we do all the analysis on your website in terms of performance. So we can pretty much tell you who's hitting your website, the demographic. We can tell you the health of your website. So we've got all this going on.

The problem that we have is, how do you now, under one company, combine the speed and agility and the stability? So you've got the telephony five nines, but you're also moving into the IS&A space.

Now, to compound that, as if that wasn't enough, we have today now more than 50 products and services. So not only do we have these two companies, we've got more than 50 products and services that we are supporting in this environment, and we have grown rapidly through acquisitions.

So you guys know what M&A is like. You buy a company, and what comes from that company to you is pretty much a heterogeneous platform, right? The tools are all different. The culture is all different. So if you're Neustar, you're this multi-billion-dollar publicly traded company, how do you quickly integrate all of these acquisitions into your platform so you can scale?

And it's still a work in progress. We're still figuring it out. But we're doing that through really more of a cultural change and what we call Technology 3.0. It encompasses several things, and we'll talk about that.

So we used to be what we call 2.0, and we just needed to kind of hit the restart button. Now in this new agile world, in a DevOps environment, CI/CD, we needed sort of a playbook of how to express what it was we were trying to achieve.

Think about it. We've got 1,500 employees, and we started out in this telephony space. So when you start to talk to them about DevOps, they're like, "Whoa." Or ITIL: "Whoa, ITIL. We don't change." How do you bring about that sort of transformation?

So what we came up with is almost like a playbook. It's a Technology 3.0 model. And really, because so much of it is the culture and the people. We'll talk a little more about the tools and technically what's going on, but I think the hardest part is the people.

So you've got the cloud, the platform, the tools, and then the most important thing is the adoption. Because we can have all these great things, but if we can't get people to use them, we might as well turn the lights out and go home.

So we created this manual. And just to give you an idea, we started out with a small working group of about 10 people. It was called our NextGen team. They were the next-generation team to help us determine, what are the values? How does it start? We wanted something that came from the bottom up in terms of the values, and so they created that.

Also, what does it look like to be 3.0? What does it look like to be DevOps? What tools do we use? How do we solve problems? What are our values? So we also identified that, and then we also asked for commitments from the employees in terms of what they would do, things like automation.

But then in return, our CTO made commitments out to the entire employee base about what he would do, and they were things like he would nurture his career, he would reward excellence, he would call out non-performance, et cetera.

So it was a technology manual, but it was almost like this pact that we have with each other in terms of how we're going to perform and how we're going to behave.

I mentioned the common values. Now, for you guys, you guys probably think this is pretty simple and common, like a "duh, why do you have to put that up there?" But frankly, automation. We all live and die by that, right? But if you're coming out of the telephony world, you don't really automate much. You call Ericsson, and they send you a new switch upgrade, right?

So this is actually something really, really important. Also, simple is better. That goes along with the buy, build, or partner sort of decision. We want the simplest, most elegant solution, because then we have fewer opportunities for failure. So here's kind of what we came up with.

Also, there's no I in team. We had a lot of folks for whom knowledge was power, right? And if they started to share that power, suddenly they lost their power base, their knowledge base.

So we tried to call these out into very prescriptive common values that we would use. So if I was doing ego run amok, doing something crazy, these values allow us to come back and challenge in a non-personalized way: "Well, Natalie, where's the I in team with what you just did? Explain to me." So it started to create this framework that we could bring about a cultural change.

I mentioned the common tools and the platform and the DevOps. So you guys are familiar: the infrastructure layer, the PaaS layer, the SaaS layer. And the way we looked at it is, at the bottom, we could have a cloud approach, we could have a hybrid approach, and we could have a metal approach.

So Neustar, we have our own carrier-class data centers, and some of our contracts preclude us from allowing the data to go off-platform, our platform, out of control, right? Because these are carrier agreements. That's fine. Peace. We don't need to take everything off-platform. We could go in a hybrid environment, or we could do a pure cloud environment, particularly when we're getting into more of the web-based products.

So we kind of looked at each one of these and said, "What makes sense?"

Now, next on the stack, we go to the platform. So if you think about this, we have all of these different teams. I mentioned all these different products, all of these different engineering teams building these products. So one particular product might need identity and access management. Well, they all pretty much need that, right? But in the old world order, they were building these on their own.

So what we did is we have quite a few different platform services that all of these engineering teams can come to the platform team and essentially buy, for lack of a better phrase, these services that they would normally have to build on their own. So you can already start to see where we're getting scale. Instead of 12 teams building an identity and access management module, we've now got one, and then we're shifting it out to the individual teams to use in their products. So that's one of the ways for scale.

And then at the SaaS layer, again, I mentioned we still have our engineering teams building the individual applications, but based on the common infrastructure, based on the common platform, we can really start to focus our intellectual property at an application layer, and that's where a lot of the DevOps comes in.

And one of the most important things is that not every product and service can utilize all of these different aspects of this Technology 3.0 culture. And one of the things I learned early on is: do not try to shove this stuff in. Go for sort of the easy ground, the low-hanging fruit. We'll talk a little bit more about that in just a minute.

I mentioned that DevOps is really hard, and I'm preaching to the choir here. We looked at quite a few models. So the Google SRE model. Some things I really like about that: I like the concept of the error budget. Once you have enough defects or faults on your platform, pretty much all production deployments stop until the entire team can holistically heal the platform. So you're pushing your speed on your delivery as fast as possible on one side, but you're protecting your SLA.

Another thing I really like about what they do is their SREs are the ninja engineers. These are the guys you send in. They do the crazy stuff. They're the guys who are like, "Why?" And they're teams that rotate around individually, product by product by product, depending on who's not making their error budget, and there are agreements on how they're deployed. Love what they're doing. Love.

But we don't have, at Neustar, even with our size, we don't have the scale to really do that, right? But there are some things that they're doing that are really cool that we can take advantage of.

Also, there's the pure startup model. So I call it the SoMa warehouse model. That's where I live. So the guys, they're in a dark warehouse for three days, and you just put pizza under the door every couple of hours, and they're happy.

This is the model where the folks, the engineers, they build it, they test it, they deploy it, they support it. So first, frankly, those are not the people Neustar can attract. Let's just assume we want to attract those people. We don't provide that cultural environment.

We provide a very, I would say, more balanced environment where we provide really fantastic benefits. We work really hard, but we know sometimes when to lay off of work. And those folks really don't work in our environment.

But then beyond that, we have compliance issues as well. There are ways to get around that, obviously, through automation, but it's just not a model that works for us. And so what we started coming up with is what I call kind of a true hybrid model, right? Looking at what everybody's doing, looking at what you guys are doing, and learning from you, and building almost like a Frankenstein model of what works for Neustar.

So here are just some sort of examples. If you think about, if you're an engineer, if you're dev, what are you motivated and incentivized to do? You're incentivized to check your code in. You're incentivized to deploy as fast as you can, right?

And if you're an ops guy, how are you motivated? You're motivated for stability. And when do you break? When there's change. So you inherently have these two teams in the same company, and they're at odds. They're immediately at odds because of their different goals.

So what we start to do is create this environment where the dev guy has his development time to commit goals. You know what? Suddenly he has an SLA goal. He's also responsible for, once that service goes into production, keeping it up.

Now, on the flip side, you go to the ops guy. He's used to having his SLA, his five nines. Great. But suddenly now he has a delivery goal as well. And what you start to see is that the two sides that were normally fighting it out, we can start to get more to center.

Again, we're still working on that. We haven't perfected it. And again, I just want to say right up front, we're a work in progress. But those are some of the things that we've taken advantage of, that we've learned from other folks and what they're doing.

So if there's only one thing we've done right, if I could say one thing, it's this adoption matrix. And I want to talk you guys through this, because I don't know where you guys are in your deployments of DevOps or CI or CD or what have you, but if you're thinking about it at the C level, they all want to know what's happening. What kind of progress are you making? How do you start quantifying that and qualifying that by team? And again, it's not easy to do.

So what we've come up with is, first, a quantifiable measure, how well we're doing it, and penetration, sort of the take rate on the adoption. So what you see here is a rating. If you look at your vertical, you guys probably can't see it, but each vertical represents a product. And again, this says products 35-plus. We're at about 50 products and services now at Neustar. So each one of these verticals here is a product.

Now, if you look down over here, these are all of the different subcategories that make somebody 3.0. So it's everything from--you only see a very small portion here, but it's the platform services, the things I talked to you about. Are they using identity management? Are they using our billing application? What about our API? All of those platform services.

And then there's also, you would see infrastructure services. Are they on the cloud, et cetera. So we've taken all of the--there are about 30 tools and services that make up, that comprise this 3.0 methodology. And within that, we've rated them.

We start with zero, and this is why it's so important. Zero is this product or service is not appropriate for this tool. So right now, when we go to later calculate our adoption, we take that out of the denominator.

And then this is a qualified measure, either one, not using, to five, ideal usage, so that we're starting to get a qualifying. So not just are they using it, how well are they using it? How well are they adopting this service or this tool?

One thing here is that these today are self-rated. We will eventually go to a more rigorous structure, but we had to start somewhere. So you can imagine if you're self-rating it, I can go through team by team by team, and they've all rated themselves, let's say, a three. But how I see the services and the tools or the methodologies deployed in production, totally different between them. So it's very subjective, but we had to start somewhere.

So you can see this right here. That's the qualitative measure. Now we want to calculate the quantitative measure. So if you've got 30 different services that each individual product or service can subscribe to to be Technology 3.0, what percentage is each one of these products?

So if, let's say, the product was not eligible, just wouldn't be a fit, it was a carrier product where we couldn't put it fully in the cloud, we remove that from the denominator. So every one of our products can get to 100% adoption that's relevant to them.

Because what was happening is we were leaving all of the individual elements in the denominator, and I'd be like, "Well, why is that product only at 40%? What's going on, guys?" It's like, well, no, but you know what? They can actually only ever get to 60%.

So it was the way that we determined that we could start to track really where we are. And it tells us we can almost develop heat now. Well, we do. We can develop heat maps in terms of who is falling behind, and it's a completely level playing field on the quantitative axis. So this has just been tremendous for us in terms of helping.

Our learnings. One thing we did, I think, really well is we ring-fenced the sensitive applications. I was just talking to a colleague last week, and he was telling me it's a Fortune 500 company, very large, financial services, and he's rolling this out right now, and he's got all kinds of compliance issues.

And he said, "The biggest mistake we made is I didn't give anybody an out." He said, "I really pushed this hard, not recognizing that for some services, it's not appropriate." And he said, "I wish I could go back and do it again."

So this is one thing we did right, is that we recognized one size didn't fit all for the different products. Where the service didn't fit, we took it out of the denominator. Definitely something that I'm happy we did.

Next is, again, building the adoption matrix. We're really, really struggling to figure out where our gas pedal was, where our lever was. And we're right at the point now where, okay, that's all fine and good, Natalie. You have adoption metrics. Show me the beef. Now show me the results.

And that's what we're working on now, is, okay, so now that we're getting traction, we've got adoption, what does this mean to the business? How are we cutting down deployment cycles? How are we building resiliency, et cetera? Our uptime, how is this improving our uptime?

So something we didn't do so well is gaining understanding and buy-in from the rest of the organization. I think we really spent a lot of time on the technology organization, really working in roundtables, doing all-hands, soliciting feedback. We still could do better, but I feel like where we really stubbed our toe is elsewhere in the company.

So the company starts to hear things like, "Well, what's 3.0?" Well, absolutely, we have a course to explain that to them. "Well, what is the benefit? I don't understand. Why are we wasting all of these cycles on learning X?" Well, we haven't done a good job of explaining to the business, and if I could go back again, I would redo that.

So when I originally put together the plan for migrating the company onto this new methodology, there were three positions I thought were key, in addition to just the regular managerial, and these were adoption folks.

The first thing I thought was key was a product manager. And the reason why is because, again, we've got all of these different product people in our company. I needed somebody to be able to go and sit with the product manager and explain to the product manager, "Hey, how much budget do you have? Well, you know what? For that budget, let me tell you how I can help you roll features faster." So somebody who almost productized from a platform perspective internally into the company. So we do have that, so that's good.

The second thing was a solutions architect, or really an SE for the platform, who could go right in with that product manager, who had one foot in the platform world and can sit down and code with the engineers, but then on the other hand, could talk directly to the product folks who know that.

The thing that we didn't do that I think would've fixed this, and it's contrary for a technology organization, is I felt we needed a marketing person. And if I could go back and change it, I would've thrown my body in front of the bus to get that marketing person, because we've really stubbed our toe on getting buy-in from the rest of the business.

And then that goes hand in hand with ongoing communication. I think we're just all so down in the details. We're deploying every day. We've got release after release after release, and we're constantly doing upgrades. We've got tech debt, and we're trying to migrate to the new platform, that we as engineers aren't naturally good communicators in general.

And so we played to our strengths, and we burrowed into the technology and our vernacular and our world, and we didn't do as good of a job as we could've in communicating.

So NextTech for us, again, it's a holistic process. We're trying to work across the business. We're just now starting to get the metrics that show them the benefits. So at our recent all-hands, it was last month, we actually had one of the product managers come in, and he was one of the most averse. He absolutely did not want to do this. He was against it.

He put his product into the 3.0, the next-gen, the cloud environment, and he came out at our all-hands absolutely singing our praises. We had quantifiable metrics. He got up there, he told qualitative stories, said he was a happy customer, and now what we need to do is get more of those happy customers to help us tell the story.

And then again, I mentioned we've got this framework, but it's a work in progress. We still have a lot to do on the execution side. Again, one of the great things is I can come to something like this, and I can hear what you guys are doing, and I can listen to what you're doing, and I can go back and apply and make what we're doing better.

And then again, the business metrics. Right now it's all about how we can show value to the business, but we're right there. We're just at the point where we've got adoption, and now we can start to show the benefits.

Okay. Any questions?

Q&A

You guys are quiet. It's not beer time yet. Yes, sir.

Q: Just in regards to your customers, are you internal or external?

A: So the question is, in regards to our customers, are we internal or external? We're both. So not only do we have all these products and services, we are both. So we sell essentially internally into our product groups, and then we also sell externally. We're a full-blown service provider, both to carriers and to enterprise.

Q: Right. Okay. I suppose the question I was leading to, though, was in terms of the way that we incentivize our operations and our delivery teams, are we jumping--when I say we, sorry.

A: Hey.

Q: Are you jumping in line--

A: The royal.

Q: --in terms of your internal customers? So in terms of your 3.0 and the teams that you're bringing on board with your shared services, what happens if someone comes along and says, "You know what? This isn't working," or, "This isn't..." How do you prioritize that work? Because you're incentivized to do so in terms of your KPIs and things like that for your managers.

A: Yeah. So when they come around, do you mean like a product person?

Q: Your internal customers. So the teams embracing your--

A: Right. So right now we're pretty much self-funded. We're self-funded in that regard, but that's about to change. So we've got to get enough traction so that they will fund us, essentially.

So what we want is our guys going out into the product groups and applying their budget to our platform. The other thing we said is, my favorite words are end of life. Right? You can't make me any happier right now than to tell me I'm end-of-life-ing a platform, because then, this sounds bad, but they don't have much choice, so they're more willing to work with us.

So right now we're self-funded, but as their platforms go EOL, we'll go ahead and we'll capture them with this new platform. Did that answer your question? Maybe we can stick around later. I'd be happy to talk to you about it further.

Okay. Any other questions? Yes, you, sir.

Q: The concept of value stream mapping, have you used that in your system or in your overall process to try to identify waste or ways to speed up the process?

A: Yeah. So the question is value stream mapping and how have we looked at that in terms of an overlay on our process to determine where we might have waste or excess, or really what I would say is opportunity, too, right?

So we've done that informally, very informally and non-scientifically, right? We know where our platform's bleeding, right? Any ops person, any dev person, you know where your pain points are. We've not done it scientifically yet, though. I think, though, that as we then gain adoption, we will start to pick that up. We're just not mature enough yet. We're just not mature enough.

Okay. Anybody else? Okay.

All right, guys. Thank you. Thank you very much.