DevOps Transformations At National Instruments and Bazaarvoice (And Infosec!)
In this presentation, I’ll share the thrills and chills of the real-world successes and setbacks in culture and collaboration, speeding up software releases, embedding DevOps engineers into product teams, implementing agile processes with operations teams, integrating testing and information security into daily work, automation and its pitfalls, metrics and their weaponization, and more. I’ll also discuss how we integrated security objectives into all these initiatives.
Chapters
Full transcript
The complete talk, organized by section.
Ernest Mueller
I'm Ernest Mueller from Austin, Texas, and I'm here to talk to you about a series of DevOps transformations that I've helped lead.
Before you ask, no, I am not Paul Blart, Mall Cop. I get that question frequently.
I spent the first 10 years of my career in Memphis, Tennessee, working for both FedEx, huge enterprise IT, and then start-up IT. Then I came to Austin, embarked on a series of four DevOps transformations, or that's what they would become known as once the word DevOps got invented.
What I wanted to do is just share with you all. This isn't a big theory talk. I'm going to go through the four different transformations I was part of, explain the problems, explain what we did, explain how it turned out, and show you four different ones that are designed to address four different situations.
The first company that I worked for in Austin is called National Instruments. They make test and measurement hardware and software for the scientific and engineering market. Their products help power everything from the super collider to LEGO robotics.
Has anybody here ever used LabVIEW? Yeah. Aha. All right. We've got a couple LabVIEW users.
My first job there was managing a web systems team. The National Instruments website was very important to their operations. A lot of sales went through it, half their leads, all their product information. They had a large team of developers working on it, and hundreds of applications and a variety of stacks. It had moved over time from Notes Domino to Oracle PL/SQL, like straight PL/SQL, to Java, to a variety of things. They were aggressively growing.
The problems that we faced there was the previous team management had been actively anti-automation. I don't know if any of you have been in a place like this, but his theory was that if you built a system to do it, you would forget how to do it; therefore you needed to do it manually all the time.
Unfortunately, what this ended up reaping was extremely low site availability, and it was a mess. Metrics include 200 on-call pages a week, and two destroyed on-call pagers as a result of those 200 pages a week.
Application releases started Friday night at 7:00, and everybody, all developers and operations, would get online, and they would end whenever on Saturday they got done. Just your standard big-bang sort of releases.
My team had been formed out of the other IT infrastructure teams because they had learned early on that we had your standard IT infrastructure setup of Unix team, Windows team, network team, storage, like eight different teams. They figured out pretty early that having all of those folks trying to support the website was destined to failure, so they put us in as like a glue team. Today you'll see some people be like, "I'm going to put a DevOps team between dev and ops." And that's what that looked like.
What we did there, the first step was stop the bleeding. When you have 200 pages a week and all this downtime, you don't really have a lot of opportunity to do anything else. Getting metrics, starting to track quality of life, doing root cause analyses on those 200 metrics and figuring out what it is that's causing them, and doing all that is your first step.
Then we automated our way out of some of that: automated service restarts, application deployments, essentially trying to cut out big chunks of the work that were going on so that we could then automate higher-value things.
We really worked a process up where we hadn't yet... DevOps wasn't a thing yet, right? So we didn't figure out the whole "embed people in the teams," but we did come up with a process that the development teams should look at during each phase of their service. It's like, okay, you're designing the service, here's the operational questions you need to ask yourself. Okay, you're coding it, here's...
So we did a lot of that. Then once we got out from under the big crush, we started partnering with both the apps teams and the business. It used to be kind of business team talks to developers, talks to operations team. One of the things I did was get us a seat at that initial table and drive some common goals.
The marketing department wanted higher performance and availability and all that, but frequently that was treated as an operations problem, not as an everyone problem.
We got through that, and after a while, we got staffed up. We got really good at this, and we formed a security practice and an application performance management practice, kind of centers of excellence. A lot of our developers in that organization were hired right out of college. They're still one of those places that does a lot of hiring for talent. The people in the operations team tended to have more experience, so a lot of the more high-level things ended up finding a home with us.
This worked pretty well. The thing that I'm the most proud about is that the director of marketing really started to value our team. Before that, he barely knew we existed. But he eventually said, "You guys, you know what it is that I'm trying to do, and you help me do it."
We improved the quality of life of the engineers. We improved uptime. But it was not an unqualified success. We improved a lot of things, but in the end, we were still the bottleneck.
I remember one meeting I was sitting at, and it was the business and apps and all of this, and they're just like, "Yeah, the web admins are doing great work," all this, "but they're still the bottleneck of us getting things out."
That was my first kind of epiphany moment. I said, "We've invested a huge amount in tooling and in process and in doing all these things, and whenever I talk to other people in the industry, our vendors, they are very complimentary about it. It's like, 'Oh, you guys are doing some very high-end stuff.' But there has to be something fundamentally wrong. If this is the amount of success that it's possible to have, pushing this much work at it, that can't be right. There has to be some other solution."
There hadn't been any DevOpsDays yet, still, at this time. The Velocity conference had just started, and we had gone and started hearing some of the things other people were talking about.
The next thing that I did was move into a different role. Our engineering department decided it's time to get into the world of SaaS products. We hear the kids nowadays love SaaS products.
They started a greenfield team, and they tapped myself as a systems architect and a number of other people from our IT organization to come over and start that. They had a number of ideas of products they wanted, but it was all also experimental. It's like, well, is this something we can do? How successful can this be?
Given what we had learned, we went in with some initial decisions. They said, "Okay, well, what do you want to leverage out of IT? What do you want to leverage out of engineering?"
We said, "We want to leverage exactly nothing out of IT."
Because we had learned the hard way that their goals weren't aligned with the sort of things we needed to be doing to ship products. Even just in the previous role, we had gotten to where we had okay kind of DevOps collaboration. I would say our Ops-Ops collaboration was pretty bad. As a team, as our values shifted to align with that overall web team and say, "Okay, well, we need to roll things out faster. We need to provide you self-service," we found ourselves having trouble with the other infrastructure organizations because those weren't their values.
Another breaking point I had early on was we brought in VMware. In general, it took me six weeks to get a new server out of all those teams because, of course, it would have to go from the Unix team to the network team to the data center team and so on.
So we put in VMware. VMware comes in, does demos. Look, provision new server, 15 minutes. It's all good. Okay, so we brought that in, and then guess how long it took me to get a new server? Four weeks, because the turnaround time for Dell to ship the server was removed.
But I started to realize it's like, okay, it takes 15 minutes to make it, and we have four weeks of overhead. Like, really? Is that really okay? IT organizations up to that time, they felt like that's okay, but it's not okay.
So we said, "Never mind. We're going to host it all on Amazon. We're going to use engineering's processes." The engineering department, they have a lot of ISO certifications and stuff like that. You had to do design reviews, have design documents, had a whole new product introduction process. So we adapted that to Agile, and we used that.
We had to give up all of our old tools. We had invested a lot in tools, application performance management tooling, but it was all either network-based or its licensing wasn't cloud-friendly because, again, this is kind of back in the day where most people's licensing wasn't cloud-friendly.
We knew that security was going to be a huge barrier to adoption for these products. For example, one of the products we were coming up with is called the FPGA Compile Cloud, where people would upload their FPGA designs to be compiled. Well, that's super-sensitive intellectual property.
We felt that both security and a lot of the other operational attributes, we felt like those were important to be kind of product features, not just something we were doing as a tax to produce the product, but that if we leaned forward on security, that would help us actually gain adoption of the tools.
Then also, we decided we needed to automate from the beginning, and we actually ended up developing our own tool. We had a lot of requirements to do some Windows automation, stuff like that. This is back when Chef and Puppet and whatnot, they existed, but they didn't have much in the way of Windows support. But also, we had been doing this long enough. We said, "We want to be able to automate..." Whoops. Wrong button. "Automate everything."
So when you draw out one of your systems back on your whiteboards, doesn't it look like this? You've got your boxes, got your lines. Pretty straightforward.
All of these different tools: do they actually model this, or do they just provide little chunks that you still are having to conceptually align to model this?
Configuration management, like the cloud and cloud provisioning, gives you the yellow boxes. Those are the roles of systems. CM gives you those little white boxes, but who gives you the lines, to oversimplify? It was a service management problem.
So we decided to write our own tooling. We basically modeled the system in XML. This was a hard decision for us because we said, going in and writing our own CM and command dispatch and everything tool, that's a lot of work that we're biting off. But when we looked at the problem, we said, we think we can do it, and we also think that the benefits of it will far outweigh the time we're going to spend on it.
So we modeled the system. We had a ZooKeeper-based service registry that kept everything hooked up. We had a system where you could replace a database server live, and all of the services that depended on that database knew it and could just reparse their own configs and restart just automatically.
That was night and day of an experience of administering systems. You're starting to see this approach show up a little bit more now with, especially, a lot of the Docker and containerization orchestration tools. But treating this service-level management as a first-order concern is something that was not done at the time.
A lot of the other stuff we did: collaboration between the development and operations staff, it's hard to get that shared culture.
I had the application architect come to me, and he said, "Okay, we're having these design reviews. These operations people are asking me weird questions. I don't even know what they're talking about, let alone know the answers to their questions. Maybe we should have separate, we should have a dev design review and an ops design review, and we should do that."
And I said to him, "Well, so you've seen how that turns out, right? Because that's the environment we just came from. I think if we power through that, and once we do enough of those, you'll learn what they're talking about. I understand you don't know right now. It's not your fault. It's not their fault. It's a hard thing. But if we keep this up, you will know what they're talking about, and we'll be able to make a better overall product."
That's what we did. Sometimes you hit these bumps, and really the answer is just keep doing it until you can do it. It's like learning how to ride a bike.
The big result here was that we discovered we could release software-as-a-service applications in a third of the time that our organization could release software applications. The average time to market for an NI software product is three years. SaaS product, we did this for three years, and we delivered one a year, even with the hit upfront of developing our own whole provisioning system. So that was a huge validation of the approach.
Also, our uptime was 100%, whereas the system we had just worked on that we could still see there at the same company was not trending to 100%. Some of that, well, it's greenfield, and greenfield's always easier. But on the other hand, results are results. Is that a fair comparison? I don't know, but we've got 100% uptime, so to a degree, the fairness doesn't concern me all that much.
Meanwhile, the DevOps revolution had kicked off. We had gotten hooked into it, kind of the quickie timeline, but a lot of the key DevOps folks came to Austin and did a thing called OpsCamp Austin in 2010. Out of just pure luck, I happened into that. I thought it was some local event, and I showed up, and I'm like, "I recognize a lot of these guys from Velocity and stuff. What's going on?"
Out of pure luck, I kind of got tuned into it, and that was a big help because a lot of what we had done to this point, now you have the benefit of a billion presentations that say to do a lot of the things that I just said we did. We didn't have the benefit of that, so we weren't sure if we were going the right way or if we were going crazy rogue. Like develop our own service management system, nobody else is doing that, or are we just stupid? Is that our problem?
But this really helped us understand, and as a result, I really value that ecosystem. We started a DevOpsDays Austin immediately, and that's still going on. We're in our fifth year, so come on out to Austin this year for DevOpsDays.
My next job was at Bazaarvoice. Bazaarvoice is a software-as-a-service company that does ratings and reviews as a service. For example, if you go to walmart.com and look at all the product reviews, those are actually collected, curated, and hosted by a third-party service. Very large scale because we weren't doing that for one retailer. We were doing it for 30% of leading brands and a lot of the top retailers. Very large-scale systems.
My first job there was release manager. They had been making the transition to Agile and had hired up a bunch of engineers. It had started off as a more marketing-driven company that outsourced a lot of its engineering. But they decided to pivot into a more engineering-driven company. They brought a bunch of folks in, moved to Agile, and they tried to move from their big 10-week release cycle to a two-week release cycle as they started going into sprints, and it failed awfully.
They're like, "Great, release," and everything fell apart.
So they said, "Okay, well, let's bring in somebody that's going to help us make that work." And so that's where I came in. I didn't actually have a team. I was working with engineers on the various development teams.
This was your standard first-gen SaaS product. Still kind of monolithic in architecture. A lot of branching. Very low percentage of automated testing. Those two things create a vicious cycle.
The longer your cycle is, the more tempted people are to make late check-ins because they're like, "Well, we can't wait like 12 weeks for the next release for this one because Cabela's really wants it," or whatever. Those releases were always slipping. They always had issues. There was a lot of devs would throw it to the QA, and then QA would throw it to operations, and that was the process.
Also, a lot of the other teams that were stakeholders in this, they felt like they never knew what was going on. As a company that's producing a SaaS product, it's not one of your portfolio of products, it's your primary product. Everybody in support and implementation and sales, they wanted and needed to know what was going on with this system. But they hadn't moved past... I'm sure a lot of you have IT organizations where you shake your fists at them because they never tell you anything about anything that's going on. That was the pain that they were feeling.
This was different. My previous two kind of DevOps transformations didn't really have much to do with the release process. Even the SaaS stuff at National Instruments, we got to where we could release any time. Continuous deployment wasn't part of what we did because that wasn't the problem we had. Here, this is the problem we had.
When you do DevOps, you can't just take the formula and be like, well, continuous integration and delivery, that's going to solve your problem, if you don't know what the problems are. A lot of our problems had a lot more to do with systems management in those previous roles.
Here, product management supported us in letting the dev teams take off a couple sprints to essentially up their percentage of automated regression testing. We did some core tooling. We changed the branching strategy to trunk only. There's trunk, and then there's a single release branch that we cut right before each release. The only thing that can be checked into it are fixes for critical bugs that have been found in that release.
We put responsibility on the original developer to shepherd that change all the way through to production. It's like, no, you don't reassign that Jira ticket or whatever. That is your change, and you need to make sure that it makes it through all those steps. If it fails in test, if it fails in deploy, it's your problem. It's your baby. You need to make sure it gets all the way through.
And we did a lot of communication around the releases. This is something, in my previous roles as well, I had a lot of luck with really just ramping up the level of communication, both to direct and indirect stakeholders.
I would have people come into my office where they kind of looked like they were on the verge of tears, and they would say how happy they were that somebody was finally telling them what was going on. That release got delayed a day or whatever, and they really needed that information.
Now, you'll have one in 20 people that will tell you you're overcommunicating. Those people are assholes, and you need to ignore them. Tell them that you've got a junk folder in your email for a reason.
Especially at Bazaarvoice, I'd say, "Well, if you think that you don't need to know what's going on with the core service that makes all of our money and all that, you can put that in your junk folder. But frankly, if I were you, I would think about my life a little bit before you did that."
The results of this. Our first release, the time came, and it was no-go. This is actually very positive because up till then, nobody wanted to be responsible for delaying a release. It was always finger-pointing. It's like, well, the release is going to slip, but it's because QA isn't getting done in time. Well, we're not done in time because the devs, whatever.
Here, we had set up standard go/no-go meeting. All the dev managers came in, and they looked at it. They sweated for a while before they could say this, but finally they said, "No, we don't feel like it's up to snuff. It's no-go."
So that release got delayed, had some customer issues when we rolled it out. Second one, on time with one issue. Third one, on time, zero issues. As everybody got used to it, it got a lot better.
Developing that discipline is important. None of these changes happen immediately because people can't adopt change and have it become second nature immediately. You have to have revs on it. It's like exercise. It's like anything.
We tuned the process over time. When we'd have big blowups, we'd add more process. When things started going well, we would remove process. Those go/no-go meetings, those evaporated after a couple months because they weren't needed anymore.
Every piece of process you have is an impediment. So you always want to be curating that and saying, "Okay, is that still needed, or was it just needed at the lower level of sophistication we were at?"
Because of all those changes we made, a couple of months later, I said, "Well, smaller batch sizes are better. Let's go to weekly releases."
Going to biweekly releases had been this giant folderol, multi-month thing. I sent out an email. It's like, "Yeah, we're going to move to weekly releases." And people were like, "Oh, yeah, that sounds good." I'm like, "Okay, we're at weekly releases." And it was all fine, because we had set up all that discipline ahead of time, so we could tune it to however frequently we did want to release.
In this case, we still didn't necessarily want to do it continuously because of folks like the sales team and implementation team and all that. They needed to be able to intake those changes because they were building on top of them. So we didn't want to release more frequently than that, but we could.
That was the third one.
My fourth gig, I worked myself out of a job on that one. We turned the process into something that was automated and easy to do, and then we just started rotating it amongst the development team. It's like, "You're release manager this week." And so I was like, "Okay, my work here is done."
So they had me become an engineering manager of the teams supporting the core product. I mentioned before the monolithic architecture. Just like everybody else, we spawned a bunch of new teams that were going to rewrite the whole thing off to the side in microservices and all this. Then we were going to get that done, and we were just going to shift over.
The first thing I started with was the operations team. We kind of declared the death of the operations team and embedded them into the different product teams. A lot of them went into the legacy team, and then some of them went into the new teams.
We had a lot of offshore outsourcers. They were very important to our development historically. We had outsourcers that had many more years of experience on the code base than any of our incumbent engineers.
Problems here. Who's heard of the strangler pattern at the conference? That word hadn't really been invented yet. We had this assumption that we were just going to build up the whole existing stack, but brand-new in six months, and then just shift over, and it was going to be great.
Hint: please don't come up with that plan. That legacy stack is still in existence today.
Eventually, we pivoted into more of a strangler pattern, but we didn't think that's what it was going to take, so we made some decisions that kind of put that team behind the eight ball a little bit, especially because even though we ramped back new product development, we had huge data load, and that data load was doubling year over year.
People would say, "Well, you don't need to do a new development. We're not adding big new features." I'm like, "So, being able to geometrically scale your data stack is a feature. I don't know if you know that, but it's a feature."
So we were growing. Black Friday was a huge deal for us because, of course, it's a big deal for every retailer pretty much worldwide, and we're hosting most of them. That's something where we'd serve out 2.6 billion review impressions during one of those periods.
We had kind of poor SLA on our support tickets that were escalated to engineering. And we had a lot of compliance needs. As a growing international company, besides your kind of SOX and ISO and normal stuff, TUV, AFNOR. Europeans are a lot more touchy about privacy and security and stuff like that, so they have some more prescriptive kind of standards.
What I did, I had actually moved the ops teams to Agile first. This is kind of interesting. I moved the ops teams to Scrum, and then when they moved and joined with the devs, they taught the devs Scrum. You don't usually see that, but that's how it just evolved.
That didn't work immediately. I had a 40-person team, most of which had not been trained on Agile. We tried to break them up into four teams and all that sort of thing, but it took a couple of tries for them to really get it.
The other big thing is balancing your backlog with your interrupt-driven work. This is something I could talk about for 15 minutes, but I'm running low on time. But that's something that operations teams, but also your development teams that are balancing new feature work with sustaining work, that's a key issue, and we tried a bunch of different ways to do it.
Again, developing custom software, sometimes you need to do that. This was a visualization we put together for the metrics for our system. Because we had monitoring systems. We had 20 linear feet worth of Zabbix metrics. But it was very hard for that to tell you at a glance how your system was going. So we built one, did some more stuff, blah, blah, blah.
This worked out very well. We got our customer support SLA up. We started measuring everybody's velocity once they went to Agile. We could see that it was growing continuously using that embedded model, and we successfully weathered those Black Friday spikes.
As I mentioned, we had to pivot to the strangler pattern for the new stack. But also all that automation, it made compliance easy. There was one time where we had some ISO auditors in, and they wanted to know about backups, and so I showed them the Confluence page and then the links to in source control where all the automation just automatically did all the backups and had retention times. They were like, "Oh, man, this takes us weeks in a lot of places." Like, done. That DevOps approach, it helped us really fly through a lot of those things.
But one of the most important things, and I'll kind of close with this, is quality of life. We talk about culture a lot in DevOps. We don't necessarily put the fine point on it and say it's about quality of life.
That whole Black Friday period, it's huge for our company. It's huge for our customers. It's a big deal. We do months of prep for it. But we didn't have a big crisis room. Everybody's working at the office over Thanksgiving weekend. Everybody went home.
That's that same visualization y'all just saw, and it's running next to my family's turkey and cheesecake. Because that's what you want. A lot of the benefit of these things is that you're able to conduct business in a much less interrupt-driven, much less crisis-fraught thing. That has a lot of benefits to your business, but also has a lot of benefits to your people.
If your engineers and all of your people are not dealing with conflict, and not dealing with painful interrupt work, and not dealing with repetitive work, that's when you get their initiative, that's when you get their innovation. That's where you go with that.
The end here is basically: help your people. Improve your latency and your bandwidth.
Everybody talks about the value chain, and you're trying to shorten the steps in the value chain. That's kind of improving your latency from the network point of view. You also need to try to find chunks of work that are going into that pipeline that shouldn't be, and automate them and get them out of the pipeline at all.
Don't be penny-wise, pound-foolish. We were never able to embed DevOps engineers into the teams in the National Instruments gig because we didn't have enough. I came up with that idea out of my own brain, and we tried it, but we were like, "Well, you're going to have to cover these three dev teams, and you're going to have to cover these three dev teams," and that doesn't work.
A little bit of hiring two more ops engineers or whatever, if it allows you to go from playing man to zone, or from playing zone to man, that gets you huge benefit.
And my other two points that I won't read to you. Any questions?
All right. Well, thank you. Feel free and reach out to me if you want to ask questions. I know each one of those was quick, but a lot of times here, we either get theory talks or one big transformation, and I wanted to show how you can kind of go through a series of them that have different problems and apply different techniques and see how that works out in practice.
So thank you very much. Thank you.