Top 10 Ways to Fail at DevOps
Scott Willson is Product Marketing Director, Release Automation at Automic Software. He has over 20 years of technology experience that spans software development, pre-sales, post sales, and marketing. Scott is passionate about technology and helping business achieve value through technology and was leading DevOps at organizations before it was coined DevOps.
Chapters
Full transcript
The complete talk, organized by section.
Scott Willson
What we're going to be talking about is the top 10 ways to hopefully not fail, but to fail at DevOps at the enterprise scale. The thoughts and things I'll be sharing with you today actually come from large enterprises and just some of the gotchas that can get you as you try to implement DevOps at scale.
Hit the right one. So what we're going to start with is just going over the definition of DevOps, just to lay the foundation work, and then we'll get to the top 10 list.
So the first thing is: what is DevOps? Gartner defines it basically as a philosophy. And I think that's good to start with, because as I meet with lots of people, especially in a conference like this, and I ask people what DevOps is, I will get all kinds of different answers. Right?
And I think it's important to start with the foundational idea that it really is a philosophy. If you're doing microservices, for example, that doesn't mean you're doing DevOps. If you're in the cloud, that doesn't necessarily mean you're doing DevOps. You can, in my opinion, have Waterfall and still do DevOps if you're doing some old core back-end system like the mainframe and so forth. Right?
So with that said, let's consider how this philosophy perhaps shouldn't be applied at the enterprise scale.
We're going to start right away with tooling. We heard Steve Brodie yesterday talk about application release automation tools. Gartner and Forrester believe that ARA tools are key to getting DevOps to scale to the enterprise. All right? In fact, Gartner has said over the next few years, up to 2020, up to 50% of enterprises are actually going to have at least one ARA tool in their portfolio.
So there's expected to be a huge amount of growth. And the reasons for that, I believe, isn't just because of managing the releases, the release capabilities, and it's not just because of the automation capabilities of these ARA tools; it's the modeling capabilities of them. All right? The modeling capabilities for ARA tooling allow you to abstract a lot of those tiny little details in a more digestible and easier-to-use format. So you're not having to build a huge network or hierarchical nest full of workflows and automation bits and pieces. It makes it a lot easier to manage.
So number 10: not getting an ARA tool in your organization.
Number nine: failing to include management buy-in. This is a big deal. If you remember Heather Mickman's talk yesterday, she's been here every year at the DevOps Summit, right? And I've noticed how her path, her journey, has grown. And this year, what was remarkable to me is how it really skyrocketed once they got a new CIO. Right? And the CIO said, "We're doing DevOps."
A good story I have for this is: why is that the case? Well, first and foremost, everyone here should understand that the language of business is finance. That's it. Right?
I met with the CIO of a bank, and we were working out this DevOps plan for him. Right? We had all these metrics, and one of the great savings was going to be to save his people from HESS syndrome. Of course, he's like, "HESS syndrome? Man, is that contagious? What is that?"
I'm like, "Well, it kind of is contagious. It's holidays, evenings, Saturdays, Sundays. That's what HESS is." Right?
He's like, "Oh." He didn't laugh too much. He goes, "You know what? This is great, but I got to admit to you. Now, this is going to sound bad," he says, "and I love my employees, right? But the bottom line is, if you do something faster or you save my employees time, that does not move the needle at this organization. Their salaries are fixed. Right? If they work 30 hours, they work 80 hours, from the balance sheet, it does not matter. So you've got to help me understand what the real value of DevOps is, and it can't be in some of those classical metrics that people tend to think of."
The other thing to remember with management, right, not only speaking their language, but helping them understand why you have to change. Change for change's sake is sort of like what I say what's going on right now. If you're a follower of the NFL, right, it's the Dallas Cowboys syndrome. Their starting quarterback, Tony Romo, got hurt the last preseason game of the season. Right? They go into the beginning, first game of the season, and they've got a rookie at quarterback. And guess what's happened? They've only lost one game. They're winning.
And now their starting dreamboat quarterback is healthy again, and do you think the executive and the teams want to bring him back in? The answer is no. Why? Because they're winning. Right? And business leaders have the same objective. If it's not broken, don't fix it. Right?
You have to help them understand the need for change. Change for the sake of change won't get your management team motivated to help you. And like I said, you've got to speak their language, which is finance.
How about this one? Becoming too reliant on open source software.
Now, you hear a lot about open source software, and what I'm saying is I'm a big proponent of it. I don't want anyone coming away thinking that, "Oh man, Scott Willson is really against open source." I'm all over it. It's really cool, and it's amazing how it's transformed the world.
What I will say, though, is it isn't necessarily the end-all, be-all. Right? Open source requires you to put in a lot of sweat equity to build out these solutions. Whereas if you buy something from a vendor that's packaged and purpose-built, in theory anyways, you can buy it and get off the ground up and running rather than having to build something out yourself.
Additionally, since it is in an open community, there are some risks with open source software. Right? I was having a conversation with a good friend of mine who's really heavily involved in the open source software industry and market, and I was bringing up some of my concerns. And he said, "But Scott, the thing is, the good thing about open source is it's peer-reviewed. Everybody can see the code. Everybody can make sure it's safe. Everybody can see what's going on."
And so it's right. That's true until it isn't.
And he said, "Well, what do you mean by that?"
I said, "Well, look at Heartbleed. Anyone here remember Heartbleed?" Right. The Heartbleed bug. A few hands. That was an open source problem. The bug was released in production out into the world for everyone to use, and it sat there for two years. Now, the NSA found that bug on day two and kept it under wraps for two years, secretly decrypting the internet at scale, right, at speed.
But the point is, you have to really stay on top of open source. There's always patches. There's always fixes, especially a lot of your open source libraries. If you're not willing to put in the work to keep on top of these things and secure your environment, your infrastructure and all that, you could actually wind up inducing vulnerabilities into your environment.
And again, like I said, DevOps is a philosophy. Just throwing onboard onto your company open source software doesn't mean you're now magically doing DevOps.
All right. How about number seven? Failing to consider IT history.
Well, IT history. I would like to say that IT kind of looked like this in the '80s. Yeah. You got a bunch of people out there, green screens, sitting around, pretty advanced. You move on into the '90s, we've got separation of systems, ERP solutions, and hey, we've got these cool things called PCs that can be networked together. Right? Things start growing.
And we move into the 2000s. We've got our first SaaS offerings and so forth. All the while, while this is going on, by the way, the mainframes are going to be exterminated. We're moving off the mainframe. Right?
Then we move on to where we're at now. We have even more things. The cloud's getting bigger and cloudier. And then the big thing is everything is interconnected, interdependent, and talking to each other.
One of the big lessons to learn for the enterprise is the enterprise rarely sunsets anything. They just keep collecting more things that you have to be accountable for and account for when you start trying to do your deployments.
So when you hear a lot of the success stories we're doing now, you have to concern yourself with: how is this going to grow? Right? What's the big impact? And what's the impact as far as the interdependencies upon a lot of the apps that we onboard to our continuous delivery practices and so forth? Failing to recognize this or consider this, I think, will lead you to trouble in just a matter of two years.
In fact, there was a great story I remember from Bill Gates years and years ago when he was starting Microsoft. And as you guys all know the story, right, he built his first DOS operating system, sold it to IBM, and the idea was for him to kind of get into the hardware thing. He didn't want to do that. He looked at the history of the mainframe and said, "No, no, no. People are going to build clones. We're going to stay software only." He had some people who thought he was crazy for doing this, but as you know, he was right.
Looking at history can provide you some clues for the future. The other thing then is to realize, if your enterprise IT portfolio is going to continue to grow over the next two to five years, that anything you do with DevOps has got to consider that growth. If you keep yourself focused on such a small, narrow slice of the enterprise portfolio, you're really setting yourselves up for failure or to have multiple silos of DevOps and tooling that fight against each other, which is the exact thing we don't want to have happen with DevOps. Right?
And that kind of leads me to my next point: boxing yourself in.
What I have found company after company, as I have visited with organizations, I find that they come in, they start their DevOps journey. They start with a very simple app. It goes well. And then a short time later, they start realizing, "Yeah, well, we actually want to expand the use case. We need to do more."
For example, as I stated up here, what about the mainframe and legacy systems? Right? A lot of the apps that are out there that we interact with, online banking and so forth, actually have dependencies on these systems on the back end. Right? And what very quickly happens is that, as you start having these successes, I have found that IT management starts going, "Well, can we do this for this use case and this use case?"
Or it's just the simple thing of: what about your core back-end system, commercial off-the-shelf software? Why automatically deploy custom-written software? You should be thinking in terms of automatically deploying your commercial off-the-shelf software too. Failing to do this really boxes you in.
We've already heard a lot of sessions here: DevSecOps, including security in it from the beginning. And what about the DevOps toolchain? When you think about tooling, you really need to consider the entire portfolio.
Like I say, when I've met with IT leaders, I find that very quickly they think, "We're going to use some nifty tool because it's so easy to learn and it's so easy to onboard," and that works great on day one. And then day two is they realize, "Well, we want to integrate it to our ITSM stack." Then they're like, "Well, we can't do it. It doesn't work. This tool wasn't designed for this, or our practices that we set up weren't designed to handle that."
All right. What you really want is a great deal of flexibility, tech stack agnosticism, and you want to have sophisticated controls and calendaring. Calendaring is actually a big deal. All right?
I met with a very large retailer, and they were looking for, in their case, an ARA solution that had advanced calendaring. Why? As they told me, "We need help saving ourselves from ourselves." Because what would happen is, when they do their approval process to their various production environments, some distribution center manager would receive, download the update. Because they're practicing agile practices, and he'd quickly look at the calendar. And as he looks at the calendar, he goes, "Okay, this is good. I approve it, and I will push it in on this date," what he thought was a clear date. Only that date was Christmas or Thanksgiving or some other holiday.
And so they were wanting to find a case where the calendaring could actually have controlled blackout dates, so that even if somebody approved it with the best intentions, if they weren't really thinking through what the date was on the calendar, that it would prevent them from moving something in on that date and maybe roll it to the next available date.
In fact, that was the other thing they wanted. They wanted to be able to set what were static blackout dates so that anyone who does approvals wouldn't get themselves in trouble.
Another company I met with, their deal was with calendaring. It wasn't so much that, as that they're a very large enterprise, and they have a lot of changes that have to go out the door. Their window is very small because they cross so many time zones. And what they told me is that, "If we do our deployments all in one batch across the globe, we've got, like, I don't know, a 10-minute window to make that happen. That's just not going to happen for us. So what we need to do is say our window's at 2:00 a.m. in any time zone and have rolling deployments go across those time zones."
The point is with this, if you don't think about that when you onboard any technology or process, if you're not thinking about expanding the use case for continuous delivery, for DevOps, you very quickly box yourself in. And then you find yourself in trouble. And of course, going back to the management use case, you could find yourself trying to justify your failed attempt at DevOps or continuous delivery.
Okay. This is a big one. It's one that I find most people do not think about too often. But I worked for a company, a DevOps company, and they were very excited about what they could offer and how easy it was to extend the product. All right?
And when I met with these folks, I was asking, "Well, how do you extend the product, actually? If I want to add, I don't know, a connector to Azure or, I don't know, an ITSM stack or what have you, whatever. Another touchpoint. How do I do this?"
"It's easy, Scott. All you got to do is write a plug-in in JRuby and Perl."
I was like, "Huh. JRuby and Perl." Do you notice those two languages are not on this list? This list is compiled by ZDNet. What they did is they looked at lots of different language indexes. They brought them together and normalized them to come up with what the true top languages are.
Now, why do I say this story? Why is it important? Because I met with an executive who said that they bought some type of DevOps tool. I forget the language that you had to write stuff on. But five years after that acquisition, they had to go buy another tool. Nobody in the marketplace they could find knew the language that the tool used to extend itself.
So this is a big deal. If you jump on something and there's a new trendy language going on right now, you have to think in terms, well, what's this going to be like for the organization in two years from now? Will I be able to hire people to support it? Are the languages going to go out or go in?
Speaking of going out, Google's Go, as you see, is not even on this list yet. Not saying that that's a bad thing, not that Google is bad. The Go language, I actually like it and have written several things with it. But the point is you need to be careful.
Now, one of the things I have found that, over time, that's time-proven, that transcends all these languages, is the very untrendy, unsexy thing called command-line interfaces. Anyone heard of those? I get a few fist pumps. All right. They're not very cool, and I come from a development background. Worked for a broker-dealer, worked for some Silicon Valley-type startups, all that stuff. And it doesn't sound cool.
But the thing is, command-line interfaces require virtually no overhead to learn. They're virtually self-documenting. I don't have to learn variable constructs and memory allocations. I don't have to worry about any of those things. Literally, you pass a command that tells it what you want to do and the parameters for doing that thing.
And for all of our newfangled interfaces, RESTful APIs, Amazon, Azure, they all provide a command-line interface, or a CLI, to access their things. And the cool thing is, that means you transcend the language trendiness.
I met with an organization a few weeks ago, and I was talking about this, and they said, "Man, I wish you were here a few years ago because the mantra at our company was that we don't script anything. We don't allow it."
Okay. I saw someone question, what? That doesn't make any sense. Well, here was their thinking. "We don't script anything because scripting makes you dumb, and we don't want dumb employees. We want everyone to really know precisely what they're doing. We don't want somebody to just kick off some script and have no idea of what it does."
That kind of led the management decision at the time. And what's happened over the last two years of this company is that these same geniuses have all retired. All right? And as they've retired, guess what? There's no real good documentation trail of what exactly they did. They're now all these black boxes named Bob and Dan and Sarah. When they leave, they're like, "I don't know what they did."
Providing scripting or even an automation platform allows you to actually see a best practice, to see, well, this person's really smart. Let me see what they did.
I remember when I first got... So I started writing code when I was young, 15. And I went into the professional world thinking I was pretty much all that. I learned real quick, man, there are some bright people. I don't know if you guys have had that experience. Just when you think you're pretty smart, you come across people who are really brilliant. And I realized it was great to learn from them.
So it's sad that with this company, that knowledge was not documented in the forms of scripts and so forth, so that new people coming onboard couldn't learn from what they had already figured out.
It's almost like sometimes the technology, especially developers, the impetus is just to keep reinventing the wheel. You actually don't want to keep reinventing the wheel. If you've figured out a best practice, then set it up in a position to do this so you can do it over and over.
Secondly, along with these languages and the concepts of a command-line interface, if you have an automation platform where you can automate that best practice, think about it this way. You're ensuring the business to transcend turnover, disaster recovery, whatever it is. If you lose a key employee, you can survive a few weeks till you can hire and onboard another employee.
Now, I met with a company, I won't name who they are, and we at Automic were talking with them. We have ourselves an ARA solution, and they told us, "No, we're good. Everything is automated. We don't need to talk to you guys. Everything's good. Our continuous delivery, fully automated. No worries."
Four weeks later, we get a call back from these guys. And this guy was the senior VP of IT, and so I ask him, "Well, what changed? Four weeks ago in our discussions, kind of a discovery discussion we had with you, you guys were pretty confident. In fact, you guys had me believing you. This is great. I ought to work for you guys. You've really got your stuff together."
And he kind of embarrassingly was like, "Well, problem is Automated quit."
I was like, "Automated quit?"
"Yeah, I didn't know it. We had a guy named Automated."
So they're telling me this is automated. Deployments are automated. It was great until he quit. And what literally happened, he tells us the story, is that two weeks after this guy leaves, they go do another really big release, and they couldn't do it because everyone went to Brent from the book. Brent had left the building, and they had no idea what this guy did. Nobody did.
And that, from a high level, they thought everything was automated, but it really wasn't. It was automated by one guy named Automated, as I jokingly refer to him as. So their business continuity just didn't exist. They were in crisis mode that weekend.
The cool thing about putting some of these scripts and things in an automation platform, with these guys' case, it would've bought them time. You could hire a new person. You could let them inspect the automation mechanics, let them learn from the way things are going.
And this is important because I find in our development community, there's kind of this propensity for us to try to pad our own resumes and not think about what's right for the business. And business continuity is a big deal. The average span of anyone in any company is 4.5 years, is what the US Bureau of Labor Statistics tells us, employment statistics. So you got to think about that. If you don't have things that'll transcend when you leave, then businesses are constantly in a crisis mode, and it's costly to be in this crisis mode.
For this company, it was very costly, and they wound... Well, the good news is, for Mr. Automated, is they hired him back as a consultant for a couple of weeks until they could get something onboard to help automate things, and they could train and onboard new employees.
All right. What about number four? Centralizing DevOps. This is a great way to fail at DevOps at the enterprise.
Now, you're probably thinking, what are you talking about? Isn't that what we're all here for? Isn't this what we do? We want to centralize DevOps. Well, and I'll get to this in a moment, I really like Topo's talk about organizing people by product lines.
Because the US military made this mistake. I had the greatest success in my career doing DevOps before it was called DevOps. But I always called them Navy SEAL teams, was our DevOps teams. Navy SEAL teams.
The US military made a mistake with the Navy SEALs in the '80s. I'm not going to go into the details, and I haven't fully researched them anyway, but there was some big crisis in Panama, let's say, in the '80s. And the US needed to go send some troops down to go fix whatever had gone wrong.
Well, somebody in the military, in the Pentagon, had this great idea. And I could only imagine how the conversation might have gone. Maybe they're talking over beers about how great the Navy SEALs are, and you could imagine the conversation: "Could you imagine? Could you just imagine a whole army full of Navy SEALs? Wow." Sort of like if you could imagine a whole baseball team with all Hall of Famers. Wouldn't that just be amazing?
Well, that's what they did to solve the problem in Panama. They grabbed all the Navy SEALs, put them into a big Army unit, and shipped them down to Panama, and it was a disaster. What they learned very quickly is that Navy SEAL teams work good in small teams. They don't work good as big conglomerates.
So you may be asking, well, if that's true, then how do we get DevOps at scale? Well, the first thing to think about is organizing your DevOps teams by skill set and by product line or by pipeline. It's good to have perhaps a centralized DevOps tooling group that can arm your DevOps teams, but you really need to think in terms of doing things by product or by pipeline.
Why is this important? Well, there's a great book I highly recommend to everyone called The Wisdom of Crowds. And one of the things The Wisdom of Crowds thesis postulates is that any group of experts is dumber than a group with experts and one ignorant person.
Now, it sounds kind of funny, and it's actually counterintuitive. But the reason is, they said this quote, unquote ignorant person is actually in possession of information the experts don't have, and as a group it makes them smarter.
So I like to liken it, building on what Topo said this morning about having his DevOps teams by product. That means it's kind of everybody onboard. So you have your DBA on your DevOps team, and he's involved in knowing that a release is going forward, even if there's no database changes going on. The reason being, this guy or gal may actually have some information should something go wrong or some alert come up that you may not have planned for or may not see, and he or she will have insight to that to help you more quickly solve that problem.
And one of the things that this book, Wisdom of Crowds, talks about is that as long as there's diversity of thought and autonomy with the individual opinions, the crowds work far more efficiently and much more smartly than they otherwise would.
All right, number three. I don't know if anyone's ever done this or had a child who's done this. But ouch. Right? I always love pictures like this because I always know what the impetus for this was. It always began with the great sentence, "Watch this."
But one of the things I hear a lot with DevOps messaging is that it's good to fail. We learn from our mistakes. We should be able to fail. And that's true. We do learn from our mistakes, but learning from your mistakes in production is actually not good. In fact, it could cause great harm to people.
I remember talking to... I was in a DevOps forum, and there was a lot of disagreement about this. They thought, "No, but we're just tired of operations telling us, 'You cannot fail,' when we need to fail. We need to fail more quickly, in fact."
And one of the things that I was reminded of is a great story. I don't know if anyone here has heard of Knight Capital. I find nowadays, only, what has it been, three years removed from this incident, most people have not heard of Knight Capital.
And I'll tell this story in kind of a narrative, because I was struck by how quickly life can change. As I remarked with my wife, I can only imagine being the CIO of Knight Capital. They owned 17% of both the NASDAQ and the New York Stock Exchange trading volume.
And if you're the CIO, man, your parents are proud, your family's proud, you've moved up. You've got a job that's probably ensuring great financial security, and wow, you have arrived. This is a great job to have. You're a mover and a shaker, and your financial security is, in theory, set.
You're driving to work on a Monday knowing that that weekend, a release had just been pushed into production. You get in there, you double-check. You'd already approved things over the weekend, making sure your team, what few were there, everything was great. And there's only a few there because many of them were up till 3:00 a.m. the night before finishing the release.
And then 9:30 happens. Wall Street opens up, and within 45 minutes, Knight Capital accumulated a nine billion, with a B, dollar position in the market. A nine-billion-dollar false position.
How did this happen? Well, what had happened is there was a program called the Retail Liquidity Program, and it had a flag in it. And this flag, when it was set or not set, another adjacent interdependent program called the Power Peg, what it was designed to do was throw up all kinds of volumes of trades. Crazy things. Make bids and asks really high and really low, kind of insert chaos so that they could test the robustness of their very complex trading algorithms.
So this is great in QA. The problem is, on eight of the eight servers they had that routed to Wall Street, one of them did not get the updated RLP code. So what that meant, they had an intermittent trading problem. Every time trade requests were routed through that eighth server, it ran their test volume, and it was an absolute disaster.
In fact, it was so bad, Wall Street cut them off the line. The conversations began real quick. "Hey, we think you guys got a problem. Are these trades legit? What's happening? What's this? What's going on?" They couldn't know.
I talked to someone who's actually tied to this, said they're very quickly trying to track down the developer who helped move this stuff over, and he was in bed. Went to bed really late the night before.
So the summation of the story, as I always remark, is that this guy goes to work one day, it's like any other day, and comes home unemployed. Company's going out of business, which it did. Knight Capital did. They negotiated down to a $440 million position and ended up selling out and cashing out pennies on the dollar. Not to mention the great resume builder that would be on those people.
But great harm was caused to people. You guys are probably aware of Blizzard, right? There's a company called Blizzard. Just an example of the damage done that morning in that 45-minute span: Blizzard saw their stock go from where it was happily kind of in around $3.50, jumped up to $14.76 in that span. So it's bad because now you're having to talk to people who bought and sold this stock based on these trends. It was a mess.
And my point is, great harm can come to people if you're not careful about this.
I worked for a broker-dealer and reported directly to the CIO, and this story has always reminded me of my conversation with him. He brought me down. I was going to be managing the apps that we were going to be doing, and also he wanted me to oversee our operations team to make sure bad things don't happen.
And he told me, he goes, "Scott, there's two things that I have accepted as the CEO of this company. First of all, I'm an officer of the company. This means I'm registered with the SEC and the NASD, now called FINRA. And what this means is, if anything kind of hanky-panky goes on in production environment, something's kind of off or weird, I could go to jail. I'm not going to jail for you. So you make sure I don't go to jail."
I'm sitting there like, "Okay."
He goes, "Secondly, if something goes wrong, if we don't meet our compliance or regulatory reports," he goes, "which falls on you, by the way, I personally will get fined."
He brought up a website showing all the executives who've been fined by FINRA and the SEC, and I was seeing fines of $50,000, $70,000, $220,000. And he told me, he looked and he says, "I cannot afford a $70,000 fine." He says, "I need you to make sure I'm safe and my family's protected, because I'm not going to jail for you. I certainly don't want to have to pay some crazy fine for you either."
It really changed my perspective on what goes on in production and made me far more conservative and careful.
Now we're going to get to number two, which is related to this. How do you make sure that production is safe? And it's a very overlooked tenet of continuous delivery. All right? Which is to begin with the end in mind.
Production is the end in mind. Right? For all the talk of saying, "We just want to move fast. We want to move code. We want to do whatever," the whole point isn't just to deploy stuff. It isn't just to move fast. In the end, it is in fact to bring value to a production environment where it touches our customers, right, or other businesses who are our customers.
If that's the end, then that's the end in mind we always, must always be aware of.
In Jez Humble's book, you see the footnote at the bottom, the book actually called Continuous Delivery, page 115, chapter five, he states this: that it is important, in fact critical, foundationally, that you use the same automation mechanics in every environment. Okay?
The reason you do this is because in QA or dev, you may run those mechanics dozens and dozens of times. Right? And then QA, you're going to run them less frequently. And then as you move on up the promotion path, less and less frequently. The point is, by the time you go to production, you've actually vetted out your automation mechanics.
The goal being also, as he says around this page, is that you don't want human beings in production. You don't want them touching production. Right? You want the automation mechanics to do it.
The great thing for this, one of the reasons you want to do this, is because, as I always say, computers are fantastic. They do a great job at doing what you say, not what you mean. All right?
I had my own experience, sort of like Knight Capital, but not nearly as bad. All right? We had done a production release, and we had an intermittent problem. And the good news is we were a broker-dealer, and we also had an insurance side to the business. So this is affecting the insurance side, not the regulatory side.
And what this meant, though, is that I got to know the CEO on a personal basis, first-name basis. Every day we'd meet. I'd have to go meet him 8:00 a.m. in his office, and we'd discuss the progress on solving this bug, this problem. It cost us two weeks.
Okay. And what we found after our two weeks is that one of my system administrators, in patching servers in preparation for our deployment, had skipped one of the 50 servers that we had. Similar to Knight Capital. Right? The guy made a mistake on one of the eight. So this gentleman had made a mistake.
Now, the funny thing is, as we audited it and tried to figure out what could have gone wrong, our spreadsheets, our plans, all said that he followed the plan. When we asked the guy, "Well, what do you think you could have done?" he did the same thing on every server. Right? Because obviously, why would I do something different, Scott, on this server than I would have on another server?
Human beings, we're great at misremembering what we think we did. Computers, they do what you say, not what you mean. Right? So one of the benefits is, when something goes wrong, you can very quickly go in and figure out what did and did not, in fact, happen.
In the case of Knight Capital, it's my belief, and there are a lot of things to learn from the Knight Capital story, but I think had they used the same automation mechanics everywhere else, they would have quickly learned what had happened. There was one flag in the RLP code that was wrong, and that movement did not happen.
Since a human being manually moved that code over, there was no audit trail, no way to actually figure that out, to see exactly what had happened. So they were left trying to figure that out and troubleshoot that on their own to, of course, what we know is a very ignominious end.
So that's one of the things you guys got to think about. Right? Automation mechanics, they got to be used everywhere in production. And the last thing I would say on this is, if that's true, then whatever automation mechanics you choose, you need to make sure that it's production-proven.
My thought is, my experience has been, it is way easier to start from the top of the mountain, being production. Right? The scale, scope, and standards of production. Worrying about development, I mean, that's easy. In development, a lot of things all fit in a single box. You don't have the scale, scope, and standard.
So if you start with production with those automation mechanics, it's a lot easier to scale it down the mountain than it is to go work your way back up.
All right. We ready for number one? Drum roll.
A lot of times I hear people think DevOps means no ops. In fact, I would say that it tends to be a resonating thing. I hear a lot of developers talk about a way to get rid of operations, to circumvent operations. I would even make the argument that DevOps kind of started with that impetus. That though, yes, there's collaboration, but it was really about trying to get around DevOps.
So I would like to postulate the case for Agile ops. And I say that because when you think development, when you think DevOps, let's just be honest. What's the truth of the matter? Development practices are already agile, are they not? We already do agile development practices. And what happens is you push them by project, by time, by change, by scrum, what have you, into the operation queue. And it's in a queue because operations are interrupt-driven. They don't follow agile practices. Right? They have a low amount of continuous automation going on on their side of the fence.
I've often asked operations people, "If you're in operations, you come in on any given day, who's able to keep to your schedule?" I have found no one does. Right? See all the heads shaking right now. It doesn't happen. You're like, "Ha, that's a good one." Right?
In fact, the book, Gene Kim's book, right, The Phoenix Project, talks about that, work in progress versus the disruption. This is the key inhibitor to achieving DevOps. That's why I put it number one. We all think about engineering and all these metrics and all these things. If you're not handling what has to happen on the operations side, then you're really missing the boat.
And you're not just missing the boat because of all the standards operations has. But remember, going back to my earlier point about the skill set in operations, it is valuable. According to the book Wisdom of Crowds, it's a very valuable information for them to have on the development side. Right?
Why are they concerned? You meet with a security officer. Well, why is he concerned about X, Y, and Z? They're not just making up these standards a lot of times just to annoy you or get in your way. Right? A lot of them have learned painfully from experience. And as long as you know what those experiences and that information is early on in the process, you can bake in security, you can bake in compliance, audit, all those things, way back in on the dev side of things.
And then, as I said, from the Agile ops, if you actually automate operations, then you can make them streamlined as well.
How do you make operations agile? Well, there's a lot of things here. What about having self-help portals? So if somebody needs their password reset, why are they calling the help desk for that? If development needs a new environment, why are they putting in a trouble ticket for this? They shouldn't. That just disrupts operations.
Operations should also themselves have the automation mechanics in a self-help packaged format to allow people to help themselves.
Now, I've met with many people who run operations. They get a little scared about this, but I'll say to you what I've said to them is, yeah, but if you think about it, if you've actually scripted things out in some type of automated format, is that not now you're locked in pre-approved change process? So what does it matter whether someone in operations kicks it off or a business user or a developer? Right? Computers are going to do exactly what you told it to do, so there's safety in this. And you actually offload your workload so you can be more responsive to the business, more responsive to developers.
Now, for development, what does it mean, self-help? If I want to provision an environment to deploy my new version of the app to, you're not having to go to a menu. It could just be the last little command in your Jenkins run. Right? Developers may not even know the operational mechanics that run and make things move. Right? But they get the benefit of getting what they need on demand when they need it.
By the way, one of the benefits of just a simple thing, making sure provisioning is onboarded into an automated best practice, is operations teams get the benefit, the business gets the benefit, actually, of being able to automatically deprovision those same environments.
Has anyone here ever had the problem where you want to clean up your VMs because you have a ton of them, and you start deleting them and accidentally deleted the wrong one? Anyone have that happen?
I was at a very large financial company in South Georgia, and they were talking about scaling out their VM, their cloud-based thing. And as we were talking to them, doing some analysis, the question was, "Well, great, but couldn't you just remove the VMs?" And boy, there was this hush in the meeting room.
"Oh, we don't do that anymore. We did that once, and we accidentally deleted the one VM that helps run the business." It was undocumented, shadow IT. Right? Well, that's the problem. If you're not making operations agile, then shadow IT happens. And then you have all these undocumented pillars in the business that are there, and you wind up nixing one of them.
Well, if IT or operations can be on demand, if your provisioning can be on demand, and developers are following a best practice through automation mechanics, then this doesn't really happen. Right? Development gets what they need when they need it. QA gets what they need when they need it. And when they don't need it, it's removed.
I remember talking to a guy, actually he's one of our customers, and they had this epiphany moment when what they do is they provision the whole workflow to deploy any of their products on one big automated workflow. Okay? And it's dynamic. It can change. It's got some dynamic flexibility to it.
But one of the things they do in this process is that when they provision an environment, they track everything they provision, and they track the usage of it so that when it's no longer in use, they deprovision it.
So QA had used the self-help portal. They had their test environment provisioned with the specific version of the app, and three weeks later they go to test on it, and it was no longer there.
So the QA manager comes walking into the IT manager's room. "You guys blew away our test environment. Now we're going to be late. We can't meet our deadline." He goes off. And the IT manager just sat there quietly, waited for him to finish his rant.
And he goes, "Well, that's not a problem."
Of course, the QA manager, "What do you mean that's not a problem? You're putting everything late. We're going to be behind. We're going to be..."
He goes, "Just go back to the portal. Click it. Fifteen minutes, you will have your whole use case provisioned back up. The reason we killed it is because you provisioned it and weren't using it, and that's costing us money on the Amazon cloud. So we killed it."
And the guy was like, "Whoa, wait, 15 minutes?"
He goes, "Yeah, that's all it's going to take. A few clicks, you'll have your whole test environment. By the way, it can automatically be tested. So by the time you get it and your people look at it, a lot of the automated testing, Serena scripts, and so forth, have actually already run, and you can actually deal with the result of that."
That fundamentally changed this organization to realize, wow, you mean operations, IT can actually be on demand. And so you're always following this best practice. You're not creating inadvertently shadow IT or technical debt, inadvertent technical debt.
And with that, that's basically it. This is my contact information. Like I said, two Ls in Willson. Here's my email address, Twitter, LinkedIn, the whole big nine yards. And yeah, if you have any questions, you can come see me after this. And otherwise, enjoy your day and enjoy the conference, and safe travels back home. Thank you.