Solving Day 2 DevOps Challenges

Log in to watch

London 2019

Solving Day 2 DevOps Challenges

Director, HCL Urban Code Product Development · HCL

DevOps Transformation Strategist · UrbanCode

As DevOps enters its second decade, a "day 2" phase, high performing industry leaders are seeking new ways to deliver software faster and more responsibly. Agile practices and pipeline automation are fundamental enablers of continuous delivery; but it's only the start.

In this new era, many new opportunities are emerging, including leveraging DevOps to advance culture, gaining visibility to business value across many tools, and scaling best practices across the enterprise to realize the ROI of DevOps.

Brian Muskoff is the Director for UrbanCode, the portfolio of DevOps software products that enable continuous application delivery. As the engineering head for UrbanCode, he leads product teams to create tools that streamline the flow of software change through the delivery pipeline. Brian joined HCL in February 2018 as part of the IP Partnership deal with IBM. He joined IBM in April 2013 with the UrbanCode acquisition where he was Director of Operations responsible for Professional Services, Technical Product Support and back office functions.

Brian's product development leadership experience includes Insurance.com, FedEx, CSC and as COO at Moss Corporation, a high growth online marketing startup.

Twitter: https://twitter.com/BrianMuskoff

LinkedIn: https://www.linkedin.com/in/bmuskoff/

Chris Nowak is the Client Transformation Strategist at UrbanCode. Chris has 23 years of experience in Change Management, SDLC & DevOps. He was Head of DevOps Services at 2 of the top 4 US banks. He designed, led and scaled services from the ground up, including service strategy, process optimization, organizational change, large-scale engagement / on-boarding, and operations.

At Wells Fargo, his teams automated 350 applications for the Trading, Securities and Investment Banking divisions. At Bank of America he led a combined organization of Automation Engineering, Systems, Engagement, and Deploy Operations. His teams automated and supported 800 applications from source to prod. Chris left BAC in 2016 to focus on Transformation and DevOps Strategy Consulting. In 2014, he was awarded a US Patent in DevOps Methodology (USP #8875091).

Chapters

Full transcript

The complete talk, organized by section.

Brian Muskoff

Our talk is about solving day two DevOps challenges. And for those of you that have been on the expo floor, you may have seen these dots, but for those of you that haven't, we're going to get to it: what are these dots you're seeing? We are quite excited to be here today. This is one of the great events in our space, and it's really quite humbling to be in this crowd. There's a lot of experience, a lot of intelligence. I've already learned a ton in the first day and a half, and I hope you get a lot out of this presentation.

By way of introduction, I'm Brian Muskoff. I head up the product development organization for UrbanCode. My kind of claim to fame is I hired UrbanCode when it was just one guy. Our founder, Maciej Zawadzki, and I met back in 2001, and he was just starting to dabble with some open source. And the rest, as they say, is history. And I'll share some of that with you here in a minute.

Chris Nowak

And I'm Chris Nowak. I actually recently joined HCL. I actually was on your side of the fence for a very long time. I was in industry banking for, I don't know, 23 years or so. Consulted for a few years after that, and now I'm on the software side. But I'm a longtime user of these products and DevOps in particular. In fact, like probably many of you, we were doing DevOps before it was a word. So now I'm here.

Brian Muskoff

Great. So DevOps day two. The best way we thought we could frame a little explanation on our own company history is to tie it into what we see as chapters in our space. And so in this pre-DevOps stage, that's really the foundation of the UrbanCode organization itself. It was founded way back in 1996 in Cleveland, Ohio. And as I mentioned, when I met the founder, he was starting to dabble with some open source projects that ultimately led to Anthill. And for those of you that were in this space over the last, say, 15 to 20 years, you'll know Anthill ultimately became Anthill Pro, and that was the first product that UrbanCode brought to market.

And the interesting thing about that timing, it was right along the same time that the Agile Manifesto came to be. And you start to look at the different dynamics with the teams. Agile started to speed up the flow of work from idea to time it was coded. But then you had a new bottleneck, and that was, how do we get all of this great work from the developers flowing into the build and test environments? And ultimately, that was the origin of Anthill Pro. And Anthill, through the years, ultimately became really the first enterprise CI product and actually started to get into the continuous delivery space before continuous delivery was even a concept. And actually, that's when UrbanCode first met Chris as well over at Bank of America.

And so day one is really born when DevOps was created, and well, believe it or not, it's been over 10 years already. And in 2009, there's a customer story that always resonated with me, and it happens to be 10 years and a day from that event. And you all may remember, Michael Jackson died in 2009, June 25th, so yesterday, 10 years ago. And what was significant in our industry, from my perspective, with that event was the way that two retailers responded to that event. At the time, Amazon was really just a bookseller and was dabbling in some music and so forth. They were actually able to put a marketplace or a site together focused on Michael Jackson within 24 hours of his death. Another competitor, Barnes & Noble, actually took about 60 days to get that equivalent site to market. And it really just highlights the business agility and kind of the operational efficiency that DevOps brings to the table.

But what we found with our customers working in the CI space is that once you automated the build automation, a new bottleneck was created, and that was actually deploying those artifacts through the upper environments. And so, working with our customers, we came out with the UrbanCode Deploy product. You may know it as UDeploy. We ultimately repurposed the Anthill Pro into a product called UrbanCode Build, which is today an enterprise CI tool. And then ultimately, we came out with a product called UrbanCode Release, which was focused on release management, solving those very complex weekend-long quarterly releases and just solving the problems around that space. And so we classify this as day one DevOps, and in many ways, we've seen that align with our own customers' journey. And it's been very much focused on automation in the build and deploy space. But now in day two, we're starting to see new challenges emerge, and that's really the focus of our session today.

And I should mention in day one, UrbanCode went from startup to acquired by IBM, and so there's an IBM UrbanCode product line available today. And then about 18 months ago, IBM entered a partnership with HCL, and that's where Chris and I are today. The product development organization's over in HCL, and we continue our partnership with IBM. But what it's allowed us to do is, well, essentially double the size of our development team and extend our reach into new spaces. And what we've recently launched over the last year is a product called UrbanCode Velocity that's really addressing these day two challenges. But through it all, we continue to be a top tier leader from an analyst standpoint, and we hope to show you why.

And so just a little perspective on what we're seeing in the marketplace. Obviously, our industry, this IT industry, is full of buzzwords, and we're all in the midst of digital transformations. Well, what does digital transformation mean from our point of view? Really, we see at least three dimensions of change going on right now. The architectural dimension, where we're moving from monoliths all the way to microservices. The infrastructure component, moving from physical servers distributed to cloud. And then, of course, where we're focused, the ways of working. And so you've got these legacy applications in large organizations that basically, as one of our customers put it, creates a battlefield of change. And just because of the economics and the finite resources available in organizations, you end up having applications spread out across all of these different dimensions. And Chris has a unique perspective on how he sees this change.

Chris Nowak

Yeah, so I don't disagree with any of that at all. Being on the other side of it the whole time, first at Wells Fargo and then at Bank of America for many years doing it, when I look at this picture, what I see, it kind of reflects the daily life of things that we went through. We had some small successes, but in a large organization, you have a CIO organization, a CTO organization, you have risk organizations, sometimes QA is completely separate. And what you really have are at least three dimensions and probably more of paths of the way people are going in different directions doing what they're doing. CTOs say, "Yeah, we've got to go cloud, infrastructure," all these things. But they might be trying to do that for groups that may be consuming something in a waterfall fashion. And so what I see here, it's a lot of basically cultural conflict or time, like each organization or part of an organization's at a different speed, a different place in where they are.

And so you do get some good pockets. You have some solid pockets there where you get these larger bubbles that grew, and we figured it out, like this group is really good at CI/CD. They're doing agile, they're doing the right things. And you get smaller groups, which maybe aren't quite so far. And then when I look at this, too, when Brian first put this up, I'm like, well, I also see a couple of different colors. So in a large organization, what we also saw, and I had to live through and basically navigate, were alliances that started to form as well. And I think it would be disingenuous if we sat here and said that anytime you get into a large organization, there aren't politics. This reflects to me the organizational chart and the politics that you have to fight to get things done. Because very often we kind of forget we're all kind of there for the same reason. We're there to deliver code for the business. But when it comes down to it, we're working in our own little areas, and I think that that's kind of what we discovered in day one, moving into day two, was that it was kind of easy to organize our own spaces, but the minute we had to interact, sometimes it worked well and sometimes it didn't. So we really started to say, well, automation is cool, but we're starting to bump into process and collaboration and team friction. And that when I look at this, I actually see basically culture, automation, and process depicted in various degrees of basically affinity for each other. So really this is what we end up with. This is what it feels like today, so.

Brian Muskoff

Great. So look, most of us have been to this conference and have spent much of the last 10 years focused on what we would classify as day one problems, CI/CD. Wouldn't say it's a solved problem. There's still a lot more work that can be achieved there. But the reason we keep having these conferences and we keep having momentum behind these initiatives is we're achieving those results we set out for. It's better, faster, cheaper, happier. Whatever slogan you want to put on that, we are seeing those business results. You take, for example, a large insurance company that we've been working with. In just a year's time, they used our UrbanCode Deploy product to onboard over 2,500 applications. And you can imagine a 150-year-old insurance company, when you think of that battlefield of change, they've got applications all over the spectrum. But now they're doing over 1,000 deploys a month, and we've got dozens and dozens of customers with similar stories.

When I think about the success that we've had as an industry in this CI/CD world, I go back to Nicole and the State of DevOps reports. These results are not just kind of operational metrics. They're actually making their way up to the top line. And one of the most unique perspectives I saw on this was some work a colleague of ours did, Eric Minick. He put together a DevOps portfolio of stocks, and the way he identified those stocks was to find basically, who are companies that were in the DevOps community sharing their models of DevOps. And what you can see, his blue line there is much higher than the gold line. The gold line is the S&P 500. And for those of us that have our retirement funds in mutual funds, you probably know you'd be in the top 1% of fund managers if you could beat the S&P 500 by five points or more. So I think it's a really clever way to depict the success we're having as a market. Another study that was done by Forrester, they went out and interviewed a handful of the IBM UrbanCode customers and found, yes, there is a significant investment being made into their CI/CD practices, but there's a 5x payback here. I won't go through all of the individual benefits, but it just reinforces that it's the right thing to do from a business perspective.

But it's also the right thing to do from a human perspective. If you think about some of the work that has been done with our UrbanCode Release product, it's shrunk those weekend releases from a Friday night to Sunday morning event, just to say, a non-event or maybe a short Friday evening type release. It really changes the quality of life for the individuals involved.

Chris Nowak

Yeah. That's very true. And actually, on that point, I've got a personal experience with that where we would be struggling through the weekends for these multiple orchestrated release type problems. And it turns out that when we implemented this the right way, we had a dark pod, light pod concept. We were able to actually have stuff proofed out Thursday before the release even started, and at that point, it was just a bleed and a swing, and it was really quite amazing to the point where untrained release managers, non-highly technical, could actually click through a UI, see an error, send it to the person who was supposed to resolve it, like a WebSphere deploy fail or something, and it was resolved before they actually even filed a ticket. I mean, it was that easy to kind of resolve things.

So the soul-destroying thing, different story. Early on when we started implementing some of these things, it was a foreign exchange trading group, and at the time, this was back in the day, we had this internal system where these yellow flash stickies, we could message each other, and if you're support, your screen was plastered on a Monday morning, you couldn't read your emails, right? You can't get rid of it till you answer it. And so we implemented this, and it was all of these multiple, 12-server delivery, and it used to take all weekend, but we automated the whole thing, and it was literally almost a push button, and it went in parallel, 12 servers done in 20 minutes, validate for an hour. Monday morning, it was about 7:00 in the morning, I'm getting these frantic phone calls from the FX support teams. They're like, "It's broke." "What do you mean it's broke?" "It's all broke. It's so broke, we're not even getting messages. Everything's broken." So we're digging through logs, and it turns out nothing was broken. We didn't expect it. It was such a weird thing that everything, all these flashes they expect on a Monday morning, nothing was there. Everything just actually, and I hate to say it, but it actually happened exactly as designed and it was perfect. But it was so weird and uncommon that everybody thought that it was so broke, it was just broke forever. So we no longer had a soul-destroying morning, I guess, is kind of the punchline there.

So as we did some of this, we started to say, as we went through this journey, and I think, again, a lot of you can relate to this, we looked at our dev teams, and a lot of the dev teams said, "We're going to do Agile, something like Agile. We're going to work better together." And that worked out, right? So we did Kanban and other things. Hopefully, next to them, we also had CI/CD platform teams or somebody saying, "Yeah, you know what? We're going to fix that build bottleneck. We're going to fix the deploy bottleneck, get the environment management team involved, get those testers in here." And we actually started to automate that whole process. But as you can see on the upper right, there's kind of a bulge there. And when you look at what's really happening, we did really well around the Agile dev and test. But like any bottleneck, when you solve one, or a theory of constraint, it pops up somewhere else. We completely left off the left and the right side, focusing on what was basically our current pain point. And we're still missing, basically, what's the high-level goal, right? What's the end-to-end insight? And it was great because we solved our current problems, which had to happen, but this kind of gets us to this day one question. Well, wait a minute. We're better, but we're not as good as we thought, and the business is saying, "You made yourselves good. What about us?" Okay. So we're saying, how do we align? How do we do other things?

So Brian presented what does winning in the DevOps world look like. Well, it's kind of like, well, what have you done for me today? This is what losing looks like in a DevOps world. I'm not going to really read through those for you, but anywhere from 50% to 90% is basically saying it's not as good as we want it to be, depending on what you pick. And the last one here I actually find really, really interesting because ultimately, this isn't a technology business. It's all about our people, right? When somebody's completely disengaged, they've had their soul destroyed or whatever it is, the top of the house, they're losing 3,400 for every 10,000 in salary. You think about that, that's 34% where people aren't happy, and they're not doing the work you're asking them to do. That number bothers me every time I hear about a toxic environment. So, okay, thanks.

So here are the things we're hearing from our customers and also some of the things that I was experiencing over the last decade or so. At the top of the house, right, it's really about, are we connecting the IT value that we're trying to do to the business value? We talk about that in our scrums and our business meetings, but did we really, when they authorized a project with an ROI, can we really say, "Here, it's operating in production. Your users are using it." Did we match? We don't know. I don't know anybody that can measure that and say, "Yeah, we hit it point for point." Right? And so I'm going from the top, sort of a strategic, and then an operational and tactical. What does that really mean? We're adapting the organizational cultures and alignment to it. We're looking to spread our best practices. So can we even spot, well, that team's doing really well. We might know that. Well, why are they doing well? How are they doing well? If they're doing that, do more of that for everybody else. We can't really figure that out yet in most cases. So we want to scale it. What it really does mean at more of an operational level, we do want to see what that operational flow is. What we're really looking at is we have some small value streams at a team level. We're going to stack them across teams, and eventually, what we do get to is one firm-wide value stream, the business value stream. So we're stacking value streams in a way, right? Taxonomy of value streams.

The other thing we have, especially in an enterprise, many, many decades of either multiple tools, multiple mergers, and I think in 1990, the top four American banks were 35 separate companies, right? And in many cases, we never really rationalized the technology. That's a problem. We thought in day one, we're like, oh, yeah, we're just going to solve all that, collapse it, and do it. Guess what? We never did. And I think we're realizing now it's a reality. We have to figure out how to deal with the fact that we're going to have a huge amount of technology sprawl, and it's only getting worse as technology advances. That's actually becoming a growing problem we never anticipated, I think. And then finally, what does it really mean? So at the tactical level, if you have a DevOps team or you have an Agile team, are you really pulling in your security person into those scrums? Are your ops people there? I'm not going to ask for a show of hands because I think we all say we want to do it, but we really don't. Is your environment management person in those scrums? Dev and ops isn't really Dev and ops, it's dev and O ops kind of thing.

So where do we go with that? I think we look at it one way, or at least this is how I like to look at it. We want to start treating our IT like a business, and in fact, when I first started my teams, I would say, we want to run like a startup in a large company. And so we behave that way, and it actually worked out pretty well for us to create services. And you can see some of the things there. Ultimately, we do want to say, well, when that business office approved this project spend, are we aligning that to the outcomes? Okay. We want to make sure we measure the ROI and make sure it's what we're doing. If you're doing something that wasn't authorized, stop doing it. Okay, that's one thing. If you ever read The Goal, they talk about that type of thing. We do want to make software delivery sort of a first-class citizen. It's a core competency. It's sort of another business line. It's a capability of the business to think that way. As software folks or technology folks, we actually have to start thinking a little bit more like a business. It'll help us, if for no other reason, we start to speak the talk of the language of the business folks.

And so I'm not going to spend time on these points, but as we go through there, it really ultimately says, capture the value stream and optimize it. And I don't mean just CI/CD. Move it left, move it right, move it into the operational space, the post-deployment space where most of the software operates and people have a user experience. And then lastly would be the technology sprawl that I talked about that seems like it's never going to go away. You need to own that, or it's going to own you. Okay? You have to figure it out. And so we're looking at ways, how do we capture the information coming from these tools and operate them better if we're not going to collapse them?

So, I'm going to go through this very quick. I'm going to just call out these points. There's at least 28 separate boxes. This is a real-world example. This is a single SQL file deployment, basically in one application or one part of an application. Four different groups involved, at least 28 steps. If you went down those branches and did yes, no, it broke here, it didn't go there. I think I stopped counting at 114 different paths. And a lot of red lines and a lot of red boxes and at least three or four manual ticket handoffs. I guarantee you this piece of SQL didn't take 15 minutes to get a turn back into, once they fixed it. Okay?

We automated this thing. This actually, we did this in UrbanCode Deploy. Red boxes with an X, they went away. Manual processes went away. Anything with a green X meant we automated it. One really neat thing to note: oh, I guess I kind of gave it away. Is there any one particular thing you guys see here that's interesting? Four lanes down to three. An entire team didn't need to be there. Think about the drag that put on the organization because it was there, and we just didn't look at the process, right? That's what it looked like. Ultimately, the cleaned-up version when we put it in, and here are the results from it. So anywhere from 50% to 90%, depending on what you're talking about. We went down to zero handoffs. Half your support teams to call, process steps dropped significantly. This is the kind of thing that was possible six, seven years ago when we looked at it in a localized fashion. Multiply this thing by probably 500 in a larger organization across multiple products, and just think about the amount of waste we have in our systems.

Okay? So to get there, we started with this picture where everything was uncoordinated, and you can look at that any way you want to. We ended up putting a few things in boxes and pulling it together. We realized it wasn't enough. So we want to get to this thing where we're running it like a business, primarily to create business agility. Think like a business. Well, how do we do it? I think we're entering this day two phase of how do we align? We're going to get into ways of aligning investment, revenue, cost, risk. That's how we have to start thinking so that we can express things to our businesses that make sense and say, we are being good stewards of the money you gave us.

You've seen these numbers. These come from various, I pulled them from a number of different DORA reports, the State of DevOps reports. Those are the things you can expect. And we're not making this up. DORA found these things in State of DevOps. Agile enterprises, optimized IT investments, and your wins in a resilient organization.

Brian Muskoff

Thanks, Chris.

Chris Nowak

Yep.

Brian Muskoff

Okay, we've got about five minutes left, and I promised I'd get to the dots, so let's change gears a little bit and talk about that. So, shortly after UrbanCode was acquired by IBM, we had the good fortune of seeing our sales skyrocket by an order of magnitude. And so, what that meant was we had much of the team shift from planned work to unplanned work. And given IBM's a very complex, large organization, you can see we had a number of support systems that needed to be basically aggregated to report status out to customers or even just manage the process ourselves. And we basically just had a lot of friction, a lot of drag in our ability to execute our development team. And so, at UrbanCode, we've always had a culture of hack time or research time, about 10% a week. Our staff, our engineers can use that time to improve their own skills or go solve a problem. And we had a set of developers on our team that were so frustrated by this problem, they spent all of their time, in fact, went above and beyond nights and weekends, to find a very creative solution to this problem.

And we called it Dots internally. The view you're seeing here is unplanned work. So on the left-hand side of the screen here, you can see these little dots are actually incidents or PMRs for those of you that worked with IBM, and they're mapped to their SLA. So SLA is based on severity, and so they'll pop up on the screen based on their severity. And it's almost like a game of football, right? Where you're basically playing defense so it doesn't cross that line or hit that goal. You don't want to get past your SLA. And then on the right-hand side of the equation, you can see when the L3 engineer kicks it back to L2, you basically have a bird's-eye view of all of the activity across many systems and many people on the team.

And ultimately, this led to terrific results. We reduced our work in process by more than half and actually got us back to much more planned development with the same size team. And so the development team started thinking, "Hey, we had such great success with this visualization. What else can we use it for? What's next?" And so we started to apply it to planned work. And this starts to get into that mode of value streams. And for those of us that have done value stream mapping in the past, it's usually an event, right? It's a finite or static snapshot in time where you take the current state, you try to optimize for a planned state, and you execute a plan. What we have here is really a real-time value stream. You can start to see visually where your bottlenecks are, what needs our attention. If I just completed something, what should I go work on next?

And then you can start to have all kinds of other views on this data. In this case, we've got epics. If you're a product manager and you've got an epic that's tied to some business deliverable, you may want to know, am I tracking for my milestone dates? I think of it as a visual burndown chart. And so we started to present these internal tools to some of our customers, and it really resonated. They really latched on. And ultimately, what we ended up doing is pulling that visualization capability into our UrbanCode Velocity product. And so you can see here on the left, in the top backlog, those are basically work items, right? Whether they're from Jira or Rally or RTC or whatever tool you may be using. Then you can also start to pull in incidents, right, from basically those feedback loops or that unplanned work that's coming in from production.

And ultimately, what we do is, I think of it as kind of a snowball rolling downhill. As the data is accumulated, tied to these work items, you can start to see basically the traceability from work item to source control to build systems to test to security all the way through production deployments. It provides a kind of an unprecedented level of visibility and traceability to all of the business value flowing through your value streams.

And once you have this data aggregated in these views, you can start to do all kinds of interesting things with metrics, right? You can start calculating lead time, cycle time, deployment frequency. In this case, if you click on a dot, you can actually drill down and see in real time all of the basically audit trail associated to that work item. And so that's very much at a team level. What we also do is roll that up at what we call a portfolio level. So, if you're a CIO or some kind of transformation leader, you may want to start to see across your teams who are the high-performing teams, the low-performing teams, what's the quality of the security scans for the different teams, so on and so forth. I like to think of the different levels of insights as flow, value, risk, custom reporting.

And I should point out, we started really in our sweet spot. We come from CI/CD. What you see here is very CI/CD centric. But what we're doing is basically starting by going upstream to the ALM systems. Ultimately, we're going to go further upstream, start to pull in the economic numbers, right? At the end of the day, we need to see cost, revenue, so on and so forth. We can also go downstream, start to get those feedback loops, the A/B testing results. Did we get the expected value from the investments that we made?

And so what we think sets us apart with this Velocity tool is not only do we have this level of insights, but you can actually take action in the tool to make improvements. So, on the top two views there, that's what we just reviewed from a team and a portfolio level. But on the bottom there, we also have pipeline capabilities. So you can actually orchestrate, you can think of it as pipeline of pipelines. If you have multiple microservices or components or applications that are using different automation tools, you can start to build those dependencies and execute the deployments with whatever tool that you want. And from a release management standpoint, you can also start to pull in ServiceNow ticket updates and other types of, say, manual tasks that may still exist in your system.

And so as we look at Velocity, we think it really puts a bow on our DevOps platform. It builds on the CI/CD capability that we have and now brings in the business value, the business agility that was promised with DevOps and starts to really connect the business value with the IT investments.

And so to close, we love that CALMS, or the CALM acronym, and we love it so much we actually put it on a T-shirt. We encourage you to stop by our booth on the main floor and grab one and continue the conversation. We're just really starting our journey in this value stream area, and we'd love to get your feedback and start to understand what ideas on how you could apply this to your business to get some good results. So thank you for your time, and enjoy the rest of the show. Thank you.