From Velocity to Value – Scaling the DevOps Impact
As large organizations embrace DevOps some are seeing the pace of innovation accelerate but many still struggle to realize a real increase in value being delivered by DevOps initiatives. This is often a result of attempting to move at DevOps speed while needing to work with parts of the organization that remain based on legacy processes and systems.
Join us to learn about companies that have taken an approach based on policies and automation to help developers get quality code into production as rapidly and safely as possible. You’ll hear about how they leveraged existing DevOps toolsets and added centralized access to key metrics and insights to improve the entire application value stream.
This session is presented by ServiceNow.
Chapters
Full transcript
The complete talk, organized by section.
Eric Ledyard
Hello, and welcome to today's DevOps Enterprise Summit session from ServiceNow. This is ServiceNow's discussion around scaling to enterprise DevOps.
My name is Eric Ledyard. I'm a principal product manager for the DevOps business unit, and today joining me remotely will be Ben Riley, an advisory solution consultant as part of our Sweagle acquisition.
We wanted to talk about three key things that we're going to be discussing today. One is how to scale DevOps; two, customer case studies on customers that have been with us during our journey; and three, Q&A so that we can have questions and answers with you.
So the business imperative for scaling DevOps is really all around the fact that cloud native is becoming the standard de facto way of doing things. Mode two, or cloud-native or microservices-based development, is really the way that most companies are organizing to deliver code to their customers.
The reason for this is that cloud adoption is rapidly accelerating. When you go to that world, legacy traditional code development methodologies are just not going to cut it. A lot of organizations are transferring not only to the cloud, but they're also translating the way that they deliver code to their customers and value. Tracking that end-to-end value stream is really what's important to most organizations that are adopting the DevOps culture.
The second big initiative is that DevOps is a very strategic initiative now. Especially post-COVID, we're seeing a lot more organizations need to move faster, but with less risk. DevOps adoption is going to roughly double in the next year, from 41% of companies that tried in 2017 up to as much as 80% or more that are going to attempt this in 2021.
The next big part is infrastructure as code and infrastructure automation. Companies are going to start doing more and more around treating their entire infrastructure deployments as code, and treating pretty much everything in their organization as code, whether it be pipelines, software development, or pretty much everything. Everything is turning into as-code so it can be managed and tracked as part of an automated pipeline.
The other part accelerating DevOps adoption and value is the challenges that a lot of companies are facing. One big challenge is that many companies that have tried to adopt DevOps have failed to see any significant release-frequency increase. A lot of folks might tell you that value is more important than release frequency, but the reality is, and I can tell you, all of our projects were funded. I'm a former executive from Bank of America who tried this at a 200-year-old bank. All of our projects were built on the premise that we wanted to go from nine months of a release cycle, from ideation to production, down to nightly builds. Our immediate metrics were all around release frequency.
We built three complete software factories. We trained 500 individuals how to be agile and do their development in an agile methodology, and yet no one would actually let us push the code to production because governance, risk, compliance, audit: we didn't really solve for any of the IT service management components. That's a common theme. A lot of companies, while they can write code quickly and develop code iteratively, are not able to push that code to production, so they can't get it in the hands of their customers driving value.
Another big part is that developers, 61% of developers interviewed in 2019 for the ActiveState survey, said they spend less than four hours a day writing code and less than about an hour a day writing mission-critical, feature-differentiating code that would help drive business value to their customers. A lot of the tasks they do, 70% of the tasks they do, are the mundane, process-driven, manual legacy carry-ons from their old world. That is definitely a challenge for a lot of those individuals.
Another problem is that 90% of release problems have configuration errors. When we do root-cause analysis, configuration management becomes a massive challenge because there are so many configuration files traveling throughout all of these deployments, from application configuration all the way to infrastructure cloud configuration. We're seeing configuration management becoming more important to this end-to-end service delivery.
The other side is that it takes a long time to get changes approved in most organizations. On average, we've seen around 23 days to get a normal change pushed through. The second part is that there's a massive explosion in configuration data taking place. This challenge of configuration is just getting more and more exacerbated as we grow the organization's capabilities.
What we're doing is bringing ServiceNow and DevOps together. We're bringing the enterprise service management capabilities of the ServiceNow platform together with the DevOps needs of our customers. Our customers need speed to deliver faster, visibility to see across their entire toolchain and all of their toolchains, and increased productivity.
The trick to doing all of this is through a strong integrations framework so we can connect to all the tools that customers currently use. We don't want people to change any of their toolsets. We want to leave them productive in the toolsets they use every day, integrate with those tools, pull the data into our platform, and then do all of the functionality that we're doing, which out of the gate is around change automation, developer insights, and push-button audit.
What we did was try to bring together the two worlds of dev and ops seamlessly into our platform. It's a little bit challenging because we have to meet the needs of developers, which is they want to move quickly and get products to market as quickly as possible with minimal process bog-down and minimal wait time. Whereas ops need to make sure that we control and govern things so we're not introducing risk into the environment and we're not taking outages and downtime for our customers.
The way we did this was we leveraged ServiceNow's core data model, our CSDM, Common Service Data Model, and built a DevOps data model to pull in all the data from CI/CD tooling: from planning to coding to orchestration to static code analysis to testing to artifact management, all the way through that entire life cycle. We bring all that data into the platform and it allows us to do a lot of things.
One is the change automation piece, where we can use decision trees to say, "If this and this all meet, approve this change into production and let it deploy automatically with the CI/CD tooling." From there, through agile team planning to continuous compliance, both upfront pre-production, where before you even deploy we're checking compliance governance, to post-production, when you're actually running in a service environment to maintain governance, risk, and compliance on an ongoing basis, leveraging things like our GRC platform and SecOps.
It goes all the way to service health, which ties into our IT operations management suite and being able to do remediation, mean time to repair, and seeing a service degradation in production that's due to a change that just got pushed out. All of that information is being brought into the platform so we can see that end-to-end. We help do root-cause analysis. We help reduce the impact of outages. There are a number of benefits across the service-health landscape.
Finally, there's the piece that I am probably most excited about, being a former leader: DevOps insights and analytics. This ability to see across all my toolchains what teams are high performers, what teams are low performers, which teams need some help, and where we have breakdowns in our end-to-end value stream of delivery. All of these pieces are done by having all the analytics and insights across the entire data.
Then finally, we bring on the configuration management piece, which is all around managing configuration data, securing configuration data, and validating configuration data for both preventative reasons before we actually deploy, as well as remediative reasons for after we deploy and see an issue in production that could be caused by a configuration mismatch.
All of this end-to-end visibility culminates in the ability to come in and see your end-to-end value stream of delivery, from ideation to planning to commit insights to development insights, deployment insights, change acceleration, system health, pretty much the entire Accelerate metrics from the DORA report as well as the Accelerate book. We've brought them into our platform to measure the high performers, low performers, and all of our teams that we manage.
I can tell you, as a former leader, I was blind to all of this across my teams. I had no way of seeing data on how well teams were doing in their delivery cycle, or where we had breakdowns in their value stream. This is extremely important to me, and this became the reason why we were named a value stream management leader in the Forrester Wave this year. It's all because we have all this data, and we're tracking it with performance analytics, so we have trend analysis over time. We can see whether we're up or down, trending in the right direction, and start to see performance across our teams.
This brings together a lot of benefits for us. The most robust, easy return on investment is change automation. What we do is estimate that for every 100 users we bring into the platform, we save about $1.5 million. That's done by returning 14% of the time to the development teams that was wasted in legacy change processes. But that's not the only place we're saving money, and that's not the only place we're driving value.
We're really driving value across all four planes of project execution: executing products faster and increasing flow throughout the system. We can drive top-line revenue by being able to get products to market sometimes three to four times faster than we would have before, capturing revenue from our customers and translating it into revenue to the streets. That's an incredible driver for most executives.
We reduce CAB meetings, obviously, and reduce the impact of change. We've got reporting and analytics, which allows you to keep your developers writing code rather than sitting in status meetings or trying to update their leadership. As a leader, I can see what all of my teams are working on and drill in proactively without having them waste time reporting what they're doing or how well it's going.
Finally, developer productivity. It's all about getting developers back to writing code. That's what this whole platform is about: optimizing the end-to-end workflow, automating as many places as possible, and allowing us to give back developers a lot of their time so they can focus on writing more stories per sprint and increase the flow of value throughout the end-to-end delivery cycle.
So now we have some use cases that we can talk about, and a lot of this is around deploying fast but safely. One of our major key lighthouse customers is DNB. They're basically the largest bank in the Nordics in 2019. They're transitioning into becoming a true digital bank.
When they started this project, they looked at the dev side, what they knew when they started. The challenges on their dev side were reducing the cycle time from ideation to implementation; a large variation in their Kanban structure, so disjointed processes and the status of ongoing work was hard to see and track; one pipeline toolset per team, so lots of tools disparate and spread across those areas; and no coordinated pipeline structure or policy set for driving governance around change management or any of those pieces.
That caused a lot of complexity and slowdowns because it was hard to get any of the three big use cases we talked about: automating change, visibility and traceability across your entire landscape, and audit. All of that made it very complex and challenging.
What they knew from the ops side was that they had a very time-consuming change-ticket form that no one liked to fill out. They had a very long cycle time for change advisory boards and tech board meetings. They had strict policies and little insight and knowledge when they were actually making the operational judgment calls. There was a long distance between the approvers and the developers. There were many different layers of process between the people wanting to make the change and the people who would approve it.
Basically they started to set out this carrot for rewarding good performance, which was: we would automate all the change tickets if they started to put their guidelines in place. They've been seeing a lot of great results by saying, "Your CI/CD pipeline tool is accepted and the deployment process is at least partially automated. You have a set pipeline that separates environments. You'll always run the different tests." They started to set their criteria as the things they had to meet in order to approve changes automatically.
They've seen a pretty great return on this. The overall business case they brought to us was that they estimated about a 20-minute reduction per change ticket. That can really add up, especially when you have thousands and thousands of changes every year. For the first dev team of six deploying two increments a day, they saved two hours per day, or 10 hours per week. The next team of 20 has not been compliant to the policies, but will still save around two hours per day, or 10 hours per week.
If you start to add that up across many, many different developer teams and many pieces, the capital savings ends up growing dramatically. That's where we've seen DNB be very successful in their return-on-investment numbers as they've moved forward with scaling out DevOps to their enterprise.
Now we're going into the next step, talking about eliminating configuration management challenges. I'm going to pass it over to my colleague and let him explain to you about our Sweagle acquisition and configuration management.
Ben Riley
Thanks, Eric. Really good session.
I'm going to go into a little more detail on configuration data management. We're from Sweagle, and we were acquired by ServiceNow in July. We're a configuration data management solution. We really aim at improving the way that our customers manage configuration, use it, test it, all of these different things.
We see it as one of the biggest areas for our customer base at the moment, trying to get a handle on how they're dealing with configuration. By its very nature, you want it to be configurable, malleable; you want to be able to update different settings, different features, canary deployments, all that kind of stuff. We do see it's handled by config, but there's still a huge proportion of outages caused by poor configuration mistakes. Mistakes happen, but the time lost to those outages across the user base of an organization is pretty high.
What we do at Sweagle, and what we do as part of our configuration data management solution, is firstly try to centralize it and put some good practices in place. We apply tests and integrate with lots of different tools in quite an automated fashion to act almost as a watcher of config. We watch it, track it, understand when changes are made, and based on that, we give you either dependencies or validations.
Can we see who else requires that config? Can we see who needs to go and use it? Or do we need to validate the quality of that change? For example, somebody has changed a region from EMEA to North America. Previously, that's hurt us. We're not allowed to deploy there for whatever reason. Let's stop that change. Let's stop that as part of a pipeline. Let's stop that as part of an automated deployment and make sure we are pushing a preventative methodology when it comes to configuration rather than a reactive methodology, i.e., an incident happened, oh dear. What we want to do is get to: okay, that incident happened; now we're aware of it, we can put in a validation or policy that can prevent somebody from making that change again.
I want to talk about a couple of customer journeys and use cases they've been taking on. One is a rather large telco using our validation engine, which tests and tracks things and makes sure they are good quality around their data center. Essentially, they had three principal CMDBs, a large infrastructure estate, and a huge amount of duplication of resources depending on which CMDB you looked at. From an SLA perspective, from a contracts perspective, and just from a risk perspective, not very good at all.
It was very hard to synchronize those three sources. Therefore, when either an incident occurred or when somebody needed to do some work, understanding where to look for the truth was really difficult. It was hard to measure contractual compliance. It was basically a very manual process to work that out. Generally speaking, by the time that synchronization effort had happened, it had to start again. That was a long process to get through, so you're constantly chasing the tail. Poor data quality, inconsistent data, missing a real source of truth, so when an incident does happen you can really get into the weeds quickly and understand what's happening and where. And there was a lack of compliance in being able to understand what is deployed out on the estate.
There are various other things, like the amount of time spent, human effort, and rework going in to try to work these things out rather than value-add into the business. As an agnostic repository with a validation engine, we saw it as a really good opportunity to start to do some machine-to-machine correlation.
We very much believe that the more high-quality automation you can do, and allow people to get on with more value-add tasks, that's really good. Sweagle has a graph data model in it, and what that means is we can store things in lots of different ways, and it's very easy for us to do that synchronization. So, continuously looking at modifications in any of the three sources to check: is that modification good, bad, or ugly? Based on that, does another CMDB need to know about it, or does it already have that information?
We've positioned it in a manner that we could collect data from all of them and synchronize data between them. Synchronization scripts trigger workflows to correct something, enrich something that's missing, but do it in a totally automated fashion. It's really useful in a very busy estate with different tooling, automation tooling, and tool preferences depending on which team you're working with. From our perspective, it doesn't matter. It's heavily API-driven, so we can integrate in a very non-intrusive way, kind of in the background, but making sure all of that work still gets done.
That reduces people's time and effort dealing with and managing these things, and gives an improved set of data. You can make better decisions. You can really believe that if there is an incident or a problem, you can investigate it without having to do a lot of research first, and then help to apply or remove those SLA breaches and that contractual compliance element.
The second customer I want to talk about is more finance-based. This is more of a traditional application release pipeline space, so CI/CD. It had a broad technology space, and not just broad but deep as well: a mixture of legacy and greenfield applications, a mixture of manual processes and automated deployments.
The thing that really stuck with me was that they could only, the quote was, "Can only find our configuration incidents in production." They were really struggling to get a handle on that anywhere before, in any of their UAT, test environments, or performance environments, because for them there is only one production. Therefore that's where it really emerges. Testing in production, maybe not the aim.
There were large sets of human interaction with their config, which is not necessarily a problem, but something you need to track and be aware of. It was just a larger application estate: 200 applications, a mixture of microservices and legacy components, and dev configurations essentially being transformed slowly into production and not in a manner that's required. It was giving them poor customer experience, outages, a lot of incidents, a lot of rework to fix those problems, and quite a lot of security issues as well, especially around tokens, passwords, those sorts of things.
As I mentioned a minute ago, we try to be as automated, or as highly automated with quality, as possible. Acting as a central repository meant that we could slip into lots of different processes, primarily CI/CD pipelines, as a couple of steps that collect configuration that's being added or modified or removed, and validate the quality of what those changes are.
Taking that data, seeing that we're changing a region from EMEA to North America or changing a port from X to Y, whatever it might be, having the contextual information around that meant we can test that the change is actually good. If we've been burned by something before, then maybe that's a point where we can notify, raise an incident before that deployment happens. One of the big aims was a preventative methodology.
Reactive is, I think, the world that lots of people live in, but we're really trying to push this customer into a more preventative space, meaning somebody can make a change in full confidence, knowing that if there is a problem, if we've been burned by something before, it's going to get stopped as part of that pipeline. Even if the pipeline doesn't stop, the knowledge of what that change is, the auditability, the securing of secrets, and those types of things are all handled.
That really allowed them to promote better good practice in terms of continuous delivery and continuous integration. Also, it gives them the ability to start to standardize what teams are doing with configuration. Rather than everybody going off in their own ways, it gave them freedom and flexibility, but everybody comes through that same standardization process in the background, without changing tools. They're not having to leave the environments they like to work in. But everybody ends up going through that same pipeline, and the quality that gets put through there is higher.
If an incident happens, it happens. Mistakes happen. It's the world that we live in. But we've given them a platform to prevent that mistake from happening again.
Q&A
Those are two stories we've got around configuration. We've fixed a huge amount more problems in this space. Hopefully, it's been interesting. If you've got any questions, now is a great time. Really appreciate you joining the session. Feel free to follow up with myself or Eric after.