GitOps & SLO-Driven Automation Driving Faster and Better Releases!
Did you know that 90% of DevOps & SRE automation code that is currently developed must be rewritten or thrown away within the next 12 months? It’s because most automation frameworks have GitOps and SLO-driven orchestration as an afterthought resulting in high levels of code duplication and technical debt. This is especially true when implementing use cases like automatic monitoring configuration, deployment validation or SLO management.
If you are a DevOps, SRE, Performance or Automation Engineer then join this session where Andreas (Andi) Grabner, core maintainer of the Keptn CNCF Open-Source project, will show you how early adopters of the new data-driven cloud automation have increased speed of app delivery by 75%, improved deployment quality by 50%, and helped scale DevOps beyond lighthouse projects. And, do all of this without developing and maintaining custom pipelines or DevOps tool integrations!
This session is presented by Dynatrace.
Chapters
Full transcript
The complete talk, organized by section.
Andreas Grabner
Hi, DevOps Enterprise Summit. My name is Andi Grabner, and I'm really pleased to be here with you, at least virtually, and give you some insights on a topic that I care about a lot. I'm here in my kitchen, but now let's go and start sharing the screen, because this is what I really want to show you.
Today's topic is GitOps and SLO-driven automation, driving faster and better releases. I'm Andi Grabner. I am a DevOps activist at Dynatrace, but also in DevRel for the open source project Keptn. I will talk a lot about Keptn today, and therefore, if you're interested in learning more, please make sure to check out some of the links, follow us on Twitter, star us on GitHub, or join a Slack conversation. We are a CNCF project and really want to make the life of DevOps engineers and SREs easier.
All right. Let me get started. I want to kick it off with a little, let's say, breaking news alert, because I'm pretty sure all of you are trying to get better in automating your delivery and automating your operations. You're moving to the cloud, you're moving to Kubernetes. Yet just moving to these new technologies doesn't necessarily give you all the stuff you need, right? Just moving to Kubernetes doesn't give you resiliency as a service. So that means we may all need to prepare for situations where systems go down, where systems don't act as expected, especially systems that we may not even control.
In our case, at Dynatrace, we also run our systems in the cloud. And thanks to the way we are embracing DevOps, we're embracing SRE, we're embracing automation, we're leveraging observability, we were able to withstand a four-hour AWS EC2 outage in the Frankfurt region with zero impact for customers. I really love that Thomas Reisenbichler, who is leading our team, we call it the ACE team, the Autonomous Cloud Enablement team. They're responsible for running and operating and deploying our software for our SaaS and managed customers, that he was sharing the story with me. If you want to read more, there's a blog. But really what this is about is about a sharing session and telling you how we are doing things internally and also how that impacted what we've been giving back to the open source world as part of Keptn, and how we also then enable our Dynatrace customers to become better in DevOps and SRE practices.
First of all, to kind of remind ourselves, I'm sure Gene and others have been talking about this for many, many years. We as SREs and DevOps need to deliver faster and better. We're measured against different dimensional metrics. I think DevOps, on the one hand, is using automation to speed up delivery. We are measured against some of the DORA metrics like deployment frequency or lead time for change, so speeding up. On the other side, we have SREs, or however you call them in your organization. I see them emerging out of operations, where they are now using automation to ensure resiliency of their environments that they are responsible for, measured against things like change failure rate or time to restore services in case something eventually goes wrong.
Okay, so speeding up delivery and also ensuring resiliency, both heavily relying on automation. And I think one of the things that connects them together are SLOs, service level objectives, because in the end, whatever we do, however often we deploy, or whatever we do in production, we also always want to make sure that our services are available to our end users, to our business stakeholders, based on what we have agreed to deliver. These are then our SLAs, service level agreements, but we often now measure them as SLOs, service level objectives.
So we need to do a lot of things to get there. I think DevOps and SREs must automate many tasks through their pipeline, through their automation scripts. I just highlighted a couple here, and I'm pretty sure they are definitely not complete. Whether it's about automated testing, automating security scans, automating your monitoring observability, adding notifications to it, doing more around what they call zero downtime deployments, whether it's blue-green, canary. All these things we, as SREs and DevOps, need to figure out how to automate into our pipelines.
Now, as clearly, as we all know, no shortage of, I call it do-it-yourself Swiss Army knife tools or scripting. Picking one of my favorite tools, Jenkins, definitely. I can execute tests with Jenkins. I can add my test result analysis. I can add notifications to notify people about the results. I can integrate with my APM, with my observability platform. I can add an approval process. I can add chaos engineering, which is, I think, a very emerging new practice. Adding security scans, adding the whole thing across multiple stages, and then also adding these zero-deployment downtimes, right? So nothing keeps me from doing this with the tools we have available by doing a lot of scripting with these tools.
The thing, though, is if we do it with the existing tools and if we are really proficient with writing our automation scripts, then we may end up like Christian Heckelmann, a senior DevOps engineer who is responsible for almost 1,000 CI/CD pipelines. He's constantly reacting to, "Pipeline broken, please fix." Now, why is that? Because his pipelines that he built, his automation scripts that deploy tests and then just do some evaluation, ended up being very complex. This is just one of his pipelines with more than 1,000 lines of code, more complex than some of the microservices he's deploying and testing with it. And well, he started a while ago and that escalated pretty quickly, right? Because it's very hard to maintain these pipelines.
Next example is from Dieter, one of my colleagues at Dynatrace. He's responsible for kind of the new cloud-native workloads. He and his team also started using Jenkins pipelines. Again, well-known system. We know we can do magic things with it. One service was onboarded, more services got onboarded. They needed their little specific things, like a different testing tool, a different type of notification, a different metric to pull in for the evaluation. And this thing kind of exploded, or kind of what we call it a snowflake effect: many different permutations of these pipelines.
Dieter also then did the analysis of actual code duplication across our different automation scripts we use for deployment and keeping things in production. And there we saw we are victim to the same thing that the engineers that write business code are always falling victim to, which is high technical debt, high code duplication, high code complexity.
So we thought, how can we solve this? Because we, as DevOps and SRE, need to automate more, but we shouldn't be drowning in the automation scripts. We shouldn't be needing to maintain tool integrations that make up the bulk of these integrations individually, right? So what can we do to help the industry? And this is where Keptn comes in.
Oh, Keptn, my Keptn. You can go to keptn.sh or to GitHub and Slack to find out more what we do. We really want to make automation easier. Automation for DevOps and SREs, not just delivery automation, but also automation for operations.
So how did we try to solve the problem? What was, again, the problem we saw? If we look at your classical automation script, whatever tool that is, you have hard-coded steps where you may prepare your system for monitoring. You then deploy, calling your deployment tool of choice, even sometimes having your Helm scripts, your manifests, whatever it is, hard-coded in the pipeline. You then run your tests. Again, you have a hard-coded integration between the pipeline tool and the testing. Then you do some evaluation, right? You're pulling back the log file from the testing tool. You may pull in some data from your monitoring tool through their API, and then you try to then figure out, is it a good or not a good build? And then you are either promoting a notification, you're promoting it to a next stage by calling another tool or sending notifications.
Again, the challenge with this is, as we've seen, there's a lot of hard-coding integration between the process and the tooling, and there's also often configuration in these pipelines, whether it's test scripts, YAML files, and/or the metrics you want to analyze. So what we thought with Keptn, we want to, first of all, remove these hard-coded dependencies, which means we said, let's get rid of this whole thing that combines everything, process and tooling. Move the tooling to the right. On the left side, just keep the definition of tasks you want to automate as part of a sequence. So for instance, prepare, deploy, test, evaluation, promote, and on the right, you have some tools or capabilities that can then fulfill certain activities.
Now, the configuration, there should not be any tool-specific configuration in your pipeline definition or in your automation sequence definition. This all moves to the right to Git. And then we're using eventing to connect everything together, because basically we broke process and tooling, and now we have loosely coupled process and tooling definition. They're connected through eventing, just as we do in normal software engineering.
All right. So to give an example, I got four example or three examples. Automate performance sequences in staging, a very common use case. In Keptn, you would define a sequence of deploy, run a test, and then evaluation. Now, evaluation is highlighted particularly here with a special color because SLO evaluation is core of what we do. It's part of every workflow. So now, if you have a model that you want to deploy and test and provide this to, let's say, your engineers, and an engineer can simply say, "Keptn, I want to trigger the performance sequence in staging, and here's some additional information like the image that I want you to deploy." So what Keptn does, it starts with sending out an event because they know the first step is deploy, so it sends out a so-called cloud event with the information about deploy, what image, which stage, and maybe some additional information like, "Hey, this should be a blue-green deployment." And then you may have one or multiple tools that's subscribing to this event. One tool would obviously be the deployment tool. This could be Helm. This could be an existing Jenkins pipeline, a GitHub pipeline, anything that you want to use to deploy an image with blue-green in staging. Additionally, we can have multiple tools subscribed to these events, like the notification tool is also subscribing it because maybe you want your Slack channel to be updated whenever a deployment happens.
So this was deploy. Then test happens. Keptn triggers a test event, is picked up by the testing tool. Once the testing tool is done, sends it back. By the way, the deployment tool and the testing tool, they all get their configuration where from? From the Git that is completely managed by our system. And then the last step is the evaluation, right? Evaluation means I want to pull in metrics from a monitoring tool, from a testing tool, from any type of tool. I want to get these details and then calculate, and everything is good or not good, depending on your SLO definition. And again, here, notifications might be interesting. You want to notify people in the end.
So to give you some terminology, we call the process definition, or let's say that with the definition of automation tasks and sequences, a shipyard. We call the tools or the capabilities that are taking part of that workflow, that subscribe to these events, so-called Keptn services, and they're part of the Keptn's uniform. So what is kind of what's the uniform that the Keptn is wearing. What else do we have? The configuration. We have configs in a config repo. And we got CloudEvents. This is a standard. We're also currently driving with the CDF, where task-specific metadata is sent by the orchestrator, by the process orchestrator, to the individual tool that then picks it up.
So this was automate performance sequences. Now, let me give you a little more complex one: automate canary rollouts. First of all, the sequence has changed. This is now for production, and we want to do a canary rollout. We have some additional tasks, like prepare and release. On the right side, I removed some of the tools. We may stick with the same monitoring tool, but now we may have a different deployment tool that can do the canaries, and maybe we have a different notification tool for production because different teams are interested in it. But in the end, it's the same concept. You say, "Keptn, trigger a certain sequence," in this case, canary rollout for a particular stage with particular metadata. Same thing happens again. Deployment event is sent out. Now, maybe the deployment tool, D2, that can do the canaries and that has been kind of officially assigned for production deployments is now kicking in.
What about SLO evaluation? Same thing. SLO pulls data from the monitoring tool, and then at the end, depending on the result, we may say then release event, where then the deployment tool will say, "Okay, now we are scaling up to 50% to 100% to canaries." So you can see here, it's just, again, a process on the left to define your automation sequences, and then you have different tools on the right that are then participating in that workflow by subscribing to these events. And they are getting their real configuration from the Git repository that is also organized in stages: staging, production, whatever you have.
Last example, because we want to make sure that you understand this is not just another automation tool for delivery: production remediation. So here in production remediation, I may have some additional tools, right? When problems come in from your monitoring tool, for instance, you may have some infrastructure automation, some ticketing system. So if your production monitoring system finds a problem, it can say, "Keptn, trigger remediation," or it sends an event over to Keptn and says, "I found a high failure rate problem in staging, and the root cause seems to be log disk latency issue."
Now, on our Keptn side, you can specify so-called remediation sequence, but most importantly for remediation, we have a special concept where you can also specify remediation actions in a so-called remediation YAML file, actions per actions that you can specify for a certain root cause. And then what Keptn does, it takes the first action, like clean disk, sends the event, same concept as before. The individual tools that need to participate that can deliver that action, like cleaning the disk, pick it up, send back the status once it's done. Additionally, you may also want to send everything that is done by automation to a notification tool, again, as shown here.
Now, most importantly, after every action, Keptn again reaches out to the monitoring tool, does the evaluation. If it's good, thank you. Process done, problem solved, everything good, no human interaction needed. If it's not good, if it fails, the evaluation system is still down, then executing the next action, like a rollback. Same concept. Event is sent, tools are basically being pulled in, and so on and so forth. Now, after every remediation action, if it's good, everything is fine. If it's not good, it continues. And the last step could potentially be, well, let's escalate this whole thing, right? Maybe in this case, we'll create a ticket so that somebody really can follow up.
So three examples. It was the first one on test automation, the second one was on canary deployments, and now on auto remediation. Why are we doing this? Again, remember, I started off with the problem statement that many of you are building automation with your existing tools, which are perfectly fine, but don't build too complex automation scripts that are maybe more complex than the microservices they deploy.
We want to help you with reduced complexity. With Keptn, we have seen 90% less automation code with a clear separation between process and tooling. GitOps is ingrained. All the configuration is in Git. Every time you make changes, you can trigger a new process. SLOs are core to the whole thing. We'll show you in a second. And most importantly, Keptn is not replacing the tools that you have made investments in. Keptn is connecting them to really automate sequences for delivery and operational purposes.
So this is typically the moment when I am on stage and say, "Please now take a picture." Because, again, on the left side, kind of where most people are currently heading by building their own automation. On the right side, this is what I want. Always what I say, friends don't let friends build their own automation, right? Friends suggest to their friends, "Please have a look at Keptn first and how Keptn can take away a lot of the automation pain."
Most important, again, is leverage your existing tools. Take your tools that are maybe already deploying, then just trigger a Keptn sequence that does a test and evaluation. You decide how to bring in automation with Keptn. Very important, Keptn really, really, really does a great job in orchestrating all these tools. And in the center, there's always the SLO evaluation. That means after every sequence, after every task, Keptn can reach out to observability platform and say, "Are we good to go or are we not good to go? What do we do?"
Cool. Now, most adopters, I typically get the question, how do people get started? Most adopters start with integrating the SLO validation -- that's kind of the simplest process that you can have with Keptn -- into the existing pipelines. The typical use case is a lot of our users already have pipelines. They already do some deployment, some testing, but then they manually sit there and validate, did the deployment actually happen? What was the test execution? So they build dashboards in the most popular observability platforms. This is an example from Dynatrace. But what we are doing, instead of manually looking at these dashboards, how beautiful they may look, this takes a long, long time, and it takes human people that need to be available to do the evaluation. And therefore, we say, let's automate that.
If you already have a dashboard you look at, that means you know which metrics, which SLIs are important, and what are the SLOs, what are you looking for? And then we can automate this completely with Keptn and bring this down to a fraction of the time.
One of the examples, and I want to give you some adoption examples, is Mike Kobush from NAIC. He's a performance engineer. You can see on the left, he's building these beautiful dashboards in Dynatrace, where after every test or while the test is actually running, he's looking at key performance metrics from his performance tests that are monitored by his observability platform. So he's building this dashboard, but he actually augmented it with SLO information. He's then checking in this dashboard in Git, making it accessible to Keptn, and then Keptn is completely automatically analyzing the dashboard for you, giving him an easy-to-understand score in the end. So if you want to know more about this, watch the video. If you want to know more about the scoring, there's also more information out there, how we do SLI and SLO scoring. So one example: automating the validation of a deployment or a test.
Another great adoption example: Raiffeisen Software. They're responsible for Austrian online banking. In case you wonder where my accent comes from, I'm also Austrian. We work closely with them. They are triggering this from their Jenkins pipeline, where Jenkins is doing the deployment into the UAT environments and then running some tests. So then, however, they go off, define relevant SLOs that should be analyzed fully automatically, and then Keptn in the end calculates a complete total score. And as I said, everything is triggered in there from Jenkins with the link back between the tools to navigate easily. Most important thing is, if everything is green, nobody needs to look at the data anymore. Keptn provides the release validation recommendation.
Last example, we also see a lot of integrations with other CI/CD tools like Azure DevOps. We have a great partner with Riley Goldman. They built an Azure DevOps integration, same thing as we've seen before. That means people are using the automated SLO-based evaluation for deployments. And with a Belgium government agency, they were able to speed up the delivery and especially reduce manual work.
So some of these examples. Now, we also have some great testimonials from people like Tarush, performance engineer at Facebook. It's great to see comments like this on LinkedIn: "Keptn feels like a reference implementation of Google's Site Reliability Engineering and the Site Reliability Workbook." Really great testimonial for somebody that really knows SLOs, SREs, performance, and automation very well.
So to kind of close it up, I talked a lot about Keptn. It's our open source project that we want to contribute back to the world. And the nice thing is, whoever you are, a DevOps, an SRE, you pick your use case that you want to automate. You can start with just the quality gate or the SLO evaluation. You can go all the way into auto remediation. Every use case needs some configuration for your specific tools. Most importantly, you connect your tools. We're not replacing your tools, we connect them. We are giving you the chance to not build and maintain your own tool integrations and orchestration.
Therefore, Keptn automates monitoring, delivery, reliability, and remediation. We have our UI, we have our API, so we can automate everything. Most importantly, everything happens through event-driven orchestration, as you saw. SLOs are at the core. We always evaluate. All the configurations in Git. Everything is declarative. And on the bottom right, standards, very important. We're working with the CNCF, but also with the CDF to standardize all the events we use for tool integration.
Now, I talked a lot about Keptn. I also want to say a big thank you to Dynatrace. That makes all of this possible. Remember in the very beginning, I started off with, we're all DevOps and SREs. We need to make sure even if, as we move to the cloud, and believe that cloud technology can make us automatically more resilient, that's not the case. You need to invest a lot. Like the example I brought, when we were able to withstand a data center outage. Now, we are using Dynatrace to monitor this to our systems, but we also use automation and now more and more the stuff also from Keptn, which we also bring to our customers. Because we want to help them to speed up delivery, and we've seen this from our examples by 75%.
Improved quality, very important. You have to be confident in the stuff that you deploy. Therefore, we are enforcing SLOs, not just in production, but also as part of delivery automation. With the data, bringing it all together will increase collaboration between the DevOps and the SRE teams, so the DevSecOps teams. And thanks to all the automation, thanks to the self-service it enables, this can be scaled enterprise-wide.
So hopefully, you liked what you saw. There was certain things in there for you. If you want to follow up later on, then here again, all the different details, how to get in touch with me, but most importantly, how to also have a look at our open source project and everything we do also on the Dynatrace side. Thank you so much. As I said, I would love to be there with you face-to-face. I'm sure it's happening next week. Not next week, but maybe next year. Okay. Bye bye. See you.