Observability With Intelligent Trustworthy Actions

Log in to watch

Las Vegas 2023

Observability With Intelligent Trustworthy Actions

To operationalize IT automation for business outcomes at scale, Product Owners, Developers, DevOps, SRE, CloudOps, and ITOps teams must have access to the same data in the context that matters to them. Stakeholders need to work across teams to ensure application performance / SLOs are met continuously and build trust with automation in order to achieve elasticity at scale.

Join us for this session to learn how to take intelligent, trustworthy actions using IBM Instana and Turbonomic. Real-time observability that everyone — and anyone — can use and a hybrid cloud cost optimization solution you can safely automate to unlock elasticity without compromising performance. Teams can build trust with automation to scale more efficiently and build, deploy, and manage applications/services anytime, anywhere. Automate intelligent actions using observability and cloud cost optimization solutions.

Chapters

Full transcript

The complete talk, organized by section.

Odera Nweke

Good morning. Good morning. Good morning.

Thank you.

DevOps engineers, IT managers, good morning to you all. Happy to be in Vegas, right? Right? No, guys? We're day one. We're day one, right? Happy to be in Vegas. No? Yes? Okay. Great. Great, great, great, great.

My pleasure to be in front of you guys today. My name is Odera Nweke. I'm with IBM. I'm sure you guys have heard of that company before? Hope. Hope.

Today we're going to talk about taking intelligent actions using two IBM-branded products, one being Instana and two being Turbonomic. Has anybody heard of either? Got one in the back, or a few. Any users? Oh, we have a user. Yes. Who are you with? Discover. Yes. Yes. Turbonomic or both? I love that. I love that. We're going to lean on him for everything I say. I want everyone to look at him.

Amazing. Amazing.

So, to get things started, the number one takeaway that I want everyone to have with this discussion about just overall managing our applications while saving as much as possible, right? When it comes to just managing and operating these tools and managing our overall business applications, whenever something goes awry, a lot of times we're talking about, "Hey, let's give it more resources," right?

How many times do we hear that? "Well, you know, our application, one of our servers, our JVM is running a little slow. Let's give it some more memory," right?

Oftentimes we don't know if that's the right answer. Is it usually the right answer? It's pretty hard to tell.

So, that's the main issue, the main problem that we're looking to solve today: the IT complexity. As complexity grows, managing those services, managing how we handle cloud spend, how we manage our resources on-prem, whether we're in multiple cloud vendors, right? Are we looking at our distributed side? Are these mainframes? Are we managing these Windows, Red Hat, VMware, right? How are we actually going to handle managing performance while also managing resources as well?

So we want to give our application the resources it needs in real time without wasting, while also understanding what that impact has on the individual services that make up our top-level business applications. So we understand our cloud spend is out of control, right? We want to speed up how we're able to deploy applications, deploy code, but we need insights. Our apps aren't performing. Why? If we know that answer, probably are in the wrong session.

So why do we need this one? Again, apps are not easy to observe as a whole. We're all familiar with observability solutions. I come from a sports background. I used to play basketball in college. I like to think of observability kind of like how you look at sports and they have stats, right?

If you're looking at football, we're looking at all of the individual stats. You log on to ESPN, I want to know exactly how a quarterback is performing. I'm going to look at every single stat. That doesn't really tell you much.

I want to know, in context, what happens when something goes awry. A quarterback gets hurt. What's that impact on our team? That's the type of insights we need when we're looking at observability solutions.

So we need context. Context is going to be key. If a service goes down, our Spring Boot application service goes down, what happens? What's that impact on our overall business application?

Two, you want to be able to adopt automation where we see fit. So what if that answer is it needs more resources, right? We need an automation engine that can actually drive those resources in real time. We're not going to automate, it's not going to sky, we're not going to press the button, everything just goes. No. We want you to actually take a walk, crawl, run approach to enabling automation.

Am I lying yet? No, right? Okay. Okay, cool.

So basically, like I was saying a little bit before, we want to make sure that that application continuously performs at the lowest cost possible. How we're doing this is with two tools.

One is going to be Instana. Instana is going to be our enterprise observability solution with IBM, the way IBM attacks observability, especially for those microservices-built applications, right? Those environments that are using any Kubernetes, any OpenShift. We want to make sure that you're able to manage performance on those overall applications.

And underneath that, this is all built and controlled by the dependency mapping that builds up, and that's what stitches all of the individual entities together, right? So whenever you deploy, and this is going to be agent-based solutions, so we're understanding that you guys are familiar with probably Datadog, AppDynamics, all these other APM solutions, right? You're deploying a single agent or you're deploying an agent to pull in these metrics on that overall host.

But with what makes Instana different is that there's only going to need to be a need for one agent. What happens is there's going to have one single host agent. That agent is going to deploy these mini-sensors automatically that is going to give that auto-contextualization to your overall environment.

So if I want to know what's living on these individual servers, I get that in real time. And at a one-second metric granularity, we're going to understand 100% tracing all of the transactions that pass through your environment. You'll have a distributed trace for every single transaction while maintaining overhead and auto-contextualizing your environment.

So when a service goes down, I can easily understand what that impact has on our application.

That make sense so far? I still need verbal. Yeah. Yeah. It's Monday. I mean, it's day one, not Monday, but it's day one. You guys still be tired.

So with Turbo, think of Turbo as the automation engine that attaches to that overall business application, or your overall entire, not business application alone, but your overall IT environment, your IT stack.

So how Turbo works, and I'm going to actually go into how Turbo works a little bit, but what Turbo's going to be is going to be that overall automation, that automatable decisions that you can make to increase RAM, CPU, pull back when we need. We want to be fully elastic, no matter you're running in the cloud, on-premise, or even in the hybrid approach.

So again, when it comes to Instana, an observability solution within Instana, what happens is once you deploy that agent, you're going to have that auto-contextualization. And with that, you can collect all that accurate data, those statistics that I was speaking about a little bit before. But now I know whenever I'm experiencing high, right, whenever I'm looking at our golden signals and I see high latency, right, a decrease in throughput, well, what does that impact on my application? What is the end user experiencing? I can get that full understanding.

We can monitor synthetics. We can pull in metrics, right? Metrics, understanding 100% what's going on, and then being able to take that intelligent action to prevent issues from arising in the future.

Because one, we're collecting metrics at a one-second metric granularity, for one. So the MTTR is going to decrease because, one, we're getting an understanding of what that overall impact is, and we're shooting that issue over.

Let's say you get a Slack message on a database server, right? I want to know when that message gets sent over to the DBA, they should have an understanding of what all led up to that to cause that individual issue to arise, right? What ended up setting off our MySQL database? We want to provide an automated distributed trace that can actually give full context to that DBA to go in and fix that issue. That would decrease that was speaking to right there before.

Now, when it comes to Turbonomic, this is going to be the engine underneath the covers. What Turbo does is it uses a supply chain mechanism to understand what that full resourcing looks like within your environment.

So what happens is you stand up Turbonomic. Mind you, by the way, both of these can be deployed in a SaaS environment or they can be deployed on-prem. Let's say we take SaaS, for example. We spin up a SaaS environment. What happens is Turbo's going to send bidirectional API calls to all of the individual entities that you're using today to manage your environment.

So what that means is Turbo is going to allow you to connect to your APM solution. In this case, we're looking at Instana, what we work with the best with, of course, being IBM. We're going to tie into your MySQL databases. We're going to tie into your Oracle databases. We're going to tie into your Google Cloud environment, your VMware environment, your Windows environment, your Red Hat environment, your storage environment, to get an overall understanding.

All we're looking to do is we want to see a supply chain for that application from the top down. I say top down because if I'm looking bottom up, I'm not making the proper decision on that overall business application. Because if I see that a VM needs resources, well, whereas that if I'm looking bottom up and I'm looking at that overall hardware, that physical host, I don't really understand if I'm living on that, right? That proper host as that VM is living on that proper host, right?

So what Turbo's going to say is, well, we're looking from the top down: application first, down to the middleware components. So those services that make up that business application. If it's living in some sort of Kubernetes environment, we're going to look at that entire cloud native environment all the way down to the physical host and the individual cloud VMs.

And what's going to happen is, let's say you have a service that needs more resources. So Instana tells Turbo, "Hey, our service is actually struggling. That makes up this business application."

What Turbo says is, "Well, where does that service get its resources from?" Right? Is this a hybrid application?

So Turbo makes that decision to say, "Okay, well, it's getting some of its resources from a Google VM, and we're going to be able to tell you here's what that would cost to give it more resources." And you can actually automate that decision to possibly size up for compute, size down, while understanding what that full impact is overall of that overall business application.

The same for on-prem. If we're looking at a business application that is getting its resources from a VM, which service is getting its resources from a physical VM that is virtualized using VMware, I want to know if the underlying physical host has the proper resourcing to provide to that virtual machine.

So Turbo's going to say, "Well, one, does the supply match the demand? And if not, let's put it to a state where the supply does match the demand. So let's do a vMotion to automate, and then let's size up that virtual machine to help out that application service that's being hurt currently in your environment."

So Turbo is under the covers managed by the supply chain, whereas Instana has that auto-change. The sentence starts with C.

You good? T-shirt. Woo.

So here's how it works. We're all used to so many alert storms when it comes to the observability solution. We're going to start at the application. So we're still looking from the top down.

So starting with Instana, one, we're going to make sure once we're deployed, let's say we're deploying SaaS, both of them. We're going to tie Instana into Turbo, meaning you're going to type in your Instana credentials into Turbo, your admin rights, excuse me.

And what happens is, whenever in the event, which is going to be how Instana takes alerts, right? We call them events within Instana. So you get an event that populates within Instana that says, "Hey, this service is in need of X amount of resources," or memory, or whatever that case may be, right?

You can't take that action within your observability solution. But what happens is, once you get that alert, you're going to have full contextualization within that alert. What all led up to this action to where we're saying, "Hey, we need resourcing on this specific service."

You're going to have a distributed trace that's automatically created for this alert, and then you're going to be able to actually take action within your observability solution, to where you can see in real time the impact on that overall business application.

So how it works will be when Instana sends a message to Turbonomic to create an action, Turbonomic is going to create that action and then automatically accept that action, assuming you have automation completely configured. This is not out-of-the-box configuration. But once that configuration is done, it's going to send that message to Turbo. Turbo's going to find out where it gets resources from.

So whether that's on-prem, hybrid, right? We're looking in the cloud. Is it a Kubernetes environment? And we're going to actually manage those resources within Turbonomic, which is all going to be done automatically. It's going to increase that capacity that it needs and send a message back to Instana to say, "Hey, this event has been taken care of."

And this is just going to be a slide of just telling you how it makes sense to be better together. Yes, you can use these in isolation. One does not come without the other, or whatever the case may be. Turbonomic is a standalone enterprise solution, and then we have on the other end Instana, a standalone enterprise solution, that combines to paint that picture that I was pretty much explaining a little bit before.

Where we have these actions created, that auto-contextualization of those individual services that make up your business application no matter where it lives, right? Especially in microservice environments where we know it's tough to manage. And then you have that automation engine with Turbonomic to make those sourcing decisions, to then message back and forth between Instana and Turbonomic to then lead to faster remediation when it comes to code issues.

But overall, let's understand that, for one, within the cloud, every resource is tied to an individual cost, right? Which leads to overspending, which happens to be the leader of some of these FinOps conversations that I'm sure a lot of you guys are a part of today, where we want to know what that cloud spend is. What does it take to actually assure performance for our overall business applications, and what impact does restoring performance for those applications have on those individual services, right?

So what we're doing here at the very top, you see all those little people, right? We're getting everyone on the same page, basically what we're saying here, right? So we have our developers, DevOps engineers, the SREs, the cloud ops people, your IT operations managers. Everyone should be on the same page because, one, we can understand what capacity we have or need. Is there a need for additional resources physically or in the cloud? What does that cost, right?

And then we need to figure out in real time how our applications are performing because our end users need these applications today. So that's what that's saying in that marriage together.

Now, let's say in some cases we can have it on the standalones where, "Hey, our DevOps team needs an observability solution because we need to understand what that full scope is and how it's performing in real time." That makes sense, and that's what that blue is going to represent.

On the opposite side, we have IT teams that are saying, "Hey, we want to take control of our cloud operations, understanding what our spending is. What does it take to actually increase additional resources, take away resources? Can we overall save money by decreasing all of these instances, these EBS volumes, these Kubernetes nodes that are just pretty much over too much capacity?" Right? We're overspending, but we have no insights into it, so I can't really prove it.

That's what the red circles are going to represent, and in the middle, the better-together story that I'm speaking about today.

So here's one of the customers that currently has both Turbonomic and Instana. And this basically is a highlight of what we were able to accomplish by marrying the two, right? Leading to, in this example, if you look at the lower middle portion where we're saying there's a 10% reduction in memory and CPU overallocation, what does that cost look like within your cloud environment today for a 10% reduction?

How easy would it be to spend off that message to some of the people that have done your cloud about, "Hey, you're overspending within your cloud. How are we going to control this? And will it pay for itself?" In a lot of cases what we see is that would be the case.

Now, this kind of speaks to, again, that overall story of managing a humongous IT operation for, in this case, 280,000 users within Instana, what that would look like. So in this case, that's going to look like 22 tenants with 1,300 host agents being deployed, all having an understanding of what that impact has on each individual tenant, right? And then all that is going to be managed from a resourcing position, meaning from Turbonomic, and then observed with Instana.

So what happens here is that what I was explaining with Turbonomic is spoken to as ARM, and that's something that I do want you guys to take away today. And if not, come talk to us at the booth, of course, which is going to be application resource management. Meaning, we want to make sure our applications are performing how they should be performing while managing overspending and adhering to any current business policy that you have set in place today.

That's ARM. I don't have to, but we're going to provide something beyond APM, which is going to be observability. APM tells you what. Observability should tell you why, and what is overall being impacted.

Overall, we want to bring all the teams together for getting understanding of what that impact is. So we want to make sure everyone is on the same page: SREs, developers, your CloudOps engineers. That way we can take back some of that time that's wasted, some of that outspend that's wasted, and provide intelligent actions for managing our overall application to make sure that they're running as.