How OpenTelemetry Improves Observability and Monitoring
The CNCF project OpenTelemetry is increasingly becoming the standard for getting the reliable and consistent application and machine data to your monitoring and observability tools. Many organizations realize the power of decoupling their metric, log and trace data collection from their monitoring stack, giving them more freedom and capabilities to improve the observability of their application. Organizations today want to discover applications issues quickly and have more confidence in supporting their applications.
In this session, learn about:
1) What is OpenTelemetry
2) What is the architecture of the OpenTelemetry Collector (OTel)
3) How do you build a strategy around OpenTelemetry
4) How do you get started with OTel
Standardizing on OpenTelemetry makes your application more observable and helps your organization implement better observability and monitoring practices.
Learn the why, what & how for OpenTelemetry, a new observability standard for app & machine data.
This session is presented by Splunk.
Chapters
Full transcript
The complete talk, organized by section.
Johnathan Campos
Hi everyone. My name is Johnathan Campos. I'm a senior product marketing manager at Splunk. I've been in the industry for about 15 years, and I'm very excited to be here at the DevOps Enterprise Summit.
We're going to talk today about making applications observable with OpenTelemetry. Let's get started.
When it comes to making applications observable with OpenTelemetry, we want to first understand why we want to make these applications observable. It has to do with the new architectures that we are leveraging today with our deployments. Historically, we've always started with a monolithic architecture, where applications were deployed in a specific client-server type deployment. We found them to be very slow-moving, with infrequent changes and limited user transactions. Things were housed in one place and not necessarily distributed.
When we evolved to a microservice architecture, we found that we had much more distributed services: tens to hundreds of different services across many different hosts and different clouds, with high transaction volumes and frequent code pushes across our CI/CD pipeline. Things are a bit overwhelming when it comes to microservices architectures.
Let me give you an example when it comes to monitoring. On the left, we have an example of the different microservices that put together, on the right, this online boutique. It's a very simple online boutique where we can purchase several items, and you can see all of the different microservices that are involved, from the front-end service to the checkout service, to the payment service, currency service, and more. Each one of these microservices is communicating with one another, and we have very high transaction rates that occur between each and every one of these microservices.
If we may have an issue with one of these services, where is this issue located? Where can we find this issue? We're not necessarily clear on where we could find it. This is where observability can really help with respect to helping us find that needle in the haystack.
This is why I cannot stress enough that latency is absolutely the new downtime. Absolutely the new downtime. We find ourselves consistently dealing with SLAs for customer experience, making sure that we consider this as top of mind. This is exactly why, again, I can't stress enough that latency happens to be the new downtime. It's not our CPU being pegged. It's not memory pegs. It is latency.
This is exactly why monitoring must evolve into observability. Observability really helps detect, investigate, and resolve the unknown unknowns quickly and efficiently. When we monitor, we're just reactively keeping an eye on things, seeing if something goes wrong: a server went down, a service went down. But with observability, we really can understand what happened and why. It's because we have visibility into each and every one of these transactions and each and every one of these microservices. We're going to talk about how we do that with OpenTelemetry today.
The first step in making your application observable is all about telemetry. We have to gather the data. The data is the key to understanding exactly what's going on.
How do we do that? First, we're going to build a strategy. We have to think about our application, and we have to think about our application in detail. Factor how many microservices we have. Factor in what we want to monitor and what we want to see happening with these applications. Then, of course, factor in the information that we want to highlight.
This is where instrumentation comes in. We instrument our applications to capture traces and spans and metrics, to really see exactly what's going on. Third is configuring our observability backend so that we can build the dashboards and charts that we need to clearly see exactly what's going on for our organization to maintain their SLAs.
Now let's take a look at some terminology. With observability, we look at it as the measure of how well internal states of a system can be inferred from knowledge of its external outputs. This is everything and anything that happens within our outputs of the different systems or microservices that we're using.
Some of the data sources that we're going to collect involve traces, which track the process of a single request; metrics, which are a measurement about a service captured at a given runtime; and logs, which are produced when certain blocks of code are executed. All of this data is important, and I definitely invite you to check out our glossary with respect to OpenTelemetry on all of the different terms that are used throughout the OpenTelemetry community.
When it comes to tracing, we have to first think about the context that we'll use. Typically, we'll find that W3C tracing context is what's used. This is going to be the format for propagating distributed tracing context between these given services. The tracer is responsible for creating spans and interacting with the context.
The spans are the unit of work that contains a name, a given action, a start time, and an end time. They'll typically contain the kind, which is either the client or the server, or the producer or the consumer; any attributes like the version number or any type of metadata that we want to include; and the given event and any helpful links to really batch the operations. We're going to see that when we look at a configuration example here in a minute.
Then, of course, we have the sampler. We want to leverage the sampler when not all requests in a given application are really needed to be captured. We want to balance observability and expenses. We also have the span processor, which is responsible for forwarding these spans to the exporter, and that exporter could potentially be the OTLP exporter, Jaeger exporter, Prometheus, Splunk Observability Cloud, or any one of these exporters.
When it comes to metrics, there are five points in particular to consider: the context, which is the span and the correlation; the meter, which is used to record a given measurement; the raw measurement, which includes the measurement name, description, and unit of value; the measurement, which is a single value of a measure, such as CPU utilization or memory utilization; and the metric, where we actually identify that given counter, such as CPU or memory, or in some cases leverage an observer. Then, of course, the time: the time where we're going to capture this particular metric.
What frameworks can help with making your applications observable? OpenTelemetry is definitely one. It's a CNCF OSS project with a strong community and an implementation of a collector and an agent made available to collect all of this telemetry data. It actually is becoming the new standard for observability. It really helps with keeping all of our observability data vendor-neutral and building a standard that we can all use no matter which particular language we're developing our applications with.
Some of the OpenTelemetry components include a specification for your given language, an API SDK, instrumentation libraries also for your given platform, which would be a single library per language for all your given signals, and a collector, which receives and processes all of your telemetry data and exports that data to a given backend of your choice.
Again, I can't stress enough, everyone is contributing and adopting OpenTelemetry. It's quickly becoming the new standard for observability. We see cloud providers making it available, several vendors, end users, as well as different other third-party entities. You can check out the two links below to see the different adopters and contributors for OpenTelemetry. I highly recommend it.
Why the adoption? It's simple. First and foremost, we're going to offload the responsibility from the application. That simply means compression, encryption, tagging, redaction, vendor-specific exporting, control flow of data - all of this is considered. This is why we see a lot of adoption.
The biggest reason is the time to value. It's language-agnostic, which makes changes easier: kind of set it and forget it. Instrumentation is ready for the collector. The biggest reason is that it's vendor-agnostic. You're going to instrument your application one time with OpenTelemetry. It's vendor-agnostic, so it makes it very easily extensible. Think about that time to value.
Let's take a look at the reference architecture. You can see that we have two different types of deployments, one on the left and one on the right. The one on the left allows us to see that we have an application deployed, whether it's in Kubernetes or on a given bare metal host. This application is using the OpenTelemetry library, and it's sending telemetry data to the collector. The collector is then sending the information directly to a backend.
On the right-hand side, we see the same approach, except the collector is sending all of the telemetry data to an OTel collector service, also known as a gateway. We typically find these collector service deployments in a data center, in a given region, or in a given availability zone, depending on your type of deployment.
The OpenTelemetry Collector is a vendor-agnostic implementation for how we receive, process, and export telemetry data in a seamless way. With a single binary - a single library, rather - that can be deployed as an agent or as a gateway, as we saw in the previous slide, we leverage this collector as the default destination for OpenTelemetry client library data.
Let's take a look at some of the details behind the OTel Collector configuration. When we configure the OpenTelemetry Collector, we must first define the given component and then enable that given component. What are we defining and enabling? The receivers, the processors, the exporters, and the given extensions.
Let's be clear on what each of these is. On the left-hand side, we have OTLP set up as a receiver. A receiver is how you get data in, and it can be push- or pull-based. In the example, we're using OTLP to receive data from a given instrumented application. We'll leverage the processor to do something with this data, whether we batch the data, the metadata, add metadata, or redact data - say, a given Social Security number. We would then use processors to do that. Then, of course, our exporter exports that data so that we can get the data out and view and understand exactly what's happening with our application. This entire process is how we create a pipeline.
We'll define these given components - receivers, processors, exporters, and extensions - and then, of course, enable them. Let me not forget extensions. Extensions are things that you do in the collector, typically outside of processing data. What does this mean? Say you want to have a health check of the collector. You would use an extension.
Now let's take a look at a configuration example. In yellow we've defined the receiver, in green we've defined the processor, and in blue we've defined the exporter. The receiver uses OTLP, with specific protocols, sending to a given endpoint; the processor is set to batch; and the exporter leverages a given access token and endpoint. On the bottom, we can easily see how we've enabled them by specifying the given receiver, processor, and exporter that we want to use. All of this is done using YAML.
Let's take a look at a quick example where we've identified a given processor, and this processor's action is to delete any type of key or any metadata containing SSN. It will remove that information from the given trace in this case. We can see that we've enabled that given attribute. On the top, we've defined the given attribute, and below, we've emphasized that given processor attributes so that we can then take action on any key that has SSN. The same exact thing occurs when it comes to any key that occurs with user. We'll go ahead and apply a hash to that given metadata so that when we look at this information on our backend system, we'll clearly see that no Social Security information has been applied to our backend system.
To better understand the OpenTelemetry configuration, what I'd like to do is demonstrate how we would use processors to omit or redact a given Social Security number from an end user sent to a backend service using processors. With that, we're going to first take a look at the configuration that does not have the attributes processor enabled. You can see we have a definition for receivers where we're sending all information to Zipkin on port 9411. We're exporting to logs. We're not doing anything special, sending anything to any backend service at this point, just for the demonstration. Again, we have a definition here for a given processor so that we can remove the Social Security number from any payload that may be sent to the backend service. On the bottom, we've enabled only the receiver Zipkin and only the exporter logging.
On the second configuration, we've then enabled that processor attributes, which we can see here. This is the processor attributes. This is what the payload looks like. You can see that we have a trace ID, a parent ID, a kind, the name, the timestamp, the service name, and so on. As far as the tags are concerned within this given trace, you can see that one of them happens to be a Social Security number. The other happens to be an email of that given user, the status code, the method, and so on.
To get started, we're going to first start taking a look at those logs. We can see here that the collector is ready: begin running and processing data. That looks good. We'll then go ahead and replace the agent config file so that we can reflect the configuration that contains no processor attribute. Again, that's this one right here: no processor attribute. Now that we've done that, we'll restart the collector. We can see that reflected here in the logs. Everything is ready: begin running and processing data.
We'll then go ahead and send a POST. We can see that the backend, or in this case the logs, stored the actual Social Security number here: the user email, the HTTP target, and so on. Everything here is reflected and collected on this given backend service, which in this case is just a logger.
We'll then go ahead and replace that configuration file to the one that reflects the actual processor, or enables the processor attributes, which we can see here. We've done that. We'll restart the collector again, and then we'll run that POST once again and see that the actual user's Social Security number has been redacted.
Again, this is exactly how we can use the configuration of the OpenTelemetry Collector to leverage processors to redact metadata from given payloads that are sent and stored as far as traces are concerned.
Now, what about instrumentation? When we want to instrument our applications, we first want to instantiate a tracer. We want to create spans, enhance those spans, and then configure our SDK. That's for traces. For metrics, we want to instantiate a meter, create metrics, enhance those metrics, and then, of course, configure an observer.
Let's take a look at an example when we automatically instrument an application. In this case, we're looking at Java. In yellow, we've emphasized Java, where we're executing this application called MyApp.jar as an example. We've emphasized the Java agent and a path to the OpenTelemetry library, and some specific information that we want the library to use. For example, where to send the information or the telemetry: localhost port 55680, and what service this particular application is associated to. Let's say this is the shopping microservice as a given example.
Auto-instrumenting your application is very beneficial. Why? Because you leverage this library with absolutely no code changes, and it's configurable via environment and runtime variables. One thing to emphasize is that all you're doing is really just updating your runtime command for your given application, and now your application is instrumented. It can coexist with manual instrumentation if need be.
Manual instrumentation requires code changes, where we instantiate that tracer. We would then create the span, name that span, specify when the span will end, and then enhance that span with specific metadata or certain attributes that we would use to identify what's going on for a given version or for a given action.
When it comes to metrics, we're instantiating the meter, leveraging the given library name or library that we're going to use, and emphasizing that we're going to gather metrics on CPU usage. Manual instrumentation does require a bit of code change, but we add this information to understand exactly what our application is doing. This is where observability comes to life: really understanding what each and every one of these microservices is doing.
With that, I thank you for joining today's presentation. I did provide you with a couple of links to help you get started with OpenTelemetry and check out a few demos that we have. We have the CNCF demo. We have a tags webinar, which I think would be great. This will allow you to really add the data that you need to really understand your traces. We also have a great blog post that you can read on OpenTelemetry, as well as a Gitter page where you can have fun with OpenTelemetry.
With that being said, my name is Johnathan Campos. Thank you so much for watching, and have a great day.