DevOps Without Measurement is a Fail

Log in to watch

London 2017

DevOps Without Measurement is a Fail

Neil MacGowan

Director of Digital Intelligence · New Relic

Ricardo Santos

Technical Account Manager · New Relic

DevOps Without Measurement is a Fail

Chapters

Full transcript

The complete talk, organized by section.

Neil MacGowan

My name's Neil MacGowan. For the next half hour or so, I'm going to be talking to you about how we address the challenges of trying to measure the performance of applications and the process of delivery of those applications in a fast-paced environment.

And then I'm going to hand over to my wingman over here, Ricardo, who's going to go into a little bit more detail around how we actually do it at New Relic. And I'm also going to be relying on the works of Tom Cruise every now and then to emphasize a few points.

How many people in the room have heard of New Relic? Okay, pretty good. How many people are actually customers or users of New Relic? That's pretty good as well. So hopefully you'll have some understanding of what I'm talking about here, but hopefully as well we're going to show you something new, something which you can take away and maybe implement in your environment.

So first of all, I'm going to start with a brief video.

These days, everything is fast, and fast is everything. Fast news, fast food, fast delivery, fast dating. Too fast for words. But today's fast needs to be clever fast. The William Hill app is now clever fast, so you can get on all sports, accumulators, insurance faster than ever.

So that was a recent advert from William Hill, who happen to be a New Relic customer, fortunately. But I think the key thing to take from that advert, and we're seeing a lot more and more of this on the TV nowadays, is the search for speed in terms of the way that people interact with your applications and your services. Everybody's looking for a frictionless, fast, reliable experience.

And now more than ever, the actual brand experience, the way that people interact with you and understand and form an opinion about your company, comes from the digital experience. You just look at the statistics. Domino's Pizza, for example. You'd think they were a pizza company. Well, actually, they're a technology company that just happens to make and deliver pizzas.

Recently, we had on stage at our FutureStack presentation Ryanair presenting and saying that they used to be an airline; now they're a software business with a few planes. So I can't remember the last time I actually went into a bank. All my banking's done online. I expect to be able to access whatever I need, wherever I am, whenever I want it.

And similarly, revenue from mobiles, whether it's e-commerce, gaming, et cetera, is increasing all the time. So the way that your customers, your consumers, experience your brand has changed dramatically from them walking into a store or interacting with a person. And for that reason, speed really matters, because you only get one chance to make a first impression.

So a recent survey with regard to the state of mobile applications by Forrester showed—it's a word from our sponsors, I think—showed that if people have a negative experience, they're likely to switch to another application. Forty-six percent of them have switched to another application straight away, and if they get a regular issue with that application, they're likely to delete it.

So if you're making significant investments in putting these applications in the hands of your customers, you need to make sure that they deliver the right experience every time, and time, and time again.

So with that respect, speed is important, right? Time is money. Recently, Amazon stated that a one-second decrease in performance resulted in a 7% drop in conversion rate on their site. Similarly, Bing, from an advertising revenue perspective, a two-second delay means a 4.3% reduction in their advertised revenue. So what we really need to do as organizations is this.

“Hey, Maverick.”

“Yeah?”

“You hear about Ice?”

“What's that?”

“You want another one?”

“Really?”

“Yeah.”

“I feel the need—”

“The need for speed.”

So I'm showing my age a little bit there. I'm sure there's a few people in this room that weren't even born when that film came out. And if you're wondering when it came out, that will be at the end of the presentation.

So there is a need for speed. There's the obvious need for speed in terms of the way your applications perform in front of the customers. But there is a much wider need for speed as a business, and that wider need of speed covers a whole host of things. It covers being able to innovate faster, being able to be the first to market with a new offering, a new technology, a new service, et cetera; experiment faster, develop faster, et cetera. The commonly used “fail faster,” right?

So we're constantly under pressure to do everything faster, whether it's being able to identify issues, diagnose the problem, and resolve them, and also to learn from our customers, the way that they interact with our applications, and then consequently change and continue to evolve and innovate to delight our customers in terms of the way they interact with our applications.

So in that quest for speed, we as organizations are changing a lot of things. We're changing the technologies that we use, and we're also changing the processes that we adopt in order to deliver against those challenges. And that's why we're all here today.

Everything is changing. Everything's dynamic. The data center is moving to a hybrid cloud solution. Physical servers are moving to containers, and monolithic applications are being broken up into microservices, et cetera. I'm sure it's a common theme throughout the event this week, but it's a fact. Everything's dynamic and everything is changing.

We're adopting new technologies and dynamic infrastructure. We're adopting new processes and a faster rate of change, faster velocity. And we're adopting new business models and revenue streams. Businesses that were known for one thing are now offering new services which might be left field from the perspective of what you thought. A lot of organizations nowadays are trying to drive value from the data that they hold about their customers, which might be in a completely different area to their original business.

So in that environment where everything's changing, not only are we faced with a high degree of complexity in terms of the technologies which are at play, but also in terms of the scale at which we have to operate. If you come up with something which really takes off, then how do you expand to support the demand that you've generated as a result of producing that offering?

So what problems might lay ahead? We've got all of these new technologies, dynamic architectures, new processes, et cetera. Is it just a car crash waiting to happen?

Well, in a changing world, we're moving from the previous static world where we were quite secure, we understood what was going on, we had a clear vision how to operate and maintain those systems, and we had a very specific responsibility. Now in the dynamic world, we're in a situation where we're continuously delivering change into that environment. We no longer have that static set of infrastructure components. Instead, we've got stuff which is being created and destroyed on the fly. And we're trying to operate as a combined unit where everybody has responsibility for the maintenance and the support of those applications in the production environment.

In that dynamic world, things are accelerating even faster because of the dynamic cloud. So with EC2, we've now got servers running your applications and your processes. With Docker, the processes are running as a command. And with Lambda and other serverless computing technologies, now they're just a piece of code which is operating or running somewhere, and we don't really care where.

And that environment, that change in the way that we build and we deliver our applications and our services to our customers, is borne out by New Relic's own observations in terms of its customers. We manage thousands of customers' data in the cloud. We have visibility into their infrastructures, their architectures, their applications. And as a result of analyzing that data, we recently discovered that the typical age of a container is measured in hours rather than days or weeks or months.

Perhaps, actually, if we drill into that in a little bit more detail, of that, 11% of the millions of containers that we've observed in our customers' environments exist for less than one minute. So they're not a server. It's a function. It's something which is just being created to fulfill a function and then destroyed. And this is becoming more and more evident in this dynamic approach to delivering applications, either just delivering the core capabilities of those applications, but more importantly, scaling those applications to meet demand.

So in that dynamic environment where we're faced with all of this increased complexity and these challenges, let's not repeat the mistakes of others, right? Let's not bring the old approaches into the modern world. Let's not do the same thing over and over again and expect a different result.

So what we really need is a changed approach. We need to stop thinking about monitoring things and measuring things in the old world, and we need to start thinking about monitoring the life cycle, the whole cycle of creation, deployment, maintenance, et cetera. And we need to measure it from a whole host of different perspectives.

There's a lot more stakeholders involved in what we do now, rather than just the operations team that were responsible for maintaining the applications in the wild in the past. So behind every app that you develop, there's an entire team that have some responsibility, some accountability, and some requirements from that application.

So we need to be able to support the interests of everybody within this team in terms of being able to measure how your applications and the infrastructure that support them are performing. So if you think about that from a digital business in the cloud, there's various areas of responsibility. If you're using a public cloud, for example, then their responsibility is providing the scale, the security, offering new services and innovation in that cloud, and so on. From your perspective, your responsibility is about ensuring the user experience, the code and the configuration, and anything that you've got on your side as well.

But what you really need in order to manage those digital applications in that environment is the ability to see across the full stack. We need to see everything in context. We need to see the customer experience, and not just one customer, but every customer. Every customer counts.

We need to be able to see into the application code and understand where the time's being spent, where the errors are being produced, whether or not the application is delivering what we need as a business. And we need to be able to see into the infrastructure or the components which are actually supporting the delivery of that particular application. And we need that so that the whole team, from the business right the way through to the operations team, knows exactly what to do next. They can understand the performance of their applications in full context, and they can prioritize their resources appropriately.

So this is where New Relic's software measurement framework comes into play. So yes, we're interested in ensuring that we can drive engineering velocity. We want to understand how long it's taking us to do things, how successful we're being in terms of deployments, et cetera; how quickly we can resolve issues, et cetera.

We need to understand about service quality, our uptime, whether or not we've got errors, the success of our deployments, and whether or not we're improving things for the customers. But we need to see it from the application and infrastructure perspective. How are the apps performing? What are the slow queries, et cetera?

From the customer engagement perspective, is the app doing what we want it to do? Are customers engaged? Are we getting the right conversion rates, et cetera? How's the performance? And we need to see it from the business perspective. So inside those applications is an incredible wealth of data that allows us to understand, how are we doing from a revenue perspective? What's our average cart volume, if it's e-commerce, et cetera?

So all of that data is there for us to tap into and monitor and measure every aspect of that application life cycle.

So why is it so hard, right? Why is it so hard to carry out that degree of measurement? Well, fundamentally, it boils down to the fact that there are lots and lots of data sources. New Relic provides a mechanism to collect data from a whole host of these, whether it's from the end-user perspective, the mobile perspective, the application, the infrastructure, et cetera. But there's also data from a whole host of other sources, logging data, machine data, even taking feeds from social media, et cetera, that feeds into understanding how our applications are performing.

Now, if we take a simple application that our customers interact with, and we look at a fairly simple architecture that supports it, one customer comes in and they get a decent response, a 300-millisecond response. Another customer comes into the same application, and they get a one-and-a-half-second response. Not quite so good. Now, that additional 800 milliseconds might be the difference between them buying and not buying, dependent upon what your application does.

But if we only measure or sample or average or aggregate the data that we're getting from those users, we end up looking at what looks like a reasonable response. Less than one second is the average response for our customers. So we can very quickly, if we're just scratching the surface in terms of trying to pull the data out, get a false impression as to how good we are.

But really, to actually understand how your application is performing and everything's performing, you need to tap into every single source, every transaction, every user, and collect that data in real time and make it available so that we can analyze and we can start to determine exactly how we really are performing. And that's difficult, right? There's a lot of data, and that amount of data can be so huge it can dwarf the amount of data that even resides in your applications themselves. The requirement to collect, store, analyze the data dwarfs what you're actually doing in the application itself.

So is that an impossible mission, right? Being able to collect all of that data, being able to query it, detect issues with it, alert and analyze it. Well, no, it's not. It's not as long as you have three core requirements or three core capabilities to enable you to do that.

The first is that you have to have access to all of the data. You have to be able to instrument the full stack, whether that's the application tier, looking at everything which is going through your application, every transaction, whether or not it was successful, how long it took, where the performance issues are, where the bottlenecks are, and so on. Every customer interaction, so every individual user's experience, and all of the dynamic infrastructure that supports it, whether that infrastructure component existed for an hour or a second. We need to know what was actually happening. So you have to have that full-stack instrumentation.

The second thing that you need in order to be able to address this challenge is to be able to scale to the biggest day. So you need an architecture that not only can collect that data, but can consume that data, analyze it, and present it back in a timely manner. And when we talk about built to scale to your biggest day, your monitoring should not impact your application.

I've spoken to many customers where they've been doing logging, for example, and they've actually had to turn it off when their applications get busy. So Black Friday, for example, the volumes on their applications go up so high that they're actually having to turn monitoring off because it's impacting the performance of their applications.

So we talk about scale to your busiest day in terms of Grand National, Black Friday, bank holiday sales, whatever it may be, media events. So being able to cope with the increased demand on your applications, and as a result, the increased demand in terms of the volume of data which is produced.

And the final thing is to take all of that data that we've collected and present it back to everyone, everyone that needs it, in a timely fashion. So analytics for everybody. Fundamentally, what we want to do is we want to take this data and democratize it so that people can have access to it, they can benefit from the expertise of others through curated dashboards, or they can create their own dashboards that present the information in the manner that they want to see it.

So those are the three core requirements. Fundamentally, to deliver that, what you need is an intelligence platform. You need a cloud architecture, a multi-tenant cloud architecture that allows you to scale to meet the demands of all of your customers whenever they have their peak periods. It needs to be secure, it needs to be open through the use of APIs, and it needs to have some built-in intelligence, which we'll come into in a little bit more detail later on.

You obviously need the data, the full-stack instrumentation. So you need to be able to get the customer perspective, the application perspective, and the infrastructure. And then you need to be able to provide the analytics on top of that, which generates the events, the alerts, the dashboards, the maps, the views, so that people can consume that data.

And it just so happens that New Relic has one of those: a digital intelligence platform.

So for those of you that are familiar with New Relic, you're probably familiar with some of these solutions that fit in these areas. From a data collection perspective, we're founded in the APM space. That's where we come from, in 2008, focused on looking inside the applications, modern applications, modern languages, and being able to pull all of the information out of those.

Extending into the customer experience with mobile monitoring, browser monitoring, so every customer interaction through the webpage, and synthetic monitoring as well. So monitoring APIs is as important as monitoring the users themselves.

And then most recently, with the release of New Relic Infrastructure, which was built from the ground up to support dynamic cloud-based environments, integrating with the current cloud platforms, but also collecting data from standard technologies as well.

And then on top of that, we have the Insights platform, which allows us to do the analytics, the visualization, and so on, based upon all of that data. And all of that sits on the multi-tenant SaaS architecture, which allows us to support 15,000 customers, from digital disruptors, born-in-the-cloud companies, through to global enterprises, major financial institutions; over half a million users; and 1.4 billion events coming into our system every minute. In terms of scale, that dwarfs things like Twitter and various other social media streams.

So it's that digital intelligence platform that attracts customers to New Relic. And this is one particular customer, the Trainline. How many people in here use the Trainline app? Okay, excellent. They continue to innovate. Tell you which carriage you should be waiting at to get a seat if a train's busy, constant updates, live tracking of the trains, and so on.

If you think about your journey here today or yesterday, whenever you came down, there's many points where you probably interacted with an application. Maybe getting your coffee at Costa or Starbucks, maybe booking a hotel room. There's many occasions where you've interacted with something that New Relic is enabling that organization to manage and deliver that service.

But the Trainline here, they go to Insights. It's the first thing they look at in the morning. It's the last thing they look at at the night. And Chris Turville, who's the head of cloud infrastructure and operations at the Trainline as well, was recently quoted at one of our events saying that they calculated a 300-millisecond delay in response cost them $10 million.

So coming back to that speed, yes, it's speed in terms of the user experience, but it's also speed in terms of how we deliver that innovation to our customers.

So I'm about to hand over to Ricardo, who's going to show you how this actually works in practice. But before I do that, I just wanted to thank—we, as New Relic, are a DevOps company. We were doing DevOps before, really, DevOps became mainstream. We've learned on the fly. We scale to huge scale. We support thousands of customers. We're continuously innovating. We're operating software development teams as small startups, and we have embedded in them site reliability engineers that ensure that our customers get the best possible service.

So where are we going? What does New Relic look like in the future? So a short video giving you insight into what's out there now and some of the things that are coming, and then we'll move on to Ricardo.

So I encourage all of you to visit the website and see what's new, what's coming from a New Relic perspective, and how that will help you in your challenge in terms of delivering innovation and applications to your business.

But one of the things that was featured in that short video there was something called Project Seymour. And this is a code name for a development project within New Relic, which is applying various aspects of machine learning, artificial intelligence, to the huge volumes of data that we're collecting from all of our customers.

And the idea here is that basically we can provide expertise for everyone. In one of the earlier slides, I talked about analytics for everyone, democratization of data. New Relic is uniquely positioned to have the largest data set of performance management information on the planet, and the largest number of customers who have the largest number of subject matter experts.

So we can bring that data and those people together to create applied intelligence, where New Relic is able to identify stuff that you haven't even thought of, things that may become problems in the future, and actually give you the information that you need in order to resolve those, rather than just say, “Hey, this looks weird.”

So that's coming down the line. What I'm going to do now is I'm going to hand over to Ricardo, who's going to talk about how we can apply New Relic in a DevOps environment. If it was bugging anyone, by the way, the films were the following, and the years that they were released are shown there. But I'm now going to hand over to Ricardo, who's much younger than me, and he's going to talk you through how we actually deliver it.

Thank you.

Ricardo Santos

I remember all the movies, by the way. Hi, everybody. So let me change presentation here.

There we go. So what do we do for New Relic? We've seen the stacks of what we cover. So basically, you have these end-to-end business applications, and we provide visibility onto every single part of those end-to-end business applications. We cover the customer experience application side of it, as in the application service running in the back end, and also the infrastructure. But we bring all of that data together in the same place so that if your end users are having issues, if a business isn't performing as well as it should according to the website, you have a bunch of metrics, dashboards, and alerts that give you visibility onto it.

Now, similarly to that, we then have a plethora of products that fit into those segments. So all of them, for me, basically, they are effectively different data collection sources. So different types of data that we pull in from the different sides of an end-to-end business application that all fit together in these dashboards. So I'm not going to go in detail on this because Neil covered it very well.

So what do we do in a DevOps environment? How do we help customers?

And basically, DevOps is a process where you start off by designing an app, you hand it over to development, they start writing code. You then have builds that take place. You have the test teams that start getting onto the applications. They do the functional tests, the load tests. Eventually, it goes to deployment team, and it reaches production, and then all hell breaks loose if you don't have visibility there.

And that's really when monitoring comes into place typically, whereas we kind of have the view that monitoring should come into place on the very beginning of starting to design an application. You should take into consideration what the application is doing, the architecture of it, and as soon as the first line of code is developed, you should have something like New Relic monitoring the code's development there.

So really, DevOps is looking at making that whole life cycle much better, much faster, so that you can do smaller code batches, you can test in production as if it's a development type, because you just make small deployments into production that are easier to test and easier to roll back as well if something went wrong.

So all of that to make the developer's life easier, the people that manage the apps, the admin guys, have their life made a bit easier as well, so that also the business can come and say, “We need this new function, this new functionality here,” and you can take hours or days to take that functionality into production rather than months, as it used to be with things like the waterfall methodology.

So we bring all of that together to create a frictionless and collaborative process. So in every single one of these steps, we have the ability to provide some value, to provide some data that helps you then manage all of this process and optimize it in a bunch of ways.

So on this presentation, I'm going to cover how New Relic can help in each one of these segments here, in each one of these steps.

Now, the design phase initially, the first time that you go into the drawing board and you start designing an application, there's nothing that New Relic can do because there's no data being collected. But DevOps is an iterative process, so by the time that you deploy and you launch some code, you have enough data to then go back to the design stage, and this is where we can bring value on a second iteration.

And we do that by giving you a very good view of what is the architecture of the application. How does it mesh together? How does the application scale? So that then a lot of times leads to design changes. So you optimize the application. You would go and split a service into a lot of different microservices as well so that it can scale better. But you have all of that data and all of that ammunition to make those decisions on how to change the architecture and design of the applications. That's one of the things we do to help the design phase.

By the way, if you have questions, let me know. Just raise your hand.

Also, we give a very good view of things like key transactions, the maps above, the performance of the whole application, the performance of microservices, the performance of, again, key transactions, errors in the transactions, and so on. All of that then helps you change the design of the application.

Now, on the development phase, this is where you would first go and implement New Relic. So as soon as the first line of code starts to be written, you should have an agent installed on the developer's laptop that's collecting data.

And the data that you get initially, it's something along the lines of this dashboard here, where you have a view of the transaction time, a view of errors as well, a breakdown per transaction, the throughput compared with performance. Also, you can see that the more load I throw into the application, the higher the response time is, so I have to understand how I can make this application scale a bit better.

But then you go into detail of each individual transaction. You see within the transactions, what's the code element? What are the methods? What are the packages that are the slowest ones? Where do I need to focus my energy to make the code better? Where am I having errors to make sure that by the time that my application reaches production, it's better, it has less bugs, and it's performing as fast as I can make it perform? So you find these bottlenecks before the application makes it to production.

And we go all the way into the detail of actually showing individual transactions, the methods executing inside it. Is this new to anyone here, the level of detail that I'm showing here? Cool. Okay. A few people. That's good. So a lot of you know already that we can do stuff like this, which is good.

So for developers, this is really gold, right? Because this means you're writing code, then you know that this method here or this query is really slow. So I don't need to run a lot of tests. I don't need even to replicate the issue in my development laptop. I know where the problem is, so I can just focus on this and fix the issues that I have.

Now, when it gets to the build phase, the agents are running in tests, the agents are running in development, but the build environment is all about a developer does a commit, a build gets executed at some point, and we can also collect metrics from this stage.

We don't have agents that we install there, but we have the ability to collect and ingest custom data. So you can send whatever custom data you want to us, and we'll show it in dashboards like these. Now, I gave the example a while ago when I was speaking with a customer that if you know how to get the level of coffee in a coffee machine, you can pass that on to us every five minutes, and then map that onto the number of commits and see if the developer teams are more productive if they have more coffee ready for them. So you can correlate these two metrics, which is kind of cool.

The same thing you can do with builds. If a build is made, if a commit is made, if you can intercept that call, you can send it to us. So all of a sudden, you have a very good idea of how your developers are doing. You have a very good idea of, am I making more builds than I was a year ago? Am I failing more on builds than I was a year ago? So how's my process overall?

So all of these metrics, all of this telemetry, gives you an idea of how your DevOps process is being optimized and how it's functioning. And all of this stuff you can send to us. You can send all of this data to us. And then you can do silly things like the total commits per developer, the ones that committed more code, for example, and then give them a bottle of champagne every month.

Now, on the test phase, a lot of value here, and there's a lot of ways we can help with tests. We have New Relic Synthetics. Who uses Synthetics here? One? One. Cool.

So Synthetics has a lot of ways of testing an app, right? You can go and say, “I want to test or monitor the loading of the homepage,” or, “I want to run through a specific journey like this one here.” And you basically script a journey where you go to the homepage, you log in, perform some sort of transaction, and log out, and constantly run that.

If when you're running that journey test, at the same time you have load tests executing and functional tests, then that's going to tell you how the journey is performing. It's going to create a baseline that says typically, each one of these steps takes two seconds, but when I throw this amount of load to it, it jumps up to five seconds, so I have an issue. And you find that issue in a test environment, not in a production environment, so no one is actually being affected.

You go back to the drawing board, you troubleshoot the data, because a lot of the data that I'm showing here is just a high level, but there's a lot of background data there that allows you to troubleshoot things. So the first goal that we have is to show where do we have a problem. The second is what is causing the problem, so that you know exactly what to fix.

Each one of those scripts is then going to give you things like availability, the load time, and for each one of the results, you're going to have a waterfall chart that says, “This is each one of my pages loading within that script. Here's the objects inside of pages and how long they're taking to load, how long they're taking to render on the screen,” and so on.

So all of those things allow you to understand if a web page is slow because you have, for example, a JavaScript snippet that's breaking there, or you have an image that's way too heavy.

Now, the cool thing about this is, because for every single page, we collect every single object and the timings for every single object, you can then ask New Relic questions like, “This image here, what's the impact that this image is having overall in all of my web pages? What's the impact that this JavaScript snippet is having on all of the pages where it's loading?” Which is quite a powerful question, right? Because you get to understand this little thing here, if I optimize it, how much performance gain am I going to have?

Now, what you can also do, and this goes for test environments, it goes for production environments, is you can automate the deployment of New Relic so that ideally you would get to a point in DevOps where, at some stage, a test environment gets created automatically, and that test environment then has a load of functional and load tests executing. In the end of it, the test environment gets destroyed automatically as well.

So you can automate New Relic within this as well, so that by the time that the hosts are created, they actually have the agents running inside. So the web pages that are then being sent to browsers, they have New Relic Browser JavaScript there as well, monitoring the page load. So all of this can be automated so that you don't have to have any work in configuring the environment every time we do a test, or every time we do a build, or every time we do a launch.

The same thing goes for production. If you have an environment that's auto-scaling, so now you have four Docker containers or Kubernetes or whatever, you have four containers. Tomorrow, you know that you're going to hit a heavy load because it's Black Friday or something that's important to you, a big event, and you auto-scale from four to 100. So all of those 96 new hosts that are coming in, the 96 new instances, they can have New Relic already deployed automatically. So less work in configuring, the data just comes in and feeds directly to it.

The fact that we are pure SaaS, which means we have our own data center, we're not on AWS or Azure or something like that. We have our own physical data center, a multi-tenant supercluster. Because we have that, and as Neil was saying, we consume an enormous amount of data. So it doesn't really matter if you send a little bit of data or a lot of data to us. We'll be able to ingest it. We'll be able to give you dashboards with millisecond response time.

So if you go through a stage where you have four Docker containers, you're going to hit Black Friday, you're going to go up to 100, you don't have to change anything in New Relic to cope with the load. Whereas if you have an on-prem solution, you'd have to go and make sure that you can scale that on-prem solution and so on. So a lot of advantage by having a pure SaaS environment like us.

You can also integrate with the deployment process and help to roll back deployments. We also track deployments. So every time that, in each one of the microservices you have, you do a new deployment, you send a marker to us, and then we tag that point in time and we say, “Here's the deployment that took place. These are the changes for that deployment.”

So the Apdex score, which is a measure of the health of that microservice; the number of errors, response time, throughput. You see changes, and here it's like 180% number of errors increase. That's bad. This is 5,000,000% error increase. That's very bad. And you immediately see this.

But then if you click on each one of those deployments, you actually get a view for a bunch of metrics like response time, throughput, CPU, physical memory, error rate, so on. There's a bunch more here. You get to see things like this is the before and that's the after. And the only thing you need to do is send a marker to us to say, “In this point in time, I released a new version of this microservice,” and that's it. We provide all of these reports afterwards.

Now, for each one of the deployments as well, we go into the detail of showing every single transaction. So you get to see all of these metrics: throughput, total time, average time, Apdex score. You see the before and after. So here, we made a massive improvement on the transaction owners get. We went from 117 seconds average to two seconds average response time. So this is a successful deployment. But very quickly, I can see that if it was an unsuccessful deployment, just roll it back.

Now, once everything hits production, once the application actually hits production, that's when you start to look at these dashboards to see if something breaks. You want to see if you get alerts coming in, if you have spikes on your metrics and things like that.

And one of the ways that you can use New Relic is you can create custom dashboards to track incidents. So we have an analytics platform where you go and write these queries that are similar to SQL, but a million times simpler to write. So the learning curve is an hour, two hours maximum until you're professional in that language. It's quite quick to learn.

And then you can create these dashboards that say, “I'm investigating a symptom here,” and you write a bunch of queries that then bring up the root cause of that issue. So you keep adding symptoms, and then you find out what the root cause is. So here you have, for example, a count of those incidents, like you have an incident count by escalation policy and so on, all of these custom dashboards. And that also then gives you the back-end data to help you manage a very large and complex environment.

And we use service maps there on top for that, where we have an end-to-end business application. This is quite a simple one because I didn't have a lot of space on the slide. But these can be quite complex, and you have all of a sudden an overview of how all of these microservices talk to each other.

So you can then understand, do I need to optimize them? Do I need to make them a bit more efficient? Do I need to change the architecture and join a few microservices or break a very heavy microservice into sub-microservices? So all of these are decisions you take with real data from a production environment.

And then we show overviews on a global perspective of page load time. Where do I have the slowest pages? So that if I see that I have an application being served centrally, and if it's really fast in the UK and really slow in Australia, then the problem is not on my side, it's on the Australia side. Something's wrong with the network or with the ISPs. We then give enough data to troubleshoot what's actually happening on that side of the world.

Another thing we launched recently is our health maps. So once you have the applications being monitored, you also then have the operating system or the host of that application being monitored. So you have a lot of metrics like CPU, memory, and so on. And this is what we came up with to bring all of these metrics together.

So here you have a view of all of your microservices with the hosts inside. So you get to see this inventory service here, I've got four hosts inside. In this case, it's four containers. But if one of them goes red because the CPU or memory is being exhausted, then that's going to tell you what the problem is.

So from here, what you then do is, I found a symptom, you start an incident tracking and you say, so Samita here, for example, she went and said, “Okay, I found the error rate spiked up. Let's investigate this.” So she opened an incident, passed it on to the team, and then Andrew came in and said, “Well, actually, I found that the error rate coincides with my application response time, and this seems to come from the request queuing.” So then he adds that chart there.

Samita then came back and said, “Well, actually, I found where that is coming from, and it's actually coming from recent events method in the event controller.” And then someone went in and drilled down a bit more and said, “Well, that's actually coming from setup event list.” So this is several teams collaborating to bring the incidents together so you can have an investigation process happening at the same time.

Any questions so far? Nope. Everybody had lunch, so everybody's feeling a bit heavy, right? Cool.

So what you can also do is, there's a lot of data we collect, right? There's a lot of transactions on the back end, all of the pages in the front end. We don't sample the data. Everything is collected and stored, and you can go and analyze it later, which is a big difference, right? You have all the data there available to you.

This means that you can have a very good view of how the deep-dive technical things are doing, what's the performance of them, what's the number of errors there. But also you can start to bubble up things to surface, like the revenue, the number of orders processed. And this gives you the ability to create dashboards for a DBA, for a developer, or to go to your CEO and show a dashboard that says, “This is the revenue we're making now, and this is the availability or the performance of the e-commerce application we have.” So you put all of those metrics together.

So the ability we have to create dashboards really gives you access to literally all the data you have that's being collected, which means you can create local KPIs, global KPIs, pretty much anything you want.

So all in all, we're looking at a process where New Relic adds value to every one of those stages and kind of brings everything together. So the goal here is to accelerate the rate at which customers do code, accelerate the rate that you go and deploy code into production. And the way we do that is by bringing visibility onto each one of those steps. We give you enough data to then make decisions on how to alter the design of an application. And then all of that gives you access as well to a lot of different KPIs that you can flesh out to different business units.

So now I want to show you a couple of dashboards and how easy it is to create them.

So this is an example of a dashboard that we made earlier. And basically what this is showing is, this is an e-commerce application, and it's a business-level view. So in this business-level view, I want to see the user trends that I have. So how are my users doing? And I can see that in the last 24 hours, I had pretty much the wave of users that I had the previous day, slightly less users today than I had yesterday.

I can then look at revenue and see how's my revenue progressing over the last 24 hours. The timelines that I have there, they're just examples. Most of my customers, what they do is if they're selling stuff on the website, they would look at what's the orders processed the last seven days on a rolling window compared to the previous seven days.

We have Ryanair, for example, every time you book a seat, that's tracked that the seat was booked, and then there's a chart that shows the last seven days of seats booked compared to the previous seven days. And if they see that the current number is lower than last week's current number, then they have some sort of an issue there.

The issue could be technical, right? The response goes high. Or it could be that users just need to be reminded that Ryanair is really cool and they need an email with a voucher to go and buy cheaper tickets. But it allows them to action things, right? Before the users even call and say there's an issue, they get an email with a voucher or someone goes and actually fixes the problem. So you start to become a bit more proactive in how you handle the issues rather than waiting for someone to pick up the phone and have the time to complain.

Together with that, we then can track things like custom journeys or critical journeys, like going to the homepage, logging, browsing items, and purchasing. So we know we have a conversion rate of 8%, quite a low conversion rate. We see that users are dropping off between home and login. We then have the home duration here of how many pages are above three seconds for the last day. So 35,000. So that may indicate that users are not logging in because the homepage is too slow to load. But you have enough data to go in and troubleshoot that as well.

One of the funnels that's typically seen, or customers are able to implement, is the checkout conversion funnel. Because a lot of users just go on an e-commerce website to try to browse and window-shop. You don't actually end up buying anything or even logging in. However, if a user clicks on checkout, you want to make sure you have a conversion rate of 100% from the point that they click checkout until the point that they actually buy stuff.

So you can put a checkout conversion funnel there that shows you each one of the steps of checkout, how many users are progressing through to the next one. And then, because you collect all of that, you can actually go and see which users didn't make it, which ones failed to go to the next page. You can then actually call them and say, “Well, you had issues. We'll send you a voucher for 10%, so you can go and buy again.” You can do things like that if you collect usernames.

A lot of the data that we collect is all from the front end, is giving us an indication of performance, of availability by city, by browser as well, by operating system. So you can bring all of those metrics together again. So that's going to give you a view of we have the response time spiking up. Is it because someone is using a really old browser that's not supported? Or is it because a specific ISP is having issues? Is it because a geographical location that they're accessing the page from is just not a good geographical location? There's just net performance.

So all of those metrics are there for you to troubleshoot. And a lot of the times, there's things you can't really fix. If an ISP is slow, there's nothing you can do, but you avoid spending time investigating something that you can't fix. So all the data is there to help you do that.

So now I'm going to create the dashboard very quickly. Any questions so far, by the way? No. Everyone with me? Awake, happy? Cool. Awesome.

So let's create a dashboard. Now, one of the things that customers use our data for is, randomly, they have this idea, I want to go and show a dashboard to my CTO, or my marketing team is asking me a question. So very quickly, we put a dashboard together.

One of my colleagues was on-site at John Lewis on Black Friday, and the marketing team asked them, “Well, we have Black Friday running at the same time as Price Match. I wonder which campaign is more successful.” So in about five minutes, we put a dashboard together that showed how many people clicked on the Black Friday banner versus how many people clicked on the Price Match banner.

And he saw that there was about 350,000 people clicking on the Black Friday one and about 50,000 on the Price Match, so seven times less. But when he showed the conversion funnel, they were able to see that actually Price Match had five times less clicks but had a much higher conversion rate, which makes sense, because if you go on Black Friday, you do window-shopping, so you're not necessarily going to buy something. But if you go and see the Price Match, you're very likely to buy something.

So this showed the marketing team that they're very wise in running two campaigns at the same time, and also that even though Black Friday is a big thing, they should still keep the Price Match campaign, definitely.

So how do we create dashboards here? Let me zoom in a bit more. Can everyone see that? Cool.

So the way we create dashboards—and who's technical here, by the way? Okay. So apologies to the others. Yeah, you were technical.

So the way we create these dashboards, we have a query language, and the training that we give on these—not that you actually need training, but customers that want training, I go on-site. It takes me about two hours to take them from not knowing anything about this until they can write the most complex queries on this language. It's pretty quick. Two hours is a learning curve.

So if I'm putting something together in terms of looking at my applications, I would typically start with looking at performance. So I go, select an average duration from page view since one week ago, and then time series means it's a query like this. So I also want to then say, well, let's compare it to the previous week. So compare with one week ago, time series. And what I have now is a weekly performance comparison, which is pretty cool.

I then want to say, well, performance is good, but I want to match that with throughput, so with the number of requests coming in. So let's see: select a count star from transactions since one week ago. So also compare it with the previous week. And this gives me the throughput of the application. So weekly throughput comparison.

Now, what's also interesting to see is the number of errors that I have in the system and if they affected performance or not. So let's see: count star from transaction error since one week ago, compare with one week ago, time series. And I have three nice charts here. So this is the weekly error comparison.

So let's make this a bit prettier.

But then the thing is, I have all sorts of business applications here. It's not just one application that I have, or I have a bunch of microservices. So really, I need a way to filter all of those microservices. So let's add a table here. So select count star from transaction, but then group by application name since one week ago. And this gives me a really nice table. So then let's make sure we have all of them. And this is our busiest applications. Let's call it that: busiest applications.

And we add this like this over here. And I have my list of applications.

Then another thing which would be interesting is let's bring in the user and size and the business side of things. So let's see how many sessions I have of users. This is my number of requests. So count star session. What I have now is my active users, because I'm looking at the past five minutes.

Because I have e-commerce apps here that do revenue generation, let's have a look at how much money we're generating here. So let's see. Let's do a sum of the item price. So every time an order is made, we're logging in the item price effectively. So since one week ago, compare with one week ago. So let's see the revenue that I've been making this week. So this is the revenue comparison. So weekly revenue comparison.

Let's put this here. And now let's look at the trend of revenue. So I have a sum of, again, item price, transaction since one week, compare with one week ago. So now I have a weekly revenue trend.

But what would be really cool is to be able to filter by these applications, because I have a bunch of applications here, and I don't want to see the overview of them all the time. I want to be able to filter through specific applications. So let's create a filter here. Link.

And I have my high-level dashboard that is showing me very quickly I put this together. Then we have a view of the page load times compared to the previous week. I have throughput, which tells me if I have a period of traffic, if I hit a Black Friday or a Christmas and people are doing more shopping, then I would see an increase there.

And typically, if you have an increase on—if you have one of this and that has the same line, then that means you have scalability issues. So very quickly, I'm able to see that my system is scaling more or less okay because I still have the waves there in performance.

I was on-site at a customer of mine this Black Friday, and we were able to do the throughput comparison performance. And regardless of how high the traffic trend went, the performance is still the same. Response time was exactly the same. So they built their systems really well this year. And we did a lot of work with them since August last year to make sure that by the time they reached production, which was the end of August, that the applications were built for scaling.

At the same time, we have error comparison. We have a bunch of applications here, the number of active users. If I wanted them filtered by application, I go and say, let's look at the order service. How much money am I making there? And now I have a view of the throughput and error comparison just for the order service application. Here's the revenue that microservice is bringing there.

And I did this in what, two minutes? Three minutes? About two or three minutes. There you go.

Q&A

Neil MacGowan: Any questions? We're at the end of the session in terms of the presentation and the demonstration, which I think was fabulous by Ricardo. Any questions? We've got just a couple of minutes for questions, if anybody has anything they wish to ask.

Q: From a technical standpoint, how much effort does it take and how invasive is it in your code for microservices to know about each other? This one is communicating with that one, or to know about the content of shopping baskets and stuff like that. How much do I have to do as a developer to get that information into your system?

A: So a really good question. There's two questions there. The first question is, what is the performance overhead of monitoring these microservices, I guess? The second is, how much effort is required to implement New Relic, and second, to start bringing in things like revenue and business-level attributes?

So in terms of overhead, it really depends on your application, but these agents are built to run in heavily loaded production environments. Typically, it's around 2% CPU or memory. But if you have a very chatty application, it could be a bit more. If it's less chatty, it could be a bit less. Depends on how your code is written. If you have hundreds of thousands of method calls per transactions, there's more data being collected.

But the thing is, you can fine-tune the data that's collected. You can say, “I want to ignore all of this section of code,” be it package, class, method, whatever, or, “I want to add more detail on this other level of coding or section of code.”

In terms of implementing it, that's one of the best things in New Relic, which is you don't have an on-prem component you have to install. You don't need to install a server to manage the data. You don't need to install a database to store the data. You install an agent, which takes you about two minutes. That agent communicates with our platform and just pumps in all the data there. So almost no effort here. You can also automate all of that, which is what I was speaking about a few slides ago.

Now, the data that gets collected out of the box is all technical data, because we know what a transaction is. We know what a method is. We don't know if the application is an e-commerce app or if it's an agent or app. And that's where development comes in.

So until now, development doesn't have to do anything. You can put this in production without even telling development if you wanted to. But when you start talking about capturing revenue, then the effort is, as a developer, you would know where in the code is an order placed, like properly placed and confirmed. In that place of the code, you have the information on what's the value of the order, what are the items in the shopping basket, the categories of the items, and a bunch of other attributes.

For each one of the attributes, you call a one-liner which says `NewRelic.agent.API.setCustomAttribute`. Then you pass a name-value pair, which means you pass in, this is the order value and this is the actual value here, the numerical value. And that's it. So it's one line per attribute you want to store. You can also create your own instance, your own map with name-value pairs, and just pass the map to us.

Any more questions?

Q: Maybe one part of my question was, how do microservices know about each other?

A: Oh, sorry. I forgot. So the way that they communicate with each other is every time that you load a microservice into memory, our agent's going to intercept that, and then it's going to see things like, “Oh, this code here is really interesting. I'm going to monitor this method.”

But then it does the other things, like, for example, for Java, this is a JDBC call, so it's making a database call. So it tracks there's a database call. It tracks the SQL as well. And similarly to that, if that microservice is making an outbound call to another microservice, it recognizes most of the protocols that we have out there. If you do a bespoke protocol, it's not going to recognize it out of the box, right?

But if it's something like a JMS or RMI or something like that, then it knows that this is about to make an outbound call, and it injects a bit of data there. It injects a unique identifier. On the other side, if there's an agent monitoring the other microservice, it sees that there's an inbound call making, and it checks, do I have any New Relic tags there? And if it does, it links the two.

And that goes to the aggregate level of data, so you can see that this microservice is calling the other one. That's how the maps are created. So the dependencies between the maps, they're built automatically. You don't have to configure anything. Just put the agent on all of the microservices, and it starts to generate dependencies as well.

But what it also means is, when I was showing a while ago the method-level detail of the transaction, if a transaction starts in one microservice and continues in the other one, and they're both slow, you can follow that call through. You can follow the trace through to all of the microservices and see the code execution in each node as well.

Q: Thank you.

Neil MacGowan: One more question, because I appreciate we're delaying you from other sessions. I'm available after as well. Yeah, well, come to the stand if you want some more detail, but apologies, the hand up there was first.

Q: Actually, just following on from that. So that sounds like some technical tracking across microservices, but what about the application level? Is there a way to say that this business entity that may have started here is the same one that is in existence in this other microservice? Is that possible to connect those two?

A: Yeah, it is. So that's already in the UI. So for us, we're looking at the microservice level. For us, we have an application which is a JVM, or a cluster of JVMs, or a group of containers of one JVM. And we track that technically by saying that the agent then has the same name for all of those.

But then that could be part of several business applications, and that's when the service maps come in. You can see which microservices are talking to each other, but ideally, you want to have an end-to-end map that looks at a business application. So those ones, you can create them, but it takes about five seconds to create because you're just dragging the right components.

On top of that, dashboards like these, that's where you would come and create dashboards specifically for every business, because you want to bring together data like the performance of the application, the availability, throughput, the revenue the application is generating, the users there, trends that you have. And they use dashboards like the one I created to surface all of that from a business application standpoint.

Neil MacGowan: All right. Thank you, guys, for your attention. We appreciate you coming and listening to what we have to say. And thank you, Ricardo, for your good demonstration.

Ricardo Santos: Yeah. Thank you very much. Thanks.