A Scalable, More ContinuousFuture for Performance and DevOps

Log in to watch

US 2021

A Scalable, More ContinuousFuture for Performance and DevOps

Director of Customer Engineering, Tricentis NeoLoad · Tricentis

There’s no question that enterprises today want to further integrate continuous performance testing into automated pipelines. However, many are finding it difficult to reconcile the mismatched clock-speed of testing with today’s accelerated pace of development/delivery. You’ll learn, among other things, the key steps to continuous performance testing in DevOps:

- Gather the right metrics to assess your gaps Prioritize, then systematize across your application portfolio

- Plan for acceleration across the whole delivery cycle Design concrete measurements with the end in mind

- Pick the right targets to automate Make scripting easy for multiple teams Develop performance pipelines

- Use dynamic infrastructure for test environments Ensure trustworthy go-no-go decisions

This session is presented by Tricentis.

Chapters

Full transcript

The complete talk, organized by section.

Paul Bruce

Welcome, everybody. I have the extreme pleasure to talk to you today about a passion of mine: more continuous futures for performance and DevOps.

Who am I? I'm Paul Bruce. I'm one of the directors of customer engineering for NeoLoad, now part of the Tricentis family. I also chair some of my own events, All Day DevOps, DevSecOps Days, and some of those things around observability and OpenTelemetry. I'm a huge fan of my local Boston DevOps community, and I'm one of the core organizers of DevOpsDays Boston for the past couple of years. And in terms of DevOps in high-compliance organizations, the IEEE just released a standard that I and a bunch of other people have been working on for a while called 2675. We're hoping to get that adopted as an ISO international standard soon as well.

At least for today, the scope is performance in a continuous context, but I want to tag on to some other themes of transformation in the event. I'll describe what I mean when I say more continuous. We'll look at how some of the customers that I've worked with are doing that, what approaches and framings are useful to do this stuff, and briefly, how the Tricentis platform might be able to help.

Let's start with transformation. Ooh, all the rage these days, right? I say that jokingly, but in reality, transformation is the new norm. It's happening in every organization. It's happening in almost every aspect of every organization. It's what's driving a lot of IT changes. It really is the new norm.

But like every large system, various parts and pockets abound in these organizations, so it's rare you see major transformations at a day-to-day level. Sometimes it's as unfamiliar to us as, say, the universe of the very tiny, the quantum realm. If you don't mind, I love particle physics. It's just food for my nerd brain. So if I can make a quick correlation here: at the quantum level, there are no exacts. There's no perfect measurement, no absolutes. It's all statistical distributions. Hey, like performance engineering, right?

And subatomic processes, just like the teams and transformations at your organization. Just when you think you know what's going to happen and that you have a standard model pictured in your head, something changes. We learn something new, and we have to reconsider how our reality actually works.

So when I hear people say things like, should we change or not? We can't afford to change. This is going to take too long. Or in the very worst circumstances, let's just keep doing business as usual. Eh. Consider that the transformation ship has already sailed, and you really don't want to be left on some distant island. You cannot afford to do that, not at an organizational level for competitive industry reasons, and not even at a personal or team level, because slow doesn't work for anyone anymore. You can't afford to be on the fence about change.

Now, look, I'm a performance and reliability nerd, right? That's my practice. That's my scope. That's where we're going to head with the rest of this talk. I also listen to hundreds of other practitioners and transformation leaders each year from every conceivable size, geography, and cultural space we can think of. What they say can be summed up as what I call the performance imperative.

It is imperative that your systems, software, your hardware, and peopleware are high performance and can scale. For consumers, those users that produce revenue and pay your paychecks, if something is slow, it's equivalent to broken. They don't have tolerance for downtime or unreliable practices.

As just seen in the past year, massive digital expansions were driven by remote work. Unemployment processing, well, that was a mess. Overwhelming gaps in online education and healthcare management abound. We don't need this stuff breaking. And remember, if it works on your machine but not on mine and tens of thousands of other people's, it's broken.

For business management and leadership, this consumer expectation translates into an authoritarian demand that these systems be at their peak performance by default, not due to heroics. Superheroes in IT are single points of failure and are toxic to transformation and DevOps culture. There, I said it.

But still, misalignments like IT prioritizing desktop experiences over mobile performance still happen. It's still a big problem, since more than half of all internet traffic is now mobile-driven. And like Conway's Law suggests, our systems are an outcome of our teams, not just in terms of the communication patterns, but of the functional and dysfunctional behaviors. The responsibility lies everywhere, not just in the QA or subject matter experts.

For all of us in IT who actually are not superheroes, but just trying to row in the same direction, it becomes an essential and urgent thing to synthesize this demand for a high degree of performance in our systems and processes, because when somebody else's stuff goes down, we need to recover fast. It's not just about what we're building and shipping. That's us. You are your weakest link in the chain from a performance perspective. Multi-regionality in cloud providers is built for a reason.

Cloud providers have this problem too. Things like AWS, Azure, GCP zones go down. Everyone has downtime, but we don't have the luxury of waiting to see if that's us too, because we use them and we know that. The point is, everybody has the performance imperative.

But what are organizations doing about it? We've seen decades of performance being treated as an afterthought, as a silo. In my mind, this is for two reasons. One, it wasn't factored into planning because it was long and it was hard, so therefore it was left to the end. Two, because it was left to the end because it was long and hard, very few people in the organization developed the expertise to properly apply performance engineering practices, not to mention have time to grow them in everyone else and non-experts. So performance stayed siloed because it was siloed. Oh.

By now, you shouldn't be shocked when I tell you I'm a strong believer in the principle that if something has to be done and it's hard, we should bring the pain forward. Bring it on. Get so good at it that it's no longer painful, but simply just a part of our approach.

This old siloed model has been disappearing rapidly in the past five years for exactly these reasons. The implementation of the new approach I see most often in at-scale enterprise performance engineering teams is consultative. Not external consultants, internal consultants. Advocates. Subject matter expertise, or SME, is applied intelligently across many groups. Knowledge is transferred, and practices are both documented, yes, but automated. And there's a concerted drive to reduce cycle time and toil, especially in testing processes.

All this is really great. It really is, except the sad news is that this only scales very far too before we have to start pushing it further. The third approach is to build on those consultative wins and allow various teams to use self-service processes and platforms for the easy stuff in order to develop awareness and proficiency. This buys back time for SMEs to even more intelligently assist and coach performance practices, making sure that the proper guardrails are built into these processes. This is how the next phase of performance engineering is happening right now.

In all my work in the vast Fortune, let's call it 250, global and international organizations in all industries, you see an escalation of the performance imperative, but an elevation of performance and reliability to key strategic elements of IT, business, and user experience. They understand how important it is that this not be left to the end, and driving toward a more continuous process means bringing good performance practices along with it too.

Organizations that are successfully crossing chasms like this typically have a blend of Agile and DevOps. They have a true performance practice with a capital P, run by some consultative experts but expanding out to self-service models. This is so they can match the clock speeds of development cycles with right-fit performance feedback loops, providing even more value to those teams as they deliver products in smaller, more frequent batches to their consumers.

The goal here, though, is not self-service. Self-service is just a tactic in the broader approach of moving toward a more continuous and scalable model for software delivery, which itself is a tactic to harmonizing the broader performance imperative into transformational changes.

With this in mind, right-fitting performance and reliability into that more continuous model has been a key focus of mine and the NeoLoad team, along with Tricentis, for years now. There's a lot that goes into performance, and it's not just testing when you ship new code. How about rolling out new infrastructure? You forklift some components over to a new cloud region or something. What's your observability over those changes? How is continuous monitoring factored in, not just to production, but to lower environments as well?

How do you pragmatically verify that your systems are resilient to operational fluctuations, like when pods start crashing because of ill-informed timeout defaults or unexpected auto-scaling phenomena? Maybe you should be injecting faults into your continuous testing cycles to address these emergent behaviors too.

The point is, it's not about testing all the things. That's a myth, an anti-pattern. It's about testing the right things and providing valuable feedback as fast as possible. It's about thinking holistically about what needs to be in place daily, not just weekly or monthly. How can we make sure that building repeatable process on these reliable platforms gets us to the point where we can scale these things so that it's not a firefight in infrastructure budgeting and planning every single time we want to run a test?

Continuous performance is a key component to scaling, yes, your systems, but also your knowledge, your velocity, your learning, and your ability to do this proactively with less and less waste, toil, and risk. Like Steve Jobs said when he rolled out the iPhone, are you getting it? The right combination of capabilities multiplies your positive impact at a day-to-day level.

But hold up. What's so different about this, though? Haven't people been telling us to go faster and automate all the things for years now? Let's take a breath, especially me. Let's step back for a moment and ask why. Why is always a good question for good engineers to ask.

This maniacal demand for automation isn't just a reaction to increasing business and product velocity. It's a response to the complexity of our systems and how both our systems and delivery processes have changed. How have they changed? Apps and networks used to be simple. Now they're highly distributed and componentized. And yes, of course, there are still plenty of monoliths which are all so complex that those new microservices and systems depend on. There are a million ways to do everything and few ways to do those things right. We live in an increasingly complex world, and that complexity increases every year.

Product delivery cycles have compressed to a point where there's a pipeline for everything now, or at least people would like to think that. How does performance testing fit into that model? If it doesn't, what happens? When you ship a new change that's supposed to make users happy, without the proper feedback on performance, happy faces turn frowny very quickly. Oops, your database scaling set isn't big enough for that new increase in query throughput. Now we have a performance and an availability problem because some of those front ends can't connect to the database because the databases are saturated.

Now comes the reactionary response. Let's change some code and deploy again, which usually has to be fast-tracked through those fancy pipelines you think you have. Again, lacking proper feedback, your change to that change may not actually solve the problem. It'll probably make it worse. See, without right-fitting feedback on performance or reliability into your automated delivery processes, you are basically begging for this kind of failure.

We're not talking about all the big fat load tests running all the time. Nothing is ever easy to get into that process if it doesn't fit into that process. Pipelines have time budgets too. That's why when we ask ourselves, how do we approach performance and reliability in a continuous context, the answer is that it's going to take a little work to pick the right things. And not all that work has to be done by test engineers or developers or release engineers. The easier you make something, the more people can do it.

The who-does-what question is something that you and your teams in your organization need to figure out. Actively drive to what works best in which pockets of engineering. But please provide a purpose and a vision that doesn't get us stuck in holy wars on what shift left is or what it means to properly do the DevOps. What we can all agree on is that having the right feedback at the right time is what keeps us delivering on time and not accruing architectural debt and unplanned work.

Performance is not a checkbox. But hold up. If asking performance questions are not in your definition of done, it's likely not going to be tested and therefore not providing feedback for good release decisions.

For continuous performance feedback, we need quick sampling that's good enough in early cycles. It should not be hard for development or product teams to express their API and details in a way that can provide them feedback from within their own environment and tools. It should not be hard to produce testing artifacts that harmonize with more proper, more scalable testing processes when they check those artifacts in and run them in an automated pipeline, same as other tests.

Out of those pipelines now comes a frequent stream of baselines on various environments that may have their own SLOs, driving the question around what are the proper SLOs and metrics. This proves that you are exercising the performance mindset and simplifies operational readiness for when you do transition to much larger environments.

To take an analogy here, you don't ever expect to run a marathon with your shoelaces untied and zero training. Why would you ever expect high performance in your systems and in your teams, your peopleware, if you aren't exercising this continuous muscle?

Going back to how we scale this out, it has to be easy. You don't start day one training for that marathon running the entire thing either. You don't start with big things. You start with smaller goals and progress, and you make that progress to the point where you understand your capabilities and you can apply your efforts wisely.

If you want someone to get better at something, you start with easy and small wins: exercises and practices that anyone can pick up and run with; stuff that doesn't bust paradigms entirely, but moves it over time to a more continuous approach to improvement. This is how we do modern performance engineering. We build self-service processes and platforms that non-experts can use as guardrails to adopt a continuous mindset over performance and reliability.

We start with APIs and microservices, these new distributed, sometimes monoliths, that now have even more network latency. And we extend our subject matter expertise to more complex situations at the same time. We provide speed and scale over performance engineering tasks to our team so that they can be more continuous and transform safely. It all starts with making things easy.

And why do I say this? Because I've seen it work over and over and over again. I've been working with the NeoLoad team for over three years now, driving to meet this imperative. Our platform's easy to use, it's easy to learn in terms of scripting, it's easy to execute tests from anywhere in the world on your laptop, your browser, and your CI pipelines. Most importantly, it's easy to understand the outcomes of this testing.

There should be flexibility in all things, absolutely. But there definitely should be ways that are already thought through. I work with enterprise performance and automation engineers every day to right-fit what patterns they build and make available to their product teams so that it's easy to do, and it's easy for product teams to see the positive impact of.

From the Tricentis perspective, Tosca, NeoLoad, and qTest can accelerate those go/no-go decisions on release by having performance feedback in the right places at the right time. We're always happy to discuss this further with folks because everyone's slightly different. We're in slightly different places, different journeys, and right-fitting this stuff into your overall strategy takes collaboration. We're all on a journey. It's very cool for me when I get to see people, process, and technology working together and helping to grow better practices with organizations.

However, technology is just one part of the transformation. True transformation takes time, effort, and the right decisions. Decisions affect people, process, and technology. So this is really actually a three-body problem, an n-body problem. You can't change one without affecting the others. If you want to scale the transformation, you need to make sure that you're always considering these three aspects, like we've learned from the DevOps mindset. It takes effective communication, collaboration, and adaptation.

The good thing is over time, as these practices become part of your culture and you iterate, you get better. Never stop iterating, because transformation is a rolling stone, and it's always moving, and so should you. Once you start getting better at these performance practices, they lead to better continuous motion, which leads to faster transformation, and the positive impact compounds. Further transformation drives demand for better performance. So new practices and better continuous outcomes need to come out of that as well.

At the beginning, I mentioned I love particle physics. I also love gardening and permaculture. That's just my thing. The folks in the permaculture space have a saying: if you want a healthy tree now, the best time to plant one was 20 years ago. Or now. You should definitely start now. It's the same thing with this stuff. Start now. Never not be looking to improve. Keep on.

On that note, I greatly appreciate your time, and I really look forward to discussing the future of performance reliability with everyone here.