Adobe’s DevOps Journey: Finding — and Measuring

Log in to watch

Las Vegas 2020

Adobe’s DevOps Journey: Finding — and Measuring — Customer Happiness

Vice President of Adobe Cloud Engineering · Adobe

Adobe’s Digital Experience Operations has experienced more than a decade of extraordinary growth, both in volume (more than 8000% growth since 2009, with 377 trillion transactions in 2019) and complexity (nearly 3000 services, most of which are interconnected and interdependent).

With that incredible growth has come incredible challenges: how do we continue our mind-boggling (in size and sophistication) trajectory, while keeping individual customers happy? How, in short, do we scale?

Our answer? a shift from “hero firefighter” culture shift to a “build awesome fire suppression systems” culture. In other words, a DevOps transformation.

This is the story of how the Adobe Cloud Engineering organization has (and still is!) embracing a DevOps culture by leading from the roots and from the top, how we’re measuring our progress, where we’ve succeeded (and still are working to succeed), and what we’ve learned along the way.

Chapters

Full transcript

The complete talk, organized by section.

Brandon Pulsipher

Hello. It's great to be here, coming to you from the Adobe building here in Utah. I thought I would be with you live in Vegas, and then I thought I'd be with you from my living room, and then we realized that this beautiful building down the street was pretty empty, so we were able to take advantage and film this session from here. I look forward to being with you in Vegas next year, and I'm excited to be here today. I'm Brandon Pulsipher. I am with Adobe, and I'm excited to share about the Adobe Experience Cloud journey around DevOps today.

I'm happy to share a little bit about our DevOps journey and hopefully some things that are helpful to you. To give you a little context on my background: I studied computer science. I've been a QA engineer, a software developer, an IT sysadmin, and a network administrator. I've spent the last 25 years watching this hybrid of IT, operations, and software development come together. Especially over the last few years, to see the fusion of these worlds come together has been really exciting and a lot of fun, and it has been amazing to see how that can impact the experience for our employees, our customers, and everyone around us.

I've spent the last 10 years with Adobe leading our technical operations, which is our cloud operations and infrastructure organization. We've now transformed ourselves into a cloud engineering organization. You can see even in the way that we name things and think about things, it changes the way that we act and behave every day.

I wanted to first share a little bit of context. Most people are familiar with Adobe and our long history in the Creative Cloud space. You're probably familiar with Photoshop and Illustrator and these products that have become verbs in our life and help create the beautiful content around me. The Document Cloud is something most people know, and especially through this COVID time, as people are doing even more online and more digital signatures, it is a really fundamental part of our business.

I'm going to share today a little about our Adobe Experience Cloud. This started about 10 or 11 years ago in Adobe's entry into digital marketing and has really evolved into an exciting space around customer experience management and the ability to personalize and create a unique and personal experience for every consumer in every engagement. The awesome thing is it's built entirely in the cloud.

As I share a little bit of that journey, I want to ground you in some of our challenges and aspects that really led us here. If I look back at our software engineering and operations team, these have grown both organically, as we've built and developed new products internally, and through more than a dozen acquisitions. We have a variety of cultures, geographies, maturities, and companies that are very early startups, mature startups, and public companies. We've had to bring together this very diverse set of cultures and practices into our organization.

As we've been making that transformation, we've also been on this incredible growth journey. We now process more than a trillion customer-facing transactions per day. That's not back-end database queries; this is really customer-valuable, customer-facing transactions per day. Dealing with these two things at the same time has created some fun but unique challenges.

A lot of times we look at our products, or our customer looks at our offering, as this single box or single entity. This is the way our Experience Cloud solution often shows up to our customers. They love the vision. They want to accelerate their digital transformation, especially in COVID times when we've all had to move to doing more and more things online.

But if we know much about what's behind the scenes, it's never this simple. What looks like a simple app is actually a very complex set of technologies and services that have to work together. Some are very large, some are very small, and they're globally distributed. If we go even further down into the way this is built and works, we map out all the dependencies and the way everything has to interconnect and interoperate with each other. This creates a massive amount of complexity and really makes our DevOps challenge even harder.

Not only do we have this complexity of all these services that have to interoperate together, they're globally distributed across data centers, colos, and public cloud environments. We have a massive footprint with massive scale and massive complexity, and that sets up the challenge that we had.

As we brought these solutions together, we saw a lot of things that we liked, but we saw a few things that we didn't like. The symptoms of our service delivery started to show up in a few ways. We saw quality issues that we weren't pleased about. We saw ownership confusion: something goes wrong, is it the developer? Is it the code base? Is it the operational implementation or the rollout?

We started to click on that a little deeper and see that these were symptoms. We attacked some of the symptoms and saw some incremental progress, but what we really started to see were themes. We stepped back and said what we think this is, is a deeper lack of alignment between our operations teams and our engineering teams, and a lack of common goals.

That initiated our focus on DevOps. We said we were going to not just attack this incrementally and evolutionarily, but make a real revolutionary step forward in how we build and deliver our cloud solutions. For us, it really was about finding our why. If the illness is the lack of alignment, then the why became our cure. Finding motivating principles that all of the teams could align around, and that could become our rallying cry, was really exciting. For us, that became the customer experience. That became the center of everything we were doing.

While there are other great things that we do -- innovation, features, efficiency -- it came down to centrally identifying our why and rallying the team around that. Our hypothesis was that as we did that, we would see the right outcomes in our scale, in the collaboration of the teams, and certainly around reliability, security, and efficiency. I'm pleased to say that's played out in a lot of positive ways.

How did we go about this journey? First, we started with alignment around the principles. Every organization has people that are passionate about this. We had a handful of developers that wrote what they called the DevOps manifesto. A manifesto isn't necessarily motivational to an entire organization, but we took the concepts in that manifesto and turned them into a set of DevOps principles. We brought leaders together with the engineering champions and unified that into a common set of DevOps principles that we all agreed on.

From there, not only did we take that and publish it via PowerPoint or a document, but we put it into GitHub. We made it part of our code base. I think it's critical that we speak developer language when we're talking about DevOps.

The next step was to put the principles to work. We could have applied this across 200 teams and services, but that's a lot to take on and manage. We said, let's start, prove this out, and then show the teams the success and how to accomplish that within their organizations and services. We didn't want to pick only easy solutions that we knew would work. It's easier to take new services, like our Adobe Experience Platform, which we've built from the ground up and where we don't have cultural or code legacy concepts to battle. But we identified about a dozen services -- 13 actually -- including services that have been around for a long time with cultural legacy challenges, as well as new solutions. We wanted a diverse portfolio so we could know whether the principles and concepts were effective at any level.

Then we asked the teams to put some skin in the game. Everyone has to be aligned. We have to commit, pivot the responsibilities to match the principles, and hold everyone accountable to ensure we deliver a great customer experience.

In order to pivot and get everybody aligned, unified became our word: unified everything, a unified engineering approach. This is the money slide, because this is where things went from concepts and principles to results and the changes we had to apply.

We started with a unified engineering approach, and that meant unifying our on-call. No longer were we going to bring ops into the war room first if there was a problem and then figure out if we needed engineering and call them. When someone hits the big red button and we bring the teams in, our system automatically calls out to the engineering point and the operations point. They both come in and help solve the problem together. That has created value and insight for engineering teams to learn what operations had to go through, and for operations to start thinking more with an engineering mindset.

We unified our code, and we unified everything about our code. Not just the features, functionality, and application stack, but all the way down to the infrastructure, test and config models, automation, documentation -- everything needed to live in our code repository. Then we unified our backlog, which was another transformational change. We used to have several buckets for this work: features and innovation, operational improvements, cost efficiency, security. They were close, but slightly different.

It was powerful once we said we would unify all this into a single backlog and make prioritized decisions with our operations team, product management team, and engineering teams around the most important issues facing our customers. We saw interesting results. When we had events and outages, we'd always conducted root cause analysis and created action tickets. Typically teams would pick up the first few meaningful tickets and the rest would sit in Jira, which is the system we use, and may or may not get actioned over time. With the unified backlog approach, teams were excited about capturing and solving this. We saw our problem resolution ticket queues go up, because teams saw opportunities to solve things opportunistically. Maybe as they're working on a re-architecture or a feature, they could say, I can solve that problem. There was very little additional work to solve that while we went through it, but that visibility was powerful.

Some of the tools we had to put in place were part of that people, process, and technology space. We started with service level targets. We have to know what our target is. If we don't know what we're trying to reach, we're never going to get there. If we don't know where our destination is, how are we ever going to know when we arrive? Service level targets are fundamental. What does good performance look like, not just in nines, but holistically?

Once we've defined that, we have to measure it. We put instrumentation and indicators via the SLI approach in. Once we had measurement in place, we could define what was acceptable, and anything that was acceptable was our error budget. Once we cross that budget, we have to change our behavior and our action, so we have to measure, report, and constantly inspect this.

The SLT is king. It defines the customer experience. If we haven't defined the experience we're trying to deliver, we're going to argue all day about what's good enough. It becomes subjective and debatable. We're going to have some customers happy and some that aren't. This takes discipline, engagement from the entire product organization and the business, and discipline and focus to stick to it. It will test leaders in terms of commitment to say, this is what quality looks like, and this is the experience we want to deliver.

Once you've defined that, you have to instrument, because it's not enough just to know what our target is; we've got to measure our progress against that. Traditionally, we've all looked at availability, or success rate: does the transaction complete or not? That might result in a number of nines, but it's not enough. What's the response-time experience going to be? Do we want a click or action to happen in 10 seconds, 10 milliseconds, or somewhere in between? What's acceptable? We needed to define that. We needed to understand throughput and traffic levels, because that was important to delivering a quality experience. Then we had to manage capacity. We have to know where our system is going to fail so we can stay ahead of that and proactively address it. We found a valuable element in capacity utilization: the better we understand that, the better we can scale our system down when we don't need that capacity and save the business money. We identified these as our four golden signals. They may or may not be yours, but they were a good fit for us.

Once SLTs and SLIs are in place, we have to manage against them. Error budgets are important because once we've defined the target, we can also define what happens when we miss it. Making this decision ahead of time is much easier than making it in the heat of the moment. When you're above your error budget and things are operating well, you can continue to work on innovation, features, and improvements. When you fall below that threshold, everybody in the organization -- coming back to unified engineering -- has to stop what they're doing and work on addressing that issue.

I think about this like my teenagers' grades. Every Friday, I get a report that tells me what their grades are. If their grades are an A or B, or whatever we've agreed their targets are based on their classes, I might say, go have a great weekend. But if they come home with a C-minus, they've missed their error budget. They're going to stay in on Saturday and do homework, and I'm going to sit down with them and figure out what they need to do and change to get back on track. We apply this in other places in our lives, but don't always bring it into the software development life cycle.

Once you've got all that in place, you've got to measure, report, and have clear metrics. We took these SLTs and SLIs and built a DevOps quality scorecard. We shared it with the teams. We agreed to look at it as an executive committee every week. We started with executives, then realized we weren't really facing the problems on the front lines, so we brought in engineering and operations leaders who understood the technical issues. It forced them to better internalize and understand the issues their teams were facing, and for us to make decisions together. This was powerful as we measured, reported, and inspected on an ongoing basis.

This led us to a new normal. Looking back at 2019, we go through growth cycles and periodic customer events, product launches, sporting events, or world events, but we see a normal growth pattern. What we saw this year as we entered COVID-19 and moved to doing everything digitally was a new normal in our traffic. Every day of the week, we were exceeding our holiday traffic levels, and that continued to go up and up. We were able to handle this because we had the foundation in place and were operating with a DevOps principle pattern and culture in mind. The work-from-home transition was seamless. People could join, and we already had all the facilitation needed for anybody to join from anywhere. We had the tools, structure, and responsibility in place, and we've been able to adapt in real time.

As we look at this journey and what has changed and how we've had to adapt, it is certainly an iterative process. While you want to stay committed to the approach you've outlined, as you see things, you need to adapt and change. A lot fell into place. With unified engineering, we went through a struggle to get engineers on call and get every team involved. We heard every excuse: I'm busy, I'm a software developer, I don't do that; my country work council doesn't allow me to be on call. We had complex issues to work through, but we got through them.

Then we started to see challenges like observability. While we had good tools and good data, we had gaps. We had to double down and invest in metrics and observability. We had to better automate response and on-call work. Then we saw opportunities to say, if we can automate on-call and pulling people in, what if we went a step further and auto-remediated the problem? That's been exciting to solve customer problems even faster. We've stayed true to the principles, continued to focus on measurements and data, and maintained accountability across our teams.

I hear the question a lot: who owns DevOps? Is it engineering? Ops? Executives? How do we drive this, get started, and begin? Is it a groundswell movement or an executive mandate? I think the answer has to be both. We have to find champions and passionate engineering and operational leaders who want to do things differently, and we have to have the right leadership alignment. Bringing those together is what made our journey finally accelerate. The common principles and the time spent getting alignment allowed us to bootstrap the program and see it take off.

We had good alignment from leadership. We went further and regionalized that, finding the sponsor in each region where we have a major presence for engineering and operations. Then we said we would go find champions in each team. As we went beyond the first 12 or 13 services and expanded, we identified 120 DevOps champions. In this area of the business, we have about 2,000 to 2,500 engineers, so 3 to 5 percent were named champions. It was probably a mix of volunteerism and being asked to volunteer, but we found people with passion who wanted to see things change and were committed to this.

With those champions living this, breathing this, acting this every day, and talking to their peers, we saw change. This cycle of action and accountability at the individual level, plus leadership inspection and alignment on outcomes, was exciting. We even made organizational changes. We took a bunch of our SRE folks and, as we made progress, embedded them into engineering teams and completed that journey around unified engineering. Organizations are different, and what we did may not work for you, but sometimes organizational change can be another catalyst to help people see that we're serious and we're going to make changes.

To recap: first, find your why, and it has to become your rally cry. You have to be passionate and deeply committed to it, stand in front of the organizations as leaders and link arms, then see the teams link arms and commit to winning these battles. Get alignment across teams and show a commitment to DevOps, not just talking about it occasionally. It has to be built into the way we run the business. The unified experience and unified engineering concept starts to shift as we bring together software development, QA testing, site reliability engineering, and user experience, and everybody becomes responsible.

You have to find your champions. It's important to have experts on the team, on the ground, embedded in the team, who will drive change. Pick a lighthouse. It doesn't make sense to solve all this at once. Pick a lighthouse. Start with one, then two, then five. Pick a few services where you can apply this in your organization and culture, achieve success, and then champion and highlight that success across the business. Let it be a living process. It's okay if it changes and tweaks. This is the beginning of a journey, and it should continue and evolve.

As your culture grows and customers grow, this is such an interesting time we're in. We were fortunate to start this journey before the pandemic and put ourselves in a good place to ride it out successfully. But I've reflected on what if we hadn't done this then: would we do it now? I think the answer is yes. It may be harder and we may have to go about it differently, but if you're not there, don't wait to start. This shouldn't wait another year. Continue or get started on your DevOps journey, and I think you'll see those benefits and that value happen. Take the time to think about it, apply it, and let it happen.

Thank you for your time. I wish I could be together with you in person. I'd love to connect. You're welcome to reach out directly with questions or comments. There's always things I can learn and we can learn. I will be on Slack to answer questions for the next little bit, so I look forward to engaging and hearing from all of you, and wish you luck in your DevOps journey.