Your Baby is Ugly. Let's Be Friends
Highly decoupled teams allow enterprise-scale engineering organizations to move at greater velocity. But how do you build a shared culture of quality and reliability among disparate teams? Going around breaking their software doesn’t sound like a good way to build systems or friends.
Matt Simons, Senior Engineering Manager of Quality Assessment, will share how his team partners with the dozens of teams and 400+ engineers at Workiva to help them understand quality in distributed systems by breaking their software without breaking their spirits. He’ll discuss the challenges of building cross-team reliability and how they approached it using Chaos Engineering.
Chapters
Full transcript
The complete talk, organized by section.
Matthew Simons
Matthew Simons: Okay, we're live. Yeah. Sorry, we should've done like a "three, two, one, go," but here we are.
I'm Matt Simons. This is Jason Yee, and this talk is "Your Baby is Ugly. Let's Be Friends." We're hoping to make some friends here today.
This is the rough agenda. This is a story talk. It's a war story. I want to tell the story, or a story, of a DevOps transformation that we went through at the company that I work for, Workiva, and kind of its state and where we're at with that. I want to dedicate a portion of that to talk about chaos engineering specifically, and Jason Yee is here from Gremlin to talk about that.
I don't know if you want to introduce yourself, Jason, or we'll hit that later.
Jason Yee: Yeah, we can do that later.
Matthew Simons: Okay. We'll do that later. I don't know.
Sounds good. After we do that, then we'll come back, and I'll give you the rest of the story and share all the morals, the lessons learned, if you will. That's the rough outline.
So I guess let's get to it. Maybe the place to start here is that this is a war story. It's not a success story, not entirely. Like many war stories, the moral may well be at the end of the day that war is hell, and so is organizational transformation.
We had successes. We had huge successes, but we had failures, too. As cliche as it may sound, we made some friends along the way. So come, friends, listen to a story of triumph and failure, of love and loss, of ambition and betrayal. My hope is that you'll be able to learn from my experiences and take something with you in your DevOps transformation, something that you can use in your war.
Without further preamble, let's start here: your baby is ugly.
That's right. I went there. Your software sucks. It's a hacky mess of minimally viable garbage, and it's damaging your relationship to your customers. That's essentially the message that my team had to carry with us to every other team that we engaged with. As you can imagine, it made us very popular.
Let me back up a bit. I'll go back to the beginning here. At Workiva, the first stage of our product back when we were a young startup was working on a monolith. We've worked in the public cloud since the beginning. Our first product was built on Google App Engine, which encouraged, at least for us, building in a monolith.
That meant that we had a lot of alignment, I guess you could say, that we took for granted just because the platform sort of forced us to work that closely with each other. We had one monolithic code base, and that really gave us a sense of alignment that we took for granted.
Moving forward, we came to a place where we decided that was holding us back in some ways. As many teams and companies have done, we decided to split off from that and go to more highly autonomous, highly decoupled, and cross-functional teams.
This gave us huge velocity wins. But we found out that it also came with some drawbacks: velocity without a high degree of coordination eventually led us into misalignment and some degree of drift. It's almost like a queen who walks out to her subjects and says, "We must expand the empire," and all the knights gather their armies and then just charge off in different directions. You need something to bring people together.
We need a high level of coordination to augment the autonomy that we gave the teams. For us, that resulted in a lack of quality, or at least that was the place where we saw it first, was in quality.
So my team was born out of that. The quality assessment team was what we called ourselves. Assessment, not assurance. Assurance was someone else's job. Ours was to basically go to these teams and in painstaking detail describe exactly how their babies were ugly.
What we really wanted to try and avoid in this was becoming the Inquisition. That was the big threat. We were essentially there interrupting their workflow, mandated by executive leadership to show up and figure out why this team was having issues with quality, or where they may have latent quality issues that just hadn't reared their head yet.
We triaged. We went to lots of different teams, but we tried to focus on those that had either a really large blast radius in their product or that had had quality issues so far.
We decided to devise a process that would hopefully help us avoid becoming the Inquisition, which was the assessment. The assessment had three main pillars to it. The first was that we did some quality consulting. This helped us build some goodwill with the team. We just showed up and said, "What can we help with? What hurts?" and kind of rolled up our sleeves and got to work with it.
The second was an actual standards assessment, where we would go through a checklist of standards and best practices at the code and product level and essentially provide a grade for the team.
The third was to do chaos testing. This was new to us. Workiva had not yet been doing, at least in any sort of widespread or coordinated way, chaos testing. For many teams, this was something that we introduced them to, and for almost all teams, it was something that even if they were familiar with the concept, they hadn't really gotten a chance to do yet.
This was a really, really big part of the process for us, and it was a pretty huge win for us. I wanted to take a really significant chunk of this presentation and dedicate it to that, because if it's not something that you've done before, it is something you should do, and it should be a part of your DevOps transformation.
With that, I'll kick it over to Jason Yee, who's the director of developer advocacy over at Gremlin, which is a provider of chaos-as-a-service that we use, and I'll let him tell you all about it.
Jason Yee
Jason Yee: Thanks, Matt.
What is chaos engineering or chaos testing? You've probably heard of it. As Matt's last slide said, "Hey, kid, you want to break stuff? All the cool kids are doing it," those cool kids being Netflix or Amazon or other cool places like that. You've probably heard these stories of Chaos Monkey and how it randomly destroyed servers.
And so you have these maybe achy-breaky products, which leads us to: have you met Billy Ray Chaos, king of the mullet?
So what is chaos engineering? If you use the mullet definition, it's maybe some business on the front end and party in the back end. But that's probably not a great definition of chaos engineering. If we're serious about it, chaos engineering doesn't really care if it's front end or back end or infrastructure, and it's not about random destruction.
So really, what is it? At Gremlin, we have the definition that is this: chaos engineering is thoughtful, planned experiments that are designed to reveal the weaknesses in our systems.
When we talk about systems, we mean both our technical systems, things like our applications and our infrastructure, but also our human systems. Where are our processes broken? Where do we need better documentation? How do we treat the whole on-call and incident response process? All of these things can be encompassed, and you can learn a lot about how you think things work by running experiments.
If we take that idea of experiments, experiments are scientific, so let's do it for science. What do we do if we have a science experiment? The first thing is we have to start with a hypothesis. How do we think our systems, or our applications, or our monitoring tools, or our response processes work when failure happens?
You come up with this hypothesis, and that hypothesis does not necessarily mean that they survive. If I have a database and I say that I'm going to kill that server, it may come back. But my hypothesis might be that my application goes down and that my end users receive, hopefully, some sort of meaningful error. That's a totally valid hypothesis, not surviving.
Along with that, my user should receive errors. My monitoring tool should show me what's gone wrong. Those are perfectly valid hypotheses. Don't focus on simply just surviving. Focus on the things that you need to run good operations.
Next, we set abort conditions. We want to be responsible adults. It's not about that random destruction. If we're testing this for science, we want to have guardrails. What are our abort conditions? What happens if that small little failure that we're injecting leads to a massive cascading outage?
We want to think about what conditions cause us to say that we're going to stop doing this. What are we monitoring for, and what's our backup plan? What's plan B in case everything goes bad?
Ultimately, you just have to do it. You want to start small, and you want to start in a non-production environment and, just like your code or anything else, move up to staging and then move to production to get comfortable with it. But as my friend Bruce Wong likes to say, if you're trying to get in shape, you don't work out first before going to the gym. You just go to the gym and you work out. You have to start. Don't overthink things. Don't over-plan them. Just jump right in there and do it.
Like any science, it's all about repetition and iteration, and similar with our engineering work as well. As you start to do your experiments, you'll find that sometimes things work great. They work exactly as you expect, which is fantastic. It means that you know a lot about your systems and how they operate.
When that happens, you'll want to work to increase your magnitude and your blast radius, how severe your tests are and how much of your systems they affect. You'll want to keep iterating. Try to find those edges or those limitations in your systems, and that'll help you gain a better understanding for their reliability.
But sometimes things don't work as expected. In these cases, you'll want to create those tickets, Jira, Trello, whatever you're using, and ensure that you're fixing these things and then coming back to these tests and actually rerunning them to ensure that what you thought was a problem and what you thought you fixed has actually been fixed.
Sometimes things go horribly wrong. We all know that from our own experiences as engineers. When that happens, you have two options. One is you should have already set some abort conditions. You're going to be monitoring against those and you may want to halt. If things go horribly wrong and it affects your customers, then you may want to stop that so your customers don't have a bad experience.
But depending on how things go wrong, you may want to lean into it and keep going. It's a good opportunity to practice your incident response, to validate your runbooks and your documentation. When failure happens, sometimes just lean in and take advantage of that failure. Take advantage that people are already in the mindset of wanting to fix things and they're already available.
Ultimately, though, it's about learning more. As we talk about experiments in science, the word science comes from the Latin word for knowledge. It really is all about the more you know. The more you know about how your systems operate and how they fail, the better you'll be equipped to build more reliable systems and to be more reliable engineers, to know how to operate the systems that you're running.
All that said, whether you choose to do the mullet style of chaos engineering and do a little bit of partying in your back end, just go out there and do it and start to learn and build up your knowledge. With that, I'll turn it back over to Matt.
Matthew Simons
Matthew Simons: Thanks, Jason.
Chaos for us was one of the big wins. In fact, as we talk about successes here in our story at Workiva, it's the first that I'll mention. Chaos was brand new to us, as I said before. It was something that hadn't really found its way into our standard practice.
This was a chance that we had as a team to bring this to the teams that we worked with. As a standard part of our assessment, it became an opportunity for these teams who, again, had never really worked with us, to go through the entire process: to start from finding hypotheses, to actually seeing their product break, and in some ways break in ways that they hadn't expected.
Those are always some of the biggest wins for us. This is something that we're now doing in a more concentrated and intentional way at Workiva. More systemic, maybe, is a good way to say it. We're really happy to be able to pioneer that and proof-of-concept that for the company.
The next was quality. Quality did improve. This was in fits and starts, maybe, or in spurts. We were working with individual teams, but these individual teams that we worked with came away with standards and best practices that they hadn't necessarily been exposed to, and that left a mark on them, as well as a bunch of what we called findings, but were essentially deviations from those standards and best practices that they were able to take away from our assessment and go to work on.
Maybe most notably, one of the teams that we worked with, which was in charge of our unified messaging bus, took every single ticket that we made. Understand, this is sort of a gold-standard checklist. If you could put together any possible practice that would impact quality in a good way, that's what this represented.
For them to have essentially closed out every single ticket that we created for them, whether it be mitigation or actually performing the upgrade that was prescribed, was pretty incredible. The results were pretty incredible, so much so that it doesn't even make sense talking about it in terms of a reduction in percentage of incidents or SEVs, whatever you might call them. We call them SEVs.
In the time since they did that, they've had two minor SEVs. That's kind of what it comes down to. Before that, quite a bit more. So the efficacy, at least on the team level of this, the impact that it had on quality was pretty big. That was a big win for us.
And we didn't become the Inquisition. That was maybe the biggest win for us. We really didn't want to become the Inquisition. All the teams thanked us. Some of them even meant it.
Think about that for a moment. We were an interruption for them. We were an inconvenience forced upon them by executive leadership that had concerns about their product quality. In some ways, our very presence was an indictment that indicated lack of trust from the company. That's maybe hyperbolic, but you could see how if you were put in their shoes, that might be how you would interpret it.
But the formula for making friends is actually pretty simple. It's do something nice for them, keep doing something nice for them. Eventually, if they have a functioning conscience and they're not totally insane, we build up a debt in the face of that kind of an imbalance, and eventually we start trading back and forth. That's how friendships are often born.
For us, we made friends. This whole process was about making friends and then asking our friends to increase their product quality, which was actually super easy. It was something they wanted to do anyways. It was really just showing them how. That was one of the biggest wins for us.
But to go to failures now, I'll be honest, this was a very difficult portion of the talk to write for me. I have gone through lots of written versions of it and never really stumbled on something that I was super happy with. It's hard to talk about. Failure is not an attractive color on anyone. It doesn't make anyone look good.
But I think if we're being honest, and if we're trying to fulfill the spirit of what this conference and conferences in general are about, it's about sharing what we learned, and we have to learn from our failures, too.
The lesson here, I think most succinctly, that we learned was that we weren't enough. Our team was not enough. We had done what we could within our sphere of influence. But these were four-week engagements. An assessment was a four-week engagement where we would go to a team, spin up on their tech stack, have a dialogue with them about what they needed help with, start the actual assessment, the binary checklist of yes/no, are you doing this, are you doing that, at the product level, and go through chaos planning and experimentation with them. It was a heavy thing.
The teams that we went through it with, I hope, left it changed. But we have dozens of product teams at Workiva. We have hundreds of services. There was really never a way that we were going to be able to get around to everybody. We were playing Whac-A-Mole with quality.
The lesson there for us to learn was that quality is not a thing that you do. It's a stance that you take with your product. We could probably get into a whole tangent around what that means. The short version is that quality is not always a "more is better." There's an appropriate level for your product based on your industry, the product itself, your customers.
But the reason why we got into this, the whole circumstances of why our team was even necessary in the first place, were not something that we really addressed. Our stance from a company standpoint was off, as evidenced by the need for our team. We were going around applying patches and fixes to specific teams. We were swimming upstream.
We got really good at swimming upstream, but ultimately we decided to disband. Our team decided to disband. That was kind of a tough pill to swallow. It felt like a failure.
But what we did with that, and in sort of our estate planning, if you will, was to try and take the things that we had proof-of-concepted, the things that we had validated, chaos testing, standards work, and general quality consulting, and get those baked into the processes of the company in ways that would outlive us.
The interesting thing about that is that had we not gone through the exercise of working with these teams so closely, had we simply tried to rush straight to that point, I don't know that we would've had the influence, the relationships, the friendships, the political capital, whatever you want to call it, to actually get that done.
This leads me to the last point. This is the last slide, and I'll leave you with a few thoughts here.
If you'll allow me a moment of self-referential indulgence, I gave another conference talk where I asserted that the DevOps values are empathy, collaboration, and automation. I know that automation is a thing we value, not a value in and of itself. But I can confidently assert, having gone through this experience, that those values are still the right values, especially if you are looking to go through your own DevOps transformation.
We could have kicked in doors and brandished our badges and demanded cooperation from these teams. But I'm more convinced than ever that had we done that, the result would've been short-term begrudging compliance and a legacy of resentment.
The advice that I give to new managers is this: I tell them that I always come to work with the assumption that nobody has to do their job. I think that's true in any kind of big transformation, maybe more so than even just our day-to-day jobs. I work with people that probably would tell me this is a generational thing, but I would contest that people of any generation and all backgrounds will be more successful, will be more productive if they are aligned in terms of their desire, if they understand the goal that we are trying to achieve and not just acting out of a sense of duty.
Ultimately, the question of making a DevOps transformation has almost nothing to do with the specifics of DevOps and almost everything to do with change management itself within an organization.
When it comes to your own transformation, I can't tell you exactly how to do it. Your company's internal cultures and political structures will all look different and require different solutions and different strategies. Hell, I can't even tell you that I completely successfully enacted a DevOps transformation at the company I work for.
But what I can tell you is this: I can tell you with confidence that to have any measure of success, you will need to embrace those values of empathy and collaboration. You'll need to make some friends, the kinds of friends that will still be friends even after you tell them hard truths. Truths like, "Your baby is ugly."
Thanks for listening. That's our talk. We'll be around for questions.