On the Care and Feeding of Feedback Cycles

Log in to watch

San Francisco 2014

On the Care and Feeding of Feedback Cycles

Director for Quality Engineering at Cloud Foundry · Pivotal

Nothing interrupts the continuous flow of value like bad surprises that require immediate attention: major defects; service outages; support escalations; or even scrapping just-completed capabilities that don’t actually meet business needs.

You already know that the sooner you can discover a problem, the sooner and more smoothly you can remedy it. Agile practices involve testing early and often. However feedback comes in many forms, only some of which are traditionally considered testing. Continuous integration, acceptance testing with users, even cohort analysis to validate business hypotheses are all examples of feedback cycles.

This talk examines the many forms of feedback, the questions each can answer, and the risks each can mitigate. We’ll take a fresh look at the churn and disruption created by having high feedback latency, when the time between taking an action and discovering its effect is too long. We’ll also consider how addressing “bugs” that may not be detracting from the actual business value can distract us from addressing real risks. Along the way we’ll consider fundamental principles that you can apply immediately to keep your feedback cycles healthy and happy.

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

I can't tell you how grateful I am to all of you for still being here, because when I realized that I was in the very last slot, I figured it was going to be me and somebody else having a nice chat, sitting on the stage. Because it's the last day. So thank you. I'm really grateful.

Thank you all for being here. So, I want to make a slight correction to my title. I am the Director of Quality Engineering for Cloud Foundry at Pivotal. Pivotal Labs does not actually have a Director of Quality Engineering.

They don't need one for a very good reason. It's not like it's a missing person. They literally do not need one. And I work exclusively on Cloud Foundry, and Pivotal also has commercial interests in Cloud Foundry.

Cloud Foundry is an open source project, and there is a commercial version, so I work on that as well. It's a job title I swore I would never take again, and yet somehow I ended up being the Director of Quality Engineering. My past experience as a Director of Quality Engineering was that it was a horrible job. You're at the end of the cycle, everybody's yelling at you.

They want to know why you missed all of those bugs, and then they want to know why you didn't ship yesterday. They don't see the contradiction. So I swore I would never do this job again. When I took the job, I had some very long conversations with my boss, Rob Mee, who's a guy who is amazing, and I've known him now for a decade.

And it just happened to be perfect timing when I decided I was done with consulting. I called Rob. I said, "Hey, Rob, I know it's time for whatever is next for me, but I don't know what that is. What advice do you have for me?" And I expected Rob to say, "Well, I think that you should look at this technology because it's really cool," or, "I think you should meet this person, let me make introductions." And instead, he said, "Your timing is really weird.

Do you know what Cloud Foundry is?" And I said, "No." And he said, "Do you know what PaaS is?" And I said, "Easter egg coloring kit?" And he said, "No, but that's okay. We'll talk anyway." This was in the fall of 2012. And it turned out that my timing was, in fact, perfect because Pivotal had just gotten deep-- Pivotal Labs, this was before Pivotal was born. Pivotal Labs had just gotten deeply involved in this Cloud Foundry thing.

And as Rob and I talked about what I might be able to help him with, we were talking about a role that sounded like a director of quality engineering, only nothing like what I had ever done before. My job at Pivotal is to see to the care and feeding of the feedback cycles. So you hear a title like Director of Quality Engineering, and you picture me directing a bunch of quality engineers. We have zero people with that title.

And that's because every single person, they may not have that title, but every single person on the team is a quality engineer. I don't just mean they're good, because they're also good, but I mean that they're focused on quality. So, I'm going to talk a lot about feedback cycles, a little bit about Cloud Foundry. Very happy to answer any questions that you might have.

Let's just go on. Okay. So before we can talk about feedback cycles, let's talk about what I mean by feedback. I mean you do a thing and you find out how that went.

Pretty simple, right? And we all are familiar now with Plan-Do-Check-Act, the Deming cycle, or OODA loops or lean startup, whichever flavor of this it is, it's still basically the same thing. You do a thing, you see what happened, and you adjust accordingly. You steer.

That's what a feedback cycle is. Now, in software development, we need feedback on several things. Let's start with, there was a core need somewhere. There was an actual need.

Somebody needed a thing. They needed widget as a service, or they probably didn't actually think they need that. They probably thought that they needed to solve the problem that widget as a service would solve, but there was a need. And then we set some intentions to go about and address that need.

Now, the problem is, we don't actually know whether or not our intentions would address the need until we can get around to implementation, and then we can deliver the thing. So we need feedback on the extent to which our intentions match the actual need, the extent to which our implementation matched our intentions, and then the extent to which the implementation met the actual need. Now, you would think that this is transitive. This is not even remotely transitive.

I can give you an example of a non-transitive situation here that had nothing to do with software. So there I was in Kiev, and I needed cash. And I had an ATM card, so this should be a solvable problem. But here's my actual need: I need cash.

So I trundled down the street for quite a ways until I found an ATM that didn't look like somebody had backed their van up, and I found an ATM that I trusted that looked like it was in a bank, and I proceeded to put my card into the ATM, at which point it gave me an error, something to the effect of, "Transaction rejected," and ate my card. Now, here's my actual need: get cash. Somebody set the intention of having the ATM machine, the software there for the ATM machine, was supposed to be able to serve up cash, and I guess it was supposed to be able to do some error handling as well. So they set an intention.

The intention was definitely designed to meet my actual need. It's not like they completely missed on the intention. However, somewhere in the implementation, something happened with the way that they interpreted errors, and it must have had something to do with my having an American ATM card, such that the final result was that my needs were not even remotely met. So these are not transitive things, and we actually do need feedback on the relationship between all of these things.

Then there's this concept of feedback latency. Now, we talk about slow feedback, and over the last three days, I've actually been amazed and excited how many talks involve the words testing and quality.Because I like to talk about those concepts but try to avoid using those words, and I was floored and astounded and happy to hear the extent to which folks recognize that those are essential things for us to be thinking about when we're talking about software development just in general. Do you know that there are entire conferences dedicated to software testing and software quality? And they're all over there, and there's more content on that here.

I think they all should come over here. Don't you think? Woo. Yes.

Okay. So feedback latency. We all know that we want faster and faster feedback, but we're going to talk a lot about different kinds of feedback. So feedback latency is the entire round-trip time it takes from do a thing to observe the results of that thing.

And let's use that concept to see why nobody does this anymore. By the way, this is now the conference, I have to tell you, this has been a momentous conference for me. For years, I have given talks like this and thrown up the straw man of traditional practices and said, "Of course, here's this agile thing." Because I do a lot of agile talks. "Here's this agile thing, and then here's a straw man.

Don't do this. This is the antithesis of agile." Only in the past, it seemed like what I was saying resonated a lot for people because they could recognize the straw man. What I'm realizing is nobody actually does this anymore at all, anywhere. Even in insurance companies or very traditional financial companies or other forms of companies that are dealing with regulations that result in them having implemented practices like that, even those companies are not doing this anymore.

So I'm thrilled to say I have to retire my straw man. It's- Don't do that yet. Don't do that yet. Oh, you're going to make me sad.

Oh, okay. Well, anyway, let's go on and let's talk about why nobody does this anymore. So if you are following a traditional model where you've got an analysis phase and a design phase and an implementation phase and to then a stabilize phase. So does this really sound very familiar to people?

Yes. I'm so sorry. Okay. In the last talk, somebody asked, "How do you keep your motivation up?" Sometimes I keep it up by ignoring reality.

But we're going to keep going. So the first opportunity you have to see if your intentions are matching the implementation is three phases in, and these are not short phases if you're doing very, very traditional stuff. I worked in companies where this cycle took a year. I, as a consultant, saw companies where this cycle took five years.

They were in a very different industry than my previous experience. They were in the medical industry, and it would seriously take five years to get through this entire analyze, design, implement, stabilize kind of cycle, and just the stabilize phase would be 18 months alone. So if your very first opportunity to see if your intentions match your implementation is about three-quarters of the way through this, that's going to hurt a lot. And if your first opportunity to see if your implementation matches your actual need is a year or five years later, the entire world will have changed by then.

So let's look at this through a different lens. Let's talk about speculation. Now, at the beginning when we're doing analysis, we're writing specs. Everybody thinks that stands for specification.

They're wrong. It stands for speculation. So we're going to look at over here on the right side of the diagram, that's our axis for speculation. And during each of the phases, the level of speculation is going up.

So as we're analyzing, we are speculating that we have understood the actual needs, that we've understood the problem that we are trying to solve. Then we move into the design phase, and we're speculating that we understand how to solve that problem and are writing happy documents that describe in great detail how we're going to solve this problem. And then we get around to implementing, and our speculation levels are still going up because we're speculating this thing is actually going to work when we ship it or put it in production. And then eventually, we get to the stabilization phase, and we suddenly now lose all notion of speculation because the rubber hit the road.

We're actually trying things. We're finding the problems. We've got masses of bugs. We're holding bug triage meetings.

We're figuring out what to do with the information we got with that feedback, but we're taking action, and we're finally able to steer towards something that's going to work. That whole area under the curve, that's risk. Massive amount of risk. This is why projects go so over schedule and over budget, because we spent so long speculating, and with each calendar day that passes where we're speculating without checking that what we are assuming is actually reality, with each calendar day, that's increasing the amount of risk under that curve.

So the lessons here are, number one, that empirical evidence trumps speculation every single time. No matter how good we think the ideas are, until we actually put them into production, we don't know. We are still speculating. The second thing here is about the time value of information.

Anybody who has ever taken a business class has probably heard the concept of the time value of money, that a dollar today is worth more than a dollar tomorrow. So the same thing applies to information, only information has a much shorter shelf life because information that is true today may very well be false tomorrow. But a piece of information today is worth more than that same piece of information tomorrow. And so we need to be thinking about this as we're designing our feedback cycles.

Now, theoretically, agile enables us to get out of that speculation build-up and risk trap because we're shipping all the time, which means that we don't have the opportunity to be running on speculation for very long periods of time. We are either shipping, delivering, putting things into production, or if we're shipping something that has to ship to an actual customer and they aren't going to be able to receive it, at least being shippable, so continuously deployable, if not continuous delivery.So this is the theory. However, we should talk just a little bit about fragile. So anybody doing a process that calls for a stabilization sprint?

Uh-huh. Okay. So let's watch what happens with that. So we're going along, we've got these iterations or sprints, whatever you want to call them, and we are getting to done with each sprint.

Except if we plan for a stabilization sprint, we're not actually done. We're not done, done. We're not really done. We're done everything except for that last mile thing, whatever it was.

Maybe it was the security testing. Maybe it was the load and performance testing. Maybe it was that last mile, the whole reason why we have a conference, because that last wall that we're going to throw something over is the one that goes into ops. Maybe that stabilization sprint is about getting through the ops checklist.

Whatever it is that we haven't done yet that we're going to do in that stabilization sprint, that's a little piece of speculation that builds up over time. We are speculating that nothing horrible that would cause us to have to extend the cycle for another 18 months, nothing terrible is going to go wrong during that period. So that little speculation, that just grows and grows and grows. Is that looking familiar?

That's why it doesn't work. Because what you've just done is a different form of waterfall with the same exact risk curve. So this is why all the stuff that people have been talking about, like I loved hearing the stories from Raytheon and Disney and all of the case studies that we've heard, because what I heard over and over and over again was automate all the things. No, really, all the things.

Automate them all. And get your feedback cycles really, really short. And make sure that we have cross-functional teams so that we don't have these walls between things. And that's what actually fixes this.

So agile is a piece of the puzzle, but DevOps adds the most important critical last mile piece, and that's why I love this conference. All right. So let's talk about some different kinds of feedback because I've been slinging this word feedback around like it actually means something. And it kind of does mean something, but there are so many different forms of feedback, and you need all of them.

I used to go to testing conferences and I'd talk about unit testing and I'd talk about system testing, and I would get people coming up to me after talks and they would say, "Well, if we're doing all of this system testing, I don't see any reason why we should do all that unit testing because we're just covering the same things." And my soul died just a little bit inside every single time. It was tragic. So you remember that principle that a little piece of feedback now is worth more than a little piece of feedback later? Applying that to our feedback cycles, one of the most important parts of that feedback cycle is the part of continuous integration that happens locally on your machine.

When Martin Fowler talks about continuous integration, he's not just talking about you've got a CI server or a Jenkins or a Bamboo or I don't care what you use. He's not just talking about that. A critical piece of continuous integration is that as a developer, before I check in my change, I do my Git pull or SVN up or whatever it is that I'm using to get the most recent stuff, integrate locally, and then run all the fast tests. Sure, I'm not going to run the three-hour system test run on my local machine before I check in, but the working agreement with the team is I'm going to run all those unit tests and they're going to tell me if anything that I did caused any part of the system to violate the expectations that we have for the code.

And if it did, I'm going to fix it. That's part of collective code ownership. I'm just going to fix it. I'm not going to turn to somebody else and say, "Well, obviously this is your fault." Right?

So, that local integration and unit testing piece is a critical first line of defense. And then, of course, automated CI, we've heard a lot about that. Story acceptance. So at Pivotal, we use the Pivotal way, which is a variant on extreme programming.

Our product managers own story acceptance because we figure that the very best people to do acceptance testing are the people who asked for the thing to begin with. They know what's acceptable and what's not. And so that acceptance testing is a crucial piece of feedback, and it's happening the minute something is ready to be accepted. We don't wait till a sprint demo.

It just happens continuously. Exploratory testing. You might have noticed on my About Me page, I wrote a book called "Explore It," available through PragProg, and it's all about this art of simultaneously learning about the behavior of the system while using all of your test design skills to design tests, but instead of documenting them for execution later, you execute them as soon as you think of them. And as you're executing them, you're getting a little piece of information about the system, you're learning more about the system, and you can use what you observed to steer to drive to your next little experiment.

So it's a series of mini experiments, and you could do this at any level. You can do this at a component level. If you really wanted to, you could do this with your unit testing framework at a class level. You could do this on an entire system, but this exploration is how you discover risks that you didn't even think about to begin with.

I believe it was, I can't remember the speaker, but we had a speaker today who was saying he wasn't sure how to solve the problem where his auditors said, "But who tests the tests?" This is how I solve that. This is what we do on Cloud Foundry to help ensure we're not fooling ourselves. So we do test-driven development, paired programming, continuous integration, all of these great practices. They're necessary, but they're not sufficient because it is certainly possible that we fooled ourselves all the way up to the point where then somebody's actually going to try to do something in the real world and discover that it doesn't do what we expected it to do for them.

Exploratory testing is how we make sure we're not fooling ourselves.Now, for that matter, we talk a lot about monitoring canaries. All of those are feedback cycles. And in our environment, we keep our eyes on the monitoring for our staging environments as well as our production environment. And whenever something spikes that shouldn't spike, all of us are looking at it and going, "What happened?" See, that's a form of feedback.

We did a thing. We observed a result. It's a farther away result than the unit tests that we're running locally, but it's an important piece of information that gives a clue to something going wrong. And if we catch it in our acceptance environment before it goes to production, and we have on several occasions, then we don't have the downtime that would've been associated with that.

Part of that also is dogfooding. We use our own stuff internally. And then, of course, I'm on every single support ticket. I see all of those emails come in, and I can recognize patterns sometimes or draw the attention of the right engineer to a series of support cases that seem to be related.

So these are all different kinds of feedback, and you need all of them. And this is when I say that my job is to see to the care and feeding of the feedback cycles. I mean that I need to make sure that we are not only doing all of these things, but attending to the health of them. So, for example, if I were to be removed from support tickets, no longer have access to them, that would be a broken feedback loop.

If I was the one who argued to make sure that all of our product managers and team leads had full access to the support database to make sure that they could see things as they were coming in. So this is what it means to see to the care and feeding of your feedback cycles, to get any impediment out of the way of people who need to see the results of their actions, to make sure that they have full access to that information. All right, so this is our team room. Creating visibility around all of these feedback cycles is very important to us.

If you look above the word visibility, you'll see a bunch of monitors. On the far right-hand side of your screen, you'll see something that looks a little bit Christmassy. We've got some red and some green going on there. We use an open source build display thing, utility, called Checkman.

One of our engineers wrote it. It's open source. You can totally use it. It allows us to compact a whole lot of builds onto a single page.

We have hundreds of builds at this point across all the teams. And so these are separated out. Each team gets a monitor for their builds. There are some monitors back there that are really, really hard to see, but they're showing the Datadog metrics from our acceptance environment, our development test environment, and then on the other side of the wall where you can't see it, there's our production environment.

So we are literally surrounded by information as we're working. So how do you figure out what are sufficient feedback loops? And here's a simple recipe that I would recommend that you try. Now, this is a recipe.

It's not a prescription. As a recipe, you're going to have to decide what the proportions of ingredients are right for your particular environment. But it's really checking, exploring, and then releasing. So you want to make sure that you have automated checks for all of the expectations in your system.

Code-level expectations, system-level expectations, integration-level expectations. Any part of the system that's responsible for a thing, you want automated checks that it does that thing. Then you want to explore to discover risks. So you've got all of those checks going all the time.

Oh, by the way, and you stop the build on red, right? So the build goes red. It doesn't stay red for six weeks while it gets punted between three different teams. Your fault.

No, your fault. No, your fault. We get that back to green as a top priority. I'm a little worried because I'm getting dead silence.

Does that mean you're looking forward to the beer, and I'm between you and beer? No, it's their fault. It's their fault. Okay.

Moving on. And then even if you cannot release for whatever reason, actually going through all of the motions for everything except for put it into the hands of the customer, that rehearsing of the releases is crucial to making sure the actual release goes very, very smoothly. And as you're doing this, you want to be tightening those feedback loops. So when I joined Cloud Foundry in fall of 2012, the process that was in place then was very different from the process that we have ended up with now.

At the time, there was a Gerrit for automating the code review. But the Pivotal way is to pair on everything. So there's always two sets of eyes on every line of code that gets committed. And we prefer that to using Gerrit or a code review mechanism.

I'm not saying you should prefer that. I'm saying we do, because I realize that there's a debate out there, and that it's a religious issue for some people, and I'm not trying to offend your religion. But one of the things about the way that Gerrit was implemented in this environment was that I had-- One engineer once told me that she spent an entire week just trying to get a simple, single change into the code base. This is a highly qualified, presumably highly paid engineer who had the frustrating and soul-crushing experience of not being able to get a simple, single change into the code base because the Gerrit process was, you submit your thing and then it has to get plus-two'd, and the only people who have the plus-one ability were the very senior people.

The very senior people didn't care about her particular productivity, didn't particularly care about her fix, had their own set of things that they were responsible for and would take days to even bother to look at her particular check-in. Now, during those days, of course, other people's check-ins were getting merged, and so her days looked like get the latest stuff, merge, rebase, test, and then resubmit. That was a week of her time.So, one of the first things that we did was to get those kinds of wait states out of the system. We dismantled, I'm really sorry if I'm offending your particular religion, but we did totally dismantle the Gerrit check-in process and instituted a you can either get a code review, but we're not going to manage that through Gerrit.

You can either get a code review with somebody, or you can pair, and then that quickly turned into, you can pair. That's your option. You can pair to get code into the system and still have it code reviewed. At the time, there was a QA process that involved a QA team that was offshore.

Still part of the company, but they were somewhere else, many, many, many time zones away. They were really good, and I could tell they were really good because they would run the tests and then report back not only with the bugs, but having isolated the bug down to the line of code that was the problem, but they were not empowered to fix it. So they would be lobbying for weeks at a time, "Dear engineering team, this set of tests failed. Here's the problem area.

Could you please prioritize the fix so that we could get back to green?" And the engineering team didn't have direct visibility into it. It wasn't in their face. They didn't care. Nobody was listening to the QA team.

So it was too far away. This is why we moved to teams are responsible for their own quality, and there is nobody else who is going to test this for you. So, our process, we work in small pieces. Stories take a few days to implement.

Absolute worst, like a week's worth of stuff. We tend to split stories when they get that big. We found and squashed all of the wait states in the process. The end result was that we were able to go from a given set of changes going through the system took weeks to a given set of changes going through the system took days, and we're now at the point where we can do it in hours.

Parallelize all the things. We put a lot of energy into being able to parallelize our deployment as well as parallelizing our tests, both. Taking the time to remove duplicate tests. The test suite that the QA engineers that were offshore that they had written and that they were running, wasn't a bad test suite, but it took a lot of hours to run, and a lot of the tests actually tested the same thing.

They didn't look like it on the surface. It takes work to go through and curate. But taking out all of those duplicate tests paid off in spades. Our acceptance test suite now runs in 10 minutes.

And then finally, part of what we did there was to take tests that were being done at the system level, and they really reflected responsibilities that were at the unit level. And so somebody somewhere coined the term the Tetris game of testing, where you're pushing it down to the lowest level. If you've ever played Tetris, you want to get your puzzle pieces down to the bottom level because otherwise it fills up your screen and then you lose the game. So the Tetris game of testing or the Tetris principle of testing is to drive your tests down to the lowest possible level so that they will run as fast as possible.

Okay, and now here's where I'm going to let you in on the secret sauce of this entire conference. I don't know if anybody has shared this secret with you yet, but the thing about feedback is that if you look at learning models, learning models involve do a thing, see what happens, integrate that new knowledge, get a new idea, integrate that new knowledge, do another thing. It's a feedback cycle, right? It's kind of like plan, do, check, act, except it's do, observe, explore.

Or Kolb's learning cycle, which is concrete experience, reflective observation, notice what happened, notice how you feel about what happened, abstract conceptualization, consider something new, and then active experimentation. This learning cycle is just another kind of feedback cycle, which means the better you get at feedback in your organizations, not only are you able to release software faster, you're creating a learning organization. And that ultimately is how you remain competitive in this rapidly changing world. That's my big ta-da.

Okay. Woo-hoo. I think we have one minute for one question. Does anybody have a question?

Nobody. Okay. Oh, sorry. Who's got the microphone?

How did you get into testing? How did I get into testing is the question. Somebody should be running around with a microphone. I don't know where it is.

So I fell into testing totally by accident. I had done tech writing and programming, and I was taking a very short-term contract, and my entire set of qualifications for the short-term contract was that having worked at Sybase, I knew something about databases. And the contract was with a QA department who needed somebody who knew something about databases and could automate the setting up of their environments. And so I went in for a four-week contract and stayed four years and fell in love both with my teammates, because it was a wonderful team, but also with this notion that I could apply all the skills that I had before to this new area, and I got to play with stuff every day.

Was there another question? I was going to ask, with test exploration, which is something we love to have the time to do, but don't even have the time enough to do unit testing, is it that automating all your tests allows you now the time for test exploration? So I'm going to rephrase your question a little bit, which is how do you get the time to explore? Partly because I'm going to do that politician trick of answering the question I want to answer instead of the one you actually asked.

But I will say that yes, part of the way you get the time to explore is to automate all the things so that you're not spending your very valuable time doing that soul-crushing work of manually repeating steps. Like push this button, fill in this field, I hate my job. Push this button, fill in this field, verify that the window came up. I hate my life.

Push this button, right? If you're doing that, then you don't have time to explore, so automating certainly helps with that. One of the ways that we get time to explore on our project is everything goes into our backlogs, bugs, chores, stories, everything goes in. We're Pivotal, so we use Pivotal Tracker.

And everything is in stack ranked order in that prioritized list, and the PM is the one who prioritizes it. We have adopted the convention of making exploratory charters, which is just a mission for what you're going to explore and the information you're seeking in that exploratory session. So that's a charter, and we've adopted the convention of making them a story. It's a zero-point story.

It's one that the product manager can accept. And the reason the product manager has a motivation to prioritize that story high is because it turns out product managers always have questions, especially if the whole team is responsible for the quality of the delivery. The product manager is right in there with that whole team. They're going to have questions about risk, and they're going to have concerns and worries.

And so if it becomes just part of the work that is prioritized for the team, then you're going to have time to get it done because it's just in the list of things to do. I think that we need to be done. There's a red light flashing at me. So I thank you all so much.

It was a great pleasure to get to meet all of you.