Pain Is Over, If You Want It

Log in to watch

San Francisco 2015

Pain Is Over, If You Want It

Technology is always the easiest part of any problem.

This was true of Google in 2005, when Mike Bland joined the Testing Grouplet’s effort to drive adoption of automated testing throughout a highly successful company as its organization and systems increased in complexity at an alarming and unstoppable rate. This was true in late 2013, when the Healthcare.gov crisis led to a stunningly successful recovery after private industry experts were given clearance to fix the technical issues. It is also true of the U.S federal government today, as Mike has joined 18F as part of the effort to modernize how software is developed and procured, and to steer the culture towards maximum transparency, autonomy and collaboration.

This talk will outline Mike’s experiences at Google that shaped his outlook and honed his organizational skills, and describe his efforts to capitalize on the opportunity produced by the Healthcare.gov recovery to effect broad cultural change throughout the federal government.

Chapters

Full transcript

The complete talk, organized by section.

Mike Bland

Thanks. Thanks, everybody. Let's get right to it.

You might remember, in October 2013, healthcare.gov was being crushed under the load, which on the one hand, it's good news because there's actual demand for this service. But healthcare reform was about to fall apart because of a website. Seems like a ridiculous proposition.

But fortunately, a month later, they opened the doors so that folks from industry could come in, do what they do, and lead the recovery effort. And so by the time enrollment closed at the end of April 2014, they had not only enrolled more than the 6 million people they were shooting for as a revised target, and the 7 million that they had as an original target. They ultimately accumulated 8 million people enrolled in the program.

So this website recovery was a dramatic success, thanks to a very dramatic change in the tech culture of healthcare.gov. Because once the administration decided that it wanted to fix the problem, rather than just adhere to decades of conventional wisdom around contracting and procurement, it fixed the problem.

But the big question is: what happens next? How does it stay fixed? And is there a way to build on that momentum to try and reform how IT in government is done across the board? And can they learn lessons from experiences such as the drive to promote automated testing adoption within Google?

And how did I get mixed up in all this stuff? Because at the time, I had actually quit the tech industry. I had left Google and the industry in September 2011. I had gone to Boston, enrolled in Berklee College of Music. I was completely off the radar.

But I ended up getting a call from an old friend. Or rather, an email. My buddy Jason Huggins found me on LinkedIn, and he sent me an email, and I'll read part of it to you. He said, "As you may or may not be aware, I helped late last year with the healthcare.gov website rescue. It's going to take a while before the government gets good at software development and testing, and they're going to need a culture change. But the White House gets it now that fundamental change is needed in how they create and test software, and more importantly, that change imposed top-down is likely to fail. Again, no pressure. I respect your desire to focus on other things, but there's this whole 'your country needs you' thing."

Yeah. So I grew up in a military town, so that was the nuclear option. I was done with music school, and I joined 18F, my current team, in November 2014.

But to understand why he recruited me, let's go back to Google '05. So obviously, it was an extremely successful company already. Everybody wanted to work there. We must have been doing something right, and in a lot of ways, we were doing a lot of great things right. But then there was another side that most people didn't really see.

The threat was we were going to be crushed under the weight of our own success, because obviously, we were hiring more people, writing more code, building more products that became increasingly more complex. And even though we tried to hire the best developers we could and give them as much support as we could, there comes a point where all the brainy heroics in the world cannot overcome sheer scale.

And so we were running the risk of reaching a point where we would eventually hold ourselves back, slow ourselves down, and when fear of change and all the things that might go wrong would stifle our courage and innovation. It would lead us to miss opportunities, to become bureaucratic and petty and mediocre and irrelevant.

And there was no service, no product more visibly susceptible to these forces than the google.com homepage. And that was managed by the Google Web Server team, or GWS for short. You're going to hear me say GWS a lot. And I was not a member of this team, but I was very close to the folks who were on that team, as will become evident.

And so GWS, back in the day, was not the glamorous place to be inside Google, because it was basically a dumping ground for all the unrelated changes that all these other teams developing search features had to dive into all the time. So just imagine if you break google.com, or you're on the team that allows google.com to break. So we're talking about thousands of queries per second that are slow, or return bad results, or who knows what. And those thousands of queries per second quickly build up into billions of broken promises.

And we're not just talking about lost revenue, but damaged trust, and that's very hard to recover.

But things seemed to be mostly working, and a lot of people thought, "What do we need to fix? Look at our stock price." And some people, they lived the Google life, and they believed in themselves. And then there were latecomers like me, who were just living in mortal fear that we were going to be revealed as the frauds that we actually were. And so there was an intense pressure to deliver something and prove yourself.

And bear in mind that also, testing back then wasn't the common experience and the commonsense thing to do that it seems to be to many of us now. A lot of people just had no experience with unit testing or writing testable code, and they didn't have the capacity to really think about it. And the development tools we were using at the time, they were really starting to get crushed under the load. And there was some truth to the proposition that people didn't have time to test.

So the big one, though, that we were up against was trying to basically prove the value of a negative, that there's value in preventing a problem rather than fixing it after it hits.

Now, of course, after the fact, when things like, say, Goto Fail and Heartbleed happen, we can make a case saying, "It's conceivable automated testing could have prevented a case like this." But people still tend to think that it can't happen to them.

And this is especially true when people's value system relies almost entirely on objective measurements. And I use "value system," "priority structure" interchangeably with the concept of a cultural norm of what's most important to an organization, or what I sometimes call a corporate religion.

And I don't want to condemn Google's data-driven management style, decision-making process. It's actually really, really good. It's obviously served the company well, and it served, by extension, society very well. But it did make it extremely difficult for us to make the case inside the company of the value of an investment in automated testing, and we couldn't really blame them.

Because like I said, they didn't have the experience outside of the status quo at Google, where things were very slow and brittle. And they were under constant delivery pressure on top of that. And because we couldn't communicate using the language of data, people just didn't understand why we were so passionate about it. And when people didn't have the time to build their code, they didn't have the time to learn about testing.

So we had to find other ways to achieve our goal.

But first, let's step back a little bit and think about culture change more broadly. And one thing I will say is that I don't believe culture change happens like this, because there is nothing worse than Cartman with authority. No matter how good the idea, how pure the intention, if somebody high up in the organization starts issuing orders, it can backfire. It can actually hinder adoption. And fortunately, many of the Google execs had already learned this lesson at Microsoft and Bell Labs and Digital and Sun.

And it also doesn't happen like this, with some rockstar-guru-ninja-savior type coming in to save the day. And I want to emphasize the point that it's wasteful at best and dangerous at worst to think that change is only possible through the magic and charisma of a selected few.

Now, power and mythology are not bad things, but they require cultivation and care so that they produce something that becomes repeatable, basically a model. Not because there's any way to exactly repeat the steps other people have taken in the past, but you can at least frame it and use those lessons to inspire your present course.

And despite the problems, the limitations that we had inside the company, we actually had a lot going for us by virtue of the existing environment at Google. So we had access to information and tools that supported that. We could see who was working where, anywhere in the company, what they were working on, have this documentation culture. We were empowered to try new ideas and to try to prove their value to the company, most notoriously through the concept of 20% time, where everybody had up to a day a week to experiment.

And we also had a very startup culture for a company that big, and something that we called grouplets, which was basically people pooling together their 20% time to collaborate on an issue that affected the entire company.

And we also, without realizing at the time, we kind of slid into the Crossing the Chasm model, which you might be familiar with from Geoffrey Moore's book. And so the idea here is that getting the right message to the right people, the right way, in the right order, is key. And all the way on the extreme left, you have the innovators and early adopters, or who I like to call the instigators. And so it's up to them to kind of build that bridge across the chasm to reach the majority so that the rest of the population can eventually adopt the initiative.

And I'm going to return. You'll see this little bridge and what it's made of in a few minutes. But before I do that, let's get back to GWS.

So the tech lead of the GWS team, Bharat Mediratta, believed that automated testing could cure a lot of the problems with GWS related to complexity and fragility. And so he had the team take a hard line. They were not going to accept any more changes to GWS without accompanying tests from anybody.

And they set up a continuous build, of course, and were religious about keeping it passing. They set up coverage monitoring. They made sure their numbers were up and to the right. And they actually established a written policy and guidelines for writing tests for GWS that they insisted that everybody abide by, both inside the team and outside.

And this was not the most popular policy at the time in the company. But in the end, it helped. GWS was able to turn a corner. They were able to get to a point where they could confidently integrate a large number of unrelated changes from many different projects coming to them and maintain a brisk release cycle.

And when new members joined the team, they were very quick to actually make productive contributions despite the complexity of what they were working on, because of the confidence that came from this high degree of coverage, and because the code was in such good shape in the process.

So ultimately, this policy, this radical policy, enabled google.com to expand its capabilities very rapidly in the middle of a very fast-moving, competitive landscape. And it goes without saying that obviously, GWS is the model team. You want to learn how to do automated testing, look at GWS.

But the problem was, it was still just one relatively small team within a large and growing company. And so we had to find a way to amplify its voice and its message, and build that bridge across the chasm.

So what happened was Bharat teamed up with another Googler named Nick Wysocki, and they started the testing grouplet. And I eventually became one of the leaders of this grouplet after they handed it off. And we had very little budget, zero authoritah, but we had all of the creativity at our disposal to attack the problems that were hindering automated testing adoption from fresh angles. And we had the GWS experience that we could rely upon as a model.

So we did a whole bunch of stuff that I'm going to run through quickly.

We did a lot of partnering with EngEDU, which was Google's internal training organization that maintains things like the new hire lectures. And so we had a lecture and a lab so that at some point during everybody's first two weeks, they would be introduced to unit testing. They helped us cultivate these self-guided training materials called Codelabs, so people could work through examples and get a feel for the tools and the techniques. And of course, they helped us organize internal tech talks and helped us bring speakers from outside the company as well.

And we worked very closely with our internal build tools and testing tech teams to try to find a way to reduce that friction that created that whole, "I don't have time to test" excuse.

But of course, the biggest thing we're known for: Testing on the Toilet. So if you're not aware, this was a publication we just took the initiative to start putting up in every bathroom in the company every week, and we were able to just incrementally increase the degree of knowledge and sophistication when it came to automated testing all through the company.

And I chose this episode to post here not just because I happened to write this one, but it also encapsulates two other major initiatives of ours.

The Test Certified program was a roadmap that was inspired by GWS to do two things. First, we hacked the culture a little bit. We said, "Do these tasks, you'll achieve these levels, and we will put your name on the ladder." So they had something to measure themselves against, and they thought it was great.

But secondly, it also gave people a path to overcome that big scary obstacle, like, "Where do I get started?" So we told them, "Level one: set up measurements, builds, coverage bundles. Label any tests that you know to be flaky so that they could be singled out. Level two was establish a written policy for your team, no changes without tests, et cetera, and set some low-level goals." And then once the team was really bought in and feeling the rhythm of this, level three was where you set some far-out, high-end goals that really stretched your capabilities.

And we recruited a lot of volunteers who were passionate about testing across the company, especially our software engineers in test. And these people would act as mentors to the teams participating in the program. And they would provide advice, and they would also be the ones to validate that team X reached level Y on the Test Certified ladder.

And eventually, with this framework, we realized we had a strategy. We could take this framework and say, "We want to get every project in the company to Test Certified level three. Even if they're not actually in the program, we want them to operate as though they are at that level."

So for the projects that needed a little more hands-on help, we created the Test Mercenaries, which is another group that I was a member of. And we were like internal consultants. We would go into a project, spend a few months working on their code, using our tools and techniques to show them how to do these things. And we used Test Certified both as a guide to what we were doing and the goal for the team to achieve.

And again, working closely with the tools team, it was a very tight and productive feedback loop where we would try the latest and greatest, get real-world experience in some challenging projects, feed it back in, and that drove the innovation that produced the toolchain that Google has today, that enables enormous amounts of builds and tests to be executed every day.

And so we also did this other thing called Fixits, which were these kind of informal, "Hey, everybody, let's do this thing today because we really need to address these important but not urgent issues." They were completely grassroots. They were not mandated from on high. And we could give people little tasks, like write more tests for your project, fix the tests that are broken or flaky. And then eventually, we'll do this to climb up the Test Certified ladder. And then when we got to the point where the tools were really in shape, it was a great way to get them out to everybody in a very short period of time.

And the power of this came from the fact that we had these very concrete goals. They were time-boxed. It created a sense of urgency that produced a critical mass of activity. And so every time we did one of these things, we just ratcheted up the state of the art in terms of the tools and the techniques, and our entire culture change mission would reach a new plateau.

And plus, they were fun. There was all this energy, and we gave out free stuff. And if you know anything about Googlers, just know that they love free stuff.

So this whole thing took about five years, and I won't go into the whole chronology, but let's see how the different pieces that we came up with fit together into this bridge that we built across the chasm.

So this model, I borrowed it from a fellow ex-Googler named Albert Wong. He did a two-week sprint with Citizenship and Immigration Services in July 2014 and produced a talk about his experiences, and he introduced this model. I changed the name. I wanted something a little more like a splinter that would stick in your head. But I just thought it was a brilliant way to delineate the different functions that you have to cover in order to get that initiative from one side of the chasm to the other.

And it's phrased in terms of the needs of the majority that you're trying to reach. And it also has this really nice linear quality to it, where you can clearly see that some activities are more dependent on you doing the work, but eventually the goal is to get people to be able to do it for themselves.

So, obviously, things like mercenaries and tools teams were hands-on. Test Certified served many functions, but it especially locked into the validation need. We produced all kinds of materials, saturated the environment with Testing on the Toilet and everything within EngEDU. Fixits were these big, high-energy things. And we'd also give out these build orbs that would sit there and glow green or red, and the whole team could see if the build was broken or not. And I think people started forming interesting attachments to these things.

And then, of course, at our core, we were a community. We were trying to work with people to help them. And then when we enlisted the help of the mentors, that really scaled our scope and our reach.

And then finally, we had two big Fixits where, in January 2008, the revolution, that's the first time we put the modern toolchain in everybody's hands that took away the "I don't have time to test" excuse. And two years later, we rolled out the test automation platform in March of 2010, built upon that toolchain. And it was so fast and so accurate that most build breakages that affected multiple projects would be reported and fixed before most people even noticed.

So what did this enable? I'm going to use some actually old-by-now numbers here from a fellow who I actually did not know, Aaron. But 15,000 devs and 4,000 projects working in one big pile of code, making 5,500 changes a day. And I did the math: that 75 million test cases a day turns out to about 868 tests a second.

And I know Rachel Potvin from the Build Tools team recently gave another talk that makes these numbers look like child's play.

And so that's the foundation that we laid. And what does that equate to tangibly? Once people had the power and knew what to do with it, doing the right thing with regard to automated testing just became what you did. It wasn't even a question anymore. The only question was how to do it.

And so there was no more of that fear of making a change, that things are going to break. People could stay in that state of flow and focus on the future and exciting new features, and it brought the joy back to programming.

And so I am proud to say that after five years of grassroots teamwork, we'd done the impossible, and that made us mighty.

And I love using Caravaggio's David and Goliath here because it's actually a self-portrait of Caravaggio as Goliath. Because the point here is we didn't have any external forces working against us. And the technical side of the problem, we eventually solved it. It was not easy. It was a challenge, but it was solvable.

But what we had to figure out was to give people the kind of power and knowledge they needed to change their perceptions and have the kind of experiences they needed to have to be persuaded of the value of testing. Because oftentimes, the biggest obstacle to the change we want to see in the world is how we as individuals, teams, or organizations already see it.

And on that note, let's come back to the government.

So I'm going to mention just a little bit of background about 18F, blow through it very quickly. It was founded in the wake of the healthcare.gov recovery as part of the General Services Administration. And the goal is to try to reform the way government builds and buys software.

And so we do everything open source. We're very deeply steeped in agile methodologies. But the goal is not to build all the things and replace the vendors. The goal is just to establish kind of a beachhead within government that proves that this model can work here, and to give procurement officers and vendors alike a new model, a new framework in which contracts can be written and work can be done.

And we believe that by following this model, there will just be millions of dollars, tens of millions, project after project, that just magically isn't getting spent anymore.

And as for the name, yeah, they went through a few dozen ideas, and they were all trademarks, so they looked out the window and said, "Well, what street are we on?" So we're at 18th and F Streets, Northwest DC, in case you were curious.

And so, some of the things we've worked on. We've worked with the United States Digital Service team from the White House to work with the United States Citizenship and Immigration Services to not just reform the software architecture and delivery process, but also improve the user experience for prospective citizens.

We've worked with the Department of Interior to deliver Every Kid in a Park, and it's notable because we actually did user research with nine-year-olds. So if you're wondering why there's no social media buttons all over this site, it's because nine-year-olds have no use for them because they can't get accounts. Who knew?

We recently worked with the Department of Education to make all of their data accessible and useful as possible so that prospective students and their families can make informed decisions about their college choices.

And then we recently launched this web design standards project, another joint effort between us and the United States Digital Service team. And the point is to provide a better user experience so that government websites are no longer special snowflakes. You can kind of see all these different buttons from actual government websites. So this design standard work tries to provide design elements and a style guide for a common look and feel.

And then we also have our consulting operation that--

Gene Kim

Take all the time you need.

Mike Bland

Oh, okay. Thank you, Gene.

So the consulting wing, they work directly with our partner agencies to give them kind of the taste of agile in their mouths. They'll go in, they'll do discovery sprints for an agency. They'll provide proposals, recommendations, even prototypes. And then they'll actually run agile workshops and lead them through problems that they're actually trying to work on.

So everything we're doing is off to a fantastic start. But how do we make sure we keep it up? How do we make sure we don't just become, "Oh, that was another great experiment that didn't pan out"?

Well, first, let's go back to the organizational forces that exist within government, and you'll see some parallels in the large here.

So typically in government, there is a premium that's placed on compliance with the existing rules rather than a focus on the quality of products and services. And part of that comes from the fact that in the government, we don't have the same incentives like stock shares or micro kitchens or ski trips or--all right, never mind. Sorry. Actually, I really miss the espresso machines.

But there was also this thing passed in 1883 called the Pendleton Civil Service Reform Act that kind of provides the structure for a lot of this job security. And it's created an environment where I think Jamie Zawinski might characterize it as, where people want to come work for a successful company rather than work to make a company successful.

But at the same time, there's this awareness that while people want to avoid risk and avoid accountability, that the really talented people, they're kind of crazy risk-takers, so they're not really going to come here. But then what does that make us? So there's a little bit of an inferiority complex.

And of course, the waterfall model is still by far the dominant model in the psyche of the government. And particularly when it comes to automated testing, that's way at the end of the pipeline, right? Why should I have to test? That's somebody else's job.

And then of course there's outdated tools. Of course, there's outdated procedures. And the amazing thing is sometimes the government can't even get to the code and the data that actually provides its products and services. They can't physically get to the information and the tools they need to do their job.

And so all these barriers were erected because there's some fear of something going wrong and being held accountable, which, in a lot of people's minds, they think they're going to get fired or dragged before Congress or something like that.

And so again, this fear leads to missed opportunities, pettiness, bureaucracy, all these things that we typically associate with government.

And what also happens is nobody on any side of the situation, the policymakers, the administrators, the developers, nobody has access to the full information they need to actually meet their objectives. And this goes beyond ignorance. It's a full communication breakdown, right? Because these groups are isolated. They don't have a common goal. They don't have a common language. They don't have a common understanding of what the objectives and needs are.

So if we're going to overcome, we 18F now, not we Google, we 18F, if we're going to overcome these kind of challenges, we need to build an organization at least as robust as Google. And not just to withstand the pressures, but to also then communicate effectively that, yes, this new model can work within the government.

And so we've got a ways to go. And let me do a little A/B with you here. So my first day at Google in 2005, we had all the things already. Any question you needed answered was pretty much at your fingertips. I described it as jumping into the fire to drink from the hose.

But my first day at 18F, however, all these questions, I had to dig for the answers, and I had to run around asking people for them. And that was not just a drag on my time, but on theirs, because certainly they're answering the same things over and over.

And that's when it dawned on me that when I walked into Google, there was not just values around transparency and autonomy and collaboration, but there were the tools to support it. And everything that we did in the testing group, it was already there. We took it for granted.

So I figured we need to have that at 18F. We need to start developing not just these values, but some tooling around it.

So I started trying to steal the ideas that I had brought with me from Google. I created this thing that we call the Hub. It's an amalgam of several systems that I was familiar with at Google. It's kind of a prototype that has lived on perhaps past its usefulness, but it's made the point that scaling our documentation is crucial to scaling as an organization and sharing information.

And one of the things that it tries to do is expose a graph of connections between all these different individuals working with projects, with other people, in certain locations, with certain types of tools. And we've extracted that kind of graph engine into something that we call the Team API, so that instead of it just being a feature of the Hub, it's exposed as an independent set of JSON endpoints.

And we've started experimenting with how to keep that current and automate getting that data. And one of the things we've been doing is we started adding these little files to our repositories, just little metadata files called about.yml, that talk about, "This is the project. These are the people working on it. This is the impact. These are the technologies."

And we've already started building up a pipeline to harvest that information automatically, directly out of the code repository, munge it through the API, and then publish it who knows where.

And so the point of this is I'm trying to create a space so that these instigators can more easily discover one another and create their own grouplets, or in the parlance of 18F, working groups and guilds. They're basically the same thing, and guilds are a little more official. But the idea is that same model can apply to our own team that we had with the testing grouplet.

And I run three. The first one is the documentation group, because that's the foundation for all of this. Second being testing, and then third being a working group working group, because we want to help working groups.

But to also make it easier for these groups to share their knowledge and information, we've basically copied GitHub Pages, but for our own infrastructure, and started publishing the series of documents we call guides, which kind of is an exposition of: this is how we do our work.

And they're very much works in progress. But we've already gotten not just great discussion and feedback and activity within the team, but from other agencies and the general public through our GitHub issues and pull requests. And then hopefully we'll get an 18F EDU off the ground one day that can emulate Google's EDU and become kind of the permanent custodians of all this training material.

So just quickly, you can see how a lot of these pieces fit into place. And it's not just us doing this work, and it's not just me doing this work. There's so much work that I'm just scratching the surface with my own story here.

But our network across what we call the digital coalition, with the Digital Service team, agencies like Consumer Financial Protection Bureau, we're all doing great work building a community, trying to find tools to expose this. Because the insights and the methods and the products that are generated by this combination of transparency, autonomy, and collaboration, that's what empowers a team to create products and services that not only satisfy the needs of customers or the society at large, but to actually exceed their expectations.

So can it work? I believe it can, because we've got the right people in the right place at the right time, doing the right things for the right reasons. And I actually had a lot more Beatles slides in here, but, you know, time.

But will it succeed? Well, that's the big question, and I think it can if we really want it to. And I'm going to share a quote from a colleague from the Digital Service, Charles Worthington:

"If you think there is a problem with how government does tech, and that you could help government do it better, then the question is what are you doing to improve it? It's not going to get fixed on its own, and the people in charge have never been more open to new ideas. If we don't try, who else are we expecting to?"

So the ask here, I have to be careful. As a government employee, I cannot directly solicit unpaid labor from the private sector. However, there are things. There are things you can do to help validate the things we're doing, to help inform the government what we're up to, and to inspire change by shining a light on the work we're doing.

And it's the usual thing. It's blogging, tweeting, writing articles, giving us feedback. If you want to jump into a GitHub repo of ours, we're not going to stop you.

But by doing this, by helping amplify the voice of our small team, you empower us to build that bridge to help make government of the people and by the people work better for the people.

Thank you so much.