Getting Faster Answers at Yahoo Answers

Log in to watch

San Francisco 2015

Getting Faster Answers at Yahoo Answers

I will share the more detailed story of how we moved Yahoo Answers from 4-6 week deploys to daily releases – and all the cultural change that required and created.

Chapters

Full transcript

The complete talk, organized by section.

Jim Stoneham

Thank you so much for coming today. It's a real pleasure to be here.

This all started actually last year at AWS re:Invent. Gene Kim cornered me in a bar and was asking me about my past life, and I was talking to him a little bit about my time at Yahoo! And he said, "You've got to come to DevOps Enterprise and tell this story." So that's why I'm here. If you don't like it, you can blame Gene. If you love it, you can thank me.

So I'm going to talk about what we did to Yahoo! Answers to make it a more vibrant property and a more growing business. But before I get to that, Gene wants me to talk about something else as well. He wanted me to talk a little bit about the context that was going on at Yahoo! And it is actually pretty important, because I'm sure many of you are operating within larger corporate contexts, and that creates challenges for your DevOps engagements and kind of things you're trying to do on that front.

Yahoo! at the time, in 2009, was about 14,000 employees. I'm not sure how that ranks relative to the size of your organizations, but it was actually a pretty big enterprise. And there were some interesting company dynamics going on at that point as well.

We had various product teams. Yahoo! had about 50 different properties, like Yahoo! Sports, Yahoo! Mail, Yahoo! News, Yahoo! Groups, things like that. And every one of those teams had their own CI/CD process, if they even had one, their own way of deploying code, lots of different technology stacks being used. So the result was there were lots of different speeds of deployment.

Obviously, at one end of the spectrum was Flickr, and around this time was the well-known talk by Allspaw and Hammond about 10 deploys a day and loving it. I assume everybody knows what I'm talking about there. So that's one end of the spectrum. The other end of the spectrum was the team that ran membership that released code basically once a quarter. So very different speeds of operation.

We also had a new CEO every couple of quarters in 2009, 2010, creating a lot of management unrest. And there was an initiative across the company to move to CI/CD. It was very much in its infancy. So again, there was very inconsistent tooling across the team.

And then the last thing, which was kind of vexing as a product team leader, was for good reasons, operations as a competency was centralized. The goal was to drive better practices across all of Yahoo!, which was a good thing. The challenge was it created an organizational boundary between core engineering and operations. Kind of like anti-DevOps from a structural perspective. So that was a bit of a challenge for us. But as I'll talk about, we figured out a way.

So we're here in 2009. I was VP of Communities at Yahoo!, which meant I was responsible for Flickr, Yahoo! Answers, Yahoo! Groups, and then this thing called social integrations. So we had structure to deal with Facebook and with Twitter to integrate tools that most of us take for granted today: the ability to share articles on Facebook and Twitter and tweet out and all that. But in '09, I will tell you this was rarefied air to be doing this kind of stuff.

So part of my night job, as I used to call it, was to work with all these different parts of Yahoo! to get them to integrate with Facebook and Twitter and get the lift that you get from integrating with social and social publishing.

So this is my world in 2009. And because this is DevOps Enterprise, the big question of the day is, how are deploys per day looking? That's the important question.

Flickr was doing about 10 deploys per day, which is awesome. And as we've seen by the numbers the last couple of days, that's like a pittance compared to the current level of deploy rates for companies, but it was pretty good in those days.

Answers and Groups. Anybody want to take a guess? Any? Bueller? Bueller? Not quite that bad. One deploy per month. But your answer was correct to the social initiative, which was one deploy per quarter. Now imagine in my shoes trying to move integrations with Facebook and Twitter forward, taking advantage of an ever-changing social platform on the part of Facebook, who's deploying daily, and trying to keep up as Yahoo!, deploying once per quarter. Very exciting times.

So these were some of the challenges that I faced in my role. And one of the things I learned living in this larger organization, working on the social integrations, was it was critical to pick what I call the lighthouse account, if you will, or property at Yahoo!

So we centered on Yahoo! Sports, and later Yahoo! Mail, and focused on that property and did a really deep integration that showed off all the benefits of what publishing to Facebook and Twitter would mean from a traffic-lift perspective. We demonstrated really amazing numbers to the rest of the company, and that started to pull people along. And I think that's a similar theme that we've heard in other talks the last day or so: how you can get one small group being successful and then kind of pull people along with it.

At the end of the day, I had to get some CEO air cover as well to actually get this to happen across all of Yahoo! I am happy to report, at the end, we had like the most traffic from Facebook of any app, basically. So we were doing a lot of business with Facebook, promoting Yahoo! content. It was a success, but it was very painful along the way.

And I think the lesson I learned was definitely start small and grow. Build that grassroots, and also work on getting air cover at the top.

But that's not really what I'm here to talk about today. I want to talk about specifically what we did with Yahoo! Answers to get from that one deploy a month to a much more rapid and much more agile process.

Who here has used Yahoo! Answers or has seen a page on Yahoo! Answers? That's it? Come on, tell the truth. Kids' homework, looking up random trivia. Okay, quite a few of you have.

So it's a place to ask and answer questions. It's actually one of the biggest social games on the internet. Most people who answer questions on Yahoo! Answers are trying to level themselves up, one of seven levels. They get points for answering questions correctly or in an approved way, and they're all vying for level seven, which is like the nirvana of Yahoo! Answers. But then once you get to level seven, if you don't answer questions, you start to decay and drop down levels. So it's a big game, basically. But it's about trying to bring more knowledge to the internet, essentially.

Very large-scale property. As of 2010, 240 million monthly visits. Over 20 million people answering questions. So, very large-scale thing, and available in multiple languages around the world. So pretty significant part of Yahoo!'s traffic. Lots of responsibility sitting on the team to make this work and make it work well.

So I want to talk about how we evolved the way we worked inside the team, and hopefully share some learnings that can be helpful to you. I'm very happy to talk about it afterward as well, if anybody has any questions.

I'm going to tell a short story. In 2009, when I basically arrived at Yahoo!, Answers' growth was flat at about 140 million monthly visits, and user engagement was declining. Revenue was flat. And probably most telling, the team that was working on it, the employees that were working on it, were kind of arguing all the time.

And there was a reason we were only deploying once a month or every six weeks, because there were issues with quality and operations and dev. You've heard all these stories before: people saying no, people obstructing releases, and people not being on the same page. So as a result, we had this waterfall process.

I'm going to contrast that with where it ended up in 2010, about 14 months later, basically. We grew the traffic by 72%, like in less than a year, to 240 million monthly visits. We tripled user engagement. We doubled the revenue coming out of this, even though the traffic only grew 72%. We went from a team of employees to a team of owners: people who cared every minute they were in the office about what was happening with Yahoo! Answers, and frankly, every minute of their waking life, to be honest with you. And daily releases and much better site performance as well.

So that's kind of the contrast of before and after. I'm going to tell a little bit about how that came to be, in the hopes it can be useful to those of you in the audience who are dealing with these kinds of transitions that are looming in front of you.

I was fortunate to have an amazing team to work with. I had a series of four to five directors who were responsible for engineering, operations, product, design. And we sat down together and looked at the numbers and said, "This is not going to work. We can't have a business like this."

So we had a plan that we put together that sounds really considered in retrospect. I will tell you that it was a little more seat-of-the-pants at the time. And I'm going to go through each one of these in detail.

Basically, our five-step plan was: get everybody together, closer together. And I mean closer together in a lot of different ways. Try to focus on the few key metrics that would move the business forward instead of every metric. Recognize we had to do things with the architecture in order to enable a move to a higher velocity of deploying. And in doing that, look at how we can reduce the size of work, the unit of work, and also create a culture where it was okay to screw up, basically, to take risks.

We'll start with getting people closer together. When I arrived in '09, we had this interesting triangle. We had engineering, design, and product management in London. Oh, by the way, this is all in Europe. I was in the US, so I was 5,000 miles away in Sunnyvale, California. We had QA, operations, and program management in Grenoble, France. They were an hour time zone from each other, and I was many hours separated from them as well.

So that's kind of how things started. And again, as you can imagine, in '09, without great video conferencing technology, there was a lot of slow movement and arguments and people missing each other because it was an hour late after the lunch and that kind of thing.

So the first thing we did is we consolidated the team. It was pretty painful because we had some great talent in France. We offered everybody jobs in London who were interested, and we moved all the functional team together in one place, except for me, of course. I agreed to be on a plane pretty much every month, so I was there for a week a month during this period.

And the goal was to get everybody together physically, but importantly, to get everybody on the same page as well. And by being in the same location, that really helped.

I will say these days, like with my current company, Opsmatic, we get people together on the same page by documenting our conversations in Slack and having a video conference cart we can move around for remote folks. So we have lots of ways of achieving this these days that probably weren't as possible back in '09. But really getting everybody together was a really important thing for making this possible. So physical co-location, but getting people aligned around the same set of goals as well.

But I think the most important thing we did actually was getting everybody focused on just the key metrics that mattered.

Well, first, this is our old dashboard, or a facsimile thereof. We tracked every metric. We had graphs upon graphs upon graphs upon graphs. And as a result, nobody paid attention to anything. That's the result of that kind of dashboard.

So we simplified, and we actually did a bunch of research with customers to basically focus on what mattered. If you're running a Q&A site, what matters? What matters is how quickly can I ask a question and get a super high-quality answer back? That's the customer satisfaction metric, if you will.

So for us, it ended up being five items. Time to first answer: how long does it take from when I post it to get the first answer? Time to the best answer: when do I, as an asker, say this is the best answer? And how many upvotes that answer got, which means the community is saying, "Hey, this is a good answer. We think this person did a good job."

So those first three metrics had to do with satisfaction. And again, as you look at your own business processes or your applications that are facing either consumers or business, I think it's really critical to dial back the amount of metrics and just think about the ones that are really going to move the needle.

The fourth metric basically had to do with how vibrant the community was, how many answers per week per answerer was occurring, answers per week per person. And that had to do with how much content was created on the site. It actually drove our ranking on Google as well. A lot of our traffic came from Google Search, and Google Search loves fresh content. So if people are answering questions all the time, it helps drive our search rankings, drive more search traffic right to Yahoo! Answers as well.

And then the last metric was actually negatively correlated. So it was a second-search rate. If somebody came to a page on Answers and had to do another search to find the answer to their question, we counted that as a failure. So if that second-search rate was edging up, that was bad news. The goal was to keep second-search rate trending down, at least flat, or if not, down.

So these are the five metrics. They were informed by actual data. We actually studied customer behavior very quickly to figure out what really mattered, what really moved the needle. It may all seem like common sense now, but at the time, I have to tell you, we were digging through hundreds of metrics to figure out what really mattered.

You'll notice that revenue isn't one of the key metrics. You'll notice that page views is not one of the key metrics. That's because those will follow if these metrics are doing really well. So understanding how the cascade of your metrics works, I think, is incredibly important.

The next part of our plan was basically architecting to enable velocity. If you look at what we had in '09, basically it was a mess. Even though the product launched in '06, we actually initially architected in '04. So we're talking about five-year-old code at this point, all based on top of Oracle RAC. If any of you have used Oracle RAC, it's fantastic, but has its limitations in terms of the kind of applications it can be used for, and that was definitely a challenge for us.

So lots of legacy code, lots of spaghetti, lots of connection between front end, back end. There was really no sense of layered architecture at all. And that created huge problems because as we started to scale, or think about scaling, things would fall apart pretty quickly. And expanding Oracle was not an option from a cost perspective or a time-to-market perspective.

So what we did is we basically re-architected in place. We couldn't shut down the business, obviously. I didn't want to go to that brand-new-system approach. We basically built a whole new system next door and then magically migrated to it, and everything would work great, because it never does.

So we started out by re-architecting in place. Yahoo! Answers traffic is roughly 95% reads and 5% writes. So we built a MySQL-based read cache to take the bulk of the traffic and the stress off of the RAC systems in the back, and that made a huge difference. That gave us a lot of breathing room.

We also, at the same time, built a proper data access layer for reads/writes to the core database, because that allowed us a lot of flexibility in what kind of data store we used and how we interacted with it from the application level.

We then started refactoring the entire website one page at a time. And I'll remind you, we had 240 million, or at that time, about 160 million monthly visitors. We had a lot of asking and answering going on, a lot of interaction going on. And if that fell apart, that would be the end of it.

So one page at a time, we started refactoring. We started with less-used pages. It's cool to move fast, but sometimes you have to control risk. So we took some of the pages that were visited less frequently and refactored those. The absolute last page we refactored was the actual question-and-answer page, which was probably 90% of page views on the site. But we'd already done refactoring like 20 times by the time we got to that one, so we kind of understood the risks. I'd like to think that we were very cautious in the way we did it, but we moved pretty quickly.

The result of that was we broke down what was this big monolithic application, basically, into more of a service-oriented architecture, which paid huge dividends as we tried to approach a more agile way of working as a team. Because basically anybody could interact with any service without fear of taking it down the way the old system used to work.

And again, this is all going on while we had lots of customers coming in. And we really didn't have much outage during this period of time. It was pretty phenomenally successful. We were very pleased with the results.

And this whole transition took roughly four months to pull off. And I would say we spent a good 60 days of planning before we even started writing code. So we spent some time thinking about it, then we spent about four months rebuilding it.

Once we did all that stuff, we focused the team on small units of work. If you want to do agile, this is pretty much the prescription. You've got to make the unit of work get smaller and smaller so you can achieve it quickly and ship features and functions quickly.

Now keep in mind, we were still working waterfall. We were still releasing every four to six weeks. We had to make a dramatic change in the velocity of code being shipped, from conceiving of the feature all the way through to delivery.

And I will tell you that my ops team at the time kind of had this view of agile: this is going to be a whole bunch of people throwing stuff at me, and life's going to suck, basically. So they were very resistant. And again, I'll mention that these folks were in a different functional organization. I had no organizational power over them, if you will.

But recognizing what was important to everybody, and to them especially, they were being measured on site up and site down and all that. We decided to be very thoughtful about it.

So we together got into a room, every function on the team, and said, "What's the right process for us as a team? If our goal is to start to iterate much more quickly, to drive more experiments, to deliver code more quickly, to own more quality at the developer level, what do we need to do from a process point of view?" And every stakeholder was involved. You can't do DevOps with half the team in the room. It doesn't work.

So we spent a bunch of very thoughtful time looking at the core metrics, looking at where our traffic was going, looking at the business, basically, and saying, "Okay, as owners, what do we want to do from a process perspective?"

So we came up with an answer that wasn't that unique. I assume many of you have seen this kind of approach to releasing code. And again, this is '09, so maybe it was a little more innovative then.

But we agreed that we would do these weekly sprints, that every week there'd be a new topic. We would work on those through the whole week. We would try to deploy code every day to production except for Fridays. Recognizing the true peril of releasing code on a Friday afternoon and the impact it might have on our operations team, and everybody really, we started out doing only four days a week. I believe the team evolved to actually pushing code five to six days a week after I left. But basically started deploying code four days a week.

Metrics were looked at, those five metrics I mentioned, along with a couple of performance metrics. We were looking at those daily, or even more than daily, because when you've got this much velocity going through the system, you can see very quickly if a feature change you've made or a new submodule is performing correctly or performing as expected.

So the attention to metrics became much more acute in people's minds, and this is a big contributor to moving from a team of employees to a team of owners. People held on to key metrics that their project was aimed at and owned those cradle to grave. So we saw an amazing transformation in the culture as we focused people on the metrics and how they were fluctuating based on new releases on a daily basis.

We then obviously, in our case, at the beginning of every week, did weekly iteration planning as well. So we said, "Okay, what are we going to do in the next week? How did we do last week?" So we kept this cadence up of always looking forward a week, looking back a week, planning for releases during the week.

And then to reinforce that ownership as well, we would have monthly business reviews where we'd take those five core metrics, plus things like revenue, page views, and other metrics from the different language versions we had and all that. And as a team, we'd look at that together.

And I would add, it wasn't just the team that was sitting in London. It was also extended ops people around the world. It was our community managers that lived in-country because they knew the language well. So it would be a group of about 80 people, in this case, who were in this all-hands that would all come together around how the business was doing and how we were progressing.

And I will tell you that we had nearly unequivocal support from everybody on the team for this process. It definitely took us at least 60 to 90 days to get this working well. It didn't work perfectly out of the gate by any means. We deployed maybe once a week the first week. But again, we'd been doing monthly deploys prior to that. So by taking the unit of work down in size and planning ahead of time, we were able to make real progress, even in the first few weeks, to getting to more of a daily kind of agile process.

The other critical thing we did is we said it was okay to screw up. If you want to take risks, if you want to have people hit for the fences, you've got to give them the permission to make mistakes, right? That's critical.

And the only way that made sense from a business perspective is make it really easy to recover from mistakes. So when I arrived, this is what rollbacks looked like: total hell. And I assume we've all been there. I've been there. Has everybody else been there, too? "Oh, shit, it's not working. What do we do now?" And then you find out it takes three hours to roll back the site, and you're like, "Ugh."

So the good news is by 2010, it didn't take very long. We were using Hudson for a build system. We built a set of scripts that allowed us to roll back if we needed to. But I would tell you that probably 90% of the time we would roll forward. It was so easy to deploy from trunk that we'd just fix the problem and keep rolling on.

And I think if there's one way to ensure a risk-taking environment, or people to really push themselves, it's to give an environment where they can roll back, roll forward really quickly. I would totally endorse that as a really primary goal for your efforts, because if they know when they make a mistake it has a small impact, they'll take bigger risks. And that was just huge for the team as well.

I didn't put it on the slides, but we also rewarded people for taking a risk and it failing, but recognizing it failed and then killing things. So new features would be shipped. The metrics wouldn't look good. We rewarded teams for shooting that feature in the head and moving on to something else. So we encouraged a culture of experimentation around these kinds of risk-taking that people were engaged in.

So there are some other bits, too. It's not as simple as a five-point plan, as always. DevOps is all about people, in my mind, and culture that you're trying to build, and cooperation.

So we had to coach managers on what we all call the soft skills. How do you lead a meeting? How do you work through conflict with teams? How do you negotiate well with other functional teams? How do you grow your own employees and reward them for good and bad behavior? All those things. We spent quite a bit of time mentoring and coaching the managers on the team.

And for me, this was challenging because I was leading the organization, but I was 5,000 miles away for three weeks out of the month. But again, we had fantastic leads on the team, and we had some great human resource support for doing that during that period of time.

And I wouldn't underestimate the value of this. It's really hard to model behavior for an organization if you yourself aren't that good at it. So I would urge you to invest in this as you work through these kinds of transformations.

We also got rid of people who didn't get it, basically. If they weren't on board, if they were fighting us, if they were saying, "I don't want to do this agile thing," or they were passive-aggressively resisting things, we basically helped them find another job very quickly.

The other thing we had to do to really support our testing and experimentation environment was basically build out a whole new A/B testing framework that we didn't have. And it was a bunch of work. There are much better off-the-shelf solutions available now, but it helped us to really learn quickly.

Because at the core of all this activity, I was trying to build a learning organization, an always-curious group of people who wanted to learn more about our customers, how to grow the business, and how to grow themselves as well. And that framework was really important to that, so they could experiment quickly.

As I mentioned earlier, we were focused on these five key metrics, and we still reported other metrics upwards. So we tracked revenue by country and page views and all those good things, mostly because Yahoo! corporate needed to know these things. But again, I think it was really important that we created a dashboard for the rest of Yahoo!, but the one that mattered to us as a team was a little bit different. It was focused on things that we knew organically would drive the business.

The last little bit that we put in place is we actually, given how fast we were moving and the state of tooling at the time, had to build some tooling for change monitoring: what was changing on hosts, what were the config settings, things like that. And it was really a foreshadow to what I'm working on now, and I'll just leave it at that. I might talk about that later.

But basically, we felt it was critical to actually have full visibility on the current live state of all the hosts. And so we invested some actual engineering time on tooling to really understand that well. So when things went poorly, we'd know exactly what had changed and who'd changed it and what the impact was.

So just to summarize, we got through this process in about eight months total, and got to a very agile environment from a waterfall environment. We built, I think, at that time, very much a world-class DevOps culture, a very collaborative environment. Owners, not employees. People really wanting to own the business collaboratively.

Grew the business, grew user engagement, and it was one of the strongest teams at Yahoo!, at least when I left in 2010, really kicking ass. And this was really remote from a lot of the support structure that lived in California. This is in London, kind of away from everything else, which maybe was part of the success, but also this team had to do a lot to make up for lack of supporting structure.

So if you want to try this, I'd advise the same five-point plan, modulo some of your own environmental constraints. Focus on getting people close together, and that means the mission statement. That means either geographically or using tools to create the sense of a closely coupled virtual team.

Figure out what those key metrics are that really are going to make or break your business. Don't skimp on architecture to enable rapid movement, because you can't go to agile if you've got spaghetti code and systems that are brittle and fragile to work with. Get people focused on these small units of work, and make it totally okay to make mistakes and build an environment where they can roll back, roll forward easily.

I think if you do these things, or even some of these things, you'll see some great success around moving to an agile environment and also very much a healthy DevOps culture.

So Gene said I could do one promo slide, or he urged me to do one promo slide. So based on this experience and another company after Yahoo!, this config issue kept biting me in the butt, basically. I was fortunate to find another co-founder, and we started a company called Opsmatic. The whole focus of our company is on tracking all this configuration change and drift, and giving you a live state system-of-record view of all your hosts and what they're up to.

So if you're in this fast-moving team or you're going through tooling changes, like our service gets used by customers like Slack, for example, who are going through this rapid growth and need to keep track of what's happening in all their hosts in real time, that's what we do. So if that sounds interesting, go to opsmatic.com or find me afterward. I'll happily pontificate about our current project as much as you want to listen to it, but we've been pretty excited building this service and rolling it out over the last couple of years.

Gene also said that I get to ask for help. One of the things that's come up in talking with our customers, and just from me being connected with my alma mater, is there's a lack of understanding on the part of most universities as to what it takes to be effective in a DevOps culture. So CS grads are coming out of college and not really understanding what it takes to be part of a healthy DevOps team, much less contribute to the health of that team.

So Gene, myself, a couple other folks are starting to put together a study group around this. If you're interested in that, our goal is to basically work with a couple of universities to pilot some curriculum and some recommendations around how do we build a more capable set of graduates coming into our teams to better support our DevOps initiatives.

So if you're interested in that, hit me up at jim@opsmatic.com or find me after the talk, and I would love to get your help on this. It'll take us a while to get it right, I'm sure, but it's definitely a need that we see out in the world.

So I hope this has been useful. Thanks for your time.