User Feedback at the Speed of DevOps
Traditional feedback mechanisms tend to take weeks or months to prepare, gather, and process user feedback for integration into the software life cycle.
In this session, we'll show how the Developer Division at Microsoft has changed the way the engineering team partners with usability research to get feedback at the speed of DevOps. We will share our success (and failure) stories as we have worked to get feedback at a weekly cadence.
Chapters
Full transcript
The complete talk, organized by section.
Justin Marks
Welcome. I'm Justin Marks. I work at Microsoft, where I work on Visual Studio Team Services. Today I'm going to be talking about our journey, not just moving into the DevOps world, but more importantly, focusing on how we actually bring user research into the work we do as program managers or product owners, to get feedback on a regular basis from our customer base. So when we deliver software, we can deliver it with confidence and make sure that our customers are going to get the features that they've been looking for.
As I mentioned, I'm Justin Marks. I've been at Microsoft for the last 15 years. I've worked in a bunch of different roles and a bunch of different teams. I worked in Windows for a long period of time, working through the Windows Vista cycle, and today I actually work in Visual Studio Team Services.
Miki Konno, my comrade-in-arms from the user research side, was supposed to be here today with me to give this talk. Unfortunately, she's sick, and I'll be filling in for her portion. So she sends her regards and wishes she could be here with you today.
For those that don't know, Visual Studio Team Services is Microsoft's DevOps offering. We're a heterogeneous set of tools that allow you to build, deploy, plan, and manage your projects, no matter what world you're trying to deliver them to, whether those are an iPhone app, or an Android app, or something you're bringing to the Azure or the AWS cloud. Our toolset allows you to manage that entire suite.
Team Services is really a group of about 690 people. We're located around the globe, and we roughly break ourselves down to about 40 feature teams. Each of those feature teams are made up of both engineers as well as program managers like myself.
The way we actually release our software is essentially a three-week cadence. Every three weeks, the entire product launches to the cloud. We're a cloud-first delivery cycle. Today is actually the third day of Sprint 127, so it's been about seven and a half years that we've been following this sprint cadence. Every three weeks, like clockwork, we ship.
Then basically, at the end of the quarter, we actually take all that value that we shipped up to the cloud, we've hopefully refined that and improved it based on the feedback we've received there, which we'll talk about in a minute, and then we'll ship that off in Team Foundation Server, which is the same product, but delivered in an on-premises offering. So this is what you buy, you install on your local servers. Same functionality, just a different package.
What we found as we started to go along this journey to move to this cloud cadence is we had to reevaluate all of the processes and techniques we had for shipping software from the beginning. We came from a world where it took multiple years to ship software, and that didn't really translate well to the cloud cadence.
So I want to give you a little, quick, one-minute type of summary of what it used to look like for us to ship software, and we'll use that as a jumping-off point for understanding some of the problems that we had and why we wanted to make some changes.
Generally, our planning cycles, as I said, were two years. We'd have a large upfront planning cycle, generally about four to six months, where people like myself would get together with other program managers and our marketers. We'd figure out exactly what our customers wanted. We'd have a whole bunch of late nights and a whole bunch of doc-writing activities, and at the end of that six-month period, we'd deliver our specs. This is basically the plan as best as we know it.
We can basically hand this to any engineer and say, "Hey, guess what? Go build a workback schedule. This is everything we need you to deliver over the next 18 months. Good luck to you."
I remember when I worked in the Windows team, one day, I was actually a tester on that team, something called "The Book of Longhorn" landed on my desk. Longhorn was what our product code name was for Windows Vista. It was basically a 120-page document that was this thick that, in theory, told us everything we needed to go deliver to make Windows successful.
What was the likelihood of me reading that document and actually going through the detailed analysis of everything it would take to influence my design decisions? Next to none. No one read that document. I wish I still had it because it would've been a great prop, but that's about all it was good for at this point.
After our planning, we'd essentially move into our execution phase, where we'd break that entire plan down into two milestones, and we'd basically cycle, rinse, repeat of handing things to the development team, having them build a workback schedule, and essentially executing.
All the while, we were, as PMs, doing usability research. We were working with our counterparts in UX to bring customers into the lab to get feedback. But often, we already had a clear picture in our mind of what we wanted to go build, and we were going to go build it anyways, no matter what the lab responses told us.
Unfortunately, a lot of those times that we did get really valuable feedback, it was too late. We'd already committed ourselves to a plan, and we weren't able to bring that feedback in to take advantage of it to its fullest.
Of course, at the end of the day, UX is a limited commodity. For 690 people, we weren't going to have dozens and dozens of user researchers. So whenever we did have user research, we had to put them on the highest-priority problems.
We would bring somebody into Windows to test the Start menu and this experience for how a user would interact with that, but we wouldn't be able to test things like the mouse control panel. That was too small of a feature, and we didn't have the resources available to us. During this execution timeframe, we didn't really get the feedback we were looking for.
Finally, at the end of the day, we would ship a milestone, our beta and then our RTM, with the goal of getting feedback. Beta was basically our coming-out party to the world and saying, "Hey, here's all this great new value. Why don't you go give us some feedback on this? We'd love to hear your thoughts."
And we'd love to. But the other half of that Book of Longhorn, they had to go execute on that.
So when the feedback would come in from our customers, generally the response from someone like myself was, "That's great feedback. We really appreciate it, but by the way, we're not going to get to that because we're booked solid. Wait another product release cycle. Three years from now, maybe we'll be able to incorporate that feedback."
That's where Microsoft got that notoriety of not really being customer-driven and not being innovative and not being able to listen to our customers with empathy, because our work style, our product base, the way we were building our software, wasn't conducive to being able to not only be empathetic, but to be able to incorporate the feedback on a continuous process.
So when we came down to start working on our cloud cadence, we wanted to look at that front and center and say, how can we be more customer-focused? How can we bring that data that the researchers are providing to us into our daily lives? How can we actually make sure that our PMs, when they don't have a researcher available to them, still have the skill set and tools to be able to get that feedback, elicit that response from our customers, and bring that into the product cycle?
That's what we're going to talk about for the rest of the talk.
The way our teams work is essentially we have two major disciplines. People like myself are program managers. You might call them product owners in your own organization, and we're the ones defining the what and the why. My job is to figure out what are the customer problems and do the end-to-end design for how we're going to go solve them.
Then I work in collaboration with my engineer and my engineering counterparts to go define how they're going to build that with the right level of quality, with the right scalability, and the ability to deliver and execute on that vision.
As I mentioned, as a PM, I'm really empowered. I'm given all of the opportunities to truly own my feature. If I'm the PM working on the agile backlog for Visual Studio Team Services, I'm the one that's interacting with our customers, knowing better than anybody what it is that they're asking for. What are the deltas between us and our competitors? What are the customers telling us back in the forums and in our community outreaches about what they need? And I'm empowered to go make changes there.
Similarly, when it comes to design and research, I'm empowered to go find and do the right thing there. Again, I don't always have a researcher or designer at my disposal to go work with me, so I need to be able to go get that stuff on my own.
Speaking of UX and people in our team, I would guess there are about five UX researchers and probably seven designers. So they're a very small portion of what we're doing, and the PMs have to take a large part of that work.
Similarly, when we do want to have research take care of those really important, gritty problems, we want to make sure we can capitalize on their investment and ensure that what they do is not only heard, but able to be internalized back in the team.
Unfortunately, the way the researchers often work, if Miki were here, she's from Japan, so this might make a little more sense if it was coming from her, but she often talks about research as being a Japanese tea ceremony. It's very complicated. It's very time-consuming. Both the people doing the study, as well as the people participating in the study, have to be incredibly patient as this long, drawn-out process takes place.
When you start talking about engineers, they don't have the patience for this. They want information now. They want to have a rapid, interactive process where they can collaborate with research, and they can do something right away.
The types of feedback we heard from them is it was really slow. It wasn't fast enough to get the feedback. They wanted to get the feedback early and often. Things were very reactive. We're great as PMs and as engineers going to the forums and hearing what people have said about the stuff we've already shipped, but the goal of research is to be able to be proactive, to be able to get the information before we go write a line of code.
Similarly, we need this stuff now. We can't wait six, nine, 12 weeks. That's four sprints of time. We can't wait that long for a feature to be designed, and then to be tested, and then to come back and do multiple iterations. We have work we need to get out the door, and we want to have a process that actually matches our cloud cadence or our DevOps cadence.
Finally, we can't test every single design. So when we do this, we better be able to capitalize on our investments.
When we took a step back and we said, "Well, how are we actually doing our usability testing?" we took a complete view of the landscape. We looked at all the different ways that user research can be employed to solve different problems for different contexts.
There really are two different parts or two different axes to the way we look at research. The first axis is the vertical axis, and we start looking at qualitative versus quantitative: the user feedback, the insights versus metrics. Then the horizontal axis is really around what people say versus what they do, their goals and attitudes versus their behaviors.
Skills and techniques that we have in each of these quadrants. The upper-right quadrant is really the one that is most conducive to rapid iteration, to be able to get that feedback that's very tactical in nature that we could take action on. The others are important, but they're very strategic in nature. So we weren't really focused on those. We were more focused on how can we take the tactical element in that quadrant and come up with a new technique or use existing techniques to actually go get feedback from our customers.
I'm going to talk about a couple of those techniques that we employ on a regular basis.
The first is heuristic evaluations. The easiest way to think about this is measuring success of a given activity. As an example, I'm able to say, "I want to make sure the setup experience for Visual Studio is as great as possible," and I need to have a baseline. How long does it take for a user to install Visual Studio?
I can get a copy of Visual Studio installed in a box. I can sit with a stopwatch, and I can measure each step of the process because I'm an expert. I know what's going to happen. I know how it should work. I can apply principles to it, and I can actually use measurement to see how well I'm doing. That is one toolset. It's one thing in my toolbox that I can pull out to evaluate an experience.
Another thing I have at my disposal is remote studies. How many people use usertesting.com or UserZoom? Anybody familiar with that? A few people. They're great websites for me to be able to take some designs, get them out on the web, have the service provide hundreds of people to come and give me word of mouth and evaluations on that in a very rapid period of time.
This is fantastic if I want to basically do A/B testing before I want to build anything. I can take two designs, I can put them side by side, I can get them out to customers and tell me, "What do you think?" And I can get the verbatims from them. But at the same time, it's not very deep. This is kind of throwing it over the wall and seeing whatever I get back.
Another thing I can do, which we do quite often at Microsoft, we've got 120,000 people working for Microsoft, 60,000 of them in Redmond. We can just go take our designs, go to the cafeteria, and go desk by desk or table by table and say, "What do you think? You're exactly our target audience. You're somewhat of a jaded target audience, but you're still developers at the end of the day. Let me put some UI in front of your face while you're eating your spaghetti, and let's find out what you have to say about it."
Again, a really good tool that we can use, but it's not the only tool that we want to be able to use, and it doesn't always fit for every type of scenario. If something's a more intricate flow that takes multiple steps, you're probably not going to want to spend a 30-minute lunch period with somebody and interrupt them. They're not going to be very interested in sticking with you.
Finally, one of the things we've done quite often, and actually we've done it with a lot of success in the game studios, is what we call RITE studies: rapid iterative testing and evaluation. The idea behind this is that you can iterate as you do your studies. So we'll show a study to user number one. We'll get their feedback. We'll incorporate that feedback, and then the second half of the day, we'll show a different UI based on the feedback we received to the next set of users, and do that over a period of days, getting better and better over time. A great toolset, but again, it's a little bit slower in process.
So these were all things we do, and we still do them today, but they really didn't hit home. It wasn't the type of collaborative experience we wanted to have, and it wasn't the proactive experience we wanted to have. We wanted to make sure the program managers were the ones driving this, not the user research team. We wanted to make sure they were able to participate and drive not only the priority, but the focus of what we were doing.
To do that, we actually came up with a new thing, which we call FAST studies, F-A-S-T. Basically, the idea behind this is it's a one-week, end-to-end usability test that we're doing on a consistent basis, and we're just churning it through just like we do software. So I'm going to delve in a bit more into this and show you how things actually work behind the scenes.
Essentially, FAST stands for fast, agile ad hocs with tactical. At Microsoft, we have to have acronyms for everything, so we just made up one that sounded cool. Basically, the idea is we want to get five to six users through the lab with one researcher every week.
Five days. We want to go from the beginning, from our planning stages, to the end, where we can actually have the team have not just feedback, but actual key takeaways that they can incorporate back into their design.
To do this, this is where we bring our A team. Everybody from the team is a participant here. The researcher is the person that's the single voice talking behind the glass to our customer, because we are going to do these things in the usability lab. It's a little bit annoying if you're a usability participant and you hear five different people talking in your ear, especially when you can't see any of them. So the researcher is the person that's the single voice.
Myself, as the PM, is sitting right over her shoulder, and I'm giving her direction of what am I trying to get feedback on? Where do I want to ask the questions? Where do I want to drill deeper with the user?
My designers are going to be there right with me because it's a collaborative experience. We don't want them to hear it secondhand. We want them to be in the lab hearing the feedback and being able to incorporate that into the designs as we go.
Similarly, we want the engineers to participate. We don't want them to be hearing about this six months later when they're going to start working on it. They're going to be starting on this feature in two or three sprints. We want to be able to bring them in there and say, "You know what? It's worth an hour of your time to build that customer empathy, to hear right from the users what their experience is and what their feedback is."
Our process is pretty simple, and it's actually pretty repeatable for us. We have a five-day process where essentially we start on Monday.
Monday morning at 10:00 a.m., we do a scrum every week. We treat this just like any other agile project, and we prioritize all the features we want to bring through the FAST lab, and we decide which ones are we going to test this week. We take inputs like what are the design deadlines, when are going to be reviews with leadership, what are the time constraints. All of that feeds into our decision-making and how we prioritize this.
On Tuesday, we have both the researchers and the program managers taking their own piece of this and driving it. The researchers are going to collect the study materials. They're going to make sure that we can get the right people into the lab and do the recruiting. The PMs are going to be the ones building the prototypes, the mocks, whatever that medium is that we're going to have the conversation around. They're going to plan and prepare for that so that they can have that work with the researcher on Wednesday.
On Wednesday, the two groups sit together. They do a review of what they're going to show in the lab on Thursday, make sure that the PM has all their questions enumerated, all their hypotheses ready to test. The researcher has all their questions answered so they can actually present this to the user with their best foot forward.
Then on Thursday, we run the study. Generally, with five to six people that we've signed up, somebody's invariably sick or running late in traffic, so we end up getting four or five users to actually come through the lab. At the end of the day, we start at 8:30 in the morning. We have a new user come every hour. We have learned we need to have bathroom breaks because it's actually important for the researcher not to sit there for four and a half hours. That is something we learned in our retrospective.
Basically, by lunchtime, or a little bit late on lunchtime, we've actually gotten five users through the lab. We've been able to analyze those results, and we can have a debriefing session with not just the PMs and engineers that were participants, but also the larger team.
I work in identity management. I work on things like people being able to sign in and out of our service, being able to do group management and licensing, and all the stuff that it takes to acquire our product. There are six or seven other PMs that I work with on a regular basis in this space. It's valuable for them not to necessarily spend the five hours in the lab in the morning, but it's absolutely valuable for them to come in our little recap that's 30 minutes on Thursday afternoon. So the entire PM team, and also the engineering leads, are invited to participate and hear firsthand how we've done.
Then on Friday, both the PMs and the researchers send out the reports, and we send those out broadly. We want to make sure that everybody in the organization has an opportunity to see and learn from our feedback, and not just the four or five people that were actively involved.
Here's an example of our scrum Kanban board. We're using, obviously, Visual Studio Team Services, and you can see how we've actually moving features from left to right.
Each one of the stories has got a pretty clear description of what it is that we're trying to test, who we actually want to bring into the lab, because each individual feature might need a different persona that we're trying to test for, time constraints, things along those lines. We're also able to link all of our storyboards and work there, and our results at the end of the day.
At the end of the day, when you come to the Kanban board, you can see all of the information about each one of those features in a single one-stop shop, and there's never a question of, "Hey, where did that report go? What was the name of the email? Let me get into Outlook and search for it." It's all right there on the Kanban board.
As I mentioned before, it's a pretty simple five-step process. The first is preparation. The first question is, what are we actually going to test? We're happy to test anything. We've even tested design patterns in the lab before. So we can bring ideas and concepts, we can bring mocks, low-fidelity, high-fidelity images, even live code. All of those are possibilities. None of them are requirements. It's up to the PM to decide what's the medium to test based on what they're looking to get feedback on.
The next question is, who are the right participants? What we've found is one of the largest costs in doing usability research is finding the participants and filling those slots so they can actually meet your schedule. To get five slots, we usually have to reach out to about 200 people to fill those five slots in a given week.
To simplify our lives, what we've done is we've designed personas. We say, "Okay, since I'm working on identity management, the three key personas I'm looking for are my team lead, the account administrator, and the end-user developer." As long as I understand who those are and I can describe them, I can go to my recruiter and they can build pools of people that have already been pre-filtered to match those personas.
So when they hear on Monday, "Hey, we need something that's going to be developer-focused," or, "We need developer personas to show up in the lab," they already have a group that they can pull from. It's not starting from scratch.
Next, we choose what the right lab setting is to test these. Traditionally, we're in our traditional usability lab, which is, as anybody that's done a usability lab before, simple behind the glass, lots of video monitors, eye-tracking software, et cetera, and all of the engineers are sitting behind the glass, hopefully not making snarky comments about the users as they go through the experience.
We also have other types of labs. We have a more living-room-style lab. Not really used as much for the developer community, but you can imagine people working on Xbox are going to want to have labs that are a little more conducive to their home environments.
Next is execution. Execution is pretty simple. It's what you've seen in any other usability study if you've participated before. Somebody comes in, you introduce yourself, you get them situated, you go behind the glass, and you basically walk them through tasks.
We found that the optimal number is three tasks. Any more than that, we run out of time, and we don't get to go as in-depth as we want to. Essentially, the usability researcher is that one voice coming over the speaker, and they're the ones dictating how or guiding the user through the experience and completing those three tasks.
Afterwards, as I mentioned, Thursdays are debriefing. Again, the goal of the debrief is to come away with what are our takeaways. What are the action items that we're going to go incorporate in the design? This is not, what did users like and what did they dislike? It's not a superficial check-the-checklist type of discussion. It's really an analysis of what can we do different. How did us going, spending five hours in the lab change the design? What impact do we want to have? And how are we going to follow up and measure that impact? Those are all the types of things we're looking for in the debrief.
Lastly, the most important part is communication. You're only having five people participating in this. You want the broader community to know, A, that it's happening, and B, to sympathize and learn from what you're doing.
Here's an example of one of those emails that Ally sent back in July when we ran one of the studies. You can see her key. There's a lot more stuff below the screen, and there's a bunch of links to other documents that she's built, but it only took her a couple hours to put this email together. She was able to quickly distill her learnings when she was trying to understand how people would invite users into an account. We were able to take that feedback and bring it right back to the engineers, incorporate it in the design, and the next week, they actually started building the feature.
As a quick recap of this, it works really well with specific sprint work. It's less successful for larger strategic plays. It's great for things you want to demo or doing quick UI fixes, simple navigation changes. Paper prototypes work really well here. High fidelity is something we actually try to do a lot more often because it's a little more sympathetic to the user base.
What it doesn't really work well with are really large studies where you need to get tens if not hundreds of users giving you feedback to make sure your interactions are correct. This is not going to work for that study.
With that said, this is only one of the ways that we actually incorporate feedback. We're still a data-driven, or what we like to call a data-informed, culture. We've tried to take the work of Eric Ries in The Lean Startup, the build-measure-learn philosophy, and bring that back to our engineering and our design teams as a regular part, because that qualitative feedback we get is equally important to what we get in the lab before we're actually shipping our software.
Basically, the idea of build-measure-learn is to be able to design a hypothesis before you ship your software. As you're doing your design, think about what you want to learn from it. Ship it. Experiment. Nothing beats actually getting the software out the door. You can't cheat shipping. Once you get it out the door, learn from that, and then incorporate that feedback back into the design for the next iteration.
We like to say, "Once we ship, we're not done." That's kind of the beginning of our journey as engineers and as product owners.
As an example of that, one of the things we started to do is what we call launch-and-learn emails. Every time one of the program managers on our team ships a feature, the expectation is within a couple weeks after shipping the feature, they're going to write a broad email. You can see that it's going to the whole Agile team, so you're talking about 100 people there.
The idea is to share what you're learning. What measurements did you collect? What hypothesis did you draw? More importantly, what are you going to do about it? This is a way we can hold ourselves accountable.
It's one thing to say, "Hey, I've got the feedback. I'm going to keep it in my back pocket so I don't actually look bad or I don't have a growth mindset with it." It's another thing to share that in the open and say, "You know what? I was wrong about the design. I really thought that that button was going to be discoverable, but I was wrong. We need something more concrete. Here's how I'm going to apply impact to this problem. Here's how I'm going to make a change."
Now the entire team has learned, not only because we want them to learn how to write better features, but also we want them to get better and exercise that skill set of writing hypotheses and measuring them. That's equally important to making the software good. We want to build our culture. We want to build that new muscle for our team to have.
Another way we do this is through actual iterative development. This is a little bit of an old example, but I still think it's pretty accurate.
One of the problems that my team found was that we were having a pretty big drop-off as people were acquiring our service. When they would come to sign in, there were definitive points in our conversion funnels where they were dropping off. One of the biggest points they were dropping off is after creating an account with our service, they were not actually using that account.
Originally, one of our hypotheses was maybe it's because of this terrible UI. There's too many things and decisions for them to make. So we brought them through the lab, and we got a bunch of studies done.
Then when we shipped it, we actually took the feedback and the qualitative analysis that we got from the forums, from the bugs that were filed, and we incorporated it into iterative development to make multiple different visualizations over time. We shipped them one after the other, and we were looking and measuring how our actual conversion improved over time.
That was an important learning for us because what we found was that it wasn't a linear progression. Sometimes we made a step forward, sometimes we made a step back, but the goal was constantly looking at how we're going to improve this specific design, sprint over sprint, because it's not done. Just because we shipped it doesn't mean it's as good as it can be.
We now are starting to leave room in design where once you ship it, you're not off to the races to the next thing. I might have 10 people on my team, and I might have two people allocated to responding purely to feedback for the next one or two sprints after I ship so that I can actually have room in my planning for doing this kind of build, measure, learn.
So that was a quick lap around some of the ways we actually are bringing user feedback from the lab, as well as our build, measure, learn, into our daily environment.
One of the calls to action I wanted to call out was the DevOps assessment. If you go over to our booth, where you can learn more about this. I'll be at booth 206 after this if you have any more questions, but we have about three minutes left, so I might as well open it up now.
Q&A
Any questions from the room?
Quiet audience. I know it's the end of the day. Yes.
Q: So you mentioned the one site, usertesting.com, right?
A: So the two sites that we've used for broader user testing are usertesting.com and UserZoom. They're third-party services. They're paid services, and we use them quite extensively. We've actually found it's much cheaper to bring people there than it is to pay somebody 100, 150 bucks to come in the lab for an hour-long study.
Other questions?
Wow, finished with three minutes to spare. All right, I'll be around if there's any questions. Thank you very much for your time, and enjoy the last part of the conference.