Shifting QA Left - Emerging Trends in Code Quality and Security Automation

Log in to watch

Las Vegas 2019

Shifting QA Left - Emerging Trends in Code Quality and Security Automation

This talk will discuss various advances in program analysis technology that enable a larger class of bugs to be detected earlier in development (and even to be automatically fixed in some cases).

The talk will focus in particular on recent developments that enable tight integration of program analysis tools into DevOps processes.

These new techniques have been pioneered by academia and operationalized at scale (billions of lines of code / thousands of commits per day) by large tech companies such as Google and Facebook.

The talk will conclude with best practices for organizations interesting in incorporating modern program analysis into their development workflow.

Chapters

Full transcript

The complete talk, organized by section.

Stephen Magill

Thanks everyone for coming to this session. So I'm Stephen Magill. I'm CEO of MuseDev, and just to tell you a little bit about myself. So I've spent the first part of my career building code analysis tools, tools to find bugs in software, mostly on the research side of that question. First, doing work in academia and then recently in industrial research labs. But over the last couple of years, I've gotten more and more interested in this question of how do we improve the impact of these tools, right? So, the tool is part of it, but it's the workflows and the processes and the integrations that you build around that that have a large impact on how much value these tools bring.

And so how can we improve that value? And in particular, looking at what Google and Facebook are doing in this space in terms of how they apply code analysis at scale in their companies and the impact that that has on QA processes. So that's what I'll be talking about today. And I like to start talks like this with a slide on why code quality and security is important. And slides like that usually have some story about a disaster, like a data breach or an intrusion, or maybe even something blew up because of a coding error. And then some scary graph showing the impact of that, maybe declining stock price or user trust going down. But I imagine we all have stories like that in our minds from our own organizations, right?

There's always those incidents or those close calls that really motivate the need for a code quality process and a focus on security. And so I'm just going to assume that people in the shifting QA left talk care about QA, and talk about some of the best practices that, again, I imagine a lot of people in this room are already following. So things like, revision control, testing, doing a peer-based code review, which is still the best way to improve code quality and catch lots of different types of bugs. And using static analysis tools to find things that testing's not good at or help find errors more efficiently. And then doing instrumentation and monitoring to collect production data and learn from performance and reliability results in the deployment scenario. And then also, in addition to all of this, most organizations layer on top of this a separate QA process. It's usually a separate team.

Maybe it's called the QA team, maybe it's called AppSec, but they're focused on testing the software, looking for errors, looking for security problems, and then sort of launching those back to development as issues in a bug tracker or something like that. And this team is usually separate from the development team. And like I said, they interact by sort of filing tickets. And in particular, this static analysis box that I'm showing here, it usually lives on the QA side. So, often the QA team will use static analysis tools as one of the things in their toolkit to identify issues, and then file tickets for the things that look most impactful. And so really, a large part of the focus of this talk is on that static analysis box and shifting it sort of away from QA and security into a proper member of this stack where developers are interacting with it all the time, and it's not a separate workflow, and there's not the overhead associated with that. And so there's many choices for how you could do that integration.

Obviously, we want to maximize the outcomes of this process while minimizing the actual process required, right? So obviously you could improve code quality by just doubling the size of your QA team and also adding a lot of developers so they can triage all the issues that are coming their way, but that's a horribly inefficient way to maybe get marginal improvements in quality, right? So how can we do this with less process, less communication, and just less slowdown? And so a lot of it, as I mentioned at the beginning, has to do not so much with the tools themselves, but with how they're integrated into the process, how they're orchestrated. So that'll be a large focus. And as I said, this is all motivated by what Google and Facebook are doing in this space. Both companies have published quite a bit on what their developer productivity groups are doing, what they use in terms of tooling, and in particular, how they go about incorporating those tools and what they've found to work and what doesn't.

And so a lot of those lessons learned and best practices I've rolled up into this session. And really, I've distilled it down to four key principles. So, use multiple tools. There's generally not a single tool that will hit all the types of errors that you care about. Integration matters. It's not just about the tools you choose, but actually how you incorporate them into your workflows. Cherish developer trust.

So this point is really that it's all about developers, and if developers aren't getting value from these tools and from this workflow, they won't care, and they'll stop responding. And then principle four is that these tools, when integrated the right way, can actually support productivity. So there's this myth that the more tools you add into your CI process, the more things that you add as checks along the way, the more you slow down development velocity, release velocity, and so forth. And, that's just not true, right? If you do it the right way, you can get additional checks in there, and actually support more productivity or enable new engineering efforts that wouldn't have been possible otherwise. And so I have a story about that. All right.

So, as I said, while these key principles, I think, can be applied in a lot of aspects of DevOps, testing, the instrumentation monitoring I mentioned before, I'm going to focus here on static analysis. That's my space. That's what I know best. And so I'm going to first say, what is static analysis, right? And at its core, it's just saying something about the behavior of the program without actually running the program. And it could be looking for errors, things like this piece of data isn't properly encrypted in this place in the code, or it could be trying to prove the absence of errors. Everywhere in this code, we're properly encrypting customer data.

And different tools will target sort of more or less different sides of this balance. Am I just looking for bugs that I'm very confident in? Or am I trying to find all bugs and maybe I report some things that aren't bugs along the way? Right. Whatever approach they take. There's a number of different domains of interest that you can target, right? So we all know about security.

That gets the most attention, right? Using tools to find security problems automatically. But you can also use these tools to find performance issues, evaluate readability and maintainability, look for problems that can cause reliability issues, and evaluate just overall correctness of the code. Whichever thing you're trying to check, there's then the question of what underlying analysis technology do you use to answer that question? And again, there's a range of options. So the simplest static analysis tool you can go run it yourself right now, grep, right? So grep will tell you things about the code without running the code, right?

It will look for particular syntactic patterns. And so here's an example of the sort of thing you could do with grep, right? I'm saying there's this init connection function here, and I'm assuming this sets up some sort of encrypted communication channel. And the first argument is the cipher to use, how to encrypt things. And so here it's asking for, we're providing 3DES, which I'm assuming is Triple DES, which is still supported by many applications today, but has for a long time been known not to be a secure crypto system. So you might want to make sure that your code doesn't do that. It doesn't initiate a connection with that weak crypto system.

And so you can do that by just searching for that string, maybe making some allowance for whitespace. You'll pick up direct instances like this. You'll miss cases where maybe that constant gets assigned to a variable and then flows into the function, or maybe it flows into the surrounding function via some parameter. And so for those sorts of things, there's more advanced approaches based on graph analysis and looking at the code as a graph, computing various derived graphs, things that talk about, say, how data flows through the program or how control flows through the statements in the program. And those sorts of analyses can pick up those cases I just mentioned where the value indirectly flows into the function. And so you can get more general patterns from that. And so if that's a more advanced approach to analysis, there's a still more advanced approach that there's a variety of tools I'm lumping together here in something that I'm tagging compositional program analysis.

But the idea here is you just view the program as a collection of graphs. You compute something over each graph, and then you join the results together, and you use that to answer very deep, very whole program properties about your code. Things like synchronization errors, right? Do the UI thread and the networking thread, are they properly synchronized? Is this class thread safe? Things like that. All right.

So you've got this range of analyses from simple to more advanced deep analyses. And the point I want to make here is it's not just a matter of using more advanced analyses. That's not the right approach. Each tool has its place, right? And so for simple things like simple API misuse errors, looking for deprecated APIs, or looking for a list of authentication tokens that you want to make sure don't leak, simple searches are the way to go, right? For deeper properties like memory safety and thread safety, you need a deeper analysis, which generally takes a bit longer to run. So just that conceptual description of the space probably brings to mind that the multiple tools would be the answer.

But there's also empirical evidence that using multiple tools is helpful. So NIST, the National Institute for Standards and Technology, for 10 years now has been doing periodic evaluations of static analysis tools, both commercial and open source analysis tools. And they recently published a 10-year retrospective sort of summarizing what they've learned in all those evaluations. And one of the takeaways was that the results showed limited overlap between these tool reports, and that the use of multiple tools can increase overall recall and boost confidence in the results. Separately, in separate work, Habib and Predel looked at three open source static analysis tools. So Error Prone, Infer, and SpotBugs are all open source tools that you can go down and download. And again, have limited overlap in the set of benchmarks that they looked at.

And so that's sort of results from the academic literature. Google has also found this empirically in production and published about how they use static analysis. So they have this platform called Tricorder, which is a static analysis platform. You can plug multiple tools into it. And as of January 2018, it included 146 different analyzers. The majority of which were actually written by developers, so individual developers on teams looking probably for API specific, application specific bug patterns that have come up in their code review efforts. So I'm not suggesting anyone go run 146 analyzers.

I think probably only Google has 146 analyzers to run. But clearly, using multiple analyses is important. And so which analyses should you go run? Well, when it comes to, there's commercial analyzers of course. But when it comes to open source, there's actually a collection of pretty good analyzers out there now. And so this would be my go-to list, right, if you're looking for things to deploy internally. So Google has released their Error Prone and Clang Tidy analyses.

Error Prone supports Java, Clang Tidy is for C and C++. These are sort of their internal workhorses for their code analysis efforts. Facebook has released their Infer tool, which supports C, C++, Java, and Objective-C. And then the open source community has a variety of tools that are useful. So PMD, SpotBugs, FindSecBugs, those all support Java analysis. And then the Clang static analyzer gives you some good results on C and C++. So already you have a collection of tools that you can go use.

But it's not just about picking the right tools and running multiple tools. The way that they're integrated has a huge impact. So let's talk about integration. And let's talk about how not to do static analysis. So one thing you could do is you could run these tools and never look at the results, right? And that seems silly. Why would you do that?

But, well, here's the answer. Someone bought a license of this tool, and they said that I should run it, and so I ran it on my code, and saved the results, and I'll go deal with them when I have time, right? Happens all the time. You could run the static analysis tools and file bug reports that then get ignored and deprioritized in favor of features. Again, happens all the time. So you might think, "Well, we'll be smart about this. We'll set up a process with incentives to make sure that these reports get acted on." So you can evaluate your QA team now on the bugs that they managed to get fixed. But then the development team inevitably tends to be prioritized and rewarded on features shipped and new products developed.

And so you have this sort of ongoing battle between QA and dev or security and dev. And a lot of, again, process overhead and waste. So much better is present results to developers when they want to see them, which sounds so easy, right? But isn't always done. So what does that look like? Well, from a timing perspective, the best time to display results, like if I'm a developer, when do I want to see results? Right after I wrote the code is the best time.

It's all in my head. I just wrote it. I know what's going on. I understand everything about this piece of code and how it interacts with the rest of the system. And so if you see an error that you think I should address, tell me now. It's very easy for me to fix it. Another good time is maybe it's not code I wrote, but it's code I just modified.

Someone else wrote it, I went in there and learned enough about it to make a code change. Again, if there's an error related to that code change, now is the time to know about it. Basically, as developers are working, they have a certain piece of the system paged into their head. That it's easy for them to focus on, easy for them to make changes in. And that's really the time to report results about that piece of code. Otherwise, they have to go familiarize themselves with some other part of the system, and there's all this context switching overhead. So good integration has to be timely, but there's a couple of other important components.

It has to be part of an existing developer workflow and use existing developer tools. And in all of this, I'm very focused on developers. In here, developer workflow, developer tools. Why? It's because developers are the ones who fix the bugs. Ultimately, it has to be a developer making that change to fix that bug. And so if you can get a process together that works for developers, that presents things to them the way that they like to see them, that's the best for everyone.

So what does existing workflow look like? This is a pretty typical development workflow. You have something like GitHub managing your repositories. The developer pulls the code onto their local machine. They use their favorite IDE to make some code change. And I'm not going to get into which one's better, no vi or Emacs wars here, but there's a variety of them. Things usually standardize again when it comes to the compiler.

So check that the code builds, run some tests locally, and then push the change up to GitHub, or whatever repository manager you're using, and that generally then kicks off a CI process. A CI/CD process. And so there's a CI-based build and test infrastructure. Generally, there's a manual code review step that's sort of the last step in a deploy pipeline. So any of these places are good places to integrate. In the IDE is great. In code review.

In CI, you can have a tool block the builds if you're very confident that that tool is going to be always providing meaningful results. But in general, any of these are good places. What are not good places? Well, follow me to a place well outside your workflow. So this is my awesome bug dashboard. I put this together last week. I spent a lot of time picking the right font and coming up with a great quality scale over there.

I'm really proud of that. And so you can see it's providing meaningful results. Like Thanos clearly doesn't understand what the balance operation means. I just don't think he knows what that word is. And in Brazil, the central services, that method has problems with repair. So these meaningful results that you might want to go fix and quality, got this nice quality graph, it's declining. We're up against a release deadline, we're pushing pretty hard, so quality's taking a hit.

But we know it's Joe's fault. He's the one who's really putting some bad code in there. So we know who to blame. So I'm being kind of facetious, but there's a lot of focus on dashboards like this. And they certainly have their place. A nice code quality dashboard, a list of bug results, that can be a great way to get a sense of where you are from a quality perspective, a great after action report if you've just finished some initiative to try and improve quality. It can be great for management level and higher.

But this is not an interface targeted at a developer. Developers are not going to want to pull this up and start ticking off bugs here. It's not part of their workflow. It's not oriented towards them. It's just not effective. So what's more effective? So here, this is the GitHub pull request workflow.

So the developers are very familiar with this. If you're using GitHub, anytime you submit a code change, it goes through this pull request system. And this is how code review works, peer-based code review. So starts with a description of the code change. Here, this is a performance improvement, so CPU usage was 60% here. After the change, it's down to 35%, so improved performance. Lots of detail on the particular scenario and how performance was impacted.

And then below that is a discussion. So it's conversation among developers on the team about what additional changes should be made before this code is merged. And so this person says, "Would it make sense to immediately return null after each condition check?" "Yes, I think that would be good." And they make the change, and then progresses. So this is a great place to insert tool results. There's already a conversation happening about what's good about the code, what's bad about the code, what needs to change before it's merged. So if there's some tool that knows something that should get fixed. Now's the time to report it.

And again, it's part of their existing workflow. No one has to go check a separate webpage. No one has to remember to do some other process. And this is exactly what Google does. So here's a screenshot from a paper that Google put out on their Tricorder system. And just like I showed before, it integrates with code review. Their code review, it's an internal tool, so it looks a little bit different.

But you can see these are two analysis results, one from a linter, one from Error Prone, which is an open source tool that they've released. And in each case, there's this Please Fix button that the code reviewer can click on to say, "Hey, I think you should address this issue." If the developer disagrees, they can comment on why. Again, it kicks off that conversation that's already happening. All right. Here's another great story about why integration matters. So at Facebook, there's a team that supports a tool called Infer. So they were an acquisition, actually.

They were a startup building a static analysis tool, got acquired by Facebook, went in to apply their tool inside Facebook, and initially deployed it in that thing I showed on the second slide. Off to the side as part of an overnight run, where then they took the results and filed them as bug reports and associated tickets. And they'd spent a lot of time making sure that the analysis was reporting useful bugs that people should care about. And so they deployed it. They thought it was going to be great. They got almost no fixes. No one responded to these bug reports.

They then later took the same tool and deployed it in this code review workflow, what they call here the diff time deployment. And the same analysis tool with the same results saw a 70% fix rate. Suddenly, 70% of those issues were getting fixed, whereas none of them were before. So again, same tool, just a different integration and workflow. So that's how much integration matters. All right. Principle three is cherish developer trust.

What do I mean by that? Well, let's look at some of the things that can go wrong. Even if you do those two things I've already mentioned, using multiple tools, integrating them the right way, it could be the case that suddenly there are more automated results than developer comments. You're using so many tools, they're all reporting things, and it becomes instead of having this conversation with my development team, now I feel like I'm just triaging tool results. And by the way, most of those tool results are just status updates saying, "Hey, the tool ran, didn't find problems." Or they're style suggestions that may be applied to someone else's code base, but we don't follow that practice, so ignore. Or the results that say they're errors are not actual errors because the tool doesn't understand the code well enough, and it doesn't understand that actually we're doing this other thing that protects against that. And so like, at this point, if I'm on this team, I start looking for browser extensions to just filter all this noise off.

Or like, where's the off switch? How do I get this out of my pipeline? So there's this danger with a low signal-to-noise ratio. There's this danger that as more analyzers are added, the risk of tool fatigue increases. And so as you grow the set of analyzers you're using, it becomes even more important to maintain a high standard of quality among these tools. Because once this developer trust is eroded, once they decide that these automated results are much more likely to be not useful than useful, it's really hard to regain that trust. And so this potential effectiveness of you can just put the results in front of developers, and they'll just get fixed.

No one else has to be involved, no separate workflow, you've lost your shot at that. So you really want to make sure that you're paying attention to this. So how can you do that? Well, at Google, it's probably not a surprise to learn they're data-driven about it. So again, same screenshot I showed before. Each of these reports has this button called Not Useful. If you click that, their developer productivity team collects that data, notices that you didn't find that particular result useful, and if there's ever more than 10% of reports for a particular analysis that are flagged not useful, it gets pulled.

And it doesn't get re-enabled until the team can fix. Sometimes it's the description of the error isn't precise enough, or it's not understandable. It's actually an error, but it's just not well communicated. Sometimes it's just not a very precise analysis. But they're continually monitoring whether things get fixed and pulling things if they're not effective. So that's how you maintain a high standard of quality. All right.

So these three principles together, if you manage to get this right, all of these three combined, there's this nice synergy between these where you've got multiple tools producing results. You've taken care to make sure that those results are very likely to be useful. You've pulled any tools that aren't performing. And you've integrated into the developer's workflow. And so you're essentially making the most of developer attention. You're making sure that bugs get fixed when they're easiest and cheapest to fix in terms of time and effort. All right.

So, as an example of the outcomes that this enables, here are some sort of statistics from Google and Facebook. So at Google, they've reported that approximately 50,000 code review changes get analyzed per day. That Please Fix button I showed gets clicked more than 5,000 times each day. And so by their estimates, this system running this on a continual basis has prevented hundreds of bugs per day from entering the Google code base. Facebook has had similar results. So static analysis has prevented thousands of vulnerabilities from being introduced over the last two years. And it catches more severe security bugs than either manual security reviews or their bug bounty program.

And again, they collect the data so they know these numbers. What's the percentage of bugs found by each method? And just on the topic of data, data is so important. So think about even if you're already using sort of analysis tools or even testing, right? How much insight do you have into how useful those results are, right? Can you track those results through the whole workflow to see if they get fixed and see if they deliver value? Super important.

All right. So the last principle is that these tools can actually support productivity. And so for this, I want to just relay a story. So Facebook has described this story in a couple of talks and in an article that they published on how they use analysis tools. But basically, it's a story about the news feed component of their Android app. Right? So if you pull up the Facebook Android app, there's this feed that shows you the recent activity relevant to you.

It's called News Feed, and they wanted to move it from a single-threaded architecture to a multi-threaded architecture. So if you know anything about concurrency, just that statement should scare the crap out of you, right? Suddenly you've got all these classes that we're assuming they'd only be called sequentially, one at a time in a single-threaded context, and now they're being called in a multi-threaded context, and everything has to be properly synchronized, or you can have these errors that are just very difficult to trace down, right? Multi-threading errors are notoriously hard to debug and stamp out. So hundreds of classes were impacted. And it just so happens that around the same time, the developer tooling group was working on a concurrency analysis. And so they got together, and they collaborated on tweaking that analysis to make sure that it served the needs of this product team, right?

And so they worked together. They deployed the tool and then did the re-architecting. And it was a success. They managed to do it. And at the end of the effort, one of the Android engineers communicated with the team that without Infer, which is the tool that they built this into, multi-threading in News Feed would not have been tenable, right? It was just too much risk that it wouldn't go well, that they'd get halfway through the re-architecture and have all these lingering bugs, right? And so the tool support was really critical to that.

And so I like this story because it's not just there's reasons that doing this integration right and using the right tools can improve productivity that I described at the beginning, right? It simplifies the workflow. Less people are involved. It's more automated and more localized. But this shows how having the right guardrails in place, the right automated tools as part of your process, can even enable new engineering achievements that wouldn't have been possible otherwise. All right. So in terms of shifting QA left, the summary is static analysis tools have always been a way to automate QA, right?

I said that at the beginning. Often, these tools are one of the things in the toolkit of the QA team. But as analysis efficiency has improved, as these tools have gotten better and better and we've learned more about how to do effective DevOps, how to integrate tools into DevOps these automated tools can really be brought left closer to developers and integrated better into the process. And if you manage to do this the right way, you can simultaneously improve code quality, right, what we're all after. You can get your QA stuff done while improving overall productivity. So Gene asked everyone to close with the help they're looking for. And so what I'm looking for here is tell me all the reasons this wouldn't work in your environment, right?

If any of you are thinking, that's great for Google and Facebook, but I'm not at Google or Facebook, this would never work for me, right? I want to know what are the roadblocks, what are the differences that you see? Where do you think this could and couldn't work, and what tweaks would have to be made? So I'll be in the Speakers Corner after this if you want to have more discussion on the topic or just come find me throughout the conference. Thank you very much.