Completing the Promise of CI/CD with Continuous Merge

Log in to watch

Virtual US 2022

Completing the Promise of CI/CD with Continuous Merge

Pull requests have become the No. 1 undiagnosed bottleneck in engineering… and we have the data to prove it. After analyzing 1,000,000 pull requests, LinearB has uncovered CI/CD’s greatest weakness and - thankfully - a eureka moment that will forever change the developer experience: Continuous Merge.

Here’s the dirty little secret: not all PRs are created equal. They don’t all need the same level of effort and some of them don’t need any human intervention at all--it’s just a matter of using data and context to know which is which and creating intelligent automation workflows to handle PRs appropriately.

In this session, Dan Lines, COO and Co-Founder of LinearB will dive deep into what companies can do to change their processes and how they are adopting Continuous Merge and other practices to merge and release faster, with higher quality.

We’ll cover:

The metrics and insight you need to identify bottlenecks and take action

The PR paradox and why treating them all the same just doesn’t work

The principle of Continuous Merge (CM) and how it’s enabled with gitStream

How LinearB streamlines code delivery and improve the developer experience

This session is presented by LinearB.

Chapters

Full transcript

The complete talk, organized by section.

Dan Lines

Hey, what's up, everyone? Welcome to my talk today, which is titled Completing the Promise of CI/CD with Continuous Merge.

First and foremost, to introduce myself: my name is Dan Lines. I'm the co-founder and COO of LinearB. I'm also a former VP of Engineering. A lot of the things that you'll see throughout the talk today are things that we provide and products that we provide within LinearB. I'm also the founder of the Dev Interrupted community, which includes an amazing podcast where we're able to talk and bring on super interesting engineering leaders, totally free and all that. So I highly recommend you check it out if you haven't before.

The other important person that's kind of a character in this talk today is Trong, one of my friends. He's a LinearB user and he's the VP of Engineering at FloSports. For most of the talk today, we're going to be diving into some of the findings that led us to this thing called continuous merge, some of the findings that were happening within Trong's organization. We're going to be looking at the problem discovery: how did we come up with this concept of continuous merge, with some of the analysis, some of the data behind the concepts and experiments that we ran, and also probably the most important thing: how do we improve this process of continuous merge, implementation of that, and the results. So let's dive into it.

So first and foremost, I want to start with engineering metrics. Some of you may have some of these metrics if you're familiar with DORA, which has been out for a long period of time now, but some of you may not. The way that I think about engineering metrics is actually in three categories titled business alignment, engineering efficiency, and of course, probably the hottest topic in the industry right now, developer experience.

I'll go one by one just to give an example of some of the most common. I think one of the most important metrics that all of you hopefully are already tracking, and if not can get there soon: when we think about business alignment, we're thinking about leadership, we're thinking about delivering projects on time. One of the best metrics in that area today is called planning accuracy. You can think of that as whether we commit what we're going to do, for example within a sprint, and what did we actually do: plan versus actual. You can see that on the left-hand side here. A lot of the data that we are going to talk about today does come from the community, and usually we see a lot of teams here in the 60% range, and that's something that I hope as a community we can improve.

In the middle area of engineering efficiency, this is what usually engineering teams themselves are looking at. For example, cycle time is one of the most popular metrics, also one of those DORA metrics. Once you start getting into cycle time, what does it mean? Well, how long does it take on average to go from coding on a branch to release to production? The interesting breakdowns within cycle time, which you can see here, are coding, PR pickup time, review time, and deploy time.

The last area that I think is very, very important, and like I say, a hot trending topic right now, is around developer experience: what we're doing to reduce toil, making sure that the developers are able to get their code merged as efficiently as possible with high quality. The metric there that we typically like to look at is called merge frequency, which is how often are we able to get PRs merged.

Of course, if you are using some of the tools in the industry or you're tracking this yourself, the next logical thing that you start saying is, okay, for my organization or for my team or for my project, whatever it is, where are the bottlenecks? That's something that you want to highlight really early on. For example, on the business alignment side: okay, we have a bottleneck with our planning accuracy. It's very low. We're only completing 24-25% of what we want to complete. On the engineering efficiency side, for example within cycle time, we might find, hey, actually that PR review process is what's slowing us down. For the dev experience side, you might see a dip in merge frequency. Hey, what's going on here? We're not doing a good enough job helping our developers get code merged. So of course, bottlenecks are what I think comes to mind for most people.

When Trong and I started working together, we deployed LinearB at FloSports and started looking at these metrics. We found something super interesting, which is what the talk is based on today. What you're looking at right here is cycle time. Again, for cycle time you have your coding time, you have your PR pickup time, you have your PR review time -- how long does it take to actually get the code reviewed and merged -- and then we have deployment time. What we found together is actually coding time looks very, very good. We don't have a bottleneck there. Deployment time actually looks very, very good: one day and one hour. I think a lot of the investments over the last 15 to 17 years in CI/CD have helped most organizations get to where deploy time is pretty good.

But the area that was a little bit surprising, and that may be where a lot of your teams or developers are actually experiencing pain, is actually in the code review process, which is highlighted here in red. We call that the PR lifecycle. For FloSports, the PR pickup time -- the amount of time that the developer was waiting after a PR was open to get the first piece of review -- was one day and 12 hours, which is obviously pretty high. Then the amount of time to actually complete the review and get the code merged was creeping up on three days, which at first was a little bit surprising.

What we did: we actually have this really nice LinearB Labs team where we're able to do PR research. We analyzed about 1 million pull requests, almost four million reviews across many, many developers. What we found was the issue that was happening at FloSports with Trong's team was actually very, very common. In the community right now, one of the biggest bottlenecks is in this PR process or code review process. That's where code is actually getting stuck.

If we dig into the data a little bit more, what we found is that these pull requests are actually mostly idle. What we mean by idle is they're sitting there. They're getting stale. They're not getting reviewed. They're not getting comments added. They're not getting change requests or updates. The actual data is: 50% of PRs are idle for 50% of their lifespan, so half the time they're just sitting there. For 33% of the PRs, they're idle almost 78% of their time.

You can see it at the bottom of the slide: if we're not moving the PRs to a merge point and not getting them merged, that's bad for a few reasons. The first reason that comes to mind is from a quality perspective. As code is moving and changing underneath the PR, the PR is getting more and more stale. There's more and more conflicts. Another reason: of course, if I'm a developer and I open a PR, typically I'm not just waiting for that PR to be reviewed. I'm moving on to my next task. But what if there are change requests that come to me? What if there are questions from the reviewer? My cognitive load is starting to increase. I have my mind now on another task and it's hard to come back to old code that should have been merged. These are the types of things that happen as we're not efficient in our code review process.

What we did at LinearB, and some of the research that we then did, said: okay, why is this happening? Why are these PRs just sitting here for such a long period of time? Why are we unable to get through this process? There are actually three reasons that came out of our investigation: awareness, context, and automation.

I'll go one by one and give an example of what each one of these means. When we think about awareness, it's actually kind of just the concept of: hey, I have a lot going on as a developer. I have my own work, for example, that I'm accountable to get delivered, and I don't always even know that someone has assigned a PR to me. And vice versa: if a review has started, and I'm the PR owner and I've moved on to another task, and a change request has come back to me, I don't have that immediate awareness that the PR is now on my plate and I've been requested changes from the review. So that's the basic awareness back and forth of what's needed, where, and who is responsible to get the PR pushed through.

The next thing that we found was actually around context. We don't have context about the PR. The reviewer does not have context about the PR to make the best review decisions possible. A few things come to mind when we think about context. One is some of the basics: what is this PR for? Is it fixing a bug? Is that bug highly urgent or is it not urgent? Is it related to a feature? Which feature is it related to? Is it a test configuration change, or security changes versus API changes or core functionality? Another piece of context I think is super interesting that I'll show later is: how long will this review actually take me to complete as a reviewer?

What happens -- and this is kind of a natural human thing -- when I have less and less context as a reviewer and a PR is assigned to me, if I do not have that visibility into what I'm getting into, I tend to want to put it off. I'm going to put off unknowns. But as we get more and more context about the PR, I can sit down and say, okay, I know what I'm getting into. I know how long this is going to take me to review, and let me set aside 15 minutes, for example, because it's a 10-minute review. I can get that done before something like that. That's what context means.

The last area that we found is that although we've made a lot of automation advances in CI and CD, we actually haven't made that many automation advances in the review process itself. What we saw across lots of organizations is that although each PR is unique -- some of them are large, some of them are small, some of them are touching sensitive areas, some of them are touching non-sensitive areas of code, some of them are actually opened by a bot like Dependabot or something like that -- they're all unique in themselves, but they're actually being handled in exactly the same way. Every PR is going through the same review process: typically one reviewer. I'm waiting for one reviewer to give the looks-good-to-me and that's it. But it's kind of crazy to treat every single PR exactly the same when there's a wide variance in their severity, their urgency, and what it takes to actually get the review completed. So these are the three problems that we found.

What I did working together with Trong, in Trong's organization at FloSports, was try to address each one of these problems of awareness, context, and automation. The first thing that we did with Trong was deploy a bot, a chat bot that we call WorkerB. That's the examples that you can see here on the left-hand side and the right-hand side. This is a Slack bot called WorkerB that's provided by LinearB. What WorkerB does is address both the awareness side of things and the context side of things, and I'll say a few ways that it does that.

First and foremost, when a pull request is assigned to a developer, they'll receive a real-time notification: hey, just wanted to let you know that this PR is on your plate, give you a heads up, the reviewer is waiting for you. And vice versa, going through that PR process: if changes are requested, that will be sent back to the PR owner. Hey, changes are on your side now. Let's see if we can get things moving. That's some of the basic awareness.

On the context side, which is pretty cool, everything that comes through all these alerts that come through WorkerB will always have the project ticket associated. For example, here you can think of it like a Jira ticket: admin can self-configure projects. You can tell if this is a bug or this is a feature, but there's also additional labels of different additional context, such as estimated review time, which you can see here in the bottom left-hand side.

What does this allow the developer to understand? Immediately: hey, I have this PR that's assigned to me, but it's actually a PR that's going to take me quite a lot of time to review. It's going to take about an hour. So I know that this is a large PR. I know I need to set aside time, probably an hour or an hour and a half, to get this done. By the way, one of the worst things that we found happens is if a developer starts reviewing a PR and then gets interrupted and stops. For example, in this situation, let's say that they did not know it would take about an hour. They set aside 15 minutes, 30 minutes before their standup, retro, next meeting, whatever it is. They start the review and then have to stop. When they go back and start the review again, actually that cognitive reload happens and they're starting over from scratch. This is the type of awareness and context that really helps with that PR pickup time and actually helps complete the review faster, so they're not interrupted.

Another example of context: on the right-hand side here, this type of bot, WorkerB, can say something to the developer such as, hey, this PR has been assigned to you, but it's actually really, really small. In this situation, it was only one commit. It was one file. It's three lines of code that's being changed. It looks like just a modification to a version of a third-party tool, something like that. Could you review this right now and actually review it inline? Again, that will really help the owner get this pushed through and unlock that cycle time bottleneck.

Once we rolled out WorkerB, here's what happened at FloSports. Again, when they started out they had a PR pickup time of one day and 12 hours, and we were actually able to get that down to seven hours just by improving the awareness and the alerting and context with WorkerB. That was a fantastic win.

Now the other side of it is, well, how long is the review actually taking? We can see PR pickup time was improved a lot, but what about PR review time? That's where at FloSports we rolled out a tool called gitStream, which is a review automation tool. Earlier on, what I was saying is there's kind of a lack of automation when it comes to reviewing code. We're treating every PR the same.

What gitStream allows engineering organizations to do is actually start treating PRs uniquely, whether they're small changes, whether they're large changes, whether they're critical changes. For example, very, very small changes that are maybe documentation only, and all the tests have passed, we can get into an auto-approved situation where you empower the developer to decide that they would like to merge that or ask for a review. Certain PRs, for example, you need to find the right reviewer, perhaps an expert reviewer. If we see within the code, hey, this is actually all UI changes, let me find the UI expert or the person that's modified that code the most and automatically assign that expert reviewer.

For example, you can have a really critical change. It could be a large PR. It could be touching sensitive services. It could be changing APIs. In that situation, you might want to assign multiple reviewers, not just go with that standard one reviewer for every single PR. What actually happens is if you have a review process where no matter what the PR is, only one reviewer is always assigned, that's where you'll start to get a lot of those LGTMs: just looks good to me; I didn't really review this. I know I always get assigned regardless of the style of change. You get kind of that review fatigue, where you're not really getting that high-quality review. But if you're able to actually diagnose the type of change and you have a system like gitStream on, and this is all coded in a YAML file to your liking, then when a human is actually called for the review, they know that they're really needed. You get less of that LGTM, quick, cursory review.

What we rolled out with FloSports: there are kind of three categories that gitStream is able to automate, or three different types of information. There's code review automation. That's where gitStream is going to inspect the code, and for example, it can automatically request a change back to the developer before a human even comes in. Reviewer automation has to do with how many reviewers should be on this PR, who is the expert reviewer, and maybe you also want to get a junior reviewer on there as well for knowledge sharing. Then also there is context. gitStream can label every single PR with whatever context you want, like that estimated time for review.

What we ended up doing with FloSports, here are the implementations that we started with gitStream. They had the situation where they had some bots that were making basic configuration changes, bumping minor versions. In those situations, if all tests have passed and everything looks good, we actually were able to do an auto-approve and merge, saving a lot of time for developers not having to get distracted with PRs that really should just be merged automatically.

We were also able to put in auto-approve, again putting that empowerment back on the developers if there are only formatting changes or tests-only changes and all the CI has passed. Yeah, let's empower our developers. So we're able to implement that as well.

We also implemented request changes. In Trong's case for FloSports, this was on API deprecation. There were a few APIs that we didn't want to be using any longer. If there was a code change and those deprecated APIs were used, the developer would automatically receive a change request back. We don't have to waste human reviewer time to just say, hey, this API we're no longer using; that's not a good use of human review time for each one of the PRs. We applied this to everything. We added the ERT, estimated review time. That's a label or comment that goes on the PR and helps the PR owner understand how long the review will take.

Lastly on the reviewer automation side, we started assigning expert reviewers. Who was the code owner? Who has made the most commits or changes or reviews? We can automatically assign, for example in that UI change situation, the UI owner.

Just an example to visualize, because it can be a little bit difficult understanding behind the scenes: with gitStream, everything is in this YAML. We actually call it the .cm file. This is where you can write out all of your automations. For example, a PR will come up and it'll say, okay, you do have this review required, but let's request the review from our UI expert. Of course, WorkerB will also ping that person and say, hey, you've been assigned and this is how long it will take. Good news: it's only a three-minute review.

When we rolled that out at FloSports, that's where we were able to see the decrease in PR review time. We were able to go from two days, 17 hours all the way down to about a day and a half, which I think is a very good review time. If we were looking at the overall cycle time improvement, we started out about nine days and five hours. We were able to get it down to five days and 12 hours and get that PR pickup and review time from a red into a green area, which was a 40% decrease in cycle time, which I think everyone and all development teams and developers will appreciate.

Again, if you're thinking about, okay, what is this thing continuous merge, and how do we complete the promise of CI and CD? I think CI and CD did a lot of good for the industry. But what we saw is it did not solve all the problems of code getting stuck and not being able to get out to production. If we think of CM and CI/CD, you put those three together now, I can really unblock all bottlenecks. When we're thinking about CM, we're thinking about PR awareness and PR context and the automation that goes along with it. It's really that practice of improving the code merge efficiency and quality. Awareness plus context plus automation equals happiness. Of course, let's automate the review as much as possible, but also know that we should of course be using human reviewers when needed so that they know they really are needed and can be focused. At the end of the day, if you are on the metric side of things, you'll see that the merge frequency for the team will increase, which is always a good thing.

Thanks for listening to my talk on CM. If you're interested in this kind of stuff and you want to automate that PR process a little bit more, check out gitStream. That's going to be linearb.io/dev/gitstream. Totally free, highly recommended to try it. If you're on the metric side of the house and you're saying, hey, those metrics around planning accuracy, cycle time breakdown, and merge frequency -- if that's something that you don't have today, check out linearb.io/get-started, also free to try. You'll get all of those DORA metrics out of the box. I highly recommend using both of these together. Again, thank you for listening today. My name is Dan Lines, founder of LinearB, and I'm very happy to be here. Thank you.