Context Switches in Software Engineering
Context Switches in Software Engineering
Chapters
Full transcript
The complete talk, organized by section.
Chris Hill
I'm Chris Hill. I'm from Jaguar Land Rover, and good afternoon. Today, we're going to talk about context switches in software engineering. This is actually probably one of my favorite topics to talk about.
Just to give you a little bit of background on JLR: we're about 42,000 employees. We make about 22 billion pounds a year in revenue. That's about $30 billion. We've got about 135 software groups across 5,000 software personnel, internal and external.
I currently head up systems engineering for the current generation of infotainment. If you were to sit in one of our Jaguar Land Rover vehicles that were made this year or last year, the last two years, you would use my current product.
Now, this current product lives within an ecosystem of the entire vehicle. The vehicle has what we call ECUs in it. Those are electrical compute units. So imagine you're essentially driving a miniature cloud. Each of the electrical compute units that are in your vehicle have one function or another.
IVI, or in-vehicle infotainment, is exactly what it sounds like. It not only gives you information, it also gives you your media. It gives you your entertainment that happens as well. You'll also find that in this generation and next-generation vehicles, we'll give you your weather, too. We'll give you your news. We'll give you airline status. We'll let you know how your smart home is doing before you even get there. We'll anticipate whether or not you're on your way home and whether or not you've got a preference set, so that we can increase your temperature by 10 degrees, 15 degrees before we get there.
There are a couple of things that I want to talk about: specifically context switching in infotainment and at Jaguar Land Rover, some penalties that are associated with the different types of context switches, and I've got an analysis of the interactions of the software activities between software specialties. Now, I know that's a mouthful, but what I like to do is analyze why our throughput maybe isn't where it needs to be.
Before we do that, I joined a software startup about 10 years ago. Like many software startups, they were just starting their software product. I happened to be right out of college, and I was the only software personnel that they hired.
Unfortunately, what that means is every part of the software development life cycle, I was it. One benefit of operating this way is that I could operate with full autonomy.
Now, I visualize the software development life cycle in terms of a factory. What you'll notice here on the left is you'll see your inputs. You've got your planning, where your project managers would typically schedule and do your prioritization. You've got your backlog, where your change requests and your features come in, and then you have flow control, whether or not to release WIP or release work into your factory.
On the right side, I look at it like factory stations. You guys are very familiar with these stations: requirements, design, develop, validate, deploy, and ops. So if you can envision all of the software activities that are done, all of the specialties, they probably grab a seat at one of these stations, to one or more of these stations, to output a software product at the end.
So as I'm sitting here in the startup and I'm the only one who can interact or who actually runs the entire factory, one thing I didn't have was the ability to plan on what I worked on. I may come in at the beginning of the day and think that today was going to be an ops day. I may get, after an hour, customer needs and wants, and I need to do requirements authoring. I may have remembered that I'm about 75% of the way done with a bug fix, and I realize that that's higher priority, and I switch to that. I may have some code that wasn't published from last week that I know is ready for release. This was back in the days of manual deployment, and so I actually need to do some deploy work.
And if I'm unlucky enough, since I'm terrible at design, I was asked to do some HMI design work maybe within the next hour. So unfortunately, every day was a different station for me, and I was the bottleneck at every point of the factory.
Now fast-forward to JLR, JLR infotainment specifically. We've got a lot more people. This isn't a representation of how many people we have. This is just a representation that more people sit at each of these individual stations. Now, these people could either just contribute at their own station. They could be proficient enough to review other people's work. They could potentially be proficient enough at going to another station. But typically, more people will allow you to scale.
Now the idea of context switch. Imagine we're all CPUs. We probably all understand what a CPU is, what a processor is. If we're working on a current set of instructions and another higher-priority set of instructions comes into the queue, we need to save the current state of our instructions, address the higher-priority set of instructions, finish it, take it off the stack, resume the state of the lower priority, and now finish executing that.
Humans do the same thing. If I'm sitting at the development station or I'm working against the development station, and I've been asked to work on task number two even though I'm in the middle of task number one, if it's the same project, I'm going to consider it a lower penalty. If you look on the right side, I've got a barometer of penalty, if you will.
The next stage up in penalties is if I ask you to work on a different project. Happens to be the same station, happens to be the same type of work, but it's a different project. Now I need to save the state of my previous task and previous project and ramp up on the next one to finish that one if I'm told that it's a higher priority and that's what I need to be working on right away. That's a little bit higher on the penalty scale.
The next one is design, or the next one is a different station. If I ask you to do a different station or I ask you to do a completely different work type or task type, but I keep you on the same project, I'm hoping you'll be a little bit more familiar because you know the project, but this is a completely different type of work. This is a higher penalty on my barometer.
The next one is switching, obviously, both things. We're switching project and we're switching task type.
Now, the last one I have here is if you switch station, which is your task type, project, tool set, maybe computer that you're using, maybe operating system that you're familiar with, et cetera. There are many other variables. You could even be asked to go to a separate building. If you're completely changing your environment and your variables, this is very taxing on the human mind. I'm sure we've all dealt with this at one point in time or another.
But in terms of a CPU, they just have to worry about addition. In terms of a human, you have to worry about all of these other variables. It's almost like asking, going from one to six here, is like asking a CPU to cook me breakfast tomorrow. It has no idea how to do something like that, but at the same time, it's higher priority and I need you to address it right away.
So we had a couple of questions based off of those findings on our penalties. Should we eliminate all higher-penalty context switches? The answer is it depends. We found that if you can actually sustain the capacity and actually remain efficient on the different specialties, then you can actually avoid these higher-penalty context switches with capacity in those specializations.
My favorite question: should we train cross-functional teams or train cross-functional people? And the difference between those is somebody who could work and be proficient at multiple stations, or a team that is capable of sharing empathy, that each one of them can be specialized at their own station. Which one is more of a worthwhile investment?
Are some station and roles easier to switch to than others? This one piques my interest as well. Do some roles understand other roles better than others?
Here in infotainment, these are the specialties or these are the roles that contribute in our software factory. You'll probably recognize the majority of these because they match typically in the industry.
I'm going to walk you through some of the stations, or some of the value contribution areas within our factory, in terms of the specialties. Now, what you'll see here on the left side are some red and green arrows. I went around and I asked my coworkers, and I asked other people in the software industry, the question that's defining those arrows. And I call those arrows actually empathy and proficiency arrows.
Out of all the product owners that you know, on average, could they step in and be proficient at being a customer for that product? Out of all of the project managers you know, on average, could they step in and be proficient at being a customer?
Now, I know that's a complicated question. However, the green arrow symbolizes a yes. The red arrow symbolizes a no. We found that the relationship in this case is a highly empathetic relationship towards the customer.
Now, these are the primary actors that exist specifically within our flow control station. We're trying to determine whether or not WIP should be released into our factory. Again, I'm not saying these specialties can't do each other's jobs. I'm just saying, on average, these are the results.
Typically, what happens in this particular station are micro-consulting engagements. That's what I call them. And that's where we're actually interrupting all of those other specialties to determine whether or not we should release WIP. All of those interruptions are all context switches on their own as well.
What's interesting is if I'm sitting at my desk and I'm absolutely crushing out code, and I've got my headphones on and I'm completely in the zone, and somebody walks over to my desk and does the thing against the screen, they're automatically assuming that what they have to talk about is of higher priority than what I'm currently working on.
I don't think that's fair. In fact, I think they should rank whatever they're about ready to talk about. In the CPU's case, all they have to worry about is addition and this queue line. And I kid you not, within the last two weeks, I actually had a queue line at my desk full of people who were going to interrupt me.
Typically, that prioritization is something of the equivalent, had the CEO been in the queue line, I would imagine I'd treat it like a CPU treats a real-time thread. You can come to the front. What do you have to say?
The next station is the requirement station. The same relationship exists. One interesting thing I found is that customers, on average, aren't good at writing requirements or specifications or writing exactly what they want. They're very good at talking about needs, talking about wants, talking about vision. But typically, when it comes to handing over to a developer, most of the time I found it's not enough.
They have the same sort of micro-consulting engagements that the previous station did. Again, all interruptions to ensure that the requirements being written are not going to be impeded further on downstream, essentially.
The next one is an interesting one. This is design. And design to me can be in two different categories. Design is super overloaded. It could be architecture, it could be HMI design. But we have double red arrows here.
And again, I'm just going to ask the same question that I asked my coworkers. Out of all the architects you know, on average, would they be proficient at being an HMI designer? The answer was no. The reverse relationship exists as well as the same thing exists within the customer.
What this actually can show is there are some automatic friction points that exist between these specialties. This could also show you that maybe we should spend some time to make them a little bit frictionless, or maybe we should spend the time developing a relationship that maybe doesn't have to do specifically with the product or the project, but maybe the people in general, so they understand where each other are coming from more.
Typically, there are now validation engagements that happen that are also interruptions. I find it interesting, one of the UX designers typically has a trail-off based off of how much effort they plan to put on a product. When they're finished with their wireframe or they're finished with the majority of iterations and think that they've matured far enough, they are putting remaining capacity for these interruptions. They're adding it into their workflow, which I thought was really smart.
The same consulting engagements exist further on downstream.
This one I think is going to resonate with all of us here. In the develop station, if we ask ourselves the same question: out of all the developers you know, on average, could they fulfill a QA engineer role and be proficient at it? Again, I want to separate from want here. A lot of people in these specialties don't necessarily want to be specialized in one of these areas. However, they could.
And we get double greens here across all three of these. This is one reason I think that helps contribute to the value of DevOps, is that all three of these specialties understand each other's risk, understand where each other's coming from, understand what they could do to help the other person complete their task.
Validation engagements exist. We've migrated from design or from theoretical, now we're at implementation. Most of these engagements are, "Hey, I went to build the thing you told me to build, or the thing that you wrote out, and it's not going to work for me. It's definitely not working out." Right?
Here's an example on how we exploit the double green. All of our build machine infrastructure is all done in Docker containers. It's all identified within Packer. So each one of our developers who are contributing towards a product, if they have some sort of library or some sort of component they need the build machine to have, they can go right to a Packer recipe, create on a branch completely in isolation, make their new set of infrastructure, now point their product to build with that new set of infrastructure without bothering anyone else in the entire workflow and without disrupting anyone else.
So the ops person has enabled self-service for the developer to completely work on their own, test whatever they need to do. "I've got this new version of Python I need to put on the build machines."
"Okay, there's the Packer repository. Go ahead and do it." Right?
We have CI on that Packer repository. We get an automatic deployment to a Docker registry. That Docker registry is pointed to by the product.
Another way we exploit a double green arrow is we have automated test feedback. You guys are very familiar with this, with CI/CD pipelines. We can now put in test cases into an automated pipeline so that developers can get the results back quicker.
Validation and deploy stations, the same type of relationship. However, your primary actor is typically the QA engineer. There are validation engagements that also exist when you're in the QA phase. Sometimes the validation engagement could be, should we ship this or not? Should we disable this maybe in production before we actually let it out?
One thing that's unique about developing for an embedded device is we can actually put it into a production representative vehicle without officially saying that we've released things. It's very difficult for us to compare to the web world, because in the web world, we can release everything out to millions of customers at scale very quickly.
For us, we contribute toward an embedded device or an OS that runs on an embedded device, and there's a point in time at which we bless that for a release.
One way that we exploit specifically for validation and deploy stations is virtualize and simulate the production environments so that we don't have to use hardware. One of the challenges with hardware is it typically doesn't scale, or by the time you've scaled it for what your team demands, it's already outdated.
Here's the ops station. The only surprise here for me is actually the architect. Most of the average architects we found could be proficient at an ops role. Again, it's not necessarily whether they want to be, but they could be.
So here are the lessons that we've actually learned. We found that if context switches are inevitable, maybe we should factor them into capacity. This is actually really hard for me to swallow. If it's unplanned work and it happens so frequently, it's now become planned work. That's a very hard one for me to deal with.
The capacity at which each of your stations are staffed at depend on the project maturity. You may find out at the beginning of a project, you've got significantly more architects and people that are doing requirements are at those stations than you do at further on downstream stations.
We found that some roles are in a perpetual state of interruption. It's always some sort of higher priority that you must be working on, but it never actually ends. This is a very challenging problem for us to solve when we have a due date that we need to deliver vehicles to customers with.
We found that empathy increases if a close team member has an impediment that they could fix. If the person right next to you is struggling because of something that you could actually take care of yourself, they're more likely to take care of that problem when they're next to each other, when they're a cross-functional team.
We typically found that teams that are cross-functional are more fruitful when they're all in the same location, or they at least all bond with each other on a regular basis. I think that stuff's pretty obvious.
We found that using the same or closely coupled tool set will create less friction in the more expensive switches. This means if I'm going to a different task that's a different product, or if I'm going to a different task that's potentially a different station, if it is in the same tool set and I'm very comfortable, then it's easier and has less friction for me to do that context switch.
This is where tools like Tasktop are extremely helpful, because you can replicate an entire ALM tool in another ALM tool so that nobody has to go out of their comfort zone. This can help throughput.
We found that if one of the other stations or one of the other context switches that you're doing has a significant number of manual tasks, it ends up becoming very draining and adds more friction to whether or not you should switch to it.
A culture of mentoring and training typically increases throughput. This one's kind of obvious. From a brain surgeon perspective, I'm pretty sure after an entire eight years, 12 years of education, they don't just walk in and start doing surgery on brains. They probably watched a few.
I think what's interesting is I find it's very unhealthy if a department doesn't factor in training or mentoring into their capacity. I think this is really, really important. I think the only way you're going to train the next generation that will run your company is if you actually focus on training just as important as everything else.
We still need your help. We still have a ton of problems. I know all of these particular positions are needed heavily in the UK. That's where I live now, in England. I know that Portland, Oregon, is where JLR has a research and development campus. I know they're in heavy demand for these positions as well.
So I want to thank everyone for their time, thank Gene Kim, and thank the conference for inviting me to talk. Have a good rest of your conference. Thanks.