Log in to watch

Log in or create a free account to watch this video.

Log in
Europe 2022
Share
Download slides

If Development is a Game, How Do I Win?

Rob Moffat, author of Risk-First Software Development and the website riskfirst.org looks at how risk has a part to play in every aspect of our lives, even software development! Given that this is the case, what can we take from game theory, gambling and chance to improve our chances of winning at software development?

Chapters

Full transcript

The complete talk, organized by section.

Rob Moffat

Hi, my name is Rob Moffat. A bit of background about me. I've worked in software development for many years, building risk and finance systems in banks mainly. I've got an MBA and degree in computer science, and I've moved around a fair bit within my industry. Most recently, I'm the senior technical architect at FinOS, the Fintech Open Source Foundation. And I like to write and journal about what I've been doing. I'm really interested in the ways in which software projects go wrong and how companies make mistakes in IT. So in about 2018, I was working at HSBC and I kind of had an aha moment, which led me to writing both a website, riskfirst.org, and a book based on the website, "Risk-First Software Development." Selling books isn't very profitable, but it's been a really valuable and useful experience nonetheless, and it's really helped validate that aha idea and led to some interesting things happening.

So I'm going to split this into two main parts. First, I'm going to assert the claim that all work is risk management, which is closely tied into what Risk-First is all about. I've no idea whether I'll be able to persuade you of that, but I'm going to give it a go. And then we'll tackle if dev is a game, how do I win? I'm going to apply those risk management techniques to what we do as developers on a day-to-day basis.

So today, I'm going to defend this statement. All work is risk management. And this is one of the founding philosophies of Risk-First. So for the next 15 minutes or so, I'm going to try and persuade you that this is the case. And while I'm making that case, maybe try and think up some counter examples, bits of work that aren't risk management, and then we can discuss them. So my approach is to do this in three steps.

First, I'm going to make an appeal to authority, and then I'm going to talk about how the UK as a whole does risk management. And then I'll bring it back to talk about the tickets on your backlog and how they're actually just risk management in disguise. A lot of people think risk management is really boring or really technical. I want to get away from that. After all, if all work is risk management, then we're doing it anyway, and it can't be that technical.

So the quote above is actually the first sentence from chapter one of a fairly famous book on software development written in 2000. It's this one, "Extreme Programming Explained" by Kent Beck. And this is the first book on Agile that I read. And it seemed like a breath of fresh air compared to what I'd learnt at university about software development, which was all rational, waterfall, iterative development, structured programming, and so on.

And Kent Beck had a bunch of ideas in extreme programming that really deviated from the accepted norms of software development, like pair programming. And this is a picture of some guys doing pair programming. Some people love it. A lot of developers grew to hate extreme programming because of this. They didn't want to share a keyboard and mouse with someone and work together.

But what is the point of pair programming? What Kent is trying to avoid by recommending this is key person risk. That is having individuals on a team who are the only people who know about a thing. And if they leave or go on holiday, your project is exposed. That's the idea anyway. Pair programming is actually a risk management technique, specifically trying to address key person risk.

So Kent Beck also invented or co-invented JUnit, which is a library for building unit tests. And as a Java developer, I use this all the time. And if you're used to Ruby, you'll probably be familiar with MiniTest or Jest if you're a JavaScript programmer. They're just basically versions of JUnit for those different languages. And I actually can't imagine coding now without building tests as I go and having tools to at least understand my coverage. It just seems so helpful to have this.

So why do we write tests? Why are unit tests such an integral part of extreme programming and Agile? And I'd say again, they're managing risk.

So what risks does unit testing manage? So it manages a couple of risks. So unit tests are automated, so they can be run all the time. And this reduces the amount of manual testing you need to do. And too much manual testing is a risk because, well, it takes time and the feedback's slow, but you have to rely on people to do it the same every single time. And then also, there's a regression risk. Our code is going to change in the future. Are we going to break the functionality of our existing code when we change it? By having unit tests, we have some more certainty that when we do change it, we haven't broken the features or the functionality that already exists. So the risks that we are managing with unit testing are on the left of the diagram.

What's the downside of unit testing? That's on the right here. In order to address the risks on the left, I have to own some extra code in my code base. So that's a complexity risk. The code base is more complex. And building those tests and maintaining them, that's going to take up some of my schedule. So there's a risk to the schedule in writing tests.

Now, if you're good at unit testing, this is a great deal. And what I mean by a great deal is that the benefit, those upsides, outweighs the cost, the downsides. The trick is to write just enough tests to address the risks on the left. But if you go crazy and you end up turning it into an industry, that blows up those risks on the right. So unit testing is a trade-off. Being good at unit testing means being good at controlling the risks on both sides of this slide. And so this slide gives you a clue as to what Risk-First is about. It's about exploring and categorizing the risks that affect software projects and pointing out those trade-offs.

So let's move on from extreme programming for a minute. My background is in finance. When I worked at RBS in credit risk, I was working on finance regulations, and one of the most important ones or famous ones is the Basel Accord. And so among other things, it defined three types of banking risk, and it demands that banks measure and report them. So, okay, market risk, that's the risk that financial products you own, like foreign currency, will change in value. So if you bought some dollars wanting to go on holiday and the exchange rate changes, then the dollars might be worth less by the time you actually go. Credit risk is the risk that people who owe you money won't pay it. And finally, operational risks are things that affect the operation of your business, like software systems going down, data being lost, and numbers not adding up. A good way to think about operational risk is to think about the whole firm as a machine. Operational risks are ways in which that machine breaks, and that can result in financial losses or reputation damage. So what the Basel Accord did was force banks to start thinking seriously about measuring and reporting in terms of what are the risks to our business and how are we managing them. And they were able to do a divide and conquer approach. Different departments dealt with different risks.

So banks do risk management seriously now, but also this is done at a national level. Even the UK government is talking about risk management. So this here is the cover of the 2020 UK National Risk Register. So what sort of things do you think the UK government worries about? Well, here are a few of them from the risk register. I'm not going to go through all of these, but it's the kind of disasters you'd expect, right? Storms, heat waves, cyber attacks, and then they have a go at estimating the likelihood of those risks and the costs, where the cost is not just financial, but also in terms of lives. For example, major fires, number 36, this report gives it the probability of between five and 25 in 500 chance of happening each year. And they go on to estimate the costs of that in terms of lives and money.

So my thinking is why can't we do the same thing for software projects? Just as the UK government and banking apply a divide and conquer approach to managing project risks, can't we do the same? So here I set about breaking down the various risks we face on software projects. I've got a few of them on this slide. So first up, communication risk. We see communication risk between people and between systems. Important messages go missing or are misunderstood or ignored. Complexity risk. The more code we own, the more complicated the systems we build, the more likely they are to go wrong or contain bugs. And that's true of software systems and any process just involving people. Coordination risk. When you've got a lot of people working together or processes working together, how do you coordinate them so they don't step on each other's toes? Like maybe two sets of people working on the same file at the same time. And we face this problem with our systems when we have multiple processes trying to write to the same data. And operational risk. We've already talked about this with the banks, but it's a really important problem for all IT. What happens when the servers go down? Maybe a form stops working on your website. Even people being ill or off work can turn into operational risks. How do we catch those and deal with them quickly?

So there are more than just four of these, and we'll see some more as I go along here. And in the book and on the website, there's a chapter on each and what you can do to address them. So the big question is, is that useful?

So what I'm creating here is a pattern language to talk about the problems we face on software projects. And in a way, all of the risks the UK government risk register talks about, they're also patterns. And patterns, kind of a way that we use to talk about stuff generally. The first book about patterns in terms of software that I read was this one, this "Design Patterns: Elements of Reusable Object-Oriented Software," which showed you a load of reusable patterns that you could employ to make your software more extensible. And this was a huge influence in the software world. Things like the decorator pattern and the abstract factory, they're coming straight out of this book. Many of you are going to be familiar with those.

If you've not heard of design patterns, then here's another pattern language. TV Tropes, that's a website that talks about patterns in narrative. So here's an example. A pattern called or was it just a dream? I think everyone's familiar with this tired old trope in TV programs, where at the end of the episode, they say, "Or was it just a dream?" This is a really cool site. This is well worth a look. There's lots of completely obsessive people keeping this up to date, and it's a total black hole in terms of your time, but it's very fun to read.

So at this point, I've made the case that some of the practices in extreme programming are about managing risk. And I've talked about the idea that there's risk management in banking, and the UK itself does risk management, and I've kind of talked about risk management in software development, but I've not really delivered on the original slide's idea that all work is risk management. But I'm ready to give that a go now. So

let's say this piece of text above is a task in your backlog, or a Jira item or a GitHub issue, whatever you use to track those things. Let's break it down in terms of risk. So Debbie needs to visit the client and get them to choose the logo to use on the product, otherwise we can't size the screen areas exactly. What are the risks here?

So first, we can see that there's a dependency. Doing one thing depends on something else. Sizing the screen can't be done till we have the logo. Second, there's a coordination risk here. Who does Debbie need to see? How easy will it be to get them both together? Are they going to have a profitable meeting and come to a conclusion or not? And third, there's a staff risk. Who the hell is Debbie, and why does it have to be her? What if she's sick or leaves the company or something? A staff risk is kind of a dependency risk, but it's a dependency on someone rather than, say, a piece of software or a server or something. Finally, perhaps this is implicit, but if you can't size the screen areas, does that hold up the whole project? Are we going to have issues delivering to clients what they want when they want it? In Risk-First terms, this is a feature risk, the risk of features that the client wants not being available, or that the features that are available are not the ones that the client really values. So I can break this item of work down, and I can look at the risks it addresses and how it addresses them.

Now, the thing is, at some level, we kind of do consider all these risks when we're working on a project, but it tends to be more implicit. So here's a GitHub project that I'm working on at the moment. It's basically a Kanban-type board, although you might be more familiar with backlogs or scrums. Essentially, when we choose what to work on, what we're doing in this board is risk management. We either pick the tasks which will knock out that biggest risk, like maybe fixing security breaches or investigating crashes, or we pick tasks that give us the most bang for the buck. That is, you do a little bit of work and you get a big payoff, like good unit testing where you're reducing the overall level of risk on the project. So just looking at the list above, I can see a few items to do with testing there. They're improving the quality of the product, hopefully reducing those operational risks we talked about. And there's a few to do with making sure we have the right features. So that's addressing those feature risks. We want people who use this to find the functionality they need. And there's a tutorial over here in the done column, right? Hopefully, that's addressing a communication risk, making sure that people understand the project and can use it.

So at this point, you might be thinking, "Well, how many of these risks are there? TV tropes, that has literally thousands of different tropes. Is it the same for software products, projects?" Luckily, the answer's no. On the Risk-First website, I break it down into about 50, and some of those are kind of just more specific versions of others. So as this slide shows, there are many different types of dependency risks, such as staff risk or software dependency risk. And I run through these on the website, but we're not going to go into detail on all this today. And so this classification kind of turns out to be about the same sort of size as that UK National Risk Register.

So a good next question is: how does this help? Does thinking about what we have to do in terms of risk make it easier to do the right thing in software development? And this is a big question, and I'm going to really answer over the entire rest of this talk.

And so Nassim Taleb, who wrote "The Black Swan," said this: "We're good at risk management. We survived 200,000 years as humans. Don't you think there's a good reason why we survived? We're good at risk management." Now, 200,000 years ago, those risks might have been, what if a wolf comes and eats our sheep? Okay. Well, a good risk management strategy might be, let's have a shepherd. A shepherd is risk management. What if we have a particularly long winter? Okay. Well, we should store some grain in a silo or pickle things or store apples somewhere. Storing food is risk management. In fact, agriculture generally is risk management. It's managing the risk of not finding the food just lying around when you need it. The risks were different back then, but it was still a case of preparing for what was going to bite you, sometimes literally. There's lots of things our brains are bad at. We're not good at probability, which we're going to look at in a bit. We've got this weird random dopamine reward system, which means we get addicted to gambling and social media. We're far from perfect. But we're good enough at this. This is one of our skills. Without realizing it, we put it to work when we build software because we're always thinking, "What if the user does this? What if this thing stops working? What about this single point of failure?" And when we build good software, it's because we've done good risk management.

So is all work risk management? Is this correct? Maybe you're not convinced. I don't know. Perhaps check out the riskfirst.org website if you need a bit more convincing. So the next stop is the title of the talk.

If development is a game, how do I win? What I'm going to do here is build on the idea of work being risk management by looking at risk management in games and then applying that to development.

Let's start with gambling games with the caveat that I'm neither a poker player nor a professional gambler. So a naive strategy for, say, betting on a horse race is to just try and pick the winner. The problem with this strategy is that the most likely horse to win is the one with the shortest odds. And that means if you win, you don't win much money. So let me give you a quick example. So let's say there's a horse race with these three famous fictional horses in it. Let's say horse A has a 50% chance of winning, right? You might have odds of three to two, which means you put £2 on the bet and you get three back. And clearly, if you bet on this horse over and over and over in lots of successive races, you'll eventually lose all your money because the chance of winning is less than the amount you get back. In fact, that's true of any of these horses, right? The odds always pay back worse than the probabilities. And if these probabilities are correct, you'll always end up losing. So picking the horse you like best just isn't going to work. In order to win money horse racing, you have to have an edge. And I'm going to build up to describe what that means.

So before we understand how to get the edge, let's talk about internal models. Now, this is a really helpful concept in risk generally and in Risk-First. The idea is the internal model is what you know about the world or what you think you know about the world. An internal model could be the information you have in your head, or it could be data stored on a computer. There's no guarantee that it's correct.

Now, why is this an important concept? How many times have you been in a meeting and had an argument about either what the team should be doing or the approach a team should take to do something? I mean, personally, quite a lot, right? And usually the reason for the argument is not that you have a different idea of the goal of the team. That's usually quite clear. And it's not usually that some person is trying to subvert the team and cause it to fail. They probably want it to succeed as much as you do. And often, the people involved in the argument have access to pretty much the same facts about the situation that I do. So why are we arguing? The reason we're arguing is that our models of the situation are telling us to do different things.

Right. And so chess is a great example of the power of an internal model, right? If I was to play chess against a grandmaster, we both have perfect information about what's going on in the game. It's a perfect information game. The board contains everything we need to know, and there's nothing random that affects it. However, the grandmaster's internal model is much better developed than mine, and they'll see the risks to all of their pieces much more clearly than I will. They'll understand all the risks of the moves and make better moves than me. So chess grandmasters spend their lives improving their internal models of chess by studying other games and players throughout history.

And this is Donald Rumsfeld's famous quote, and he's kind of talking about this. He says, "As we know, there are known knowns. There are things we know we know. And there are also known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know." And he's directly talking here about internal models, but he was the US Secretary of Defense at the time of 9/11 and the war in Afghanistan. So he's really talking about the internal model of the entire US military operation.

So I've tried to break down what he's talking about on this slide. Here are the three categories of things in an internal model that Rumsfeld suggests, plus a fourth one, the things you know but didn't know you knew, right, which I guess also has to exist. And I guess that might include things like the plots of movies that you forgot you saw, or institutionally, this could happen a lot, with a company having records of things and all the staff have long since forgotten about them. Getting back to the horse racing example,

we're dealing there with known unknowns, right? The box on the top right. We know that one of the horses will win, just not which one. And in the bottom right, the unknown unknowns. So in the chess game, these were the ones that were going to get me, the moves I didn't see coming. And these are the things that wreck our ability to estimate in software development. How many times have you said, "Oh, this will be done next Tuesday," but then been blindsided by a completely unpredictable quirk or bug or facet of the problem that just throws the whole thing out. We're going to look a bit more closely at what to do about the unknown unknowns in a bit.

So the key to being a successful gambler, going back to this, is not to try and pick the winner necessarily, but to play the odds. So you need to have a better internal model of what's going to happen than whoever you're going to make the bet with. So that is making your model of known unknowns better than the person you're up against. And this guy is Joseph Jagger. He made a ton of money in the

1880s out of roulette because he had a better internal model, right? He sent people to casinos to record the results of each spin of the wheel, and he found that some of the wheels were biased, and so certain numbers came up more often than others. And all he had to do was bet on those numbers to win. And for the casinos, the idea that some wheels were biased was an unknown unknown. They had no conception of it. But Jagger turned it into a known unknown in his model, and so he won. And all he had to do to get this improved internal model was to go out and experiment in the real world and record some observations.

So you might be thinking, "Well, that's all very well, isn't it? But it doesn't really apply to software development, does it?" Well, I'm going to try and persuade you that actually it does.

So in order to win at software development, you have to not just consider the game you're playing, but the meta game. For example, the game might be the current task. Maybe the task is add a login box to your website.

So a really naive understanding of this game might be, I'm just going to knock this out as quickly as possible. And in Agile circles, there's an expression, YAGNI, you aren't going to need it. This captures the idea that you shouldn't over-engineer the solution. So this developer has taken this to heart, and he's done the simplest thing that could possibly work.

So at one level, this wins the game. It fulfills the definition of what was asked for. But it loses on the meta game level, the game of games, because the product owner might look at this and say, "Well, actually, that's not what I wanted. Can we have something that just doesn't look completely terrible?" And so second time around, the developer goes away for a whole week, and he comes back with something like this. And it's pulling in style sheets and graphics and an animated background, and it looks amazing, and it fulfills the brief, but it also fails at that meta game level.

So what's wrong with either of these solutions? Put simply, they're just not managing risk that well, right? The product owner is trying to manage the risk on the product. He or she has an internal model of what those risks look like. So the very simple login is going to leave the users feeling that the product was built by a five-year-old, and that's not what you want to convey. And on the other hand, the super awesome login, look, it took ages to build, and it brings in style sheets and images, but it's adding way more complexity than necessary.

So the game is to implement the requirement, but the meta game is to implement the requirements while managing down the risk of the overall product. And this is something everyone needs to consider, right? Now, an interesting thing does happen once you build either of these login pages, and that is your internal model changes. Some things that were unknown unknowns are now known unknowns. So will users be able to reset their password? How will they be able to sign up? Will we be able to store their credentials? Do we have to worry about being hacked and losing people's logins? Do we have to worry about data protection?

There's a whole workflow around logins that you need to consider, emails and resetting passwords and so on. Now, I'm sure a lot of you are thinking, "Yeah, this is obvious stuff. Wouldn't they have thought about all that beforehand?" And I kind of deliberately picked this example because I knew people would be familiar with it. But so many times in software, we have problems where when you first look at it, it's an easy problem, but the more understanding you get, the more intricacies and details there are to consider.

And so at this point, you might also be thinking, "Well, couldn't they just use a third-party security library?" Or, "Shouldn't they use a social login like OAuth 2?" And these are great solutions perhaps. They probably take a lot less time to build, but again, they come with their own risks attached. With either of those, you end up with a dependency on some third-party software. And dependencies, as we've seen, are risks. We don't know whether the software will be supported forever, or have a security breach, or stop working, or they change the terms of use, charging you more money. Using a social login has its own risks, too. You might be excluding people if they don't have a particular Facebook account or whatever provider you choose. So again, all of these types of risks that we're seeing on the screen here, they're all broken down and described in more detail on the Risk-First website.

So obviously, right, we know that different solutions have different trade-offs. In a way, whenever we take an action to try and do something on a product, we're kind of making a bet. Just like the horse racing or the roulette example, we want bets where the payoff is worth what you stake. So you might say a bet on OAuth 2 adds to the software dependency risk, but at least it doesn't tie up one developer for a month pushing back my schedule. Things could go south, but this might be the safest bet out of the ones available. And so

this is what I mean about games and meta-games. At the bottom of those games, the short-term goals that you think you should achieve to win, in quotes. However, the person running the project should be concerned about the meta-game. That is the game of games, where you're trying to balance all the risks in play and end up with a successful project, low complexity, happy team, happy customers. And to do this, you have to take shrewd, calculated bets.

So let's look at some of the things we do on projects and see how they are like bets. So we've all done refactoring before, the idea that we spot some duplication of functionality in our code or a better way to separate concerns, and we implement it, and we make our lives easier.

What's at stake when we do a refactoring? Well, the main thing is time. Usually, what I find is that I have some idea for a refactoring. I start implementing it, and it touches large tracts of the code base, and I have to fix all that up, and then I have to fix a load of unit tests as well. Whereas in return, the payoff is that I reduce the complexity of the overall project. And in some complex projects, keeping complexity under control is a really big deal, and it saves time in the number of bugs and the amount of testing and so on. So it might be a good bet. If I win and the refactoring achieves its goals, it doesn't take too long, that's great. On the other hand, I might lose if it takes ages, and halfway through, I realize some edge case might ruin the whole idea that I've had. So it's a bet, and it's based on my limited internal model.

Adding a new feature to a piece of software is also a bet. Again, it's a step into the unknown. So let's look at what's at stake. So adding a new feature is going to take time, so that's schedule risk on there. It's going to make your product more complicated, so that's complexity risk. And it might make your product harder to understand, so that's like a conceptual integrity risk. And on the other hand, if the features of your product don't fit the requirements of the client, then they may go off and start using a competitor's product. So that's hopefully this feature fit risk that we're reducing on the other side. Will it work? Well, we don't know. It's a bet, and it's based on your imperfect information.

So on the Risk-First website, I have a short section covering a few more examples of things like this, which you can look at. Personally, I feel it's quite liberating to talk about pieces of work, not as issues or tasks, but either bets or experiments. And the reason this is liberating is that because it means that psychologically, you're not making the delivery of a feature or another piece of work part of your own identity, and it shouldn't be. We're all operating in a world of unknown unknowns with imperfect internal models.

So hopefully, we've covered a couple of things at this point, right? I've talked about how all work is actually risk management. And then in this section, I've talked about how to do good work, i.e., how to win at software development. And it basically comes down not to having the best algorithm or the best webpage, but to doing the right pieces of work to manage down the various risks across the project to levels that you can keep under control. And in order to do that, we need to take bets on the work we do, and in order to win those bets, we need to have a good internal model of the risks we face.

But actually, all of this is really what we do in our everyday lives. Well, ideally anyway. So if you know you're at risk from heart disease, you might choose to do lots of cardiovascular exercise or choose the right diet. And if you have a genetic risk to breast cancer or prostate cancer, you should probably choose to get tested for those things regularly. Essentially, what you're doing is you're managing risks in your life. And I think that we can agree that generally when we're healthy, we're able to do better work and look after our families and so on. So doing all of this risk management stuff I've been talking about isn't crazy complicated. It's a lot like living the healthy life, and vice versa. If you want to live a healthy life, you actually need to develop a good internal model of the risks that face you and do work to keep them under control.

And we use terms like, "Oh, this project is in poor health" to cover things like, oh, there are morale problems here, or it's drowning in its own complexity, or the stakeholders aren't engaging, or deadlines are getting missed.

With Risk-First, I'm trying to apply the ideas of risk management in the world of software. People have ported other ideas before like Lean and Waterfall and stuff to the world of software, and I feel this is an important idea for our industry. And as we saw with the UK government and banking, it's been important for other industries as well. So the ask is, do you think you can help? Please come and join me. We have a GitHub team. If you want to get involved in any way, please join the team and say hello. Be very happy to hear from you.