Monoliths vs Microservices is Missing the Point

Log in to watch

Las Vegas 2019

Download slides

Monoliths vs Microservices is Missing the Point

Manuel Pais

Co-Author · Team Topologies

Matthew Skelton

Co-Author · Team Topologies

The debate on monoliths vs microservices as architectural patterns for modern software systems usually focuses on technological aspects, missing crucial details around organizational strategy and team dynamics.

Should we start with a monolith and extract microservices or start with microservices? How many microservices is the right number? These kinds of questions indicate a confusion that is made worse by the perceived need to adopt lots of new technology in order to make microservices work.

The false dichotomy between monoliths and microservices helps no-one. Instead, switched-on organizations start with the team cognitive load required to build and run a part of the software system. If a team is not able to understand fully the details of a service or subsystem, there is little chance of the team being able to own and support it.

The resulting team-sized services are by definition suitable in size and complexity for a single team to own, develop, and run. No longer do we care how many lines of code there are in a single service or whether it is a "monolith": what we care about is that a team can own and run the software effectively.

Using team cognitive load as the guiding principle - assessed by the team via measures such as supportability, deployability, testability, operability, prioritization difficulties and domain complexity - organizations can optimize for sustainable ownership and evolution of software systems.

This talk draws on research and case studies from the book Team Topologies by Matthew Skelton and Manuel Pais (IT Revolution Press, 2019) together with first-hand consulting experience from the authors with organizations around the world.

Chapters

Full transcript

The complete talk, organized by section.

Matthew Skelton and Manuel Pais

Matthew Skelton

My name is Matthew Skelton. And I'm Manuel Pais. And we're here today to share with you some thoughts about software architecture and how that relates to teams. So the talk today will look a little bit like this, four sections. First, we'll look at monoliths and microservices.

We'll then look at something we've called team cognitive load and how that relates to building software systems. Manuel will then take us through some case studies that apply some of these ideas. And then finally, we'll have a little look at how to get started with some of these ideas in your organization.

There we go. So we're authors of this book, "Team Topologies", published by IT Revolution Press. There are copies available in the book stand. And this evening, we have a book signing. So at 7:15, I think it is, in the Chelsea Theater there, along with all the other authors from IT Revolution who are signing today.

So if you're interested in what you hear today, come and get your free copy signed by us and take back with you. So we hear quite a lot about monoliths and microservices at the moment in terms of software architectures for cloud native or to enable teams to deliver very rapidly.

But we think this is a bit of a false distinction. Let me try and explain why. It sometimes feels a bit like Street Fighter or Mortal Kombat or something. So over here, we've got someone like Tammer Saleh saying, "Start with monolith and extract microservices." Then on the other side, we've got Stefan Tilkov saying, "Don't start with a monolith when your goal is a microservices architecture." And then we've got someone in the middle who's like a guru, Simon Brown, saying, "Well, if you can't build a monolith, what makes you think microservices are the answer?"

Something is missing here, right? There's an angle on this problem which we're missing. So where should we focus? What should we focus on in order to make this stuff effective? And I think Daniel Terhorst-North puts it very well when he says we should think about software that fits in your head.

Can we understand the software that we're building ourselves? If it doesn't fit in our head, if it's too big for us, we've got a problem. In the context of the talk today and the context of the book that we've written, "Team Topologies", we like to extend out what Daniel has said and say software that fits in our heads when we're working as a team.

Why is this important? Who has a copy of "Accelerate" or will have a copy by this evening? Every hand in this room should be up. Okay? You need to get yourself a copy of "Accelerate" book. There are four key metrics in the "Accelerate" book that are indicators for high-performing organizations.

Lead time, deployment frequency, mean time to restore, and change fail percentage. I'm not going to go into these now. That's for Nicole and for you to chat to her this evening. However, the problem we have is if software does not fit in our heads, there's a real danger that each one of these four key indicators is going to get worse.

So if the software is too big for our heads, then the lead time, which is depending on how you measure it, the time from starting to work on a new feature to it being in production, there's a danger that will take longer. That will start to extend. There's a danger that the deployment frequency will decrease rather than increase.

We will deploy less frequently if the software is too big for our heads because we won't have the confidence to deploy more frequently. There's a danger that if the software is too big for our heads, then we'll not be able to restore service in production as quickly because it's too complicated, it's too involved.

And likewise, there's a danger that if the software is too big for our heads, that the percentage of deployments that result in failure will increase. We're trying to drive that down. So that's why we think that the framing around monoliths and microservices is the wrong way to look at it.

And a useful way to look at it is this phrase that Daniel Terhorst-North comes up with, which is software that should fit in our heads. Software that is too big for our heads works against organizational agility. If you want to go to sleep for the rest of the talk, just take a picture of this slide. That's the only thing you really need to worry about.

And that's a really key thing, right? So the thing I want you to take away from the session today is we do need to think about the size of software that teams, and I'm talking about teams, not individuals, that teams work with because it has a direct impact on organizational agility.

And so how do we approach this? In the book, we talk about team cognitive load. Let me just talk you through a little bit of background first. Cognitive load is a concept that was defined by John Sweller in 1988, and he defined it as the total amount of mental effort being used in the working memory. So when we're building software systems, working with software systems, we've got a lot of stuff in our working memory as we're juggling kind of concepts and trying to put those into code or working out how to shape a data set or whatever.

Cognitive load comes into play a huge amount as we're working with software systems. And there are three kinds of cognitive load. Intrinsic, which is something fundamental to the way we're working or the problem domain. Extraneous, which is stuff which gets in the way effectively, which prevents us from really thinking too much about the problem at hand. And germane, which is useful stuff about the problem domain that actually helps us to solve a particular problem.

Now, in a software development context, this would be something like this. Intrinsic could be remembering how classes are defined in Java. Extraneous would be, for example, oh, how the hell do I deploy this application again? It's really complicated. We shouldn't have to think about it.

Germane, if we're working in a financial services application, it might be, well, how do bank transfers work? We need to have that cognitive load on people who are working because they need to be thinking about the details of, in this case, how bank transfers work in order to be able to write code effectively.

You could sort of see it like this in a software delivery context. Intrinsic is the fundamental skills that we bring as engineers. Extraneous could be something like the mechanism which we shouldn't really have to think about. And germane is the important stuff about the domain that we're working with, the business domain we're working with.

A bit of a simplification, but you can think of it like that for now. What we're trying to do is, we have to work with the intrinsic cognitive load. That's just the nature of the beast. We have to do it. We're trying to squeeze down as much as possible the extraneous cognitive load, which doesn't add value, which gets in the way. And we're trying to give ourselves as much space as possible for the germane cognitive load, the stuff that is really business differentiating.

Tried to represent that in this slide here. If you want to know more about this, by the way, have a search for talks and slides called "Hacking Your Head" by Jo Pearce. You'll find some interesting talks and slides and blog posts and things around that. What all this means is, if we want to enable organizational agility, we need to explicitly limit the size of software services and products to the cognitive load that the team can handle.

Because as soon as we exceed the cognitive load of the team, there's a danger that those four metrics, if you remember from Accelerate, there's a danger that we're going to be driving bad decisions, we're going to be increasing bugs, we're going to be making it more difficult to diagnose and redeploy and so on. And that's not where we want to be.

So this is a very different starting point for software architecture and for ways of thinking about team responsibility boundaries and so on. We've started to think now about what's an effective size of software. Well, the size of software should be no more than the owning team can handle based on the cognitive load.

It's certainly not something that many organizations have been explicitly doing. Many organizations have been thinking about this, but perhaps not exactly in these terms until more recently. So again, this is this software that fits in our heads concept. If it fits in our heads, we're more able to own it as we build and run it in production.

So we're starting with a team. This is very much a team-focused way of thinking about software responsibilities, boundaries, architecture, and so on. When we say team, we mean a long-lived group of people with a shared purpose and backlog, probably fewer than nine. In some organizations with very high trust, you might be able to get away with a team being more like 15 people.

But certainly in our book, in Team Topologies book, team means something very, very specific, which is this long-lived collection of individuals who work together over a long period of time, multiple months, years, possibly, with a common purpose and work together as a team rather than just a collection of individuals with the same manager.

And the reason for that is because a high-performing team is far more effective than just a collection of individuals. So if we want to be a high-performing organization, we use teams to do the work. So there's a really important point here that each service or application, each part of the software estate, must be fully owned by a team with sufficient cognitive capacity to be able to build and operate it.

There's no applications or services which are shared or which don't have an owner or which only have like a BAU team keeping it ticking over. Every application or service has got full ownership from one team that builds and runs it. And it has sufficient cognitive capacity.

We haven't exceeded the cognitive load of that team. So we're not just piling more and more services onto the same team. At some point, that team will have reached its limit of cognitive load. And there are some techniques we can use these days which we know work to help us do all these things.

So whole team techniques like mobbing, where the whole team comes around a single keyboard, brings multiple viewpoints to solving a problem. We solve that problem with very high quality. We've reduced the likelihood of downstream problems and bugs and so on, and then we move on to the next feature.

That's a very whole team approach to getting work done. We can use techniques like domain driven design, DDD, to help us establish effective boundaries between different parts of the business domain, and therefore assign the responsibilities to teams to match those domain boundaries.

We can emphasize developer experience. Developer experience, sometimes called DevEx, where we've got strong emphasis on the experience that developers and other engineers have of using other parts of the software estate, platform, tools, this kind of thing, so that we're making sure there's as little friction as possible in using various tools and parts of the platform and so on.

We also need to focus on operator experience. So whoever is running these systems in production, whether it's the same team or whether it's a separate team, maybe it's SREs or ops people, whoever's running it, we need to understand what their experience should be.

Because if their experience is terrible when there's an outage, we're going to be hurting that mean time to recovery from the accelerate metrics. We need to be building in operability as a first-class thing for our software so that the operator experience is excellent.

In the book, we talk about something called a thinnest viable platform. So this is the concept where we are going to need some sort of platform underneath what we're building. We might choose to ignore it, but that will be there. And we're not looking to build a platform which is absolutely huge and all singing, all dancing. We're looking at just the smallest amount of platform to accelerate teams who are building application software and services, and make it safe to do the right thing. Safe and rapid to do the right thing.

We'll come back to this one a little bit later on. In the book, we talk about four fundamental topologies. These are four team types, which as far as we can see, are the only four types of team that we need in a modern organization building and running software systems.

We've tried hard to find more types that are necessary, but we've not yet found them. So if you're sure you've got another team type, please come and tell us. We'd like to hear about it. But based on our experience and so on, this is what we've come up with. And the most important one is the stream-aligned team.

Because we're trying to optimize for a fast flow of change, we want to make sure we've got a team that is aligned to the stream of change coming from the business. And we've used things like DDD to help us get boundaries between these different teams, different streams, so that that team is able to take an idea or a change from concept all the way through to production and running it.

So the stream-aligned teams build and run applications and services. And the other three types of team are there effectively to reduce the cognitive load on the stream-aligned team. So, if the stream-aligned team needs to understand a new way of a new kind of technology, let's say, a new kind of database type, we might have an enabling team shown in green there in the second one. The enabling team will come on, perhaps they're database experts.

They would work with the stream-aligned team to help them get to grips, to help them understand this new kind of database technology for a period of time. Perhaps it's two months, perhaps it's just two weeks. At some point, the enabling team will move to a different team and start to help them with this new technology. They're not there permanently.

They're not there as a support permanently. The complicated subsystem team is optional, but if there is a part of the system which is really awkward and requires really highly specialist knowledge, then we might give that particular chunk of work to a team with that extra expertise.

And then at the bottom underneath, we've got a platform. There's always a platform, but we need to define it very well and make sure that the way in which we build this platform is focused on enabling the stream-aligned teams to deliver rapidly and safely. So the people in the platform treat the stream-aligned teams as their customers.

In some organizations, they even use things like net promoter score so that the stream-aligned teams can rank, can rate aspects of the platform as if this were a public kind of SaaS service. So let's say in an organization, we've got three stream-aligned teams.

They're running on a platform. Two of the teams are using a component which is quite complicated, so there's a specialist team looking after that. That's on the left in red. And the top two teams are having some help from an enabling team to get to grips with some new technology.

Perhaps it's databases, perhaps it's machine learning, something else. So we can immediately see that the kinds of interactions between different teams are different depending on what they're doing. We don't have exactly the same kind of interactions, and needs, and dependencies between different teams.

It varies depending on what teams are doing in the organization. And the way in which those teams might interact can also be different, needs to be different. The top two teams there that have this enabling team working with them, that enabling team is going to be facilitating those two teams.

The way in which those interactions will... That will feel very different from the way in which the component is being used by these bottom two teams, for example, which the bottom two teams just want to consume this component as a service, if you like. So they've got a nice clean interface, nice easy way to install it, or easy way to test it and access it.

There's very little additional interaction that's really needed there. And likewise, all these three teams here, stream-aligned teams, they can just consume stuff from the platform in a very straightforward way. There's nice APIs, nice documentation. It's nice and straightforward.

The team at the bottom, the stream-aligned team at the bottom, however, is collaborating with the platform on something new. Perhaps they're moving cloud provider, or perhaps they're changing the way they do infrastructure automation or something. They need to interact with the platform team in a different way.

So we've got different kind of team interactions at different parts of the organization at the same time, depending on what's happening. This is just a snapshot. In six months' time, the interactions will look different because the teams are doing something different.

So this is an important point, that the purpose of the platform, the enabling team, the complicated subsystem team, are there to reduce the cognitive load on the stream-aligned teams, to enable them to own their parts of the system effectively. We're expecting to interact differently with other teams in the organization.

And this starts to help us to move towards the concept of environmental scanning. In this case, it's our internal environment within the organization. So Dr Naomi Stanford, who's one of the world's foremost experts on organization design, talks about environmental scanning as a really crucial aspect of how organizations should expect to set themselves up for success.

And the patterns we're talking about today start to touch on that. So let's have a look at some case studies now.

Manuel Pais

Thank you, Matthew. So I'm going to talk about two case studies. The first one is from a large worldwide retailer. They're still growing into new markets, and so they realize we're a traditional enterprise. Our delivery cycles are very slow. So we want to do something different.

So they had a specific market that they wanted to enter, and they said, "We need a new mobile experience, so we're going to create a cross-functional team and give them the autonomy to decide whatever architecture you think is the best to do this." So this team had all these good practices around DevOps, continuous delivery, using public cloud, et cetera, and they had this iterative approach.

So they very quickly were able to deliver something working and then iterate and improve over time. So it's a very concrete success story for this organization. The kind of success stories you'd put in a presentation like this. So what happened next is that because they were successful, they were asked to do another mobile experience for another market. So you can see they started to have a bit more complexity in terms of back ends, and they needed a CMS to control different types of changes to different markets.

And this went on for quite a while. So about a year and a year and a half later, you can see the team has grown considerably and the system around them as well, or the system they're responsible for. So you start to have more back-end services, product catalog, framework with shared services between different mobile applications, et cetera.

The interesting thing here is that a couple of people in this team, the more senior architects, were realizing that actually our delivery cadence is slowing down. We're actually starting to have more dependencies within the team. And what's happening here is, as you can see, this is becoming a little bit of a monolith that the team is working on. And you start having people who are specializing in certain parts of the system.

So those people become bottlenecks. If we needed to change this part of the system, only one or two people know how to do it. Effectively, you start having different work streams within this larger system, and some of them are blocking each other. So what they decided to do, and at first they had a lot of pushback against this decision to split the team into two smaller teams.

You can see that on your right side, one of the teams is more focused on the front-end experience and the product catalog, and on the left side, the team is more focused on the back-end services. But because the team was working quite well before, they were not really very happy with the split, but they did it and it turned out quite well, because actually most of the time they could work independently on their part of the backlog, on their features.

But obviously there are some that were cross-cutting across the two teams, and for those, they're represented... Between the two teams, you can see those two blue bars. That means they have a very considerable amount of communication between the two teams. You could almost see it as a pair of teams that come together for specific needs.

So there will be some features, there will be some changes where they need to synchronize and actually work together for a period of time. But this is intentional. It's explicitly designed like that, and the rest of the time they can work more independently. So this worked out quite well for them.

They even went on to further split. I believe now they have front-end teams almost aligned to a single market, so they can go as fast as possible to meet the needs of that specific market. And on the back end, they also split, and they align to almost one service per team.

So what was happening here is that as the team grow and the system grow, it was becoming more monolithic and having flow of work being blocked within the team. But they were able to listen to some of these triggers that, okay, we need to evolve. What was working before and the structure we had before is not working anymore.

So software growing too large, overspecialization. So people like Brent in the Phoenix project, who are the only ones who know how to change part of the systems or support it. And just overall increased need for coordination, spending more time coordinating different changes, et cetera, even within the team.

So the other case study I want to talk about is from OutSystems. They are a low-code platform vendor. And they also grown considerably in the last years. In particular, they had one team which was called engineering productivity team. So they were helping the product teams get better in terms of these domains of continuous delivery, test automation, build and continuous integration, as well as infrastructure automation. But this was over time, they were acquiring more responsibilities in these different domains.

But what happened was that, again, people had to specialize in one or at most two domains because it was very difficult. Although they wanted everyone to be able to work on everything, in reality, people had to specialize because it was too much cognitive load.

And so they realized we're actually getting people demotivated and not engaged with the work because there's so much happening, and they were just trying to stay alive and respond to the product teams. That alone was very hard. So again, they also decided to split into smaller teams.

Each of these smaller teams is aligned to a single domain, and they don't have a team lead anymore, so it's a flat structure within the team. And this quickly proved to be very useful, very successful for them. Because if you think about the intrinsic motivators for individuals, so if anyone has read the book "Drive" by Daniel Pink, he talks about three intrinsic motivators: autonomy, mastery, and purpose. So each of these teams were in a better place to have those motivators because they had a shared purpose, a single domain of focus that they were engaged with, interested

in. They had more autonomy to decide, okay, what are the priorities for this domain? Where do we want to go? What are we missing as an organization? And mastery in the sense of, okay, we have the autonomy to allocate effort to improve our knowledge, to learn new techniques, maybe go to conferences, try out new tools, et cetera.

So this worked out quite well for them. And you can see, again, there are cross-cutting concerns. Maybe some requests will need people from different teams to come together because they cross different domains. But that's the exceptional. And what they do in that case is they create a micro team for a period of time when we're going to work specifically on this request or on this feature that is cross-domain. But most of the time, they're able to work independently as they are aligned to a single domain.

Ironically, this team engineering productivity was created to reduce the cognitive load on the product teams, but they themselves fell victim of too much cognitive load, too many responsibilities. So it's not always just about software size. Think about some teams are more support teams or productivity teams, so they have domains of responsibility that you need to be careful that they're not overbearing for the team.

So if you aim for teams with this kind of high cohesion internally, this shared purpose, autonomy, and mastery that we talked about, that can be quite powerful. And between teams, there will always be a need of coordination, communication, but you can try to make that the low bandwidth, minimal communication that you need.

And for most of the time, they are independent. They can work on their own backlogs. So again, they were listening to triggers for evolution, awkward interactions within the team or people not invested. Some people were at the point of almost burnout and leaving the organization because they didn't like how the work was being done, and frequent context switching. Every time we switch context, we need to upload to our working memory the skills and the domain knowledge that is necessary for that problem or feature we're working on.

And give it back to Matthew. Thanks, Manuel.

Matthew Skelton

So technically, we're out of time. If you're happy to leave, thank you for coming. Otherwise, I'll take about two minutes to run through a few extra things. Here's some ideas for getting started. Go and ask your team how confident they are, or rather, how anxious they are, how much anxiety they have about the software that they're working on.

Try and get a sense for that. Try and get to the point where they feel comfortable giving you an honest answer. Because the anxiety about the software they're working on is a leading indicator for potential problems in production. And we want to use leading indicators rather than lagging indicators, right?

So if we can actually assess the sense of how confident the team is that they understand everything about the software they're working on, that can be a powerful indicator for whether we're likely to get problems later on. Have we exceeded the team's cognitive load?Do we therefore need to pull some things into a platform? Maybe, maybe not. It depends.

Are there skills or capabilities missing within the team? So these things are signals. If we've gone beyond the team cognitive load, that might indicate that there are other things we need to change around that team. These slides are obviously going to be available online.

Think about what is your platform? How is that defined? How do teams understand what they are consuming from that platform? How good is the documentation? How good is the developer experience for using the stuff that's in that platform? Because if any of that stuff is not first class, then you're increasing the cognitive load on the stream-aligned teams that are supposed to be developing software. And why would you do that?

We need to minimize that kind of extraneous cognitive load that has evolved around how do I deploy this component? How do I update the package? Whatever it is, we want to minimize that kind of extraneous load. So work out how easy it is for teams to use that platform to understand how to use a platform and so on. So that's a kind of developer experience.

So here's the book. We're book signing at, I think it's 7:15 this evening in Chelsea. If you're interested in training, get in touch. We have some options available. We are also looking for industry case studies. We're talking to three organizations at the moment who have started to use the patterns and ideas from the Team Topologies book.

We're talking to a global manufacturing company, a large government agency. Well, actually two large government agencies, and a company involved in global financial services. But if you're working in a situation where you think you've got some interesting dynamics in your software delivery challenges, then just do get in touch if you find the material useful.

We've got a newsletter. Sign up if you like. Thank you for coming.