Monoliths vs Microservices is Missing the Point—Start with Team Cognitive Load

Log in to watch

London 2019

Download slides

Monoliths vs Microservices is Missing the Point—Start with Team Cognitive Load

Matthew Skelton

Author · Team Topologies

Manuel Pais

Author · Team Topologies

The “monoliths vs microservices” debate often focuses on technological aspects, ignoring strategy and team dynamics. Instead of technology, smart-thinking organizations are beginning with team cognitive load as the guiding principle for modern software. In this talk, we explain how and why, illustrated by real case studies.

Matthew Skelton has been building, deploying, and operating commercial software systems since 1998. Head of Consulting at Conflux (http://confluxdigital.net/), he specialises in Continuous Delivery and operability for software in manufacturing, ecommerce, and online services, including cloud, IoT, and embedded software.

Matthew curates the well-known DevOps team topologies patterns at devopstopologies.com and is co-author of the books Continuous Delivery with Windows and .NET (O’Reilly, 2016) and Team Guide to Software Operability (Skelton Thatcher Publications, 2016). He is also co-founder at Skelton Thatcher Publications (http://skeltonthatcher.com/), a specialist publisher of techniques for software teams.

Matthew founded and leads the 2300-member London Continuous Delivery meetup group (http://londoncd.org.uk/), and instigated the first conference in Europe dedicated to Continuous Delivery, PIPELINE Conference (http://pipelineconf.info/). He also leads the CodeMill digital skills initiative in the North of England (http://codemill.tech/), and is a Chartered Engineer (CEng).

Manuel Pais is an independent IT consultant and trainer, focused on team interactions, delivery practices, and accelerating flow. Manuel is co-author of the book ""Team Topologies: Organizing Business and Technology Teams for Fast Flow"" (IT Revolution Press, 2019). He helps organizations rethink their approach to software delivery, operations and support via strategic assessments, practical workshops, and coaching.

Chapters

Full transcript

The complete talk, organized by section.

Matthew Skelton and Manuel Pais

Hi. Good afternoon, everyone. It's good to see you here. My name's Matthew Skelton.

And I'm Manuel Pais.

Together we are the co-authors of a new book called Team Topologies. We're here today to share with you some insights, advice, and experience on how to size software services. The focus of that is team cognitive load.

Today's talk will look something like this. We'll have a section where we're looking at monoliths and microservices, different sizes of software. We'll then look at what we mean by team cognitive load. We've actually had this term mentioned a couple of times already today in some of the earlier talks. Manuel will then take us through some case studies: organizations that have used team cognitive load as a way of helping them to evolve their software systems. Right at the end we'll look at a few tips for getting started with this approach.

This is a book published by IT Revolution Press. The publication date is September 2019. These are the early advance copies, and we're doing book signings tonight at half past seven at the SolarType stand. If you're interested by what you hear today, come along and you'll get a signed copy.

01Monoliths vs Microservices

In the past few years, many organizations have started to adopt microservices as a way of being able to deploy their software systems more rapidly, with greater focus on specific areas of the system. But there's often lots of debate around what size a microservice should be. Should it be 10 lines of code? Should it be 100 lines of code?

It starts to look a little bit like this, some kind of Mortal Kombat thing. In the blue we've got Tammer Saleh, who says, "Start with monolith and extract microservices." Over the other side of the arena, we've got Stefan Tilkov, who says, "Don't start with a monolith when your goal is microservices." Then the wise words of Simon Brown, who says, "If you can't build a monolith, what makes you think microservices are the answer?"

What's going on here? Where should we actually focus?

I think Daniel Terhorst-North has got it right when, in his phrase, he talks about "software that fits in your head." There's an awful lot of experience and awareness behind that recommendation or that phrase. In the context of teams, if we're thinking about building software within the context of teams owning and running software, we might rephrase this to be "software that fits in our heads" as a team, but the intent is the same.

Who has yet to buy or read a copy of Accelerate? Put your hand up and be shamed. Fine. You need to get yourself a copy from the stand. Very straightforward.

These are the four key metrics from the Accelerate book, based on five years' worth of State of DevOps reports and assessment from many thousands of companies around the world. These are the four key metrics that are strongly indicative of high organizational performance: lead time, deployment frequency, mean time to restore, and change fail percentage.

The problem is: if the software which we're working with does not fit in our heads, these things are going to be very difficult to improve upon. If the lead time is the time from, say, version control to production, and the software is too big, we're likely to distrust the tests and want to take more time to find out what's going on. The lead time is going to extend.

Same with deployment frequency. If we don't understand the software well enough, are we going to have the confidence to deploy more and more frequently? Probably not. We're probably going to restrict how many times we deploy, and so on. If the software we're working with is too complex and too complicated and fails in really awkward ways in production, it's going to be difficult for us to restore that service quickly. Again, our MTTR will extend.

If we want to start to move towards improving these four key metrics as recommended by the Accelerate book, then we need to start thinking about the size of software that we're expecting teams to work with. Software that is too big for our heads works against organizational agility.

This is a different starting point compared to how many organizations and many people have started in terms of thinking about software and architecture. Often in the past, we've started with bits of technology. We've started with a database, a message bus, or something else. If we start with the team and the cognitive load for that team, we get some different results.

02Team Cognitive Load

Let's have a little look at what we mean by team cognitive load. It was defined in 1998 by psychologist John Sweller as the total amount of mental effort being used in the working memory.

There are three kinds of cognitive load that John Sweller identified: intrinsic, extraneous, and germane. In the context of software development, we can think of them in these three ways.

Intrinsic is something like how a class is defined in Java. It's a fundamental of how we're working with software systems. We don't have that front and foremost all the time. Once we've spent six months or a year doing Java development, then it becomes an intrinsic part of how we work.

Extraneous is something that works against what we're doing, something that is like a distraction. "How do I deploy this app again? I can't remember. It's really awkward. I've got to set this config property, blah, blah, blah." This is extraneous cognitive load, and it's effectively valueless. We don't want to have this kind of cognitive load on our teams.

Germane cognitive load is load that we have to deal with because this is the part of the business problem we're trying to solve. If we're building an app for online banking, then part of the germane cognitive load of the software developer or tester or whoever is building the application might be: how do bank transfers work? You need that kind of load in your head as you're building the software.

In a software context, intrinsic is like the skills that we bring to the table. Extraneous is stuff to do with the mechanisms of how we do things in a software world. Germane is the domain focus. It's a bit more involved than that, but that's how you could think of it.

What we're really trying to do is give the most space to the germane details, the germane cognitive load. The intrinsic we have to deal with. We can't get rid of it. We're working with software and computers; there's stuff that we just have to know. We're trying to minimize and squeeze the extraneous cognitive load, to get rid of that as much as possible. If possible, just get rid of it entirely so that we've got the most space available for the germane cognitive load, the business focus of the problem we're trying to deal with.

If you want to know more about this in some detail, there are some great presentations by Jo Pearce. If you search for "Hacking Your Head," then you'll find lots of slides, videos, and so on. There's some really good material there.

The slides from today will be online later on today. We'll tweet out the link and you'll be able to download the slides.

03A Team-First Approach to Software Boundaries

This is the implication of what we've just been talking about: we should be thinking about limiting the size of software services and products to the cognitive load that the team can handle. We're starting to take a sociotechnical approach to building our software systems here. We don't just pretend that we can throw any kind of software architecture or design or technology at a team and they'll just have to deal with it. We're using the constraints or properties of the human systems that we have in our organizations and working with them to produce more effective software delivery and more effective software systems.

This is "software that fits in our heads." This is really quite a different approach to thinking about software boundaries. It will feel very unfamiliar to many people. Not to everyone; there are organizations already doing this, as we'll see very shortly. But it does feel a bit unusual.

When we talk about teams, we're talking about a group of people that's probably less than about nine people in size. There are evolutionary reasons for this. Some organizations have found patterns where you're able to bring two of these teams together in close harmony. If you think about a rugby team, you've effectively got two closely operating teams together: the forwards and the people at the back. I don't play rugby, but I spoke to people who do, and they say it feels a little bit like there are two separate teams, but working really closely together. Some organizations have found ways in which they can do that. But generally speaking, we're talking about a cohesive, long-lived group of people that work together on the same set of business problems for an extended period, and that group of people is less than about nine.

We've heard from many of the talks this morning about ownership of software services and how important that is. We need to move to the point where every service must be fully owned by a team with sufficient cognitive capacity to build and operate it. In the words of Andy Burgin from Sky Betting & Gaming earlier on, it was, "You build it, you run it, you fix it, you support it, you diagnose it," and so on. That's what we're talking about here. There are no services, there are no products which do not have an owner.

We know that there are techniques to help us do this kind of stuff. We've got techniques like mobbing, which apply to the whole team, which help the team to own that service. We've got techniques like domain-driven design, DDD, to help us choose domain boundaries in an effective way that really works for the business context.

We've heard many people talk about the importance of developer experience, particularly when building a platform: making sure that platform is very compelling and very easy and natural for product teams and development teams to use. We're making sure we're explicitly addressing developer experience, particularly when we're building a platform, but to be honest, when we're doing anything inside our organization where other people need to use our software.

We also need to think about the operator experience. What about the people who actually need to run this stuff? People who are on call. How easy is it to diagnose these systems? If we've built a system that's fine for our team, but we've handed it over to another team and it's terrible and really difficult to operate, if the cognitive load is way too high, we're in a bad place. We need to focus on operability to make that stuff work.

Another technique is what we've in the book called Thinnest Viable Platform, which is an approach where we explicitly define what the platform looks like. Again, from Andy Burgin's talk this morning, there was a really nice slide where he showed the very beginning of their platform evolution. They had a wiki page which defined exactly what that platform was aiming to do and listed the services it provided.

Being super explicit about what our platform is is important. It's also important to make sure that it's not bigger than necessary: hence thinnest viable. If you're a startup and quite small, maybe 10 or 15 people in your organization, the underlying platform is going to be something like AWS or Azure or Google Cloud. You might decide to build an extra platform layer on top of that. But your platform might simply be a wiki page listing the five services that you are going to use from AWS. If you don't need to build anything more, don't build anything more. That's enough. That is your thinnest viable platform: just a wiki page with a list of five services. We're not trying to build a huge great thing.

We need to make sure that whatever we build is compelling to use and has strong developer experience. We're treating the product teams, or what we call stream-aligned teams, as customers. We're treating them as people whom we need to speak to, to understand what they need, and we need to be set up to meet their needs.

04Four Team Types and Three Interaction Modes

I've talked about a few different team types. In the book, we've identified four different kinds of team, which, as far as we can see, are really the only types of team that are needed in this kind of context in building modern software systems.

The first team type is the most fundamental: the stream-aligned team, the team that is aligned to part of the value stream for the business. They have end-to-end responsibility for building, deploying, running, supporting, and eventually retiring that slice of the business domain or that slice of service.

The other kinds of teams listed below are effectively there to reduce the cognitive load of the stream-aligned team. If we've chosen our domain boundaries well, the stream-aligned team should have everything they need to deploy changes for that part of the business system. But they can't do everything. They need some supporting services from a platform, for example. We heard a great talk from Tom this morning about the platform at ITV. We need support from the platform so we don't have to think about how to spin up a Kubernetes cluster, because that would be too much increased cognitive load compared to deploying something more business focused.

Likewise, for a complicated subsystem team, if there's part of the system where, let's say, in the case of media streaming, we need to write a specialized video transcoding component, we probably hire some people with PhDs in maths or something like this and get them to work on a complicated subsystem. We're taking the cognitive load off the stream-aligned teams, who can focus on the more customer end-to-end experience.

Enabling teams help to upskill the stream-aligned teams, typically on a temporary basis, and also detect if there are gaps in the platform or gaps in what the stream-aligned teams are expected to do.

This is maybe an organization where we've got three stream-aligned teams. We've got a platform underneath. We've got a complicated subsystem on the left in red. Towards the right-hand side, we've got one of those enabling teams facilitating two of the stream-aligned teams. Perhaps they're moving from one container platform to another, so they're trying to get up to speed.

Another key idea in the book is the need to be much more explicit about the ways in which teams interact. From our experience and what we hear from other people, in many organizations teams don't understand why or how they should interact with other teams.

What we've defined is just three interaction modes. Part of the purpose of these three interaction modes is to help reduce confusion and effectively reduce irrelevant cognitive load so that it's easier for teams to understand how they should operate effectively.

If the complicated subsystem, our transcoding component, is being built by that team, and we set up the expectation that they're simply providing that component as a service to the two teams at the bottom, then all those three teams involved in that interaction have a clear understanding about how they're supposed to interact, provide something, or consume something. We've minimized the cognitive load around how we should operate as a team.

Similarly, the stream-aligned team at the bottom here is currently collaborating with the platform to discover something about logging or a better way of doing Kubernetes. They know that for a period of time their cognitive load is going to be higher because they're working together closely with another team. But perhaps after, say, three months, we finish that discovery and go back to consuming the container platform as a service.

There are mechanisms here that, if we define much more clearly ways of working with other teams, we're able to address cognitive load too and minimize that in different parts of the organization. We're now going to look at some case studies from organizations.

Manuel Pais

I'm going to talk about two case studies from the book. The first one is a large worldwide retailer. They're still growing into new markets. Back in 2016, they decided they wanted a new mobile site for one of these new markets. They put a team together from scratch: a cross-functional team with business people directly involved in the team. They had all the technical skills to have this kind of end-to-end ownership that Matthew is talking about. They had good DevOps practices. Everything was in the cloud: the typical success story that you would include in a presentation like this.

Given that success, they were able to quickly release a working version of the mobile website and then iterate frequently. After a while, they were asked to do the same for a new market, a new mobile site. They wanted this to be rather independent, so they could evolve the different sites for different markets more or less independently. But in the back end, they started to need a little bit more complexity. They needed a content management system so they could upload content to different sites. Overall, this was working quite well still.

Over time, they were asked to do even more markets, more sites, and the back end started to get more complicated. They needed a subsystem to handle product management and product catalog, so different markets could have different sets of products, versions, and pricing available. They also started this framework, which is essentially a collection of common services to all the sites: things like searching for a product or uploading static files to a CDN. All the sites would need those, but you wouldn't want to repeat them for every code base.

You can tell probably what's happening here. The system is growing, and the team is growing along with it. By now they had far more people than in the beginning, and it's becoming a little bit of a monolith.

Some of the people on the team started to realize that now they also had different work streams going through the team. You have feature requests for one market site, other feature requests for other markets, and changes that need to be done in the CMS for the content editors. The fact that the system was a little bit monolithic by now meant that these work streams were impeding each other. There were dependencies, and they were slowing down pace of delivery. The thing that had made them so successful in the beginning was now harder to achieve.

In particular, two people in this team who had a senior architect role started to realize this. Even though the team worked quite well together, they were a high-performing team, they noticed these dependencies. People had to start specializing in certain parts of the system. Before, it was pretty fluid: you would get a change request or a feature and know exactly which parts of the system to change and get it out. Now people were starting to specialize in specific parts.

Those two senior architects proposed to split the team in two. They got a lot of pushback because the team members felt that they were working well together, but eventually they did it. They got into this pattern Matthew mentioned, a kind of paired team. Obviously there was a lot of communication going on regularly, but after doing some refactoring and re-architecting, they were able to split into two teams: one team more focused on the customer-facing applications and markets, and the other team focusing more on the CMS and this framework.

This worked quite well for them. The two teams were able to deliver more independently. There was still some correlation between the roadmaps for these two teams, and they had regular communication, but they were much more independent. They realized that at this point there was too much cognitive load. The system was too large to handle as efficiently as before.

From what we've heard, they've gone on to further break down this team. I believe now they have smaller teams aligned to markets on the customer-facing side, and they have split the CMS and the framework, which is a kind of platform team with common services. This worked quite well for them.

The key point was that as they grew and were successful, the system became larger and the team became larger, and things were starting not to work as well. The flow of work was getting blocked, or at least significantly delayed. The critical thing is that some people in the team were listening to the signals that something was not as efficient as it was before. The software was getting too large in this kind of monolithic architecture.

Some people were overspecialized. If you've read The Phoenix Project, it's kind of the Brent syndrome, where only this person or this couple of people know how to change that part of the system. You're introducing this dependency, even inside one team: only when those people are available will we be able to get this out the door. Overall, there was an increasing need to coordinate releases and make sure when one part is done so another part can be done, introducing delays in delivery.

05OutSystems

It's not always just about the size of the software the team's responsible for. There are other types of responsibilities. In the case of OutSystems, one of the leading low-code platform vendors in the world, a few years ago they started the engineering productivity team.

In the beginning, this team worked as an enabling team around build and continuous integration, and also test automation. Their goal was to reduce cognitive load for the other engineering teams, who are in fact their customers. They were helping them adopt good practices around these areas, set up tooling in a good way, and overall help the engineering teams increase their maturity in these areas. Again, they were quite successful.

What happened was that they took on more domains: in particular, infrastructure automation and continuous delivery enablement. The team grew to cope with that. The interesting fact was that the other engineering teams were getting more mature, more advanced in the way they used test automation, CI/CD, and so on. They were coming back to the engineering productivity team with requests for help that were much more specific and much more domain-specific for those teams.

This productivity team was now facing a large number of requests across different domains and from different teams with specific needs. They were barely able to keep afloat, let's say, and respond on a timely enough basis to these requests. Inside the team, it became very difficult for any one team member to understand all these different domains. People were, in practice, working only on one or perhaps two domains, and motivation went down significantly.

Some people felt like they didn't have enough effort available to master the domains they were supposed to support and understand in detail. At the same time, they were spending a lot of time in planning meetings and standup meetings, where most of the things being discussed were not directly related to the work they were doing.

At this point, quite recently, in late 2018, they made a bold decision to split into smaller teams, almost micro-teams, where any one team was only responsible for one of these domains.

The early results were quite positive. Motivation went up. People felt like they had more autonomy to decide the priorities for their domain of responsibility. They also interacted much more closely with the other engineering teams, their clients, and really understood the problems and the best solutions they could find. They had a little bit of breathing space to master this domain, understand good practices, perhaps come to conferences like this and get to know what other people are doing. Motivation really went up, and there was a feeling of shared purpose inside each of these teams.

There were still issues and requests that were cross-cutting across some of these domains, as they are closely related. But those turned out to be the exception. When that happens, people from different teams come together. If needed, they create a temporary team to work on that specific problem or need, and then go back to their original teams. Before, they were optimizing for the exceptional situation: actual needs and requests across multiple domains.

This has worked quite well for them for now. There is still communication between different teams, but the bandwidth required is much lower. The key is that it's not always just about software size, but actually aligning the number and complexity of the domains that the team is responsible for to their cognitive capacity.

If you aim for this kind of pattern, with smaller teams with high cohesion internally, high communication internally, and shared purpose, then you need some synchronization with other teams, but that can be much lower bandwidth. You don't need to be communicating across all teams all the time. That can work quite well.

Again, they were listening to signals that what worked for them in the beginning was now becoming a problem: awkward interactions, some people not really invested, some people maybe almost burnt out because they were trying to keep up with all these different domains and had to put in a lot of extra time to understand all of this, and definitely frequent context switching inside the team.

06Sky Betting & Gaming

The last example is not from the book. It's from a recent talk, again from Sky Betting & Gaming. Besides getting a slide of a cat in the presentation, it's also to show you: is it always the good pattern to split into smaller teams? Not necessarily.

In this case, they decided to keep a large team of 12 people because they had different applications: older applications that were making money today, and new applications with more experimentation and trying new markets. Within the same business domain, the demand for working on one part, older applications or newer, would change over time. In one quarter, maybe they needed to increase the resilience of the older systems and spend most of the time on that. Next quarter, maybe they wanted to push out new applications and try new things.

It made sense to keep the same team, but within the team there were clear work streams, and people knew: now we're focusing on this part, the older systems or the newer systems.

Matthew Skelton

How do we get started with this kind of approach? A few ideas here.

Simply speaking, just ask team members. Do a survey of members in a given team: how well do they understand the software they're working on? Give it a score of one to five or something like this. Get a very rough idea of which teams currently are really struggling with the cognitive load of the systems they're being asked to own and develop.

Could there be some things that are candidates for pushing into a platform? Don't rush ahead and do it, but come up with a candidate list to start with and have some conversations.

We're looking for missing skills or capabilities. It could be that within the team there are actually missing skills. It could be that the organization as a whole has missing skills.

If we adopted these three team interaction patterns that we saw earlier on — close collaboration, where we know our cognitive load is going to be higher; X-as-a-Service, where we know we're just supposed to consume something; or facilitating, where we're helping or being helped — what would happen if we adopted these patterns? How would teams actually react and behave in this context?

You need to sense your organizational situation, your maturity, or the dynamics within your organization, as to where to start to apply some of these practices. Don't just rush in and do it.

Is your platform well-defined? If not, go ahead and define it really quite carefully. You'll probably be surprised that there are far more services actually being run by a small group of nearly burnt-out platform engineers, and so it's time to do something about that.

What is the thinnest platform that could work in your context? It doesn't have to be thin, but the thinnest and no more.

As I mentioned, here's the book. We've got the signing at 7:30 this evening. It goes on sale in September. You can pre-order now if you go to teamtopologies.com/book. Bookstores all over the world are currently stocking it, which is great.

We've got some training. If you're interested, give us a shout. You can sign up for some news and tips if you go to teamtopologies.com. Thank you very much, everyone, for attending today. Hopefully it was useful.