How and Why to Design Your Teams for Modern Software Systems
For effective, modern, Cloud-connected software systems we need to organize our teams in certain ways. Taking account of Conway’s Law, we look to match the team structures to the required software architecture, enabling or restricting communication and collaboration for the best outcomes.
This talk will cover the basics of organization design, exploring a selection of key team topologies and how and when to use them in order to make the development and operation of your software systems as effective as possible. The talk is based on experience helping companies around the world with the design of their teams.
Chapters
Full transcript
The complete talk, organized by section.
Matthew Skelton
Today, I want to share some experiences and some ideas around four things: Conway's Law. If you're playing buzzword bingo, you can tick off Conway's Law now, that's fine. Cognitive load and how that affects teams. Some real-world team patterns, topologies, as we call them. And then, more importantly, some guidelines for team design. So that's what we'll look at today.
The slides will be online later today, so don't worry about taking all the details down. This is me. If you use Twitter, it's `MatthewPSkelton` as the Twitter handle. I'm co-founder of Skelton Thatcher Consulting. I've written one book, Continuous Delivery with Windows and .NET. That's a free download from the O'Reilly site. I have one paper copy with me today for the most interesting question at the end of the session, or most interesting conversation that I have, so bear that in mind.
I'm writing a book on software operability as well. If you like what you hear today, if you like the concepts around team design, then we're running a workshop in London in September. It's actually the 27th of September. So there's a discount code there for you to use and book it online.
We've got clients in different parts of the world, and we help them with stuff to do with DevOps and continuous delivery. Basically, we build modern capabilities by mentoring teams in client organizations. This is how we see ourselves. That's not actually my bottom, but you kind of get the idea. We're in the organization doing stuff, fixing things.
Anyway, today, this is what we're talking about: how and why to design your teams for modern software systems.
Why do we care about this? It allows us to make safer, more rapid changes to software systems. Why do we care about that? Ultimately, it gives us business agility. That's how we sell these kinds of approaches to other people in the organization, because we're aiming for business agility, to be able to change what we're doing, a bit like what Chris was saying just in the previous session.
A really key thing is to understand what we mean by team. Team is not just a collection of individuals who report to the same manager. The team is the unit of delivery for software systems. We build a team. We value how that team works as a single unit. And that has some big implications, really, for how we think about the software itself.
Crucially, we need to think about not just the capabilities the team has, but actually their appetite for doing particular things. What do they want to do? What do they want to be able to do as a team? And what kind of responsibilities should we actually assign? In which way does it make sense to assign certain kinds of responsibilities to this team, or that team, or another team? Where should the boundaries of responsibility lie?
From what we've seen, many of the problems in delivering software come from accidentally unclear boundaries between different teams and their responsibilities.
And I've got a really important assumption here: the team is a stable, slowly changing thing. We're not spinning up and dissolving and then creating a new team every three months for a new three-month or two-month project. We're building a team and investing time and effort and learning into that team.
If a chunk of work only takes three months, and then there's another chunk of work that takes another three months, keep the team and bring the work into the team. So this is something that Allan Kelly, who's done a lot of work in this area, advocates: bring the work to the team, but don't keep destroying the team just because the work is finished.
Okay, a very quick introduction to Conway's Law. Who has heard of Conway's Law before? Most people, so I'm not going to spend too much time on this. But here it is in full: "Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations."
If this is true, and increasingly, as we'll see, there's lots of evidence to suggest it is true, it has massively radical implications for how we build software.
More recently, Ruth Malan, who won an award for her work with team organization design, puts it like this: "If the architecture of the system and the architecture of the organization are at odds, the architecture of the organization wins." There's a force at play here which tends to push back against a dissimilarity. So there's a force that tries to make the organization and the software architecture similar.
And there's actually been some academic research in this. So MacCormack and colleagues, in 2012, analyzed a set of open source and closed source software products: databases, spreadsheet, word processor, and a few other things, CRM, I think. And they said, "We find strong evidence to support this hypothesis that a product's architecture tends to mirror the structure of the organization."
So it's not just a conjecture now. It's not just a heuristic. There's been enough academic research done to say this is a thing. There's something in here that's quite strong. It's not a simple linear one-to-one mapping, but there's definitely something at play here.
Back to Allan Kelly. He calls this the homomorphic force. It's a force which tends to make things the same shape between the architecture and the software.
Now, in very simplistic terms, what we're talking about is this. If we have four teams, each with a front-end developer in blue and a back-end developer in green, and then a single, lonely database developer, DBA. It's always a single, lonely database developer, right? Then the natural software that would emerge, just turn that diagram through 90 degrees, the natural software that emerged would be a UI layer, an application tier, and a single shared core database, because the communications patterns are being matched.
Now, if you want a single shared core database, that's fine. If you don't, then we might have a problem. So then we can use something called reverse Conway to try to anticipate. If we set our teams up in a way which anticipates the architecture that we want, we're not fighting against that force.
So if we want an architecture like this, some people might see that as a microservices architecture. It doesn't really matter. If we want our data to be aligned with the business domain, let's say, on the right-hand side, then we might need to bring in a data capability within the product development team. That's the implication.
Again, all we've done here is just turn that diagram through 90 degrees. This is a really, really simplistic version of the explanation of Conway's Law, but the point remains. We're thinking about the design of our teams to anticipate the software architecture that we want.
And this is heresy, right? So all the traditional software architects can burn me at the stake, because this goes against decades of thinking. But the point here is that there's evidence that this is a thing that we need to think about.
So we need to consider designing the organization architecture to help produce the right software architecture. It's not going to produce it all by itself, but to help to produce it, at least so we're not fighting against that tendency. So we'll come back to Conway in a bit.
Let's just have a little look at cognitive load and what that means for teams and software.
Cognitive load as a concept was formulated by Sweller in 1988, and he characterized it as the total amount of mental effort being used in the working memory. And it's quite important for heavily cognitive tasks like developing software.
There's different kinds of cognitive load. There's intrinsic things, like understanding just in the back of our minds how a computer works: memory, registers, TCP/IP, stuff like this, which we don't really think about all the time, but it's definitely there.
Extraneous or irrelevant things, which we should not have to think about, which get in the way. Things like, "How do I deploy that thing again?" Or, "How do I configure that service?" Stuff that gets in the way.
And then germane or relevant things that, let's say, relate to the business domain. So, in the case of Jaguar from the previous talk, how does this part of the software system interact with the drive control? Something that's really important to the business domain.
There's a really good talk by Joe Pearce called "Hacking Your Head." This was given, I think, at OSCON recently. If you're interested in cognitive load and how it affects individuals, then go and have a look at that talk there. There are some slides online, too.
Now, how this applies to teams is, for me, even more interesting. And thankfully, we've got some science, so we don't even need to invent it. There's been quite a lot of research done on the relationship between the load, or the amount of work and load we place on a team, and the resulting effect on the team.
Some of the research was done around teams in the U.S. Navy, so quite isolated, on a warship in the middle of the ocean. So something like a nice experimental environment. We can't apply it directly to software, but we can get some intuition for the sorts of forces at play.
A study by Driskell and colleagues, 1999, ended up observing that stress impacts team performance by narrowing or weakening the team-level perspective required for effective team behavior. So if we overload the team too much, the team ceases to act like a high-performing unit and starts to disintegrate into lots of individuals. And we don't want that, because teams are incredibly high-performing. If we can get the team to work as a team, it's far more effective than lots of individuals.
By the way, this is not just popular science. There's a lot of that going around. I did do some research into neuroscience a while back. So this is some speech perception. This was a brain-scanning software that I developed a long time ago in Delphi 5. Who uses Delphi 5? Anyway, I'm not an expert in cognition, but I've got a bit more background than just reading New Scientist or what have you.
High-performing teams are hugely effective compared to lots of individuals with the same manager. So we need to optimize for the team when we're building software.
The implication of this is we need to match the team responsibility to the cognitive load that the team can handle. Which in turn means that the maximum component size, or the maximum subsystem size in our software, if we're aligning teams to components, which we should be doing, generally speaking, components or subsystems or product, the maximum size of a subsystem or component should be no bigger than its team can handle.
So the size of bits of our software system start to be determined by the cognitive load that particular teams are able to deal with. So there's a very human-centric way of thinking about the size and shape of software architecture, which again feels radically different, or feels very different, from the way in which some people think about software. We'll come back to this in a minute as well.
We've been collecting together, based on the work that we do with our clients and also just some community contributions, different kinds of team configurations. They're online at DevOpsTopologies.com. It's all Creative Commons ShareAlike, so you can send us a pull request and we can evolve it.
It's been going for about four years or so, so there's quite a lot of different... We've had some validation of some of the different patterns. We've been evolving some of them, adding new ones.
There are different kinds of color schemes here. It's not come out so well here. Anyway, green is kind of DevOps-ish, at least some sort of collaboration. Yellow, this is supposed to be yellow next, development, product development, software development. Blue is IT operations or a platform, and so on. Red is something we want to avoid.
I'm not going to go through every single one of these now, because they're online. But I'll just give you a flavor. There is actually an important team type that's missing. See if you can spot which it is.
So some anti-types. A classic anti-type is just dev and ops being completely separate, throwing code over the fence. We know this stuff. This is Patrick Debois and Andrew Clay Shafer right at the beginning back in 2008 when they first started this stuff. We know about this.
This is a classic error where we try and do DevOps by putting a DevOps team right in the middle between dev and ops, and we prevent them talking to each other still. So there are ways to make this kind of team pattern work really well. But if the DevOps team stops the other two teams talking, and they should be talking, then that feels like an anti-pattern.
And there's lots of other anti-patterns as well, like, "We're in the cloud. We don't need ops." So we just have some DevOps sitting somewhere here, and then eventually you have a big car crash.
Lots of other anti-patterns. The sysadmins learn a bit of Puppet and call themselves DevOps, but they still don't talk to developers, and so on. You get the idea. There's lots of different anti-patterns in here.
There are ways of working well between development and operations which seem to be very effective. We can collaborate like this on configuration management, infrastructure automation, logging, metrics, maybe things like that, that sit in the middle. And this works well for lots of organizations.
We might actually have basically blended dev and ops together so that pretty much everyone does a bit of everything. That can work often in a startup or often in a company that's moving at very high speed with a single client base, if you like.
Some places, though, find that they don't need actually very much collaboration. So if all your stuff's running on Amazon, would you expect, as an application developer, to just call up the Amazon engineers and say, "Oh, my application, it keeps restarting or won't deploy properly. Can you help me figure it out?" Of course, they couldn't care less. Unless you're Netflix, on that scale, Amazon couldn't care less.
And that's quite right, because Amazon have got a nice API in the middle, and they've built out a platform which does everything that you should need. And some organizations run with this kind of infrastructure, type-three infrastructure-as-a-service model internally too. So they have an internal platform team that provide a whole lot of stuff. Just an API and API key is all you need. And that works. That works very well for good reasons.
And there's lots of other patterns as well. I'm not going to go through all of these. Here's where that DevOps team in the middle can help. If it helps to, over time, bring dev and ops together, that can work. Perhaps the DevOps team perpetually keeps dev and ops talking, facilitates things. That works well.
There's some more patterns and so on in here. I'm not going to go through these now. The key thing is this: there's no single right team topology. There's no magic DevOps way of working. But there will be many bad team patterns for your organization if you don't understand the context.
Which brings me onto the final section of this talk, which is some guidelines for designing teams. And we think there's basically two fundamental patterns when you think about team interaction in a DevOps context.
One is collaboration, where we deliberately bring two teams with different skills together to collaborate on something, work together.
And there's kind of "as a service," or X-as-a-service, where one team provides and one team consumes. And there doesn't need to be very much collaboration at all.
Neither is better or worse. What we end up with is collaboration is really good for rapid discovery. We avoid handoff. The downside for collaboration is that the cognitive load needed is probably higher than it would be otherwise. Each side of that collaboration needs to understand more about the other side, so we have to retain more in our heads.
But we might be willing to pay for that, pay a tax, if you like, on that because we want to go very quickly. We can innovate very rapidly.
Contrast that with X-as-a-service. There's great clarity about who owns what. The blue team owns this bit and the yellow team owns this bit. There's less context needed, so the cognitive load on each side can be lower than if they had to collaborate. But perhaps as a whole, we're innovating slower here precisely because we've got a nice clean API that's locked things down.
A couple of other patterns that are worth mentioning, too: the supporting or productivity team that helps other teams to do useful stuff. And teams that are aligned to a business domain, so product teams, or sometimes called feature teams. There's a diagram in a few slides which shows all this amalgamated.
Based on what we've seen, and what we've read, and what we've talked to people about, there's four fundamental types, three of which are basically essential. And if your teams don't map to these, then have a think about why and how. One of these is optional.
So the three team types that are essential, in our view, are product. So it's three Ps: product, platform, and productivity. So make sure you organize software development, generally speaking, into product or feature teams. Organize your underlying infrastructure-type stuff as a platform on which we can deploy things. And make sure there's a team or a capability that enables productivity across the other development stream that's cross-cutting to the business-domain concerns.
You may get some value out of a component team or component teams, but generally speaking, we want to minimize the number of component teams as a rule of thumb.
So what does that look like? If we had four product teams here, the three at the bottom don't need to collaborate with ops at all, or the platform team at all, because they're just consuming the platform, and it works for them. It works very well.
The team at the top here, the product team at the top, is collaborating with the platform team. Maybe they're moving from virtual machines to containers, or moving from one container fabric to another, and there needs to be a very high degree of collaboration to make that stuff work. This is just a snapshot, by the way, a point in time.
We may or may not have a component team. Let's say that if we're doing some video processing and there's a component which needs people who understand matrix maths. We can't have that capability, we can't expect to embed that within every single product team. It would be too expensive. It would be too difficult to find. So we break out a small piece that really is specialized, and we have that off to the side.
But we definitely also have a productivity team that is cross-cutting to the business-domain-aligned teams to help them with things like build engineering, deployments, deployment pipelines, containerization, other kinds of scripting, test automation, all sorts of stuff like this. Which otherwise, if you had to have that capability only within each individual product team, things would never get done. There wouldn't be enough natural alignment of purpose to make sure that stuff's happening.
So just to summarize again, when we have this kind of collaboration like this, we use this kind of collaboration when we're in the context of discovery or rapid learning. We expect to collaborate with another team that's quite different skills, or of different skills, when we're in that mode. There's a particular mode of interaction.
When we've actually got something that we can deploy onto, then we don't need that collaboration. We can achieve predictable application and software delivery by making sure we've just got the right API. We don't need to have that kind of interaction.
And of course, from a Conway's Law point of view, the implication here is that with this discovery and rapid learning, the responsibilities and the architecture of the software is going to be a bit more blended together at this point. By the time we get to this nice, clean separation, we'll have, in this diagram, two nice components with a nice interface, because that's how Conway's Law would tend to act on the software we're building in these situations.
If you look at how, particularly, Cloud Foundry arranges their security boundaries, they align their security boundaries with application and platform for exactly this reason. Cloud Foundry, Pivotal have done the discovery. They've done that collaborative discovery to work out what the platform needs. So all you need to do is deploy your applications on top.
So the fact that Cloud Foundry advertise a clean segregation of duty between application and platform is for exactly this reason. They've done the discovery already. No need to do it.
So this implies there's a kind of evolution of team interaction over time, depending on what we're doing. If we're in a discovery phase, we'd have some degree of collaboration.
But actually, that kind of collaboration often doesn't scale across organizations. And what we're looking to do is, for many of the development teams, try and establish a nice platform that they can just use. So moving from light discovery to established delivery over time.
At a slightly larger scale, this might look something like this, where perhaps this team one in the middle continues to be a discovery team, working closely with an ops or a platform team. After a while, team two can take that version of the platform and use it. It's good enough for them. They can now start delivering applications in a nice, stable way, and they can continue to use that version or those features, at least at that point.
But actually, teams three and four and five, whatever, they need to wait a little bit longer until some extra features are available before they can actually use the version of that platform.
The implication here is different people in different teams, the experience of different developers and operations people in this organization will differ depending on the kind of thing they're working on at that point in time.
So people in this team might say, "Well, why are we not collaborating with the ops people?" Well, think about Conway's Law, and we want to help firm those boundaries. You shouldn't need to collaborate because you should just be able to consume that version of the platform. And we want to make sure that then that API, the clean break, if you like, between those two parts is retained.
So here's the implication: we need to evolve different topologies, different team interactions, for different parts of our organization at different times, based on what we're doing, based on what we're trying to achieve. Are we trying to discover things? How rapidly do we need to discover things?
There might be times when actually we need to co-locate people, like at the same desks. Or maybe sometimes it's fine to have them on the same floor, or just in the same building, and that's enough kind of proximity. And sometimes we might deliberately decide to have people on separate floors, because actually you want to just slightly help enforce an API boundary, and a slight distance in terms of communication might help.
But the team topologies change over time, slowly over months. We're not talking about every day or every week, but slowly over months, you'd see a change in the team interactions, and you'd expect a corresponding change in what the software architecture looks like.
Quick summary. This is Conway's Law, a really simplified version. So if we want to have an architecture like this at the bottom right, then we need to change our team to anticipate it. So design the organization architecture to produce the right software architecture, or to help produce the right software architecture.
This is really about an orientation. We're really kind of pointing north instead of pointing south or east or west. It doesn't take us all the way there, but at least we're not walking in the wrong direction.
We need to think about how cognitive load and stress affects how the team works as a unit, and therefore match the team responsibility to the cognitive load that that team can handle. And that then has an implication for the size of the software subsystem and components that we build, and we maybe chop off a monolith and that kind of thing.
DevOpsTopologies.com has that catalog, if you like, or that list of different kinds of team interactions and where they're suitable and where they're not.
There's no single right team topology, but several bad ones. And these four, certainly these three beginning with P, product, platform, and productivity teams, seem to be very effective. You may or may not need a component team. And if your teams feel like they don't fit into that pattern, it's probably worth thinking, well, what would it take for us to be able to fit them into that pattern?
And we should expect team topology interactions to evolve over time. Slowly, but certainly evolve depending on what we're doing, whether we're discovering new things or whether we're trying to establish a nice delivery cadence. So basically, evolve different team topologies for different parts of the organization at different times, depending on what we're doing.
Now, a little word of caution. These team topologies and this kind of thing will not in itself produce good software, right? Like I said, this is an orientation, so we're facing in the right direction. We're not pulling against the tendency from Conway's Law and so on.
We also need loads of other things, including strong culture, the right kind of culture. We've heard a lot about that in this conference. Good engineering practices. We need to take the time to invest in proper engineering practice: unit testing, proper TDD applied correctly, continuous integration, deployment pipelines, configuration versioning. All of this stuff we actually feel is quite boring, we need to have that discipline in place.
We need to understand how money affects, and funding affects, the behavior of teams. We went to one client where we found, simultaneously, that the product managers had a monthly financial bonus for delivering 20 features every month, at the same time as the sysadmins had, or the people on the support desk had, a monthly financial bonus for closing tickets within, I think it was 10 days. And that was happening at the same time.
And so that's money having a really incredibly negative effect on how the organization was able to do things and fix things quickly.
Ultimately, as well, we need to actually insist from a business or organizational point of view that the vision for what we're doing is strong and clear. If there's a lack of clarity about what the organization itself is trying to achieve, as technologists, we can't be expected to build anything that meets that.
So we should be, once we've got our house in order, or as we're getting our house in order with some of these things around team topologies and practices and culture and things, we should then start to have the confidence to say, "Well, look, your vision is not clear."
And we care about this because safer, more rapid changes to software systems gives the organization that agility which is going to be needed. Certainly, in the UK and in Europe, with lots of different regulations coming in, lots of different legal changes, Brexit, whatever, this is where organizational agility is incredibly important.
Again, if you're interested in more depth on this, there's a workshop in September. So there's that discount code, `DOES17`. Plug that in and go to SkeltonThatcher.com. At some point soon, though I'm not going to say when, there'll be a book coming out that talks about this stuff in some more detail.
And I think that's all I've got. Any questions?
Q&A
Q: You said about cognitive load, measuring cognitive load, so you don't put too much on the team. So how would you go about measuring a team's cognitive load? And how would you relate that to an effective software architecture?
A: So, the question was about measuring cognitive load. I think what I said was match the cognitive load. The simple way to think about cognitive load is just to ask the team. Ask, "As a team, do you feel like the bit that you're working on is too big for you to actually work with effectively?" And base your response on that.
It's hard. There's no scientific measure for that, but you'll get a strong sense from the team whether they're overloaded or not. It's the best place to start.
Thanks very much, folks.