Microservice Teams – How the Cloud Changes the Way We Work

Log in to watch

Las Vegas 2024

Microservice Teams – How the Cloud Changes the Way We Work

There are a lot of technical challenges and complexity that comes with building a cloud native and distributed architecture. The way we develop backend software has fundamentally changed in the last 10 years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But, did you also change the way you run your development teams?Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it impacted the way the engineer teamwork. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform teams.

Chapters

Full transcript

The complete talk, organized by section.

Sven Peters

Let's get started here. I have 100-something slides in just 30 minutes, so be prepared.

Let's talk about microservice teams. This is actually our story from Atlassian. I'm a DevOps advocate at Atlassian, and this is how we actually went to the cloud and thought, "Oh, technically everything goes smooth" — but then realized how we needed to change the teams: how they work, what that means. Because we went from a monolith application to a microservice architecture. But what we realized is that teams needed to change too, and that takes much longer than just the technical migration. Actually, the team migration took much longer. You'll see that through our journey.

Our journey — 2002 to today

I'd start with guiding you a little bit through our journey, from monolith to microservices.

2002 — Jira. It started in the year 2002, when our founder wrote the first lines of code for Jira. At that time, Jira was downloaded by companies, installed on their server, run on their server. We didn't have anything to do with the IT infrastructure. We didn't send out CDs at that time, but people could download the software.

2004 — Confluence. Two years later, they wrote Confluence. Everyone's happy.

2012 — OnDemand. Until some customers came to us and said, "Atlassian, we would love to have your software somehow as a service in the web. Could you provide that?" In the year 2012, we built our so-called OnDemand platform where we ran our software, and people could just buy the software and use it — like what you normally do now with software-as-a-service.

The Unicorn platform — not really cloud-native

At that time, we said: "Okay, we're going to build it on something we called the Unicorn platform." (We are fans — this whole project was called "The Rise of the Machines." I think we should have waited until now, when the AI revolution comes, and called that "the rise of the machines." But hey, we called it that at the time.)

Unicorn was just a down-stripped Linux version where we ran Jira and Confluence for our customers. So you might think, "Hmm, that's not really cloud-native." It wasn't. It was a single-tenant app — one customer got one JVM where we ran Jira in. You might also know that this might be difficult to scale. If you have 5,000 users, probably no problem in one JVM. But if you have 30,000 users, like some customers have, we really ran into problems at that time.

Same codebase, two deployment models

We had these two options for our customers — server option and cloud option. It was actually the same codebase. We were running the same codebase on both, but our devs and product managers really loved how the cloud enabled them to do more releases, much faster release cadence.

Imagine: server-side, we just released once a quarter, a new version. People needed to download it, install it. We couldn't release more often. But in the cloud, we could release on a weekly basis — started weekly, turned into daily. So we could push out the software very fast. We could all of a sudden run growth experiments — does that work out, does that not work out? All these benefits we had all of a sudden in the cloud.

Our server customers also benefited from that. Once in a while we said, "Okay, now we need to make a release for the server customers." So we just branched — created a release branch — and gave the whole thing to the server customers. We could operate with feature flags on the cloud side, which was also awesome. We could release even more, even faster. We didn't have to finish a feature.

Why same-codebase strategy stopped working

But with this model also came some challenges. We were running things on the same codebase. We still had to provide different databases — access to different databases. If you run the server version, it was also in the cloud version. Really, really hard to maintain these extra things for the cloud. It was just holding us up in speed. When we were saying, "Okay, we're making a release for our server," we had to stop and make sure that everything's packaged for our customers, and then release the server version. That was actually slowing us totally down.

And then also we couldn't use the new cloud-native stuff. It was just running on this Unicorn platform.

2016 — Project Vertigo

So we started a new project — a third project, actually — called Project Vertigo, where we said, "Okay, we now just build a cloud-native app, microservice architecture, on AWS."

We decided for doing stateless, 12-factor apps. We sliced and diced the monolith. It was more evolution over transformation — we started to slice it here a little bit, there a little bit. We didn't want to build the whole thing from scratch. We said, "Okay, we take what we have." Also for our customers — they're used to the interfaces and how Jira works. We couldn't just reinvent Jira. We wanted Jira, but just to start running it on AWS.

Also with the teams, we started slowly. We had our best programmers move over to this project so they could set the basis for people to come. Then we moved teams slowly over to that migration project.

The Big Migration — 2017 to 2018

This whole migration of our monolith onto the AWS platform took us around one year. We started out in 2016, and 2017 — great — we were done with the migration to this platform. Now comes the hard part: the customer migration. We needed to migrate the customers over from the Linux OnDemand Unicorn platform to the new Project Vertigo AWS platform.

That took us — 90% of the customers to migrate took us half a year. The hard part was the rest. 10% of the customers took us another half year. There were some specialties — they had some apps (you can install apps in Jira from other vendors) that didn't work on the other platform. Two or three customers couldn't migrate, so they needed to use the server version. But the rest, we migrated by the end of 2017. So we had zero customers — that was the board that the engineers had in their office: zero customers on this Unicorn platform anymore. Our developers were happy that the Unicorn was dead.

The developers were really struggling with all the double codebases. We forked the codebase, of course — so we had two versions running, one team still building the server for our server customers, and another big team building all this AWS stuff in the cloud.

We did the tech — what about the teams?

So great — we have this migration, we have a distributed architecture now in the cloud, shared microservices. There was still some work to do — we wanted to slice and dice things more — but generally it was working. Our customers were happy. Great. But what about the teams?

This is really what I want to talk about today.

Back in the days, we had our teams built around Jira — feature teams, value-stream-aligned teams, IT teams responsible for keeping the Unicorn platform up and running, design teams doing the design, QA teams making sure that the quality of the software is all right. That was working.

But now we move to a microservice architecture, not the monolith anymore. So we built teams around microservices. They were owning it end to end — responsible from the idea to the operational part of things, running things in the cloud. So we needed to see and change a lot of stuff about how we run those teams.

Now we have these autonomous teams running things autonomously, responsible end-to-end. How do we do that? Today I'm going to talk about four areas:

1. Tooling — what we create for them. 2. Autonomy — the philosophy of how to make them autonomous. 3. Support — how to support those autonomous teams when they're not experts in everything. 4. Alignment — how to align small, nimble teams running in different directions so they all run the same way.

You won't see the complete picture. I'll give you a couple of examples — but you'll get the philosophy behind it.

Tooling — Micros platform

How do we change the toolings? Now that we sliced and diced the monolith, we now have around 17,000 components that make up all of Jira and Confluence. That means for us 17,000 times having Docker container configuration, 17,000 times having CI/CD configs and pipelines, and 17,000 times needing to add observability.

You can imagine — there's a lot of copy-pasting from one microservice to the other. You copy the things over and change a little bit, then change a little bit here. This is total chaos. You don't want to have that. You want a structured way of doing it.

So one of the first things our engineers realized was: we need to find some standardization. They built a platform they called the Micros platform. Sounds big — but it was a command-line tool. You can just say, "I want to spin up a new service," and then it says, "Okay, I need a Spring Boot service." And it asks you: "Do you want to write Java code or Kotlin code?" You can decide. It was building that — a template engine, basically. A lot of templates in the Micros platform.

As we heard this morning, it lowers the cognitive load of the developers. They don't have to understand all the configuration. They can just say, "I want this service" — and what normally takes a day, they could do in a few minutes. So they were up and running very fast. It helped them to create value fast for the new microservices.

But also the services themselves are better services, because they're actually configured by experts — not by someone who just copied over, changed a few parameters, sees it's running, and keeps it like that. They're run by the experts. If we need to change something for all services, we know all the configurations — they're all based on the Micros platform or Micros templates. Our SREs got happy because they knew there's not a service somewhere in the cloud running where we have no idea what it's doing. We have added observability from the get-go with these templates, because it's built in.

We needed, of course, to educate everyone. So first it started out with just a few templates. Now we have a lot of things on the Micros platform. With 17,000 components, we need to educate developers on the platform. So we built a whole portal, training program, videos, self-paced training. There's a group going in and training new developers. We have built that infrastructure for our developers.

Microscope → Compass (the developer-experience platform)

This makes our developers happy because they can create a lot of services, and they're doing that. They have all those small little microservices — service, service, service. But what are all these services doing? What are the incoming and outgoing dependencies of the service? Who's owning the service?

We needed something that documents what the services are doing and who's owning them. We're a technology company, right? Very sophisticated technology company. Our developers are very advanced developers. So what they did is — like every other company — we created a spreadsheet. With the services, ownership, incoming dependencies, outgoing dependencies. Great. It works for 100 services, it works for 200 services. With more than 200, it's getting a little bit out of hand.

So what did we do? We created more spreadsheets. Yeah, that's what you do, right?

At the end, we said, "Okay, we need to do something." One team said, "Let's build something for that." We called it Microscope — our documentation tool for the microservices. You see one service: incoming dependencies, outgoing dependencies, ownership, all the things.

Our customers asked us, "What are you using for a component catalog for your microservices? What are you using for documentation?" We said, "Yeah, we built our own thing." "Alright, can you just give it to us?" We said, "No, it's just built for us, internally" — until we said, "Okay, we make it as a product." Now you can see it's called Compass. It's a product that you can buy from Atlassian for your microservices.

It's not just a documentation tool for microservices — it's more like a communication tool for teams. Because I don't have to tap everyone on the shoulder anymore. I can just go into Microscope and see: "What is this service doing? What are the dependencies?" I can also announce API changes — if I have breaking changes for my microservice, I can announce that, and everyone that depends on the service, every team that depends on the service, gets updated and says, "Hey, something will change here." Also if the functionality inside the service changes, I can inform people through Microscope.

We built all of this. You hear this term a lot here — a developer experience platform. This all turned out as a developer experience platform where we have Micros, Microscope, all on the platform. We want to make the standard easy for people to take. They should just pick the standard microservice. But at the same time, also allow people the autonomy to run fast. We don't want to be the blocker for them. If they need something special, they can do it. But the standard is so easy to take. Maybe it's giving them 90%, but they say, "Okay, this 10% I don't care, I just take the 90%, that's way easier."

Autonomy — teams improve themselves

Now: how do we make those teams autonomous? We built it so they should build the microservices autonomously, with end-to-end ownership. But how do we make sure the teams are also improving autonomously? We don't want to have a big engineering effort. They pretty much know what's the problem inside their teams.

First, of course, you can say: "Okay, just track those metrics." You've heard of them: time to restore, change-failure rate, deployment frequency, lead time for changes — the DORA metrics. Great basis. Give them to your teams.

I recently talked to a CTO and he said, "Okay, I rolled out the DORA metrics to all my development teams." I asked him, "Tell me, how did it go?" He said, "Actually, not so good. We rolled them out, but no one was actually using them, because they didn't fix their problems."

Teams have different challenges. You need to find out what the challenges are, and the teams pretty much know what their challenges are. So every team has different challenges, and we want to acknowledge that.

In the Microscope/Compass platform, we added scorecards. People can put in scorecards to measure their productivity. It's a little bit like DORA metrics, but they have recent incidents, deployment frequency. They want more than five deploys here, and they just did three deploys — so it shows a red line. The team has a problem. Or — do they really have a problem? Maybe they just had three deployments, but they're currently working on a very complicated part of the subsystem, and three deployments are alright for them. They want five, but right now they can't deploy that often because they know, "Okay, when we fix that, we will go back up."

So you always need to add context to those metrics. They don't live on their own. That's why we have a ritual we call CheckOps. CheckOps is a weekly ritual where the teams sit together, watch the metrics, and discuss them. They say, "Yeah, we know that deployment frequency is down right now, but it's not really a problem. When we don't work on this complicated part anymore and add more features again, we'll see deployment go up again." We write everything down. CheckOps is available for the teams.

We put those ways of working into a Team Playbook — practices like how to do standups, how to do retrospectives, good practices we collected from all kinds of teams and spread to all teams. Teams can pick and choose what practice they want.

We talked to customers about it. They said, "Yeah, our teams use the playbook for all kinds of things. Can we have that?" We said, "Yeah, sure" — and just released it. atlassian.com/team-playbook — it's free. Practices we use internally at Atlassian. That works for us.

The Team Playbook makes the standardization very easy. Teams can pick a play, but the teams can pick the plays they actually need right now. Maybe they don't need to do a standup anymore because they're sitting together in the office the whole day — then they don't need a standup. We don't demand that.

Support — Quality Assistance, not Quality Assurance

Going from autonomy to support — these are small microservice teams, responsible end-to-end. We need to help them become better testers, better designers, better product managers. We can't put designers, product managers, and testers in those teams. Our ratio from QA to dev is one to 30. We don't have enough QA people to put on every team. That doesn't work.

But we also don't want to make the QA person the bottleneck — they shouldn't be responsible for five teams, because then they become a bottleneck. So we need to educate our developers to become better testers. We looked at it and said: developers are doing the automated testing; QA people are doing the exploratory, the manual test. So we give the manual test to the developers. And we call QA not "Quality Assurance" but Quality Assistance — they assist the developers in doing the testing part.

We do some rituals like a QA kickoff. The QA kickoff goes before the developer implements the code. The developer says, "I have a user story here. I need to implement that. QA person, can you help me write a test plan?" The QA person comes in, they write a test plan together. The developer learns how to be a better tester. Not every developer requires a QA person coming in, but at the beginning they do. Then the developer goes on, writes the code, takes the test plan, tests the software, and deploys the software. No QA involved anymore.

Well — that's not entirely true. There's a demo, a little bit of a safety net. The flow is: QA kickoff → implementation → testing → demo → deployment. The demo is where the QA person comes in. We want to lower the dependencies on QA — they don't want to be the bottleneck anymore — and increase the developer autonomy, but give them support so they can reach out and get help whenever they need it.

Support — Design system

Same thing for design. Who's using Jira since 2015? Keep your hands up if you used Jira since 2010. A few. Since 2005. (Thank you very much for staying with us so long — because Jira wasn't really pretty in the beginning. 2005 — this was Jira. Because we thought we don't need any designers — developers don't need nice designs, they can just deal with it. Functionality is king.)

Until we heard from developers, "Okay, we also appreciate design." This is Jira now — looks much nicer.

But also with designers, we want to have the design dependencies down and the autonomy for our developers up. The developer needs to do some kind of design, with support. If they don't get support, it looks a little bit like: I got these standard bricks and I need to build a castle, and this would be my castle. It's a nice castle… not nice-looking. But what Lego does — and what we do — is give them special bricks, give them instructions, they can build great castles.

That's what our designers are doing. They built a design system, and you probably have a design system too. The design system helps developers pick a component — say, they need a tab. This is also public, because we have app developers that develop apps for Jira and Confluence. So you can find it on the internet. They can copy the code for the tab and put it into their code. We have a lot of education material so developers get trained to become better designers, and know when to reach out to the designer and when they can just take something from the design system.

Also standardization: we want to make it easy to have a design system. They can just pick components. They don't have to build it on their own. But it allows for autonomy.

Alignment — Demo culture

The last thing: alignment. How do we align those microservice teams, the small, nimble teams?

We built this microservice architecture, built teams around microservices. We have a developer platform that supports those teams. We give them guidance with the Team Playbook. We give them support by thinking a lot about how to remove bottlenecks. The philosophy is we want to make our development teams fast and nimble and work autonomously.

The truth is, there is no autonomy for developer teams. There are dependencies all the time. How do these teams keep updated? The design team wants to know, "Let me see what you're building." The marketing team wants to know, "What's the current state of development?" The quality team wants to see how you implemented it. All of these things — there are dependencies in those teams still.

What we do to update all those stakeholders is run regular demo sessions. Back in the days, our demo sessions looked like this — people in a room, someone demoing. Nowadays, we're all distributed. This doesn't work anymore. We're also not doing Zoom calls. We're doing videos. We're recording a demo with a video and sharing it out to those stakeholders.

You might have questions: how do you get reactions from people? Like if you're in a room: "Oh, that's a great idea" — or someone laughs at a joke, or someone has questions. That's why we use a tool called Loom. You can have a lot of reactions, comments in those videos. There's a lot of engagement — or there can be a lot of engagement. We measured it: we have a 50% increase of engagement with those videos. Back in the days, not everyone was attending the Zoom calls, but with videos, we have much more engagement.

We set a goal — we called it the demo culture project. The goal was 5,000 demo videos in 2024. I recently checked the numbers, and we're now at 15,000 demo videos already. So it's a real hit. Everyone loves to see what's going on in the software development. I'm actually following some teams because I want to know how our new AI features are evolving. I'm watching these videos. I would never be in the Zoom call to do this, but I can now see what's going on.

Standardization is made very easy: we say, "You need to do videos." How the teams carry that out is up to them.

Closing — change is the only constant

So: tooling, autonomy, support, and alignment. These were just a few examples to support developers becoming small and nimble.

Now you might say: well, did the move to the cloud — the microservice architecture — really change the way we work? I don't know. Did all the learnings, all the involvement, change the way we work? I don't know. Or did just the whole growth that we had in the last years of customers change the way we work? I don't know. I can't tell. It just happened. It was just — not a question of "what changed the way we work?" but rather "we need to change the way we work — how do we change?"

And the thing that I just presented you is probably three months old. We already have changed again. We will continuously change the way we work.

This is us — this is Atlassian. It's probably different in your company. You need to find your own philosophy. We're different. We have 300,000 customers, 5,000 engineers, a 20-year-old product, and we're operating in a very competitive market. You might have different things — might put more emphasis on other stuff in your teams. But for you, too, one thing is for sure: change is the only constant.

Thank you very much. Thanks for your attention. Enjoy the rest of the summit.