From Crash to World Class - Scaling Agile at Ericsson

Log in to watch

London 2017

Download slides

From Crash to World Class - Scaling Agile at Ericsson

Paul Madden

Head of Product Development · Ericsson

Gerald Curran

Agile Line Manager · Ericsson

A big part of Ericsson’s research and development is in the network management domain. Network Analytics & Management is a product development unit in Ericsson that develops and delivers management solutions to maximize the value of Ericsson Networks for our Customers.

In 2013, we set out to create a next generation management system using an Agile approach. Agile had been implemented in previous projects to varying degrees of success. This was an opportunity to apply all our learnings to a new green field project.

We started small by deploying a proof of concept in parallel with the legacy system. We learned a lot about the architectural needs and scaled the number of teams to 30. The project was on track and the system went live with a lead customer in trial phase.

Then came the corporate acknowledgment of the PDU’s success and a large budget was put in place to deliver to our customer’s needs faster. We analysed the situation internally and decided to stay small and build on the existing capabilities. However, weeks later we made a commitment to scaling our 30 team project to 120 teams in a period of 12 months.

Within nine months of growing to 120 teams, the project ground to a halt with major issues in deployment, architecture, development environment and project management skills. A one year delay was announced globally.

We took some time to retrospect and reflect on the mistakes we made. We put together a strategy to get the program back on its feet. We tackled the deployment issues, invested in our CI infrastructure, created a main track function and invested heavily in our project management capability across the program. We changed our approach to problem solving by tackling big problems in small chunks. Gradually we became more predictable and there was a more continuous flow of software through our system. Every now and then we experienced design stops due to quality issues but they were quickly resolved.

Today, after tackling all these issues, we’ve come out on top. We continue to be a program of 120 teams. We’re continuously delivering shipments every three weeks to our lead customers with best in class engineering practices and we’re structuring our organisation for mass rollout of the product.

We’re still on a journey. The better we get, the more obvious our deficiencies become. This presentation tells our story.

Paul Madden, Head of Product Development, Ericsson

Gerald Curran, Agile Line Manager, Ericsson

Chapters

Full transcript

The complete talk, organized by section.

Paul Madden

So we're from Ericsson, and we literally connect the world. Forty percent of the world's mobile traffic is switched through networks provided by Ericsson, or supplied by Ericsson. We're about 110,000 people in total, 25,000 working in R&D, most of it software R&D, and we're about 140 years old.

We're based in Athlone, Ireland. The product development unit we work in is responsible for creating the network management tools to help operators manage their networks. We've got over 1,000 R&D engineers, and we've been doing this for the last 25 years in Athlone.

My name is Paul Madden, and I'm a member of the leadership team of the development unit that's headquartered in Athlone.

Gerald Curran

And my name is Gerald Curran. I'm one of the managers responsible for the development environment in our product development unit.

Paul Madden

We're going to tell you a story. I'll tell a bit about the story from 2009, when we started our transformation, and I'll bring it up to 2015, where we hit a significant challenge with the transformation. Then Gerald will go into some of the more detail of those actual problems that we hit. I'll describe briefly the leadership response. And then finally, Gerald will talk about how we fixed some of those problems back in 2015.

As I was saying, in 2009 we kicked off our transformation. In 2015, two years ago, we brought in a cameraman to take a picture of our transformation, to see if he could capture the transformation in photo. And he did, very well, actually. That was our transformation.

That was two years ago, not today, to be clear. And if you can't see the picture clearly, that's a steaming pile of manure, and we were up to our necks in it. It's not a nice place to be. I know, I was there. It stinks. But today we smell like roses, and so we'll go through that story.

To give a little bit of context of where we were in 2009: if an engineer, a developer, wrote a line of code, it was about seven weeks later before we ran a feature test case over that piece of code. That was that type of lead time.

Our systems were available to deploy to customers every six months. Most customers only took it once a year, so they took a system update every year. And those project life cycles were about a year and a half to two years long. So we'd kick off, start a project, it would take a year and a half or two years to finish.

And if a customer came with a requirement and said, "We'd really like this feature," we were typically talking about three years before they saw it, and that was on a good day. So that's where we were.

We started off the transformation with a small project back in 2009. We had been reading a lot about Agile and this transformation that was hitting the wider software industry, and we wanted to test a few things with a small project. We had about 20 people, so we said, "This is a good place to start." There wasn't so much commercial pressure on this product, so it gave us a bit of time and freedom to think and see if we could read more and execute on these new concepts.

We wanted to try out small cross-functional teams. Would they work for us? We're used to building large-scale enterprise systems. Could a small cross-functional team work in that context?

Would this idea of product owners work for us, where one person is responsible for prioritizing and breaking down a backlog and signing off that work when it's done? And this new type of leadership, where we have scrum masters who act as servant leaders, where the team is really the whole entity, and you have a servant leader facilitating the team rather than the more traditional approach of a project manager directing a team.

Could these work for us? We found that they could, which was great. That was really good for us. It gave us some confidence to bring this onto a bigger stage.

And we did that. We brought it onto a larger project where we had maybe 100 or 120 people, about 10 cross-functional teams, and we started to try new things.

One of the first things we started looking at was team stability. Now, team stability sounds quite trivial. You have a team, leave it together, and keep it together for a couple of years, you get better performance out of it. There was a lot of research done on team stability and how teams perform better if they're left together.

But culturally, that's a significant change for an organization, especially for us, if you have a little bit of a hero culture. You have a team there, you get a customer escalation over here, you pull Mary or Mark out of the team to deal with the customer escalation. They're over there for six weeks, then they're back in the team.

Or you have team A over here with some competence set. You have team B over here that needs that competence set. So you pull a couple of people out of A and move them into team B, and you're treating people like pawns on a chessboard, just moving them around.

So to try and embed team stability into the organization, that was an actual cultural change for us because you're changing the actions and behaviors of the leaders and managers in that organization. And that's something I think we evolved and did quite well with over time.

One of the next things we started to look at was cycle time. This was the first time we started to bring, say, lean manufacturing principles into our software development processes. We wanted to fundamentally break down our work into smaller chunks so it could move through the system quicker, and it became more about speed rather than resource utilization.

We wanted to get a requirement, break it into a smaller piece, and move it through the system as quickly as possible. We started to bring in some metrics around this, and one of the first metrics we brought in was around how often a team would check their code into main branch.

The reason for that was that at the time, teams were checking in maybe once a week, or maybe every second week would be typically how often they would check in. And we knew that if they were checking in more often, they would break the work down more often.

So we put up this metric and publicized it a bit, and next we saw a team that was checking in 60 times a day. And we thought, "Fantastic. We've hit this straight away." That was an immediate response. A couple of weeks after we put up the metric, we've got a team doing this.

Then we take a bit of a deeper dive to see this hyper-productive team and how we can learn from that, and we discover they've got an automated script that just clicks the check button every eight minutes regardless.

And that teaches us a really valuable lesson around metrics. Don't just put up a metric. Explain the purpose, and what are you trying to achieve? Make sure that's communicated clearly. The metric is giving you some sort of indicator, but don't take it as the Bible. And that was another learning for us.

Then we hit this dream scenario, and this was the perfect scenario for everybody working in software. At the time, our development unit had this flagship product, which was responsible for network management and telecoms, and it was a really large enterprise-scale product. We had well over 1,000 engineers working on it. Practically every telecom operator in the world had this system, or multiples of this system, deployed.

It was a large legacy system. Some parts of the system were over 20 years old. Thirty-five million lines of code. It was a behemoth.

And we got the green field, after a couple of years of lobbying and proposing to the CTO to build a new system, greenfield. So this is great. Everybody loves building greenfield from the start. It's sexier.

We got the green light to kick this off, and we started building it. Over the space of about 18 months to 24 months, we ramped this up to about 30 teams working on that. And that's large scale. This is a really large-scale enterprise system, so it was going to take us several more years to get a system where we could start to retire the legacy system.

But everybody was really happy with this. Our engineers were really happy because they're working with the latest technologies across the board, all the latest technologies, and that's really cool. We had all the engineers looking to get into this new project.

Our senior leaders, our leaders, and our managers were really happy because we have this new way of working we've been trialing out with different projects, and now we get to start it with a greenfield. This is going to be a really big project, and it allows us to shine as pioneers across the Ericsson world with this new way of working.

Our customers were really happy because we had very close engagements with a couple of lead customers, and they had more influence than ever on what the shape of this system was going to look like. So they were really happy.

Our very senior managers, the senior executives within Ericsson, were happy. This is a new cash cow, our new revenue generator down in the pipeline. Our accountants were happy because part of the story was this new system and new technology, it'd be cheaper to build, so we won't need as much money being pumped into R&D.

Everybody was happy. Everybody was happy, and they were so happy that our very senior management and our lead customers asked us to accelerate the project dramatically. So we had a roadmap, and they said, "We need that accelerated dramatically. This is so brilliant. Do it way faster."

And after some negotiation and quite a lot of pressure, we agreed to add people to the project. So we were at 30 teams. We put together a plan to add 90 more teams on top of the 30 over the next nine to 12 months. And we published a new roadmap based on all these extra people working on the project, and published that roadmap out to our customers.

Now, before I go any further, I just wanted to recap briefly on a childhood story. When I was a child, my mother tried to protect me from the harsher realities of life, like mothers everywhere around the world. And one story that epitomizes that brilliantly is when we sat down to watch this movie, Old Yeller.

For those of you who haven't seen the movie, Old Yeller is about a family who are out in the frontier in the Wild West, and the dad goes off on a cattle drive. Travis, the eldest son of the family, has to become the man of the house. And as the man of the house, he finds this stray dog, Old Yeller. The family are attacked by a pack of rabid wolves. Old Yeller saves the day.

And at that stage, my mother interjects and says, "Off to bed now, Paul. They all live happily ever after. Don't worry about the rest of the movie."

Back to today. When we were ramping up this project and asked to accelerate, it's like we went out and read, how do companies do this? How is it done across the industry? How do you accelerate a project? And it's as if, add people quickly. And how did that work out? Off to bed now. Everyone lived happily ever after. Don't worry about it.

But that, of course, is not the way it works in the real world. For God's sake, Travis had to take a shotgun to Old Yeller.

In 1975, Frederick Brooks wrote the book The Mythical Man-Month. And the main hypothesis, I think as most people will know, is that if you add people to a software project, it will slow it down. And as he quipped himself quite often about it, "The Mythical Man-Month is like the Bible." He says, "Everybody's heard about it. A few people have read it, but hardly anybody goes by it."

So where did it all go wrong?

Gerald Curran

Okay, perfect. Thanks, Paul.

Now I'm going to go into a bit more detail about where it all went wrong for this dream project that we had been assigned to. There were lots of issues that constituted us to announce a year's delay to our customers, but really it boiled down to four key problems that we identified.

The first one, as Paul mentioned, was rapidly growing our teams. So we went from 30 teams up to 120 teams within 12 months, and very quickly we started to experience growing pains. It was quite clear that we underestimated the effort here. We made the silly mistake of thinking that adding more manpower would speed things up, and the actual result was quite the opposite.

The chart here on the left is what we thought it would look like. We thought that teams would be productive after three sprints. A sprint to us was three weeks long, and we thought that once they were onboarded to the program, their productivity would increase at a steady rate. And of course, once we saw this chart, we were happy, and we committed with our stakeholders.

But the result was quite the opposite. The chart on the right is what it really looked like. We did grow our teams, so that's the orange bar chart. But the velocity, which is the gray line, looked a bit different.

Looking into it, we made two critical mistakes. You've got the normal problems of when you're adding more people. You've got communication issues and so on. But also with our project, we were introducing a whole new suite of technologies. We were introducing Enterprise Java, JBoss, clustering, just to name a few. And we really underestimated the effort that it was going to take new teams to learn these technologies.

Also, we missed the impact that onboarding teams would have on already productive teams. Productive teams needed to provide a lot of support to teams that were onboarding. And as a result, our teams started to spend more time managing dependencies.

As we were integrating our software to our main track function, so our main track function is where we integrate all software packages and we build an ISO, the quality levels were dropping significantly. As a result, we went into frequent design stops. So we slowed down, and in fact, we were slower than we were before we added all teams.

The next big problem that we identified was we needed to make a significant platform change mid-development. Our feedback times were gone very bad. Upgrade was taking 22 hours. Our applications weren't cloud ready. This was going to be a big problem for us as cloud was coming around the corner. So we decided to swap out the platform. And very quickly, again, we underestimated the effort. So very quickly, we decided that we needed to make an improvement here.

Next was around our development environment. Our CI engine for 20, 30, 40 teams looks a lot different from one for 120 teams. And as we added more teams, the environment became more and more unstable.

On top of this, the CI practices within the teams were bad. Teams were committing oftentimes once a sprint, and even in some cases, it was once every two to three sprints. So this needed to improve.

Our test environments, we made the mistake of going with a decentralized approach. As teams were looking for hardware, we allocated out the hardware to them. And of course, this caused one big problem, which was we had just introduced a new competency to the teams. Now all teams needed to be competent in deployment. And on top of this, the utilization rate of the hardware was really bad.

On top of this as well, the team's testware was very brittle. So when test cases were failing in main track, there was always a doubt whether it was an actual software fault or a flaky testware. So this needed to be fixed.

All of these factors, again, constituted into us going into frequent design stops. And as it turns out, back then we were closed for deliveries more times than we were actually open. So imagine that: 120 software development teams, and the main track function was closed for deliveries. A really bad place to be.

Next was our organizational structure. As we scaled to 120 teams, we didn't scale our program function, and it became clear very fast that we didn't have the program structure to support all of the teams. On top of this, our visibility and reporting was really bad. At any point in time, we didn't really know how bad things were. We knew they were bad, but we couldn't put a figure on it. So this needed to improve.

So Paul, what was the leadership team's response to these issues?

Paul Madden

If I was to ask the audience, if you could look back over your career and pick out that moment when you were a part of a team that was performing at the highest level, most people can come up with an answer on that fairly quickly. They'd say, "I remember that time a couple of years ago. I was in that team, and that team was doing a great job."

And if I was to look back over my career, I think the pinnacle of that, in terms of being in a team that was fulfilling its potential completely, would've been two years ago as part of that leadership team.

A couple of things that you need to have in place, the requirements, whether they're environmental or whatever, to have a team performing at that level. The first one is purpose. For the first time, I'd say in quite a while, we were all completely, absolutely crystal clear on our purpose, what we had to do. We had to get this program back running. We knew that. We all had other responsibilities, other assignments, other pet projects. All of those other things faded into the background. Getting this program back running was the number one priority.

The seriousness of this having come to a stop, there was a real risk that if our development unit didn't get this program back running, the development unit would cease to exist.

In terms of autonomy, we had complete freedom to do whatever we needed to get the program back on track. Now, for sure, there was a clock ticking on us, and if that clock got all the way down, well, then it didn't matter anyway. But that clock was ticking. But until it was, we took that autonomy. We acted, and we did whatever we thought was needed to get it back on track.

Some of the people on the team had more than 20 years' senior management experience. Multiple people on the team had been on foreign assignments with different Ericsson offices or with different companies. So we had that expertise and mastery required at that team level to perform.

And psychological safety. There's a lot of research been done over the last couple of years by Google in particular on this, and what you're talking about there is that people's concerns for the performance of the team outweigh their own individual concerns of how they themselves are reflected in the team.

So I didn't care whether I looked stupid by asking a question in the team. Nobody did. If someone gave me feedback, I took it on the chin and said, "Yeah, I need to take something from that." And that's because you're in a situation where you don't care about your individual concern. Your individual concern had diminished to nothing. It was all about the collective, getting this program back on track.

And I think because of that, that allowed us to act decisively, quickly, and very effectively. Again, if I was to look back over my career, I'd say that was the moment where I was part of a team that executed at such a high level.

And to go into what we actually did then.

Gerald Curran

Okay. So now I'm just going to talk about the fixes that we implemented for those problems that I mentioned earlier on.

The first thing we needed to fix was our onboarding process. We introduced onboarding 2.0. The fundamental difference here was instead of focusing on time, we focused on quality. Teams needed to prove that their software was at a certain quality level. They needed to prove that their engineering practices were good, and they also needed to prove that they knew the system at a high level. And then once they ticked all those boxes, we let them deliver into main track.

To help them, we provided buddy teams. These were teams that were already onboarded to the program, that knew the system well, and they were able to give the required support. But the difference here was that we allocated 30% of that team's capacity for this help. So as it turns out, it took some teams up to a year to onboard. You can imagine what that team would've done to the system if they were able to deliver after nine weeks.

Next, we fixed our deployment issue. So that deployment issue that I was referring to for the platform issue was the deployment. We decided to swap out our in-house deployment tool for KVM. And we can honestly say that looking back, if we didn't make this change, we wouldn't be where we are today. Our upgrade times reduced from 22 hours down to three hours, and our applications were now running as VMs. Although we still hadn't migrated to the cloud yet, since then we've migrated, and this has been a very smooth transition thanks to that.

Next, we fixed our CI infrastructure. We strengthened the CI engine, we implemented a lot of new functionality, we brought in queuing mechanisms, and we implemented new frameworks. One framework in particular was around Docker, to give really fast feedback to teams on the user story acceptance level.

And to fix the CI practices, we created networks, coaching networks, that is. One engineering coaching network and one Agile coaching network. We trained the coaches, and then those coaches worked with the teams to work on their practices and make sure that their software was good quality.

Next, we fixed our testware. Before, with the decentralized approach, we centralized that. And we offered it out as managed services. So when teams were looking for hardware, we allocated them time on the hardware for a certain period of time. This meant that they didn't need to upskill on deployment knowledge, and the utilization rates of our hardware were a lot better.

On top of this as well, we created an uplift program to work with teams on their acceptance-level test cases. Once we did this, quality improved, and whenever there was a fault in main track, we were confident that it was a software problem, not a flaky testware.

Then next, we fixed the organizational problems. We brought back in project managers, but this time it was to work with teams of two to three teams, and it was really to help them with their release planning. So to identify any risks, issues, dependencies upfront. And we improved our planning capability across the organization, both top-down and bottom-up.

And we always made sure that everything was factual. We used tools like Jira to make sure that everything was fact-based.

On top of this, we added feature teams, or some organizations call them delivery teams. These teams didn't own the code base. Their main responsibility was to burn through a requirement. And there were no handovers between feature teams for these requirements. So that was a big improvement.

And then finally, we made big improvements around communication. We set up a daily standup for the whole organization to attend on the management side for main track issues. This was to make sure that they were given the right attention. And then weekly, we set up project reporting so that the program could see at any given time whether a project was on track or off track, and we could provide the necessary support.

Overall, we wanted to make sure that all our stakeholders, so the funders, the teams, and management, knew at any given time how the project was progressing.

Then to look at the results of this. Previously, it was taking us seven weeks to get confidence level and acceptance-level test cases. Now we were getting that within 90 minutes. Before, we were deploying our system twice a year, and now we're deploying it to our lead customers every three weeks, so every sprint. And then finally, we were able to turn around requirements or features every six weeks. Previously, that was taking two to three years to get through our requirement handling process.

Paul Madden

And when you're up to your neck in manure, the only thing to do really is to work at it and hope eventually it turns to fertile ground. And it did for us.

As an organization, we're performing now. We're being held up across Ericsson as a development unit that can do large-scale development, large-scale Agile, and large-scale enterprise development. And that's a credit to the mindset of the company as well. We don't plan for mistakes, but we certainly learn from them.

I hope you found something useful from that story today. Thanks very much.

Q&A

Paul Madden: So I don't know, is there a Q&A?

Audience: Yeah, sure.

Q: Daniel Velez from General Electric. So you are delivering solution to customers. How did you onboard your customers in this transformation? Because at the end, your goal is to deliver solution to your customer, no? But to the cloud, I think. So how did you onboard them?

A: It was an ongoing dialogue with the lead customers, and some of our customers were pushing for this as well. They see it as a competitive advantage that they can get a delivery every three weeks, that they don't have to wait that three years. So part of that dialogue on how do they get their requirements in quicker, it was just a dialogue between... They were pushing for it as much as we were pushing for it.

Q: So did you push the continuous integration and deployment process up to your customer? That same process, or you have a discontinuity approach?

A: No. Once we have the software ready to ship, the customer takes it and deploys it themselves.

Q: Understood. Okay.

A: They have their own install hardware, yeah. But then with the next steps, we're moving towards cloud, so we're trying to push more and more towards as a service. So that'll enable us for continuous deployment.

Q: Okay. What quality parameters did you use to identify whether a new team is ready to deploy into production?

A: Yeah. So we looked at bug count, make sure that the teams didn't have a lot of bugs before they went in. We used tools like SonarQube. And then a big thing was around engineering practice as well, make sure that they had good code review processes before they delivered into main track. And there would've been a subjective evaluation, say, from an expert engineer as well.

Q: Hi. How did you keep the architecture coordinated across so many teams?

A: Yeah. How did we keep the architecture coordinated across the teams? We had the central team of architects who would support the teams. So every time there was a new requirement to be built or delivered, the architects would have input into specing out what the solution would look like, and would be available to talk to the teams in terms of when the teams were implementing it.

So we tried to develop as created, and I think we had the same with the Disney presentation, that there was a central group there available to support the teams. Sometimes the teams had to be told, "Go and ask for this opinion." But in general, we got pull from the teams, which was good.

I think the philosophy too, that your architecture isn't the PowerPoint slide, it's what's actually built is your architecture. I think once that was embedded, say, with the architects themselves, that you can do all the PowerPoints you like, but if the teams don't pull from you, it made them more open to challenge and to negotiating with teams in terms of the implementation.

Q: How do you balance the different testing levels in this transformation? Because, from the beginning, you probably make manual testing, and right now you are faster, but... And you probably have automated those testing. How do you balance the different unit and front-to-other testing, and how the developers attempted to develop their own testing?

A: Yeah. I can take this one.

Previously, when we went with the decentralized approach, where teams were heavily dependent on hardware, it turned out that the classical pyramid that we're always trying to strive towards with unit, integration, and acceptance, a small bit on top, was looking more like an hourglass. So teams were writing a lot of acceptance-level test cases.

Then when we changed it around, we brought in Docker to try to pull in the dependencies and deploy it on their laptops locally, and that helped us get back to more of a 70% unit, 20% integration, and then just a small percentage towards acceptance test cases.

So in terms of the developers' test cases, they're all fully automated. And even our QA department, or the QA testing that's done afterwards, is more or less all automated, just a small amount of manual test cases.

Q: So QA is another team in your case?

A: Well, we've a small unit which is more full system level, end-to-end. It's the last level of stamping before it goes out to the customer. It's, like, what, 20 people out of 1,200. So it's, yeah.

Most of those test cases are rerun, so the developers' test cases, the 10% acceptance-level test cases, we've created a framework where they can be rerun again later on in the pipeline.

Q: So developer team ends at integration testing?

A: No, it ends with acceptance. So the team is responsible for 100% confidence, but just at the very end, we have some integration-type testing just to make sure that the full system is verified and stamped.

Q: Thank you.

A: Thanks.

Thank you. Thanks very much. Perfect. Thanks.