How To Turn the Software Team Around
The toughest failures are the greatest teachers. Through pain, frustration, and experience comes wisdom. Learn to embrace the hardest challenges for in those the greatest growth comes.
In this talk we will discuss the bright side of the most common problems you face as a leader and developer. By the end of this discussion you may switch your mindset the next time you or your team runs into a nasty seemingly unsolvable issue. Maybe get a little giddy when you fail. Who knows! Come find out.
Chapters
Full transcript
The complete talk, organized by section.
Charles Lafferty
Leading a development team as a manager or a tech lead for the first time can induce a lot of anxiety. I can tell you from my experience, you self-doubt. You question whether you're going to make the right choices. You wonder how people are going to treat you and if they're going to respect you. You wonder how you're going to succeed or fail. Now, in this role, the decisions you make no longer just directly impact you; they impact the lives of the people who you lead. That's a big responsibility.
When I get anxious, I learn. I realized a long time ago that the more I learn about a topic, the better I become at it and the more comfortable I feel. I appreciate you all joining today because I'm going to share with you some of the key learnings of how to turn around software development teams. I won't be able to relieve all of your anxiety if one day you choose to lead, but I might be able to take an edge off with these tips. What I'm going to share today is what I learned to get over my own doubts. I'm going to talk about managing teams, culture, and metrics. One thing I really do enjoy doing is talking to software development teams, especially ones that want to improve. My name is Chuck Lafferty, and I'm a senior director at ADP.
Let's talk a little bit about ADP. ADP is one of the world's largest HR providers, with over one million clients worldwide. ADP does HR services, talent management, benefits, HR outsourcing, and payroll. Some really eye-popping stats: ADP moved over $3.1 trillion in client funds worldwide in fiscal year 2022. And there's a good chance ADP might do payroll for your company, because ADP pays one in six U.S. workers.
One thing ADP really does is they want to know how employees work their best. There is something called the ADP Research Institute that does surveys to figure out how employees can do their best. One survey I'm going to bring up today is the Workplace Resilience Study done in 2020. ADP sent surveys out to 26 different countries, and in those 26 countries sent surveys to 1,000 people each, and they asked questions around workplace resilience and engagement.
Engagement is a frame of mind you're in to give your team your best. When you bring your whole self to work, it's a proactive frame of mind. Resilience is a team being reactive: do you bend in the face of challenges and bounce back? In this survey, they found some eye-popping results. It turns out that teams, trust, and leadership were a theme, and a really important theme. The survey found that people were 2.6 times more likely to be fully engaged just by being on a team. People who said they completely trust their team leader were 14 times more likely to be fully engaged. So this tells you: be on a team, and make sure that you're working on trust with your leadership team.
In terms of resilience, this bouncing back, there were seven important things that people needed. The first was around agency: people felt they had the freedom to decide how to get their work done. Compartmentalization: people needed to feel that no matter what else was going on around them, they could stay focused on getting their work done. Strength for the work itself: people said that they felt excited to work every day and brought their strengths. A strength is something you love doing and you're good at; if there's something you love doing but you're not very good at, that's called a hobby.
What did people need to see from their team leader? People needed to see psychological safety. This was popularized by Ron Westrum's organizational culture and the DevOps Handbook. Psychological safety is that you feel safe that you can say anything to this team and there won't be any retribution, of course if it's polite. People also needed anticipatory communication from their team leader. I call this the no-surprises rule: the team leader needs to tell people what they need to know before they need to know it. If some change was coming or some work was coming, give a heads-up first. From senior leaders, people needed to see visible follow-through. They needed to believe they can count on what their senior leader said they're going to do, and the senior leaders followed up and did it. They also needed vivid foresight: leaders able to see around corners and anticipate what others might not have foreseen, enabling them to remain focused on their work without worry. Imagine being a passenger in a car and asked to do a task in the passenger seat. If you don't trust the driver and keep having to look up to see if you're going to crash, you won't do a very good job on whatever task you're doing. That's what that team leader does: they create that trust so you can do your job without worrying.
How does this relate to teams? Let's talk about turning around a software development team. Some common problems I come across when talking to teams: frequent service interruptions in production; they can't figure out exactly what's wrong in production; they don't really know where they're going in the future; they might not get along as a group and butt heads; and they don't really know if they're making a difference in production and whether releasing these features to prod had an impact.
I want to talk about a five-step process for how we alleviate some of these problems. We're going to deep dive into each one: autonomy, sight, future vision, cooperation, and measurement. These seem cryptic right now, but as we dive in, we'll go on this journey and figure them out together.
Autonomy is how you gain control to do the things you need to do so you can help the development team out. Imagine it's Monday morning. You got into the office at 9:15. You have a hot cup of coffee in your hand and you had a pretty good weekend. You look over at your phone and it's ringing. You see it's from Ranburne, Alabama, and think you know that number but forget where it's from. About five seconds later you get a notification from OpsDuty, a fictitious company that sends you alerts when you have problems in production. OpsDuty says contact service timeouts are higher than normal. Then you realize this is a production outage.
You do what you normally do. The team all got this outage notification too. They jump on the phone together and everybody starts digging into the problem. You bring up your logs and your metrics. The usual suspects are on the phone: a tech lead, a couple of quality assurance people, maybe managers, maybe business users, all trying to figure out what's wrong. In the meantime, you text your boss: hey, listen, there's a problem in production; we're looking into it right now.
As the team digs down, it's been about ten minutes and the tech lead speaks up: I think I figured out the problem. The contact service timeouts are coming from this new query we sent to production yesterday. I have a feature flag for it. I'm going to roll back. I'm going to set the feature to the previous feature flag, and then it's going to use the old query and should restore the service. The tech lead switches the feature flag back, everything is working again. There was a production outage, but everything's now resolved. You're feeling good. A couple minutes later, the tech lead says, we forgot to add the index to production; we'll make sure we do that tomorrow.
You're feeling good about this one, but about five minutes later you know what phone call is coming next. It's your boss, and your boss wants to know why the heck production had an outage and why you sent that code to production. In this situation, deploy the AAA method. I got this from Ron Mitra online. The first A is that you have to let your boss know you're aware of the situation. You never want your boss to come to you and ask if there is a problem. That's why you texted your boss: there's a problem in production, we know, we're going to jump on it and figure it out.
The next thing is assess the situation. Boss, there's a problem in production; we've assessed it; the contact service is timing out; the team is looking at it right now. The next thing you need to tell your boss is that you've acted: we realized the contact service was timing out, we knew there was a new query, we had a feature flag for it, we flipped it back, and now it's working again. Finally, with your boss, you need to explain it. You need to be competent. You need root cause analysis. We know the index was not added to production; we're going to add that tomorrow to make sure we can use this new query. Problem solved. Your boss now knows that you're competent and have this situation under control. Remember to use the AAA method with any kind of situation like that.
Now the ability of sight: what do we focus on first? Some teams might say they have all these problems and all these things to fix, but don't know what to focus on first. The first thing I want you to do is give your team the ability of sight. Let me take you on a quick story about a rocket engine: the Saturn V rocket engine that sent men to the moon. It was developed by Rocketdyne in the 1950s, also known as the F-1, and this rocket engine is massive: 18 feet tall, 12 feet wide, and weighs 18.5 thousand pounds. Still to this day, it's the most powerful single-combustion-chamber liquid propellant rocket engine ever developed.
In early development, tests revealed serious combustion stability problems that caused catastrophic failures in the engine. It would blow up, and initially progress was slow because the problem was intermittent and unpredictable. How does that relate to our software? Those problems in production that you can't reproduce drive you crazy, especially in lower environments. Oscillations in the engine to 4 kilohertz with harmonics at 24 kilohertz were observed. Eventually engineers developed a diagnostic technique of detonating small explosive charges outside the combustion chamber, allowing them to determine exactly how the running chamber responded to variations in pressure and how to nullify those oscillations. These problems were addressed in 1959 to 1961.
Eventually these engineers had the engine combustion so stable it would self-dampen artificially induced instability within one tenth of a second. How cool is that? The 1960s engineers were able to make a self-healing rocket engine. I want that in our software too: self-healing systems. But what I want you to focus on is how they observed the problem. They had sight into the problem. That's how they were able to fix it. That's what I want to give your development teams: the ability of sight.
That's where observability comes in. Before refactorings, architectures, reorgs, and everything else, and even before unit tests, this is what I want you to focus on first: observability, because you don't even know where to look. You don't know how many people are having problems with your stuff in production. You don't even know, when outages are happening, where to look. Observability is going to focus your refactoring, testing, and everything else you need to do to make the team successful.
Observability is three parts. I learned this from Charity Majors and O11ycast. First is logging: Splunk, Elastic. You want to send stuff to a log somewhere. Never do try-catch-nothing in your code; always log it somewhere. Logging is super important. Monitoring is the Golden Four from Google's Site Reliability Engineering book: latency, traffic, errors, and saturation. You want to know if your CPU is pegged or your memory is running out, using tools like Dynatrace or Datadog. Finally, tracing: can you map something from a user click all the way through your system, through your networks, to the production database? It's really powerful when you're able to map a click to a query. That gives you so many insights, and you'll be able to fix things so quickly. Give your team the ability of sight in production. That's what I want you to do first.
Future vision: some teams come to me and say, what's next for our product? Where are we supposed to go? What does good look like? What is the Serena Williams of software development? What is the Michael Jordan of software development? Some people might not have exposure to other teams to understand what good looks like. Here, I'm going to glance over this because it's probably been brought up a lot, especially at a conference like this: the Accelerate State of DevOps 2022 report and DORA metrics, the DevOps Research and Assessment metrics. This is the vision for your team. You can say: we want to be a high-performing team. We want to deploy multiple times per day. We want our lead time for changes to be between a day and a week. We want our time to restore service less than a day. We want change failure rate between zero and 15%. These things are exciting to developers and exciting to me because I know it's going to make my job easier and more fulfilling if I'm able to get my production changes to production quicker and with fewer incidents. That can be the future vision for your team.
The next one is cooperation. Maybe people aren't getting along too well, and how are we going to treat each other? Have you ever felt like every day is a battle at work? You come into work and there's this person named Jeff, and everything you say Jeff has to say how it won't work, or everything is a debate, or everything is going to go wrong, or nothing ever works. It's tough to work in that type of environment. Have you ever caught yourself thinking: no one ever listens to me; nobody's going to like my idea; I'm afraid of what others might think of my proposal; I never get a chance to finish my sentence; I don't want to upset anybody? Complete apathy is the worst: why even try? People say we keep talking about doing changes but never actually do any of them, so why am I even saying things? At that point, the team is kind of lost, because you have to have follow-through.
How do we talk about this in terms of teams and team stages? Tuckman's stages of group development: forming is when there is politeness, people are figuring out each other's personalities, and people avoid controversy. Storming is when people are arguing among team members, vying for leadership, lacking role clarity, having power struggles and clashes, and lacking consensus-seeking behavior. Norming is when processes and procedures are agreed upon, people are comfortable with relationships, there is effective conflict resolution, a sincere attempt at consensual decisions, and people develop routines. Performing is when roles are clear, teams develop independence, there is better understanding of strengths and weaknesses, and there is predictability. You can predict what a teammate is going to need or do. The last one is adjourning, when the project is coming to an end and slowing down.
How do you get from storming to norming? Some teams can get stuck in storming for years. Don't feel too bad, because this problem is as old as human history. The Code of Hammurabi is 3,700 years old and tried to solve this exact problem. Developed by King Hammurabi in ancient Babylon, it had 282 rules establishing standards for commercial interactions, fines, and punishments. It was carved into a seven-foot-tall stone stele. Those rules created what you could call a manual of collaboration for society. You can do the same exact thing for your development team to get from storming to norming.
Developers create guides for ourselves. A developer guide can include decisions on coding styles, tool choices, refactoring practices, how we're going to unit test, and consensus written down in the guide. That way, you no longer say, this is my problem. You say, this is the problem on the page. If we want to change it, we're going to have a discussion and change the page. It's no longer my idea or your idea; it's what's on the page. It removes ego from the situation. A team guide can cover how we use our story tracker software, what we do during outages, and whether we have a rotation. A deployment guide can cover how we promote code, deployment deadlines, configuration practices, and communication practices. This helps remove ego and puts something on the page that you're going to argue.
How do you actually get people to communicate without degenerating into an argument? You want open discussion; discourse is very good for a team because you want to understand where each other is coming from. Focus on rules for debate. Some of these ideas come from Kim Scott in Radical Candor, along with my own things over the years. First, you cannot use "you" or a name, because it attaches a person to an idea. If you remove that and say this idea stands alone, it removes ego from the situation. You're no longer saying the person is bad; you're saying the idea might not be the one we need. Focus on the problem. Build on ideas. Use "yes, and" not "no, but." No work or idea is bad; it's just an idea. A call for more discussion can be a decision. At the end of the conversation, you don't need to decide; you can have another conversation. Focus on that with your teams, start building consensus, build that guide, and write down that manual of collaboration.
The next one is measurement: how do we know we're headed in the right direction? It seems like, how do I know the thing I just deployed to production actually had an impact? I like the Peter Drucker quote: if you can't measure it, you can't improve it. It's really important to measure things on your development team to understand if you're going in the right direction. This is a great thing for a manager or tech lead to do for their team if they're using some of their non-coding time, because you can help guide the team and give them metrics they haven't seen before. I listed 50 things you can measure on your development team. Things like total number of problem tickets; maybe that's a smell in production. Maybe your slowest calls or most frequent API calls. You want to measure those to see if they're going up or down. Maybe JavaScript bundle size, infrastructure, latency, CPU, memory. Maybe dev metrics like CI/CD pipeline timings, how long each step takes, how long it takes to merge PRs, total work in progress, how long builds take, and of course DORA metrics: mean time to recovery and mean time to detect outages. These are really powerful techniques you can use to help guide your development team along their journey. Once you give them measurements, you're giving the ability of sight in the future.
In conclusion, we've revealed what each one means. Autonomy: handle those outages with the AAA method to show competency and avoid being micromanaged. Give your team sight: create observability first, then you can figure out what the heck to focus on. If you don't even know how many errors are in production, you have to figure that out. Create future vision: find out what good looks like, which is kind of already done with the DevOps Research Assessment, but you can continue to figure out what good looks like. Cooperation: make that manual of collaboration for your team; find common ground; write it down. Measure what's important and continuously improve on those metrics. For example, number of timeouts: if you don't have any timeouts in production, make it zero. With that, I hope these tips give you the confidence to help turn around that software development team. I want to thank you for your time. Again, my name is Chuck Lafferty. You can reach out to me on Twitter and LinkedIn. Thanks so much for joining.