Chasing the Unicorns at T-Mobile

Log in to watch

London 2020

Chasing the Unicorns at T-Mobile

Twelve hour outage bridges, worn out headphones, 90% unplanned work, and 25TB of randomly corrupted file systems were normal business for T-Mobile developer platforms.

When the foundation of where software delivery happens is the bottleneck, throughput remains buried under a large pile of debt. Ripe for improvement, T-Mobile has begun to embrace DevOps principles including transparency, telemetry, post-mortems, and continuous experimentation to spark a turnaround of historic proportions.

Listen as Chris Hill, Senior Manager of Developer Platforms, walks through a journey capitalizing on T-Mobile culture and desire to create experiences customers love. The culture, otherwise know as "Team Magenta" lead to an appetite to change and now has teams achieving up to 30x throughput gains and decreased deployment pain.

Chapters

Full transcript

The complete talk, organized by section.

Chris Hill

Hello and good afternoon, London. My name is Chris Hill, and I am the senior manager of developer platforms, and I'm very excited to speak to you all today about T-Mobile's pursuit of unicorn status.

Now, this is with respect to developer platforms, which is my favorite subject, and I'm highly passionate about this. And I feel like, and I'm really excited because all of us can relate to this. We either lead developers, or we've been a developer, or we understand and we empathize with the developer experience. So I hope that it resonates.

I'm going to start with the developer experience and walking us through what a developer experience maybe is like in your enterprise, and how we can improve it. I'm going to go into transformation. I know that word gets overused a lot, but what does it mean to transform and change the way that you work? I'm going to dissect that a little bit. I'll also mention the unicorn playbook and what unicorns -- basically how they operate, and maybe what we need to do to catch up, and some of the lessons learned within T-Mobile, at least on our journey so far.

So I'm going to kick us off in an area that a lot of people don't really like to talk about, and that is the onboarding process. This is where I feel developers first lose their motivation.

And the best analogy is, you know when you get your IKEA furniture, you've got this instruction booklet. All the steps are laid out in front of you, and you have little cartoon versions of yourself that basically just show you either really confused or don't do that. This is the designers from IKEA that are empathizing with the experience you're going through, or that you are going to go through as you build this table. There's also a little nice tools chart and inventory chart that literally lays it out all in front of you. These are all of the pieces that you'll need, and these are all the tools that you need to put these pieces together.

I'm still waiting for the day when I inherit a software project or join a software project that it comes with a list of reasonable instructions like this, like I'm putting together an IKEA table. But every software project I've joined as a developer feels like I just stumbled on step number 650 of the IKEA table build cycle, but the instructions are actually missing. Where did those go? I don't know. And every screw is the wrong size, and every screw is completely stripped.

There's no way I can be productive at what I just landed myself in. If I could just get somebody to walk me through how the software's been built before, or someone show me the design documents and why you made those decisions. But instead, I'm stuck raising a flurry of access requests. I'm stuck looking at incomplete architecture drawings, if I can find something, and thousands of people that, well, everyone has their food ready and they're watching me try and build a table and saying, "How come your table's not done yet?"

Maybe after scouring the Earth, I'm able to obtain a hard copy that tells me how or where I need to go to ask for access for code and environments. This would be step-by-step instructions on how to run, essentially get into service tools, ServiceNow, Remedy, ticket-raising types of tools. And if I'm lucky, when I do make those requests, the people in the approval chain are in the office this week, and I can finally start to actually look at code. I'm really excited.

In a larger enterprise, that SLA for the approval workflow of what you just asked for, oh yeah, that's a five-day SLA. Okay, so I'm sitting on my hands until then? But then maybe when I actually do get access to the system, they're so fragmented that it's a context switch every time I go from one tool to the next tool. And basically, in order for me to fully understand and fully utilize the tool chain that I have at my disposal to do my job, I need to understand the narratives of every company who builds each one of these tools. I got to jump into why a company decided to put that part of the workflow there, and then I got to log into another tool and understand where that company was coming from just to do my job.

Does this sound familiar to anybody? What a way to welcome a new developer into a company or even into a new project. Like, "Congratulations. Here's the most demotivational and disenfranchising way that you can join our project." I honestly don't know why more developers don't run for the hills. Patience is not a quality that I usually see a lot of in developers.

Let's say your motivation started at 10 out of 10 on day one. New project, new company. I'm really excited. Now I'm going to kick off my career. You're at 10 out of 10, right? After going through a painful onboarding experience, and before you're ever even able to look at code, you're down to, like, a two. You better hope that the code is perfectly written and there's no debt and it's tightly organized because you don't have much motivation left to work with.

Now, at T-Mobile, customer experience is in our blood. We love the customer experience. And just like a better subscriber experience leads to lasting and growing revenue, a better developer experience leads to higher software throughput and higher retention, higher levels of innovation. And the more investment you make to keep the motivation up and the friction down from the beginning, the happier your devs will be, and the more productive.

So why does this make sense? How do I rationalize developer experience to results? How do I reconcile this? And the less cognitive load for the context switches will decrease your overall cycle time. You will be faster at delivering products the less you have to worry about the fragmentation on the tool chains that deliver your products.

There's also less wait time within your developer value stream. If you can say that it takes you 30 seconds just to log in to every tool that is associated with each one of the tools that exist in your software value stream, you've already lost hundreds of thousands of seconds based off of how big your enterprises are. We've even calculated that if we save right now one second in every single CI/CD job, we get feedback back one second faster. It's just like we hired a full-time person.

That's pretty amazing. And it makes somebody think twice when they go, "Oh yeah, that'll add five seconds. Well, it doesn't really bother me. It's five seconds. I'm getting out of it what I want." Yeah, but at the whole enterprise scale, that's a huge tax.

We'd like to empower rather than we'd like to impede. We don't necessarily always know that we're going in the direction of impeding, but we do know when we have a loss in creativity and when people are unhappy. It's all about empowering. If a decision is made and you're lowering the empowerment, it will sacrifice from a throughput perspective and a results perspective.

Now, the assumption is, if you're trying to instill change or you're trying to make a difference, that you have confidence in the business. And this isn't just confidence in the leadership. This is confidence in your strategy, how your product works. If you're not, you're not going to get priority cycles to even be effective.

Now I want to dive in a little bit on change. And change for me, and transformation, is really about fear of loss. And I really think of fear of loss because I hear phrases like, "We've been told this is the last time we're making this change," or "We're making a change in this area," or "The last time we're moving. What I already have already works for us. Why are we mixing things up?" Right?

If you're not armed with a why, or if you're not investing in earning confidence in the value of your shared service, if you're trying for economies of scale, are you really doing the best thing for your company?

I've been with T-Mobile on this initiative for at least two years, and I've always asked myself, were we just late in determining how valuable developer experience is? If T-Mobile had started five years ago, would we be further along in our pursuit of unicorn status? Would we be a unicorn right now? If we had invested earlier, would it have even made a difference?

I've reconciled the answer is that it's way more complicated than that. You can't take a unicorn's playbook and just magically become the unicorn overnight. There's a lot of work to make transformation and a transition successful. You've got people to convince, funding to earn, legacy systems to keep running, anti-patterns and behaviors to break, feelings to probably hurt, architectures to rip apart, network firewall rule changes, policies to challenge, a culture to evolve, and unplanned work to compete with.

If your change or your platform or your product or your service that you'd like to be able to transform is at 90% unplanned work, the entire team has no room to even have a thought on how this could be better.

Back about two years ago, when I first joined, we definitely weren't in the state that we are in right now. We were on our feet most of the time. And I went through multiple pairs of headphones due to 10-plus-hour bridges, where the padded black part in my headphone was starting to wear off like I was getting a haircut every day. We used to be in constant crisis. That's where your hair gets matted down because you've been on a bridge for so long. We definitely knew we needed a new way.

One of our guidance points was John Allspaw's phrase, "Incidents are unplanned investments." And this was our biggest headline and was our fuel to how we were going to change this experience.

Now, the interesting part about the experience for our users is that we had to acknowledge that it was a current poor experience. That was a lot of half the battle. We had to basically say, "Yes, we know this isn't ideal, but this is what we're doing today to make it better in the future."

I feel like transformation is intended to be fruitful for all, but it's painful for some and uncomfortable for most. Now, if we acknowledge that the experience isn't ideal, we're leaving ourselves room for opportunity for improvement.

The thing is, what should this actually be? It changes every day in a crisis. At hour number 10, what this should be is really just, man, I wish I never had to dissect this technology again. But ultimately, your postmortem should actually dictate your priority. What you've established is not sustainable. How do we get everyone together to reflect on how to make it more sustainable?

This is my fourth industry going through a digital transformation. I started in semiconductors, then retail, then automotive, and now telecom, and I keep thinking one day this is absolutely going to get easier, but it never does get easier. Who am I really kidding? Telecom has the same crippling legacy debt smothering our ability to ever improve, just like the other industries. It's like you're always in a hole trying to claw your way out, but the bottom is your legacy quicksand.

There's hope to get out of that hole. Here are some things that worked for us. The overall objective was for us to turn unplanned work into planned work.

The book "Making Work Visible" by Dominica DeGrandis talks about not only how do we understand where work comes from, how do we accept work? Do we pull work? Do we push work? How do we ensure that our capacity limits and our WIP gates for individuals aren't being exceeded? How do we know what value is in progress and what value isn't, right? The transparency is a pillar of transformation in my mind.

The blameless postmortems I talked about before, turn them into investments. Take them as an opportunity where you can take an incident and really find comfort in understanding that your assumptions were wrong. Everyone else's assumptions were wrong. How do we ground ourselves on an assumption that we know is actually right? Or how do we prepare ourselves or our systems in case our assumptions are wrong? Architecture safety.

We took an interesting stance with our customers right away. When we were in crisis mode, we basically raised our hand and said, "Hey, you know what? We screwed up. Things are really, really bad." We struggled to even come up with 80% uptime. We're not even having the nines discussion at that point. Maybe unless it was 89.

We told our customers we needed six hours every single week, and yes, it was going to be early morning business hours. We need six hours every week to make sure that the experience improves over time to harden our old systems. That was really hard for customers to swallow, and I don't blame them. Again, we acknowledge it's not ideal, but this is what we're going to do today to make it better for the future.

We were disciplined in our operations. This seems obvious for a unicorn but may not seem obvious for horses. Take a buddy, pre-check all formal runbooks, peer-check all formal runbooks, make sure you have the estimated time allotted. These are things that probably seem natural to a unicorn and might not necessarily seem as natural to horses.

And don't be afraid to back out. This one is probably one of my favorite methods to turn unplanned into planned, and I'm going to go over a little example on how I explain this particular point.

One of the downtimes, these were the two-hour early morning downtimes on these Tuesday, Wednesday, Thursdays. I had woken up, and we had about five minutes left in the downtime. I had known from the previous day, because I looked at the runbook, that the downtime was only supposed to take 30 minutes, and we had five minutes left, and we actually allocated the full two hours. We had five minutes left in the two hours, and I hadn't seen any sort of celebratory messages, and I hadn't seen any sort of indicators or signals that things were going right.

So I joined the bridge, and I think we've all been here before. And when you join the bridge, you hear phrases like, "Well, we'll see how long that takes," or, "I'm going to try this," and, "Man, what the -- why is that working like that?" Right? You're hearing all of these very negative phrases.

And we had five minutes left in our downtime, and I asked the question, I said, "When do we expect to be able to hand our systems back over to our customers in a known state?" And the answer was, "Well, based off of the napkin math, I think we'll be done by noon." That was approximately four or five hours.

And my next question was: how long would it take for us to get back into a known state if we backed out? "Oh, well, we put that into our runbook, and that will only take 20 seconds." And so I said, "Just back out."

Why are we waiting on a judgment call to do something that we know is unknown, and we keep failing forward, patching, trying to figure things out, patching to see if it works? Your trial and error periods aren't meant for your downtimes. Your trial and error periods are meant for your before the downtime. So to get into a known state, don't feel bad about a backout. I actually celebrate backouts just as much as I celebrate non-backouts, because the whole thing was planned, and you've planned a backout, and we've made a commitment, and you fulfilled your commitment within those two hours, so we backed the whole thing out. Don't ever be afraid to back out.

We also started to reward flawless execution. Now, I really like ensuring that when things go as planned, let's celebrate, let's embrace it, and that is a huge success for us.

Now, I feel when you've turned unplanned into planned, you can start asking the right questions. Do you know what good looks like? What are our measurements to see whether or not we are successful as a platform, as a service? Where are our bottlenecks? What standards should be enforced, and what standards should be flexible or distributed? Are we actively impeding or are we empowering? Do we have a community of support? Do we have a community of experts? Do we even know what an expert is in this area? And do our customers believe in us? This goes back to the confidence question.

If you have the right set of questions that you know you need to ask now, you can transition into finding the right solutions. This is when you can define the best practices, the ones that you can control, and then simultaneously challenge the best practices that you can't control for more refinement. Think unicorn, think ideal, but also think in iterations.

I don't like the phrase, "Yeah, I agree, but that's not really possible here." Fair enough. Okay. What is the path that we can make it not impossible? What is the iteration?

Determine if you have any large-scale directional movements in people, process, and tools to make during this transitionary period. This is your opportunity to change how things work after you've pulled yourself out of crisis. Treat any sort of feedback that you get as gold.

I have a set of pre-transformation metrics. I have a set of post-transformation metrics. Is the transformation and adoption going as we would expect? Has it been successful? And then what is the quality of adoption? And then what is the quality in your post-transformation with how happy your users are? We measure net promoter score. For anyone that you can turn into a net promoter, you've basically just started multiplying your staff, and you've started to build your knowledge economy of this capability.

What I think is extremely important also is you start to move to low-context-switch tools. Low cognitive load, low amount of friction, so that way they're throughput-focused. One of the changes we made was we took, well, source control and CI/CD, those were two different things and two different narratives. Well, we moved to something like GitLab and GitLab CI, which they have the same narrative, and they were built in with a minimal context switch from the beginning.

Things like Conway's Law come into play here. When you have the same company with the same narrative that is building and architecting for flow and throughput, you are able to be more successful from a flow perspective. You're able to eliminate and create a better experience. You're able to eliminate waste and create a better experience.

The last one is core versus context, and the core versus context is this idea where, and Gene mentions this in his book, it's this idea where focus on what makes you special, let someone else focus on what makes them special. And there's a good example of us doing this, actually.

We chose not to run GitLab on premise. And I always ask myself this question: Will T-Mobile ever be able to run GitLab better than GitLab can run GitLab? And that answer was always no. So by taking the context and handing it over to somebody else who's really better qualified and better at it in general, we can now focus on what's core to T-Mobile, and this is T-Mobile's implementation of how we use GitLab.

Everyone who would've been involved in babysitting an on-prem GitLab instance, and I use babysitting, it's not that bad, but everyone who would've been involved in doing that for on-prem can now focus on what is core to T-Mobile: automation patterns, automation instruction, reusability, internal runners, our pipeline integrity, our SOX integrity, any of our integrations with T-Mobile's internal ecosystem. And now our implementation becomes something where we can have a direct influence on what the experience is, and we can leverage the fact that another company can ensure that they hold up their end of the bargain and deliver what they're good at.

There are a couple of lessons we learned along the way. Failed transformations can lead to an extremely productive second attempt. Too many times, the fear of loss actually equates to, it's not a loss, it helped inform the decision in the future, in the evolution.

We also learned that there is a great question to ask anytime when you're in the middle of standards. There's a great question to ask, "Is this the best thing for your team, or is this the best thing for your enterprise?" And I love that this is a conversation starter. And this will essentially go, what is being requested is a deviation. Okay, fair enough. Is the deviation something where we should now make the new standard, and the rest of T-Mobile should now adopt, and we should have the economies of scale where we can lift and shift very quickly? Or is the deviation not something that T-Mobile should adopt as a standard, and therefore the team should adopt the standard, the existing standard themselves?

This gets people's brain working in a different way and understanding that we're all a part of the same ecosystem, and we all have a responsibility to be a steward of that ecosystem, and it's a shared responsibility.

Transformation fatigue is absolutely a thing. If you have, every single day, multiple transformations in flight, you will be in constant chaos. This means be selective on which transformations you want to do simultaneously, or if you want to do any transformation simultaneously. There's a balance here. How can you be productive and also transform at the same time?

I think it's important to focus on the constraints that can't move, and move the ones that you can move. When I hear a no three or four different times, when I say, "Well, how about we do it this way?" Or, "How about we no longer do it this way?" And I continue to get no's, I realize that constraint probably won't budge and I need to come up with a better solution. I also need to know when a constraint is completely unreasonable, and maybe there's iterative steps to eventually get out of the business of maintaining that constraint.

We also feel that obtaining adoption inertia is part of this unlocking of passion. Almost like the rebellion within The Unicorn Project. There are people who are excited, and they have passion. They may not have been able to actually unlock that passion until you partner with them and understand that this transformation is about unlocking that passion, and we want to create and foster that creativity and innovation.

We just merged with Sprint. We have a ton of work to do in terms of taking these behemoth telecom companies and merging them together, and the technology associated with that, and the change required, and those people, process, tools that are required in that. We need a lot of help. So please, if you're interested in joining us, please go to that link that you'll see there.

I do want to thank everyone for listening. I know that it's been a challenge right now that we're doing things virtually, and I would love to be in London in person. And due to the circumstance, I think I just want to thank the IT Revolution staff, who've done a phenomenal job with putting on a conference and taking a constraint that they've never had before, which is making things virtually, and making the best of it, and making it a successful conference.

So with that, I appreciate your time, and have a good rest of your conference. See you later.