Log in to watch

Log in or create a free account to watch this video.

Log in
Las Vegas 2024
Share
Download slides

How SiriusXM Built a Next-Gen Streaming Platform for 34 Million Subscribers in Less Than a Year

How do you transform your flagship subscription service when you’ve got 34 million subscribers, and you need to build a next-gen streaming platform from the ground up? How do you drive a culture of engineering excellence when you’ve got less than 12 months until your Wall Street deadline?SiriusXM is the leading audio entertainment company in North America. We’re a $9 billion business, with 150 million listeners enjoying curated music channels, on-demand talk shows, and live sports play-by-play. We wanted to do more for our subscriber base, and in 2022 we decided to move SiriusXM from satellite radio to an online streaming platform, by the end of 2023.In 12 months, we ramped up an entirely new engineering organization, distributed across North America and Europe. Our 40 platform services teams built from scratch a cloud technology architecture that’s event-driven, multi-region, and super slick. At the same time, our ecosystem reliability teams leveled up our culture with observability as code, monthly chaos testing, and engineering scorecards. And in November 2023, we successfully launched our next-gen app, on Android, iPhone, Alexa, and more.We’re Ben and Rachel, the VP of Services Engineering and the VP of Ecosystem Reliability at SiriusXM. We’d like to share with the ETLS community how we approached leadership at scale, our successes, and our lessons learned. You’ll take away from this session the scaling antipatterns we avoided, how we gave our teams technical alignment with maximum autonomy, and how we built reliability into our platform services through cross-team collaboration.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

So the next speaker, from SiriusXM, which is the leading audio company in North America. Included in this portfolio are the SiriusXM subscription network, Pandora, and an extensive podcast network.

So here to tell a story of how they transformed their flagship product that supports over 34 million subscribers in just 12 months — in a high-profile, must-succeed product launch — are Benjamin Manning, VP of Services Engineering; Rachel Uhrig, VP of Software Engineering; and Carol Gastal, Lead Strategy Partner from Equal Experts.

What's so exciting is how they wired their organization to win by ensuring alignment and autonomy and massively leaning into things like observability and chaos testing. So up next are Benjamin, Rachel, and Carol.

Ben Manning

Thank you. Thanks, Gene. As Gene mentioned, I'm Ben Manning, I'm Vice President of Software Engineering for SiriusXM.

By a show of hands, how many of you listen to music? Podcasts? Do we have any podcasters out there?

All right. According to Statista, 93% of the population listens to music, 40% listen to podcasts. The real question I have for you though is: who is that 7% that does not listen to music? I'm not going to be able to answer that for you in the next 20 minutes, but if you have the answer, I would love to hear about that after this.

Our mission at SiriusXM is to shape the future of audio where everyone is effortlessly connected to the podcasts, channels and music they love. Our product portfolio consists of SiriusXM, Stitcher and Simplecast, AdsWizz, and of course Pandora.

For additional context on the conversation today, what are we dealing with here? 33 million subscribers. 150 million listeners. Grossing around 9 billion in annual revenue. Supported by roughly 1,200 engineers.

To be able to orient yourself: how SiriusXM is broken up in organizational structures — we have our technology organization. Within the technology organization, from a very high level, we have product and engineering. I fall on the engineering side. Within engineering, you have the services engineering — which is what I lead — and then you have my peer who leads client engineering. Client engineering builds the mobile, the web, the TVs, etc., the applications that you all use. The services that we build is the platform that powers those front-end experiences.

Now, why undertake such a massive transformation? Why undertake such a big change?

In today's digital landscape, user behavior have fundamentally shifted. Audiences now expect — and I would even say demand — a digital-first experience with deep personalization. We at SiriusXM really needed to transition from being a satellite radio broadcast company to being a digital experience with full personalization capabilities, not merely confined to the car. We needed to be able to transition so that people can access across a multitude of devices, making the car just one of many platforms that you can listen to.

What were we dealing with? Well, we were dealing with 15 years of tech debt. We were dealing with fragmented environments split across AWS, GCP, on-prem. We had monolithic C++ on-prem driving massive functionality that required maintenance windows — we did not have a hundred percent uptime. In other areas, we had monoliths that we migrated to the cloud; however, we didn't change the underlying primitives, and so now we ended up with a decentralized monolith in the cloud, where any service change now has these cascading failures. So to me, that's actually even more challenging — having that decentralized monolith.

This technology baggage was stifling our business innovation. We really needed the inverse. We needed to make technology the driver for business growth, not its obstacle. The modern service landscape really demands the scalability, reliability, and innovation.

This next-generation streaming app became our cornerstone of our SiriusXM strategy to modernize towards digital-first. It was a critical launch that our CEO announced months before we were even ready to launch and go live. I'm sure some of you have been there — CEO announces "we're going live," and you're like, "all right, we get how to make this date."

The launch simply could not fail, given its strategic importance. The challenge: a very aggressive timeline. How many of you all have faced a similar challenge, or will face a similar challenge?

So really the question then becomes: how did we enable a transformation in such a short period of time? And that's where Carol's going to go into some of the how we made this happen.

Carol Gastal

Thank you, Ben. So how did we do it?

As we reflected on how we enable delivery of this modern and complex piece of technology, with high reliability while constrained by this ambitious timeline, we identified three pillars essential to that journey.

The first is alignment. When we first started, we needed to build technology that was broadly aligned to our vision — stable yet adaptable, flexible enough to evolve with the product. We needed everyone to be on the same page. And it wasn't really about setting individual or team goals. It was about having a north star, creating a shared vision that every team could rally behind.

And it helps, right, that we have leaders who naturally inspire people to rally around this common vision. But we weren't immune to the tensions that arise from trying to hold onto that common vision and the forces that were pulling us away from it. Did we mention the timeline?

We were — or we are — a large organization moving at a million miles an hour. Things just weren't always going to line up for us. We were building our tech architecture while in flight. So we leveraged the concept of aligned autonomy. Teams were extreme aligned. They had the autonomy to make decisions and innovate, but always in alignment with our overarching goals.

We described technical alignment in terms of guidance, expectations, and business consequences, and captured it in architectural decision records (ADRs) format, which teams were invited early to contribute to. With having that — everyone knows and has easy access to the broader direction and the principles that we're guiding our decisions — which reduces the mental burden on teams and allows them to focus on execution rather than constantly seeking approvals.

We also worked on creating a culture of openness and collaboration, which spans across how we discuss and evolve our practices, processes, and technical decisions.

How do you do that at scale though? How do you help your identity, playback, and personalization teams, for example, to make independent decisions at speed when they need to deliver a business outcome? Together we do that with solution committees, where all teams that are impacted by an outcome have a voice in the room on how to implement the solution. To mitigate the risk of analysis paralysis, we revisit our north star, check our alignment, and remind ourselves to disagree and commit. We are deliberate, sure, but once a decision is made, everyone commits fully to executing it.

As we empowered our teams with that ownership, we also recognized the need for greater visibility into each team's adherence to core technical principles and best practices. It's a critical factor in ensuring a reliable launch.

Without a centralized way to track our progress, we were risking mis-prioritized efforts and overlooking essential non-functional requirements. To address this, we began by reviewing your list of maxims — five or six items such as "deploy to prod every week," "automated unit testing" — you would have your own lists. And we worked with our teams to listen to their input and guide their efforts. We ran our initial guidelines through implementation cycles and tailored them to what our teams and our business actually needed.

We then automated our checks. That's the most exciting part last year. It helped us confidently assess our readiness to launch, and now it provides the organization with real-time observability of our tech services, helping the teams to focus on the areas that need their most attention.

Recognizing our teams' knowledge and skills and trusting them to apply it allows us to work in an environment of empowerment and trust. That's our second pillar.

From day one, SiriusXM was committed to building teams with deep technical expertise. We took advantage of the new remote-first reality to bring in the most knowledgeable, talented experts possible. Our high trust in their abilities helped foster a culture of openness to failure and collaboration. We want to make sure that we listen to the people we hired and value the expertise they bring to the table.

We know that diverse perspectives always lead to better decisions, so we needed to move away from a top-down approach and empower the teams to make decisions collaboratively and closer to where the actual work was happening. This decentralization of our decision-making process was only possible because we are aligned on our technical vision. And not only does it speed things up, but it also ensures that our decisions are more relevant and are better informed by those who have the most context.

Then we gave our teams more accountability. We embraced the "you build it, you run it" principle. This means that our teams are responsible for the entire lifecycle of their solutions — from design and development to deployment and maintenance. By giving them this ownership, we empower them to take pride in their work and to make decisions that are in the best interest of the company, the platform, and also themselves.

One example of how it all comes together is our organization-wide Friday demos. These sessions were dedicated to showcasing technical experiments and non-functional, non-product-driven implementations that teams had been working on. The value of these sessions was so significant to the community that after launch, things quickly reorganized to keep the practice alive for themselves.

Finally, knowing that they have their organization's support empowers teams to be forward-thinking — focusing not just on the immediate needs, but also on how their work will sustain and evolve in the future. This was especially critical in the early days of building a new platform, when we were building the foundation to support a product vision that at that time was not yet fully crystallized.

Today, this mutual trust has cemented a culture of resilience and innovation, which in turn drives both immediate successes and long-term growth.

Rachel will tell you about our third pillar now.

Rachel Uhrig

Thank you, Carol. Afternoon and good morning.

Reliability isn't just a feature. It's the core to user trust. As we embarked on building our next-gen app, we made reliability the cornerstone of our culture: engineering excellence. The third — and arguably the most critical — pillar for our success was the bedrock of our approach. It empowered us to do three things: accelerate development, ensure stable releases, and deliver a predictable and resilient user experience.

As Ben and Carol mentioned — and I'll say it louder for the folks in the back — we had an aggressive timeline. We had to build SiriusXM in 12 months. This did not provide space for errors or rework.

While our efforts across teams were largely decoupled, centralizing and standardizing components of the development process allowed us to move with speed and accuracy, to hedge potential implications of teams doing different things in a million different ways. We platformed common tools, processes, and workflows. But we also took it one step further.

This upfront investment in a well-thought-out platform — and adding in the flexibility for what the future might hold for us — continues to pay dividends today. This began with standardizing service profile templates, extended through operational assets such as runbooks, alerting workflows, and observability. And it ends today with a common understanding and set of products for the way that we not only measure, but we benchmark our application health and performance.

In the short term, this suite allowed us to pick up, move quickly, and reduce a lot of cognitive load, which naturally allowed our devs to focus on specific feature and functionality. In the longer term, we're able to mature our entire engineering organization. We're moving from binary success criteria — so think "did that deployment fail?", "is our MTTD down this month?" — to quantitative, data-driven decision-making that informs how we better support what we're all here for: our end users.

Think about: our user experience score dropped a percent this month. How do we get this back up? How can we reduce time-to-stream to potentially improve our overall user experience and get our UX back to where it was, if not above?

We not only had to ship product, but we had to shift a legacy culture in how we build software. And think about reliability — we stood up a dedicated reliability team to ensure our product was ready for launch, it was reliable, and it was resilient when incidents occur. We created a number of products and programs to drive our engineering excellence culture forward.

Two honorable mentions outside of what Carol mentioned were: one — our production readiness plan. What you see on screen is our earliest version of our production readiness plan, which outlined two things: what it meant to be production-ready, and paved a road to how we get there. We took a beg-for-forgiveness permission with our CTO, and did an overnight self-mission to graffiti what you see on screen on his conference room. This rallied our entire organization around the culture we were trying to build, and gave us something more tangible to understand the path to how we get to our north star.

The second program worth mentioning was what we call readiness days. This served as our primary vehicle for shifting how we think about reliability — more specifically, moving it from top-down to a bottoms-up culture. Think about these as monthly chaos days where we were also reenacting and practicing our launch plan.

We set these dates in our standards firmly out of the gates. Our user expectations and our launch date, as Ben mentioned, wasn't changing — so why would our readiness day? We held the line on not only our readiness dates, but only having three environments. We no longer were asking which prod we needed to deploy to. Today, we use these dates to hammer our systems and improve our teams' response processes. These not only serve as a forcing function for all of our teams, but highlighted where we had gaps in workflows, observability, tooling, and training.

All this to say: if these products and programs don't exist in your organization today, don't be afraid to create opportunities to drive new values and ways of working. And to close this slide, don't underestimate the power of just keeping it light and fun.

Ben Manning

All right, Rachel — talk is cheap. Did this actually work?

Our first goal was to improve user experience. You won't believe this one — again, hammering it through — in the 12 months we had to build from scratch, we launched with zero Sev 1 and Sev 2 incidents.

Not only that, we improved our mean time to detection and resolution — 72% for Sev 1 and 69% for Sev 2s.

We needed to optimize for speed and innovation. In the first three weeks post-launch, we completed more deployments — both new features and bug fixes — than the entire previous year.

Last but not least, which is something we would have not been able to do with our previous stack, was enhance our organizational agility. We delivered against three strategic partnerships within six months of launch. First being Hilton; the other two — I would love for you to continue watching the headlines to see what those are.

We enjoyed this past year tremendously, and we're truly passionate about what we're doing and what we've accomplished this year. We're always on the lookout for outstanding reliability talent in the engineering space, and we'd love to hear about your journeys as well. If anybody has any questions or wants to share sort of troubles that they've seen along the way, we'd love to share the lessons that we've learned. Thank you.