Dr. Christof Leng & Jason Cox (Las Vegas 2022)

Log in to watch

Dr. Christof Leng & Jason Cox (Las Vegas 2022)

EXCLUSIVE

Dr. Christof Leng

SRE Engagements Engineering Lead · Google

Jason Cox

Director, Platform & SRE · The Walt Disney Company

An exclusive interview from DevOps Enterprise Summit Las Vegas 2022.

Full transcript

The complete talk — auto-generated from the talk's captions.

All right, Dr. Christophe. Hey, great. Great to have you here.

Great to talk to you, Jason. Um, I, you know, ss r e right? So once you give us a little bit, like what, what are your, what are your involvement with SS e? Just kind of your story first, why don't we just start with that and, um, let everybody know your founding, if you will, in the SS e space.

And I got some more questions for you. When a recruiter approached me in 2013 and asked me if you wonder interview at Google, I was like, sure, why not? Let's see how far I get. They're like, do you want to become an sre?

And I was like, come again. What is that site reliability engineer? Never heard that, had no idea what it was about. So they explained it to me.

I was like, okay, I'm not sure I understand. So a few more people explained it to me over and over again, and I finally got the job, and I, I started there and I learned what s e is, and I think I'm still learning because it is evolving. Um, back then more and more companies outside of Google were starting s e things. And, um, and SS e at Google also is constantly evolving.

Um, but the picture has changed a lot, um, with, with the publication of, of the SRE books, with like lots of conferences about SREs a a huge online community. Um, so nowadays more and more people have an idea what SRE is, and they're all contributing to this idea. And, um, I'm very excited to learn like where it'll go. That's awesome.

That's great. Well, and I do think that that's, that's a great point, which is how do you define SS e especially you move into a space where maybe you were software engineering on your background or an operations focus, and then moving into an SS e role. How do you, like right now, today, how do you define what you do? Somebody says, Hey, Christophe, what do you do for a living?

How do you, how do you react to that? How do you tell them or explain to them what that s r e is all about? I think it's, it's very hard. Sss, r e is about bringing the software engineering mindset to production problems.

Um, to like try to solve it algorithmically, um, with principles and, um, to scale it beyond what we previously thought was possible. And at the end of the day, software is part of everybody's lives nowadays, and it's becoming more and more important. And so I think as an industry, we have a responsibility, um, to make this work flawlessly, um, and, um, to like help people to just do the things they want to do without technology getting in the way. Right.

That's good. Excellent. And if you were to say, Hey, this is what a good day looks like for me as an nursery, how would you describe that? What is a, a winning day look like in your mind?

A great day is when we collaborate, uh, across silos, um, with the developers, with product managers, um, with, with all kinds of disciplines, um, to identify and solve problems before they happen. The best outages are the ones that you were expecting, but they never happened because you were prepared. Love that. Love that.

So true. It's great. And on the flip side, because we know we're all in technology, what does a bad day look like for an SS r e A bad day is we're being Sisyphus, we're not making any progress. You're just doing operational work, you're fighting outages.

You are like making recommendations that do not really stick because you do not have that connection with your partners with, with, for example, the developers. And, um, so you know it's going to be Groundhog Day over and over again, and then you're not really helping, you're not really making a lasting impact. Right. That's good.

Well, well said. I think that from a technologist standpoint, right? That's, we want to be able to contribute to something that you see has this larger impact, right? And you think SS r e has that potential, you're talking about the good days is putting into place the things that prevent the outages or prevent the, the outages that never happened because of the work you've done as an SS e.

Um, I guess, uh, another question would be like, um, thinking about SS e as a discipline, how do we prepare people that, uh, because, 'cause back to your whole point, people I don't think necessarily understand or gravitate to ss e maybe outta college, right? How do we prepare people to be really good SREs in the row? I think that's a lot about dealing with real world problems. Um, a lot of that what we do when we learn at college and courses, all of these things are like small throwaway problems.

Hmm. So you, you, you understand the problem. You write the code, you run the code, code produces the result you're done and you throw away the whole thing. But it is not how products work.

Mm-hmm. The real interesting problems really only start there when you pile code on top of code over years, over decades, connected with many, many other systems. And, um, you need to keep maintaining it. You need to keep improving it.

Mm-hmm. Um, so I think learning about that can only be done when you work on something that is bigger than a single exercise. Something that will take years and that will go through multiple lifecycle stages, but it's really challenging to integrate into curriculum. Yeah, that's a really good point.

And it really is that product mindset, right? Is uh, and SRE are part of that, and it sounds like that really, like if we're able to somehow bring along this technical talent that's being forged right? Through their desires and their industry or their desires through school, it would be about getting oriented and understanding of how products really do work versus, to your point, just solving point problems. You know, just writing solutions for a problem and then walking on understand.

I really like that. I think that's profound. The other thing you said, and I just thought I don't, I don't know, it just, uh, it, it fascinated me a bad day for an s r e is those days where you're just in the throes of operational work, maintenance support, right? Just keeping stuff around you.

Maybe it's incident in your mind, like as you're, as you're seeing it, uh, across the teams you interact with other SREs, what percentage of time is that? Like the op side, more the traditional OP side versus the transformation side, which is writing the automation, writing the code, or helping? So at Google we have a basic rule that no team should, excuse me, no SS r e team should do more than 50% of operational work. Um, but that's a threshold.

It's not a target. Yeah. So in practice we see most teams doing a lot less, but it varies from team to team. Um, personally I think something between 10 and 20% is a good target.

Um, so because you actually do need some exposure, if you do not get your hands dirty, um, you will not see interesting new patterns and production, new problems, connections between different problems. Um, and if you only do oncall like once per quarter for a few days, you get rusty. Um, that's bad for like the M T T R, but uh, it's also bad for, for your psychology because now it becomes scary again. And being on call shouldn't be scary.

It should be a routine thing. Um, and it's also important to like always think of oncall as a learning exercise. Mm-hmm. Oncall is much less about fixing the service.

Yes, that's part of it. But SRE should be on call 'cause they learn something new about the system and you need that pipeline. If you do not learn anything new from the system, you're just like applying textbook principles to the service without really understanding what matters. Oh, I think that's profound.

I really do. I think that's profound how things like the instant response management, actually, those are not just their learning opportunities to improve the service. They're learning opportunities for our talent, for the SS r e for the tech talent that's in that space. And if you don't have that, you're not learning.

I think that's really profound. But as we do our job well as sra, we start to minimize less and less of those outages. We're driving reliability up. Imagine that we're engineering reliability into the system.

How do we then, like if you're thinking about like how do we then find and discover and explore what are some things that we're doing that or SRE should be doing to be able to gain some of that knowledge, those failure domains that you may not even know that are there buried under the surface. What do you see in SS r e teams do in that space? I think data can help. Yeah.

Data can often help, um, to understand like, can I query the system to tell me what are all the data sets that I have? Can I query the system to tell me which of these data sets have backups and which ones to not? But then I also need to apply the business context. If dataset doesn't have backups, is that a problem?

Might be, might not be depending on what the data is and how easy it is to restore it, how, how business critical it is. So then you need to talk to the partners who really need to understand this, uh, the systems. It's, it's not, not just your, your plumbing operations through, through a pipeline. You really need to understand what it is you're dealing with and um, how, how you can improve it, where it matters to the users and to the developers.

Often ses can improve systems in a way to make developers' life easier. Hmm. That's good. I really like that.

Along the lines of sort of learning and finding those, uh, fault domains, if you will, of our applications, the concept of injecting problems into the system, like through chaos engineering. How do you see that, uh, factoring or do you see SS r e teams driving that? What's your thoughts on like how valuable is something like chaos engineering in the s r e space? It is super valuable to, in a system that's typically reliable, but when it breaks, it will fail.

Disastrously with like large impact across the, the, your business, um, to, to test that out, um, to find like all of these corner cases that no human can enumerate all. And um, that is really like with this chaos engineering approach, you, you reap them out. Um, if your system is already failing all the time, well then you can use real world data. Um, but when you, you reach that level of maturity where like things do not immediately break when you launch them, um, then you have to apply a little bit more pressure and chaos engineering can help.

That's great. Excellent. Um, another thing, and we talked about just briefly about re's thinking in terms of the product, right? And that's getting closer and closer to understanding it from the business side, right?

Why we run, what we run, why we do what we do. And this whole notion of setting service levels is a type directly back into what does the business need to succeed and be able to measure towards that set objectives towards that. Can you talk a little bit about the service levels and like SREs involvement in establishing that, running that? So generally I think that, um, it's an anti paradigm when the SS three team is too much involved.

It's good, uh, because service levels are business requirements. So they are the non-function requirements. Unfortunately, a lot of development teams, product managers often think about function requirements, features first and foremost. Yes.

And then the non-function requirements like throughput, latency, reliability, availability, they are often afterthoughts. And it often then needs an SRE team to actually put these conversations at on the table. And then it's often too late. It is when the system has already been designed, implemented, and launched.

And in these cases, there is not a lot that you can do, but if you have an upfront conversation, how many nines do we need? You can design the system accordingly. If you need more lines than what your original idea could do, you build a more sophisticated architecture and that will prevent a lot of headache later. Things that you might not be able to fix afterwards.

Because changing the architecture of an existing system is very hard. On the other hand, if it turns out that you need less nines, then why build a complex architecture? Why make things so complicated? Build something simple, build it fast, it's easier to ma maintain.

It's, it's quicker to launch. It's so much better for the business, right? So the conversation about SLOs should happen in the beginning and SS e can really help to foster that culture and to explain why SLOs matter and how to properly design them. I love that.

I think it's something you're, you're identifying there. That is to me sounds like characteristic of where SS e should be, right? It's so much of the time I hear this over and over, it's the new ops and so, right. And it's cast into this ops world way over here.

The reality is the value it sounds like to me, correct me if I'm wrong, the value is early engagement, the very inception of products, start to think reliability, engineering all the way through. Maybe you could talk just briefly about like how you're seeing that work or being executed. Well, maybe we have some opportunities for improvement, but where you see it operating well and what are all the stages where SRAs are involved, product development, early inception, build, build, engineering through, uh, actual operation. Where do you see SS e sort of fitting when it's done?

Well, I think the biggest opportunity and challenge for SS E is shift left to be really part of the conversations when they happen. It's so much easier to fix the design on a whiteboard than in production. And I think other parts of the software industry have shown that it's possible and how impactful it is. Things like test driven development, security by design.

And I really think we need reliability by design, pro production driven development. And with these then we, we need to educate our partners in the different roles, um, to, to make this part of their culture. Like 20, 25 years ago, writing software tests, executing tests, most software developers wouldn't consider that part of their job description and, and a distraction at best. Mm-hmm.

Actively harmful to what they want to do in many cases. And that has changed and they see the value it has for them being better developers, being better touch of producing better outcomes, producing outcomes more quickly. And I think the same is true for production. Um, SRE should not be a laundry service where you drop your dirty clothes and then you, you, you do not look behind the scenes.

You get them cleaned back. S v should be more like a football coach that helps you learn, helps you grow, but at the end of the day, it depends on everyone on the team to be successful. Oh, love that. And I really think that's, um, um, definitely a direction we need to go, right?

And we think about like the future of ssri, the shift left, getting more of the business involved and understanding what it's all about. I guess that's another thing is like for in your mind, like what, like casting beyond the horizon. Like where does, where does Esri go to, right? What's our future state like, if you could dream into that a little bit, where do we go?

What does it start to look like? What's the life of an SRE like 10 years from now? Well, 10 years ago I didn't even know that SS r e is a thing. So I have absolutely no idea what it will be in 10 years.

Um, I do think an ideal outcome would be that, um, we look at software less of like writing code than about like the overall life cycle. And this sounds profoundly boring when I say it, but I think it can be very, very exciting, especially as, as the software that, that we build and that we run is powering so many aspects of our life. And, and there is something really exciting seeing systems that like millions of users across the planet are using every single day. And you really want to get that right and this going to be, should be part of the identity of software engineering.

That's awesome. Yeah. Well said. I guess in the last bit, talking about the human aspect of it, right?

Um, I've had the opportunity, and I'm sure you have in the SRA community to talk to engineers and across the spectrum you have some that are super excited, like what we're talking about, like we're the future, why this matters, like what we do matters makes a difference. I also hear from some that are experiencing cognitive hood, they're suffering, right? It's very clear they're in inhumane situations where, uh, maybe the ops part of the spectrum or their, the, the pie chart is too high, but there definitely is a burnout potential in that space. And I hear over and over again, it's hard to find sre, ses are a scarce resource in a lot of organizations.

In your mind, is there anything we can do both from a leadership perspective of those of us who are in ss e to help make this, nudge this in the right direction so that we're not burning out humans right in the process? I'm stop you for Oh, did we miss Oh, his, that's funny. No, we, we didn't miss anything right when you started asking that question. That's great, by the way.

Hey, I'm enjoying this, This is good. Thank you for, I I would love to talk more with you about like the things that you mentioned and when you talked earlier today. Oh, good. Oh, good, good, good, good.

So good. Yeah. What Microsoft, yeah, it's right when your mic's up. Yes.

Yes sir. Yes sir. We'll do that. 1, 2, 3, 4.

That's right. Thank you All. Yeah. So I ask about, um, making ss r e Humane, right?

So as we try to nudge it into more of that human side of the spectrum, how do we make this a career where you, you know, you can make a difference, right? But you're not getting dragged into the really the, the abyss of burnout that can occur because you're so in high demand in all these different companies. Any thoughts on that? I think cognitive load is one of the biggest drivers.

Um, we are in a probably very chaotic phase of like production platforms where lots and lots of new ideas, concepts, tools, way of doing things emerge and, uh, well, engineers love new things and they integrate them. Uh, so it's when an s e team is working with three, five development teams, it's not unusual that there are 10 different technologies, 20 different technologies at play, just for one thing. And to be effective, you need to understand all of them and be able to help developers, which each of them, um, because otherwise the developers would be off, um, just as well on their own. But there is never enough time to really go deep into all of these directions.

So I think converging on less technologies and like leave room for innovation and experimentation, but have some established standards can reduce that and make it also easier to learn it. Because if you're going to be in SS three today, if you want to teach s e at the university, which technologies do you teach the students? That's great. It, it'll change every six months, every year.

And, and that can be very, very scary and keep people from really focusing on, on the actual ideas behind that, like on on, on the architecture, on, on the, on the key principles of production, which are the really the more interesting ones than like, how does this particular tool work and how does it not work? Right. Ah, that's great. Well, Dr.

Christophe, this has just been a pleasure. Uh, before we wrap, is there anything like you, like, would love to tell that we haven't talked about? Like this is a key issue in sra, we should be talking about this. Is there anything like that that sort of keeps you up at night or nags you?

I think it's education. Yeah. Um, and education knowledge sharing in both directions. And I don't think that we really have good established principles for continuous learning between the s e and the product development areas.

How do I structurally approach that the SREs understand the business and keep understanding as the business keeps changing. And at the same time, how develop do developers learn the necessary things about production that make them better developers and keep up with all of the changes in production infrastructure without being distracted from the stuff that they actually want to do, deliver value to, to users. Well said. That's well said.

Thank you sir.