Lesley Cordero & Jason Cox (Amsterdam 2023)

Log in to watch

Lesley Cordero & Jason Cox (Amsterdam 2023)

EXCLUSIVE

Leslie Cordero

Staff Software Engineer · The New York Times

Jason Cox

Director, Platform & SRE · The Walt Disney Company

An exclusive interview from DevOps Enterprise Summit Amsterdam 2023.

Full transcript

The complete talk — auto-generated from the talk's captions.

Uh, I have with me here, Leslie Cordero from the New York Times. So glad to have you here. Yeah. It's exciting to be here.

Yes. So, I wanna find out a little bit about your role, uh, New York Times, if you don't mind. I think what would be really good for us to hear is, uh, your path there. Tell us a little bit of your story.

Just wanna hear your story. How did you wind up Yeah. In your current role. Yeah.

So, uh, I started off as a data engineer, um, in my career. Uh, before that, I, uh, got a bachelor's at, uh, in computer science. Um, and then started as a data engineer. Um, did that for a few years.

Um, that was where I really got, uh, a, a huge appreciation for like data problems, which mm-hmm. I'll connect, um, as I, you know, finished what I'm about to say. But, um, and then from there, I went to Google for a couple years, um, and was, uh, towards the end of my tenure, I was, uh, there for the switch to remote. Um, and this Was at Google, correct?

This is at Google. Okay, got it. Yeah, yeah, Yeah. Interesting.

And so, yeah, that was more towards the end of my tenure. So the last like six months was me dealing with that shift. Um, and the reason that becomes relevant is because during, uh, I was on the Google classroom and the Google for Education, uh, team. And so we saw the switch to remote, um, with schools, which was a pretty big deal.

Um, and so that's actually when I really started to get into reliability management, um, because those were, those suddenly became like our biggest issues, right? It was no longer that much about feature development. We were just worried that like when New York City would switch over to, you know, Google Classroom, that things would just go down, right? Oh, right.

It's gotta Stay, right? We just, uh, and like, to give you a sense of the scale, the increased scale that we got, it was from, we were around 45 million to over a hundred million users within, you know, a matter of a month. Um, and then that grew further to, I think, beyond 120 million users. So, you know, it was just like a massive shift.

Um, and just like being there was, uh, one unexpected and I just, I learned so much from it and yeah, definitely gave me appreciation for the perils of, uh, managing production. I, I, for sure. I I love that. I think there's, that's anybody that's in technology, right?

There's, there's that moment, right? You realize, oh, this is real, right? This is gonna be real. And those things become sort of crystal.

Um, I, I would love to hear like, what are some of those lessons? Like you picked up, there should have been some nuggets, right? Oh, sure. I'm sure you have some things that you cared for.

What was that that you learned? Alright, this is, uh, was it how to do, uh, the, the reli reliability aspects of it. Mm-hmm. What were the, the components?

What did kind of Yeah. Component look like? Yeah, I think, uh, a few things. I think the most compelling one though, to me was, uh, like I grew an appreciation for foresight, um, for sure.

Mm-hmm. Um, and, you know, the, there was no world, I think in which any of us predicted the pandemic, but it was interesting to see that the things that I advocated for us to do in anticipation for, you know, scale and whatnot, paid off Nice. Um, and so, you know, it's, uh, and in the, when you're justifying that work, it can be hard to, it's, it can be, it's super hard to get buy-in, right? Right.

And so, uh, it was just like a huge moment of validation for sure. And I think I've taken that definitely throughout the rest of my career, um, is I think like, yeah. Uh, having a strong sense of foresight. Yeah.

Um, and then in terms of more like technical things, uh, we were, we, well, I assume they're still using Spanner, um, so Oh, yes. Which, uh, uh, did not ever expect spanner to, uh, I mean, spanner is such a, like a beautiful database technology, and so seeing, like, it, seeing the cases where it doesn't hold up was just something I don't think we expected to see as much as we did. Um, and even still, right, it did, you know, a pretty excellent job. It's still, um, getting us through because especially relative to our competitors, uh, the amount of downtime we saw wasn't, wasn't actually too bad.

Um, I will say New York City, when New York City went, uh, to when they started on that Monday to remote, we did have a little bit of downtime there, but relatively speaking, we did pretty well. Um, and, uh, yeah, I, uh, other challenges I worked on were onboarding, uh, school districts or countries onto Google Classroom. Interesting. Yeah.

Okay. So, Uh, Yeah, the country of Mexico was like one of really, yes. Yeah. Yeah.

That's incredible. Um, and so it was, I was working with, uh, uh, some of the district leaders and like, their IT people, and it was also really interesting to see like the difference in how we approach technology and whatnot. Yeah. Um, so yeah, just, uh, it was definitely a time where I learned a ton.

That's interesting. I mean, what I think is fascinating about that on the, the foresight, there's the, that's the technical, right? You really were understanding and trying to put plans into place to handle the scale. Right?

And you knew that needed to be there. Um, but then when you look at like the, uh, like Mexico, right? Mm-hmm. As coming onto the platform, it sounded like dimensions of that human interaction.

Like, so that change, there was nuances of that. How, so was, uh, the challenges related to that? How did you trace that back to technology? Or was there anything specific you say, this changed how we did this because you had those conversations?

Yeah. I think for us, thinking about, we were already starting to think about, uh, large school districts and what that experience would look like for onboarding and whatnot. Hmm. Um, and a lot of the ways we dealt with that was all through, through partnerships.

Um, so, uh, I was working on integrations at the time, so the Google classroom, a p I was like, really the, my kind of baby. Um, and so, which was awesome because I also got to work with tens of ed tech companies, um, seeing them how they're dealing with the shift too. Right. Um, and so, you know, everything, the normal rules of social interaction and like how we co collaborate, were just kind of out the window, not in a negative way, just like, you know, it was, it was reset.

Yeah. It was just, you know, if we had a phone call that we had to do late night, you know, that was the, that was gonna be the time that we did, you know? Right. Especially with time differences and whatnot.

So, um, and I think like the way it ties to technology is I always approach it from a socio-technical perspective. Mm-hmm. Um, I like that. Yes.

Which, uh, I think Liz Fung Jones coined that term, or, um, at least in like a tech, like the software engineering world, I believe it was existing, but Yeah, I think she popularized it. Yeah. Would you, would you actually, like, so everybody sort of understands what that Yeah. Do you want to talk about, like, you know, how you interpret that?

Maybe you could Yeah. Give a summary of what that looks like. Yeah. So to me, uh, it really comes down to thinking about or acknowledging that what it means to develop and operate software is inherently a people and a technical problem, and that the two can't really be like separated, right?

Mm-hmm. Um, and so the idea of like socio-technical systems, um, tells us that like, instead of, for example, when we're thinking about incident management postmortems, we're not just thinking about like, the technical aspects of went wrong. We're thinking about like the systemic, you know, what led us to that problem. Mm-hmm.

Right? Was it, you know, a was it a cultural problem where we didn't prioritize reliability as much? Right? Right.

Um, was it because we didn't have enough safeguards, you know, in place, um, for when we were, uh, pushing out changes? So that's, that's kind of the, the general idea. Um, and I definitely prefer that because I think it's actually the, it's the only way to truly be, we talk about, you know, blameless culture, and I think it's hard to be that without this sort of interpretation. I think that's powerful.

Um, and I can see that like so many areas, right? Mm-hmm. I mean, so to your, to your point, I mean, that, that was a, uh, a very visceral, practical learning, right? That, that reinforce that you can carry for your whole career, really, frankly.

Right? Yep. So I love that. Love that.

Um, so what was the, uh, talk about kind of the, where did that end in terms of like, here you're doing all this work with these classrooms? Oh, I guess there was one technical question I had related to the geography because it expanded. Mm-hmm. What were some of the other technical challenges?

Or was it because Google was already global, that was not a problem? Yeah. Or did you have to make some considerations based on the expansion to like Mexico versus New York? Yeah, so we did, luckily, you know, we, it, we were already global.

Um, so Google classroom was already used being used in other countries and whatnot. So we were thinking about those types of things, luckily, yeah. Um, thing that I didn't expect for us to encounter was, uh, so, you know, Google overall was kind of like, okay, what do we prioritize in terms of like, reliability, like what products, right? Google Meet was of course, you know, top of the list.

Um, and Google Classroom was next though, right? Mm-hmm. Um, and so, uh, you know, we were prioritized in terms of like compute and whatnot. And, uh, there, you know, I didn't expect there to ever be a question of like, do we have enough compute to support this?

And so seeing that was like arguably still the wildest thing I think I've ever seen. Yes. Right. Um, like, you know, like to, for Google to be concerned about that, it's just like not something I was expecting, and that specifically, like advocating for the compute and whatnot was above my pay grade at the time.

But, uh, it was still interesting to see, see that happen. And, um, yeah. To that point, I, you know, the leaders that I, the tech leads and leaders that I got to interact with, I learned so much for them from them. And, uh, yeah.

It's, uh, now I get to be those people, which is kind of weird. But Yeah. There's an interesting observation. Tell me if there's anything here, but I mean, just listen to you talk about that you were a leaning on the platform.

Mm-hmm. Right? And you, the assumption is it's just gonna be there. Right?

And I think that's the truth, yet, if we're, we're understanding the business and then responsible for the business mm-hmm. There are those times, even as the abstraction is there, there is a reality under the abstraction that maybe you need to have awareness of, right? Mm-hmm. How do you manage that, even from a platform engineering perspective, when you're responsible for the platform, how much do you make visible?

Because they need to know, especially if there's Yeah. Any risk there that could be exposed later on that you wouldn't be able, Right? Right. Yeah.

Yeah. That's a great question. Um, so, you know, I think I, so yesterday when I had my talk, I talked about platform engineering. Mm-hmm.

Um, and I talked about the three pillars of, you know, principles, support, um, and architecture. And I think we tend to overfocus on the architecture part, um, and not enough on the support portion, um, which, uh, you know, is how we ultimately collaborate with, you know, our users. Um, and, uh, so I think that's where we really, you know, like that's where that work really shines. Mm-hmm.

Um, in terms of architecture perspective or like a p i design and whatnot, like, definitely believe in like simple interfaces as much as possible. Right. Um, it's just, you know, we have to think about adoption and it's just table stakes, but, you know, I, you have to hold that intention with what you just said, right? About when is it, how and when is the right time to expose the underlying details.

Right. Um, and so I think this is something I really see with observability, especially because I think there's a huge mind shift going on with observability right now. Like, no longer are we just talking about monitors and learning, right? We're talking about distributed tracing a lot.

Um, logging is no longer like the only way we approach, um, we approach debugging systems. And so, you know, uh, at both my last roles, I've definitely had to participate in like, making that shift with everyone. And like, you know, you have to meet people where they are. And so definitely a lot of like, education has to go into, go into that.

And, you know, again, it's like a balance of, you know, how much do you spend doing that versus how much time do you spend, you know, doing, you know, hands to keyboard work. Um, but I think if you have a culture where learning is like a natural part and where collaboration is, you know, also a natural part of your culture, it happens, it happens super fluidly. It's just like a natural consequence of just being a, being a coworker and whatnot. Yeah.

Um, and then, you know, of course there's like the intentional parts that are also very important. Yeah. Um, so I love that. And, and, and, and not to you back up just a little bit mm-hmm.

But maybe, maybe we would want to say, and kinda explain a little bit about what platform engineering is, you hit the pillars, which I thought was great talk, by the way. Thank you. Highly recommend everyone should check out the talk, which I think it was recorded. It is, right?

I, I'm not sure if it was recorded, but 'cause of the power it was, we were, so it should be recorded, so that's wonderful. Okay. Um, but yeah, do you wanna just talk a little bit about platform engineering? Like how do you, uh, not just the elevator speech of it, but Yeah.

What is platform engineering? Yeah. Yeah. So platform engineering, um, is basically like hyper-focused on what it means to build an internal developer platform.

So your users are actually, you know, product engineers within your company. Um, and so, which is, you know, very different from being on product engineering where you have, you know, end users that you probably aren't interacting with as much, um, depending on the type of product you have. But, um, and so it's interesting because again, like your users are your coworkers. Yeah.

And so, um, and so, you know, but the goal of platform engineering, at least to me, and this is what I say in my talk, is like to build a sustainable organization. Um, and what that looks like, again, is like by taking on the cognitive load, similar to like DevOps, where you take on the cognitive load of all these different aspects of what it means to manage software, right? It's not just, you know, writing, uh, a p I code and sending it out, right? You have to think about testing and you have to think about observability.

You have to think about like, deployment and all these different components. And so, um, the idea is like just acknowledging that it is unreasonable for any singular product engineer to take all that work on. And so in a way, we're like compartmentalizing, you know, who has to think about this all the time? Because ideally product engineers are focused on like business impact.

Um, and so yeah, I think that's like a pretty good summary of what the goal is, is, um, I can also expand, but I think that's great. Yeah. Let me, let me, so, um, I mean the, the whole point, right? In my mind, I think this is real.

You did a great job. It's that there's that socio, um, technical angle of the fact that we take a team that's needing to deliver something for a business, and there's this full stack of knowledge required to be able to deliver that. And we're saying either full stack engineer, which is really inhumane, right? Yeah.

Full stack team, which also is a lot a cognitive load. Mm-hmm. Cross that team. How do we remove that?

Well, part of it is you abstract away. Yep. A bunch of it. Yep.

So the platform is that, right? It's sort of mm-hmm. We started with things like, oh, you don't have to know how sort of the chips and, and, and Exactly. And feeds work anymore than it's e ai.

Yeah. That's an api. Slowly we're, we're coming up the stack and layering on these platforms that you leverage, right? Mm-hmm.

And so platform teams are the ones that are building those Exactly. Right. Managing those. Yep.

So that's great. That's great. I, I really like, um, uh, I did like your pillars, and I think you covered a few of those already. Yep.

Is there anything else you wanted to highlight? Like, and sort of going into detail on like, if you're going to do this, which we're recommending, right? Mm-hmm. I think that's a good thing for a lot of people to do, a lot of different organizations.

What are some of the, uh, as they think about platform engineering, what should they be focusing on? Yeah. Yeah. So, uh, you know, I talked a little bit about support, um, a little bit about architecture, but the principles part, I think like, uh, is one of the most important parts because it is what colors the rest of it.

Um, and, uh, you know, since we're here at DevOps Enterprise Summit, um, for me, it's, it's very much influenced by, uh, the DevOps principles, right? I referenced the comms framework. Yeah. Um, and, uh, you know, I think there is an unhealthy narrative right now that's going on about how DevOps might be dead, and I just don't buy that, um, at all.

So, um, and I think, you know, DevOps is being evolutionized, but, but it's absolutely here to stay. Um, and, uh, I think platform engineering, one of my favorite things about it is that I do think it's built off of the principles of DevOps. Um, and so the ones I think that I, the ones I highlighted on yesterday were, you know, culture Yeah. Um, were automation and, uh, measurement.

Right? Um, and I could have have expanded it further, but of course, time constraints, um, right. But, uh, that culture part is like, I think really, uh, the core of what it means to do platform engineering, what it means to do engineering, just period. Yeah.

Love that. Now, and, and because of that, it's important, the culture, all those aspects, like, and it can change. Um, you're moved in from Google to New York Times. What was the delta there?

Like, what was the different cultures you're going into, new places, anything learnings in that journey that you had? I'm just kinda curious. Yeah. Like what, Yes, definitely.

Um, so I've been at some, at companies. The companies I've been at were different, very different sizes and mm-hmm. Uh, so I was at like a startup where, you know, there was, um, well, my data engineering role was actually, it was just, it was few. I think there was 15 employees.

Mm. Right? Yes. And then, uh, another role I had that I actually didn't mention it was, you know, about 80 engineers, um, so medium startup, whatnot.

And then Google, which like, you know, there's hundreds of thousands of employees I believe now. Oh, yeah. There was, I think, yeah, we had just crept over a hundred thousand I think. But anyway, right.

And then now I'm at an organization that, uh, you know, we have, I think we're nearing 500 engineers, and so it's been interesting to see like what it means to do platform engineering also at each of these. Yes. Yeah. Um, and, uh, which is, uh, something I've been thinking about a lot actually.

Um, and, uh, you know, like when you're at a smaller company right? You don't really need a platform engineering team yet. Um, and, uh, but I think it's about where you start to hit, uh, what is it, Dunbar's number. Oh, Interesting.

That's when you, I think platform engineering really becomes important, but that delta from, you know, the small company to that, uh, you still in a way have to, you know, that foresight I mentioned earlier, right? You have to think about foresight and consider like, is there a world, hopefully there is a world in which we grow enough that we do need hundred, um, you know, tens or hundreds of engineers. Hmm. Um, and so how do you build your architecture, um, in a way that will naturally evolutionize to that, right?

Um, and, you know, modularization, um, you know, don't start off immediately from microservices, right? But Right. You can build a monolith application in a way that will age well. So in case you ever have to do that, that shift, right.

Um, and, uh, I think, you know, along the way that becomes more true. It's like just a, it's a game of anticipation for sure. Um, I think it's like That for foresight again, right? Yeah.

Yeah. That's great. Well, so, um, that's interesting in the sense that you think about our, um, the, the who, maybe even the audience here, right? Mm-hmm.

Could, uh, the DevOps Enterprise Summit, it's gonna vary in the size of the organization where they're on the material. So Yeah. Any advice like you, you made that, that that transition smaller organization now? Yep.

Like if you're at a 400 mm-hmm. Right. Technologist organization, what does that, what, what should you be doing? You're talking about getting the stuff in place for that, but what are some just practical things you've had to do?

Yeah. So right now, so I think because we lacked foresight in the past Got it. Um, we are now reconciling the consequences of that. Mm-hmm.

So that's really a lot of what my role looks like now, actually. Yeah. Okay. Um, we have this, uh, what I think is a strong platform vision, um, but, uh, you know, getting from where we are now to there is definitely going to be a long pro process.

Oh, yeah. So, um, those are like the more specific technical challenges that I'm seeing. And, and I think it's very common to fall into that. Again, if you don't, if you aren't intentional, which was I think, a huge statement.

My talk yesterday, right? Um, like the team that I was on where it was only 80 to a hundred, um, engineers, uh, it was very clear that the engineers had been intentional with the platform engineering because yeah, we were at a place where in retrospect, it was very mature for, um, for, uh, you know, the type, the size that we were at. Um, and so definitely like a lesson learned is, uh, just lesson validated. I guess.

It's like intention is very important, um, and, uh, if not, you will like, it will come back to haunt you, you know, that's what tech that is. Um, and so we don't, I think we don't think about this as a form of tech debt a lot, but it really is, right? Yes. Like the lack of centralization, um, in how you do things, it's gonna lead to so much drift.

And at some point you're either gonna have to deal with it and suffer the consequences of it, which is going to be like lower velocity, for example, because Right. You know, in a microservices architecture especially, you have to, you own your own services, but you have to think about the other services too and directly, right. Or directly sometimes. Right.

Um, and so, you know, that Drift can really make that collaboration process, uh, so much difficult. Right. Uh, that's, that's well said. I absolutely, well said.

And I think if, as we're talking to those that are wanting to go down this journey or Perfect, right? So they're discipline, that space. Um, thinking about the different dimensions of using platforms, knowing those that are using your platforms, again, the sociotechnical aspect of like knowing your customer Right. To some degree on that.

Yep. Like, um, what other, is there anything else like we would wanna impart, like to anybody else, or you'd wanna tell, like, to anybody starting this maybe their journey Yep. Out of zero. Like, ah, we need to build, and maybe we're getting towed, go build a platform engineering team.

Yeah. What does that, what is it, what would be some of your advice for anything that Yeah. Would be good tips? Uh, one of the most common things I see is, uh, so I talked about this tension of build versus buy versus reuse.

Yeah. And, uh, the first two, I think are obvious reuse, I'm really referring to like using open source, for example. Um, and, uh, it's so tempting to just build your own things from scratch, and it's one of the worst patterns I think we encounter, we encounter as platform engineers, right? Um, just yesterday I was talking to someone and they were like, oh, I really loved your talk.

I really wish we were there, um, you know, organizationally. Um, but the thing we're suffering from right now is like, we've built everything from scratch, and I know this is gonna fall apart at some point, but how do we even go about solving that problem? Um, and so, you know, again, it's, it's very tempting to build your own software, but it's, uh, it's usually not sustainable. Um, and, uh, you know, I think I remember, I forgot which talk it was that it was, but there's a talk where they say like, everyone loves their own software more, right?

Oh, yeah. Um, yeah. Um, and, uh, you know, I think, you know, China on, you know, get away from that mindset, you know, it's, and it's hard because, you know, we want to have like creative freedom. We want to have those spaces where, you know, we're learning and whatnot.

But, you know, I think for me, uh, software engineering I think is such a creative field. I, I've never found it hard to find an opportunity to learn. Right. Um, and so I guess like, you know, my pushback is always like, you know, there are creative ways you can learn.

There are so many ways to learn things. And so, yeah. But I think that's like one of the, that's probably the biggest nugget I would say for people starting off, is like, don't, don't build your own things. That's Right.

Right. Don't bias towards that. Yeah. Out of the gate mm-hmm.

Bias towards looking at using a platform versus that. That's interesting. Yeah. I think that's really good advice.

Really saved your advice. So the one thing I would love to hear about, just love to hear about, I know you were involved in sort of the election and sort of setting up New York Times to be able to support, right? Mm-hmm. During the election and some of the system.

Would you talk a little bit about that? Like what was your team's involvement in that? Yes. Those aspects and what was it?

Tech Yeah. Yes. Related to supporting that. Yeah.

So, uh, I'm, I'm a tech lead for the observability team. Mm. Um, and, uh, we were doing midterm elections actually in the beginning, more towards the beginning of my tenure. So at that point, I was more observational, but it was very interesting to see how much time and preparation goes into it.

And now I'm feeling it because, uh, I guess we kind of, in some ways parallel like the ways politicians like will announce a year and a half that they're gonna run, um, because, you know, we are preparing now already. Um, so, you know, part of the, uh, H two, um, strategy, uh, and priorities, and then also, you know, going into 2024 is, I mean, election's gonna be the most important thing for us with respect to, you know, reliability management. Mm-hmm. Um, so we're already prepping for that.

And, you know, what that looks like is, uh, thinking about the things that we need to be prepared for that. Right. Um, especially as we are evolutionized our platform, um, the idea of our platform is to make things more reliable as well. Right.

That's definitely what I spend my time thinking about. And so, you know, the component I'm really thinking most about is observability, since, you know, I am the tech lead for that domain. Right. Um, and, uh, for us that means we're doing a push to add distributed tracing to all of our services, um, using Open Telemetry.

So we're like going through a huge shift right now, and it, um, you know, with, again, with the goal of the election, um, and, uh, I, you know, I already know, I told my friends that, uh, that that like quarter before, it's gonna be really interesting as we prep for it. Uhhuh. Yeah. Um, and, uh, yeah, there's so many components that we have to manage, and we do have, um, you know, a lot of, uh, project management and whatnot, support too.

It's really just like a cross cutting, like everyone is thinking about it. It's not just my team. Everyone will be thinking about it at some point. Just to Highlight maybe this, so I understand like the criticality, like, uh, the system, your, your observability will be monitoring.

Uh, what is the expectation in terms of service levels like that the business has placed on it? Yeah, so, so one thing we're actually going through right now is we have SLOs, individual teams have SLOs. Um, our platform has some SLOs. Um, what we're missing is Sloss at like the entire organizational level.

Oh, Interesting. Yeah. Yeah. So that's like one thing we're thinking about a lot, especially in anticipation of the election.

Right. I'm one of the, one of the things I mentioned is like that foresight, that anticipation, the intention, and that those are the things that I think will really make or break mm-hmm. You know, how we, um, handle this. Uh, I wasn't there for the 2020.

I imagine a similar kind of like chaos was also happening. Um, and so yeah. The, was, uh, yeah. Anything I can elaborate on there.

Yeah, I, I think that's great. Would you, maybe, maybe what would be helpful is like this, um, uh, and you probably can't necessarily go in detail, but this is massive scale. Mm-hmm. We're talking about large scale.

Yep. I'm assuming is this supportive of like, uh, all the workflow, editorial workflow or whatever that is Yeah. And the, all, all the readers that are gonna be hitting this in real time, and as it ramps up Yes. In terms of the news, you're gonna get to see this flood of mm-hmm.

Sort of incoming traffic. Yep. Um, talk a little bit talking about tracing mm-hmm. Traceability piece, like, uh, explain what insights from a business standpoint that provides, right?

Yeah. And how you're, how you're doing, you're talking about tel and things like that. Mm-hmm. Maybe we even unpack some of that, because I don't know if everyone is up to speed on Yep.

Some what that would mean. Yeah. Right. Yeah.

And where they should be looking at it. Yep. Yeah. Yeah.

So for us, I, so we embracing the open telemetry standard, um, because, uh, in that like build versus buy versus reuse, right? I mean, we're kind of doing a little bit of each, um, because, uh, you know, on one hand we do have a vendor. We are on Datadog, um, but then we are using open telemetry, but then we need to make sure that standardization is a thing. Um, and so, you know, we're kind of managing all these different things.

Um, and like the business impact of that is if we are experiencing downtime, right? Downtime means lost money, right? Oh, yeah. Um, you know, for us, we have ads, right?

Uh, an hour of, I remember we had an incident that was, I don't even, I think it might've been just a couple hours and we lost like over a hundred K, right? Oh, yeah. That's like, yeah. Um, and so that, those sorts of things are very easy to suddenly justify your job, you know?

Right. Um, and, uh, you know, in general, operational costs is something that my team thinks about a lot. Um, it's so easy to spend on cloud costs and, um, observability costs, which I think is like a really hot topic right now. Um, and so, right.

You know, it's, uh, there's many angles to go from this. Um, and for observability, again, it's like the more observable your systems are, the more equipped you are to actually debug them. Um, and if you can debug them better right then mm-hmm. Ideal, then hopefully that should lead to less downtime.

Right. Which means less money lost. Um hmm. Have, do, have you, do you have any examples of like, where it sort of saved the day?

Do you have any of those that have shown up yet? Yeah, so Let's see. Um, for us, I'm struggling to think of a time at the times because I haven't done as much incident management as I thought I would be, actually. Oh, okay.

Yeah, yeah, yeah. But definitely it's, uh, in the past I've seen it, you know, ex eliminate the amount of time by like, hours, right? Right. There's been times where, uh, an incident would've taken probably, you know, twice as long, and it's, it clearly cut that because we were able to get that like aha moment.

Right. Um, especially when you connect it to monitoring, actually, this is a good example, right? It's like monitoring, right? Right.

Um, I've talked about observability, right? But observability, the way to action on it also is through monitoring and alerts, right? Um, and so if you have a good monitoring strategy, um, which is enabled by observability, right, then you're more likely to catch issues. So really, like, honestly catching issues is that you wouldn't even see, you're not even aware of is like one of the biggest impacts that you can have with observability.

Oh, that makes sense. That's great. It's hard to, not hard to know what you don't know. And so, uh, you know, I think that's like one of the most powerful parts of observability.

I love that. That's great. Talk a little bit though, I think this, um, I'll wrap up with this. I think it'd be really important to hear is, um, observability as a platform.

Mm-hmm. Right? And that's what it is. And how does that, as sort of team function, how does your team function?

How you know what to run after? Like, what sort of is that, that pull right, for you to know it, it clearly election and we have to, you know, you have this large scale applications gotta be supported for that, but yeah, I'm just gonna, how does the team operate? Yeah. The Structure.

Yeah. So we have, um, a few, a few sort of priority areas, right? Mm-hmm. We are, um, we fall under operations engineering.

Um, yep. And so part of operations is cost management. We've done a lot of work with cost management, actually. Mm-hmm.

Um, because, uh, you know, we weren't intentional about in the past. So we were essentially, again, reconciling I guess that that like technical debt, um, but it's more, but it's actual, it's actual debt. Um, and so, uh, that's been a huge focus area for us. Um, now that we're getting to a better spot there, um, we've been thinking a lot about how to support runtime development with observability.

Oh, sure. Yes. So, um, you know, I think for us, especially because we're under, uh, our organization's very infrastructure focused, um, which is absolutely important part of observability, right? But, um, ultimately, right, you're debugging your applications, and so you need observability at that like application service level.

Um, and for us, we've, so for us, we've been thinking about that aspect a lot. Um, and what that looks like is, uh, we have, so standardization is huge for us. Um, partially because, uh, we wanna follow the open telemetry standard. We trust that they are, they know what they're doing, highly advocate for their work.

Um, and uh, that also means, but there's also like, you know, a N Y T opinionated aspect of it where like, what, what attributes make sense for all of our traces to include, right? Um, so that way that, so that way traces are actually connected to each other and there's like a level of predictability and the data that you're gonna be shifting through. Um, and so we've taken a, like modular, like, uh, insert standards into the libraries, um, making, balancing that, that tension of flexibility and, uh, um, standardization. But, um, so it requires a lot of intention and design thought.

Um, and then, you know, ideally the idea is that we take care of like the auto instrumentation, for example, and the engineers can just like import it very easily to their applications out of the box. It's, it's done. Um, at the same time, we are also thinking about, uh, you know, how we enable them to un to actually understand telemetry, um, because there's an interpretation portion, um, that's where that support layer of platform engineering comes into play. So, um, again, we're like going through the framework shift, um, the mindset shift in terms of like, how do you think about observability?

Definitely seeing the legacy thought of monitoring. Um, so we're kind of like undoing that. Um. Oh, interesting.

I like that. Yeah. That's great. I like the dimensions you've even sort of highlighted as part of that.

It's, uh, making it easy mm-hmm. For them to onboard. Yes. There's the aspect of training or sort of transformation of what should be Yeah.

In, in their minds. So they're leaning out and there's just running it, right? Mm-hmm. Making sure that it has it's cost effective.

Yep. So you have that, those elements that are part of your team, right? Yes, Yes. Yes.

That's great. Oh, I think this is great. Is excellent. Is there anything else that you wanna highlight like that, uh, at like New York Times specifically?

Um, yes. Uh, let's see. Yeah, I think one of the things, uh, I would highlight is like, technical leadership is super important. Oh, okay.

You know, I think like as a leader at the times, um, definitely like leading into that, um, and making sure that we are aligned on the things we need to be working on. Um, you know, I mentioned all the work that I, that we're working on, but, um, we also need to make sure that we are serving the business ultimately. Right? And so, yeah, that's good.

It's our job as leaders to make sure that we are trans translating the like technical impact into business impact. Um, because I think if you're not intentional about that and getting that buy-in, um, it's very, it's very easy for the work to be dismissed as like not valuable. Right. Um, and, uh, again, if you're intentional about it, sometimes it isn't as valuable as, you know, you want it to be.

So Definitely, I think, uh, intention and, you know, leadership is, is important for all of this as well. That's awesome.