Christina Yakomin & Jason Cox (Las Vegas 2022)
EXCLUSIVEAn exclusive interview from DevOps Enterprise Summit Las Vegas 2022.
Full transcript
The complete talk — auto-generated from the talk's captions.
Well, Christina, I'm so excited to get to, to talk to you. Um, love your talks. Thank you and love your kind of, your story right in the ss r e space and kind of what you're doing. Um, I thought it'd be a really good opportunity to just give you an, uh, give us kind of your story really.
Sure. How did you get into SS r e? What's, you know, your connection to it, uh, maybe some of your perspective rightly and we'll, and then I have some other questions for you. So, Yeah, I, I mean, I kind of fell into SS e I I started out a as your, your typical web developer and thought, you know, I was in a rotational program early on in my career, and I was like, well, what can I do for this last eight months of this program that's going to just make me a better developer later?
And this cloud program thing was really new and interesting and emerging at Vanguard. So I jumped over there thinking like, let me go learn this for eight months and then I'll go back and like teach everyone what I learned. And at the end of the eight months, unlike any role I'd had in the past, I still felt like the dumbest person in the room. I felt like I had so much that I still needed to learn in a way that I'd never experienced after eight months in another role.
So I decided to stick around and I have not left the chief technology office and our cloud program ever since. Uh, and I ended up, uh, in my first role as an s R e supporting one of our earliest cloud platforms. So then since then, uh, that was a bit of a headache. And so I dedicated the next couple years of my career making it easier for everyone else trying to do ss r e.
That's awesome. That's really cool. Well, so I mean, I, you, we hear this pattern a lot, right? It's mm-hmm.
It's really SS r e is, uh, software engineering applied to operations Yes. In that space. And so you are coming that path Absolutely. Coming that path into it.
But how do you describe, like, 'cause this is a question I get all the time, like mm-hmm. How do you describe SS e How do, how do you explain this is what I do to your friends and family? Um, it's, it's always fun because especially when I, I did my stint on our chaos engineering team. My whole family was like, what are talking about?
Like, why are you engineering chaos? This sounds crazy. Um, but really, I think, uh, what I, I talk about when I describe site reliability engineering is I am writing software. I'm building automation that helps us keep all of the products and services more available or more performing so that when someone goes onto Vanguard's website and tries to see how much money they've got or, or make a transaction, they're able to do that and they're able to do that quickly.
And I wanna make sure that I can do as much of that as possible through software and automation, um, either to happen automatically or to do it a little bit easier so that I have to spend less time waking up in the middle of the night to do it myself. And that resonates. As soon as you talk about losing sleep, everyone's like, ah, yes. It's So true.
It's so true. You said something there I think is really interesting. I think all engineers have this understanding, like what they do matters. Mm-hmm.
And they, they under, they can see a line of sight into how that's gonna affect the customers, right? Mm-hmm. Other people, it sounds like in the SRA space, like what you're talking about, it's next level, you're starting to think about those service levels in a different way because obviously it could impact you back to the waking up. But let's talk about just the, the service levels for a minute.
Like, how did you get to that stage, right? Understand what the service levels should be. Is that something you had to introduce to the team or it, Which, or was it it, it, it was difficult because we were absolutely coming from a space where our business partners didn't have any expectation of availability from our technology organization, aside from keep it up, right? Keep it fast, keep it running, and so to, to have to start the conversation about availability targets, I think was very difficult.
And the automatic knee jerk reaction that I don't think is unique to us was, oh, alright, okay. So five nines. So we'll just, we'll just be five nines available because we're, this is a, everywhere it's financial services that is very critical. But you, you dig a little deeper at that.
And we had to have some difficult conversations up front of like, all right, well, forget about the target. Let's just talk about what does it mean for some individual request to be successful? And you can have a bit of an easier conversation about that until you then make the calculation of, well then how available were you this past month? And as soon as you show them that it was 99% or 99.5% and they haven't even reached three nines yet, the conversation gets a little bit more difficult and they suddenly don't wanna look under the covers anymore.
But once the, the tough love is over and, and people start to embrace it, what I've had a lot of success with is talking about the, the cost of additional availability. And when we're working in a financial context, it makes sense. That really resonates and actually helps us because now whether you consider it increased availability or just, uh, paying down some technical debt, we're able to bring our product on into the conversation and say, is the, uh, increase of availability that you're looking for, more or less important to you and to our end users, then this new feature? 'cause we can get this feature out a little later if we focus on this availability increase, uh, that you're looking for.
And being able to include them in that conversation is not something we would've been able to do five years ago. That's amazing. Right. And that was the tipping point too.
That was really having the service level conversation Yes. Really opened that door and sort of, I think that's wonderful. It, did you find then from a, uh, tracking standpoint mm-hmm. Like as you started to introduce those, that having that feedback loop of seeing where you are help drive prioritization or what was the, uh, both from the team standpoint, but also from the business saying, do we really need that?
And if we do, we're gonna put money behind it? Yeah, It has, initially, I think a lot of teams wanted to be considered high priority because high priority teams, they get more attention, they get more funding. But then when, when we were starting from, uh, the, the baseline of the availability that we had achieved, and suddenly they started to look into what it actually takes from a pers support perspective, from just a cost of infrastructure perspective to stand something up that would be able to deliver five nines of availability, then you have some like internal service providers that are like, you know, realistically we don't need all of that. So it, it's made the conversation about availability and service level targets as well as prioritization and funding and things like that a little bit more reasonable because everyone is thinking about the accountability and the investment of time and effort and money that it's gonna take to reach any of those levels.
And it's not just the idea that everyone wants to be the most important thing, like it might have once been. Right. That makes sense. Did you find, um, examples or some of the service levels that were higher, ultimately higher than actually what the business needed?
Like they took a look at it says, oh, we don't necessarily need it that high. We're making a too much investment. Did you actually find those? Absolutely.
I mean, and we, we have conversations all the time where, I mean, if you're building a a website, do you really need to be more available than most ISPs it at that point, what's the marginal gain for your client? And so while yes, we, we have, uh, very, uh, difficult conversations about like the cost of even one failed request when you're looking at trying to make a trade at market open or market closed, those things, yeah. You need the high level of availability if not for the satisfaction of our clients for regulatory reasons. Right.
But, uh, when we're talking about internal service providers, when we're talking about marketing sites, yeah. That initial shift from just keep it up as much as you can to Alright, five nine sounds good to, we don't, we don't really need all that, do we? We don't, I mean, we don't need an, a complete three region out of region failover strategy just for this marketing site, right. Necessarily.
And so it, that's been, uh, helpful in driving down cost and also strategic prioritization. Oh, I love that. It's, you know, it's showing how service levels actually helping the business get inside, right? Yes.
That, um, making decisions that are the right things for the business. I love that. Now, uh, another aspect on just service levels, we're talking about availability. Did you find like other dimensions, like, uh, error rate, uh, I'm sure transactions, like what you're talking about, like we can't fail on a transaction mm-hmm.
That's going through mm-hmm. Things like that. Was there any surprise, uh, in any of that that changed how the team approached it? Yeah, I've been, I've been pushing a lot of teams, challenging teams to think deeper than error rate and latency.
Hmm. With web-based workloads, especially the ones that aren't, we're not talking about like event driven stuff. If you're just calling an a p i, everyone can pretty easily come up with, uh, the definition of healthy in terms of status codes, in terms of performance. But what I really want teams to look at is setting service levels related to like graceful degradation, which allows you to be a little bit less strict on your availability targets for things like error rate.
As long as you have that higher target around, well, we at least need to be able to load this particular portion of the page, but if this portion of the page isn't available, that's okay. Uh, maybe 95% of the time we want everything to be there, but 99 and a half percent of the time we need this section to be there. And four nines of the time, you gotta have this one number. Something like that.
That's Awesome. Um, and that, that was the conversations between s r E and was it predominantly the product side representing business that was helping drive that from a prioritization standpoint on the work? Yeah. Um, or was, was actually SRE pushing into that?
I mean, kinda where, where was the, who was, who was actually making some of those priority calls on, like, this is the work we need to focus on? Prior to the rollout of s r e, we already had, uh, some product ownership, um, that, that was pretty strong in a, in a lot of areas. So we had product ownership driving prioritization, but it was primarily in the context of feature delivery. Mm.
And so with the rollout of s R e, we started trying to bring the, the product owners into the conversation about more than just prioritization of feature delivery. Because up until that point, it was, they kind of just left us, depending on the team, anywhere from five to 20% of their time to say, like, that's your technical, at that time, you, you figure it out. And if you have an incident that eats up all your time, no more technical debt pay down this month. Um, but as a, a next stage of maturity, we, one of my most recent roles was on a team that developed a curriculum for reliability engineering for the engineers in the organization.
And then once we rolled that out, we actually created a shorter, higher level and not quite so technical curriculum for the product owners. That's now a part of our mandatory product owner training. And for new product owners coming in, they all get a couple of hours just dedicated to SS r e and product operations so that when they go into those prioritization discussions, they're prepared to have conversations about availability and, uh, how to prioritize, uh, changes to service level targets. Oh, I love that.
I love that in the, the aspect of a whole educational program, right? Mm-hmm. Rolling out, which leads me into a kind of a next question, which I, I think that we have, like in your particular case, we've seen software engineers become sre, right? Mm-hmm.
And there are cases where we're hiring in sre what we're not seeing is people, are our people graduating with a degree in SS e, right? Yes. So, and the very limited amount of education preparing people or engineers to be SREs, um, I'm assuming like with your program, that's helping develop even some internal talent. Mm-hmm.
But what do you think, right? What advice could we start to give to some of our universities and some of these computer science programs that be focused on delivering talent for the next generation? Right? What are some advice that you would say, Hey, we need more people to be in this space.
What are some of those check boxes you would like to see? Uh, that's An interesting one. So, I mean, I think, I think to what we have in our curriculum that, hey, it doesn't, doesn't need to wait until you're taking on an SS r e role internally. We could be teaching these things earlier.
And, uh, some of the things in there that are most important to me are, uh, learning from incidents and preparing for incidents. So just the idea of responding to a production incident and handling operations and, and diving into a playbook, I think is something absent from software engineering and computer science curriculums, because the, you're, you're focusing on the development stage. We, we almost need to bring the DevOps into those curriculums and incorporate a little bit of that ops mindset. And I think one way that we could do that is by incorporating, in addition to teaching, um, students about automated testing for the code, they write, teach 'em about things like performance testing or chaos testing, and, and let them break test software, um, break test infrastructure and get some experience, uh, with writing out incident response playbooks or even, uh, automated recovery for systems.
Yeah. Love that. Love that. And, and so, which prompts me a couple other things.
Mm-hmm. A couple of areas. One area would be in, uh, the failure. Mm-hmm.
Mm-hmm. Because that's what, like as ses, you're gonna be confronted with incidents, failures, outages, uh, how do you manage incidents, right. In your space, what would advice would you give, okay, from an S three standpoint, probably part of your training, here's the right way to do it all the way through the postmortem, right? The, the review of the incident.
What does that look like in your, um, in your space? So For us, a, I think a, a passion in, in grassroots, uh, effort, a passionate grassroots effort has sprung up around, uh, learning from incidents and a blameless culture before we even had SREs. And so in pockets, unofficially, you saw teams going that extra mile and doing, uh, more formal post-incident reviews that were more extensive, that weren't root cause focused, and that were looking for contributing factors, but also mitigating factors and highlighting some wins. Uh, and one of the, the early s r e roles I had, I was on a team that was an early adopter of that practice.
And, um, when we published those post-incident writeups, the responses that we would get from folks all across the IT organization of, I learned so much about the platform from reading this, this is so great, that spawned more and more teams wanting to pick up the practice, wanting to share their post incident reviews. And ultimately, it, it's been a really great example for us of like upward pressure to make change happen because the incident management process that we had was not changing, but the way that teams were handling their incident management and reflection after incidents was changing despite that. And it has influenced change in the formal post incident review process to now incorporate much more like timeline review, tell the story, understand what the individuals were thinking, and do that without pointing fingers, assigning owners of incidents. And, uh, it, it's gone beyond just the traditional problem management.
Uh, and I, I've been really pleased to see that progress just in the past few years. Ooh, I love that. And I, I think the, the one thing you said there that really resonated with me, um, was how the incidents trained somebody on the platform Yes. Actually brought more intel information about the platform they didn't have because of the incident.
Which kinda leads me to this, this thought and this, um, question for you is that, and we see this, right? We see this when you get to a a point where there's no instance, like, I mean, it, let's say that that exists on promised land, right? There's just no instance here. You don't have the opportunity mm-hmm.
To keep that training in place, both on the SS e side as well as sort of those that own the product. Um, how do you keep that discovery in place? Mm-hmm. And that kinda leads a little bit into things like chaos and things like that.
Yes. But would love your thoughts on that, like when we get to those, A couple, a a couple things come to mind there. One is, so before a team even gets to production with a new application, we have some like readiness requirements that need to be met, that include doing some break testing, doing some chaos engineering, practicing rolling back through the pipeline just to make sure that the team is familiar with which buttons they're gonna need to click and when. So that before they even had an opportunity to have an incident before they've had an opportunity to deploy new code to production.
And the this is for, uh, just our, our most critical, most essential services. We're not gonna hold up, you know, uh, a new application that's not quite as critical from going to prod, but we recommend that every team do that. And that, that gets into just the ongoing continuous learning. I always recommend that at least a couple times a year, practicing that rollback just to keep it fresh.
Making it a part of the onboarding process for new engineers to go through practicing a rollback and a frequent, uh, chaos engineering practice is good, especially for those teams that have really high availability that get out of practice with responding to incidents, not just testing, oh, if I inject this fault, can my system respond well? But can I inject this fault then I know is gonna have an adverse effect in my system? And do the people that support the application know how to respond and fix it when that happens? So we call those fire drills, and we've done those, uh, quite a bit since we first enabled chaos engineering.
And, uh, I've had great fun with that specifically as an onboarding tool. That's awesome. Yeah. And I remember your talk on that, right?
Mm-hmm. I thought you did a great job on that. And the, one of the things you said in that, which, uh, I thought also super important is sometimes you don't need chaos engineering. If you have problems solve those first, right.
Those Are not the teams that need it. Yeah. Solve if it's broken, fix the broken first, right? Yes.
That's good. Um, so tell me in your mind, um, we talked a little bit about like, you know, the, uh, the wake up middle of the night kind. What's, what's a bad day for an SS e look like? Hmm.
So I don't necessarily think that getting pulled into incident response is a bad day for an SS e I do think that, uh, anytime you're losing sleep, that's a bad time for an s R e. You never, you want to avoid having to wake someone up in the middle of the night to do something. Um, because if you do that too many times, you're gonna get burnt out engineers, it's not gonna be a role that they wanna be in. You don't wanna be, not even just interrupting sleep, but interrupting time that they wanna be spending with their families, eating dinner together, doing something on the weekends that they really enjoy.
So that's a bad day for an SS r e is when that time gets interrupted. But if you're able to contain things to the business day, keep those life interrupts down to a minimum. Some of the best days for an SS r e can be quickly responding to an incident, getting it under control, and making sure that you've limited the impact to the client and left your system better than it was before. That's good.
That's actually a good point. And, and in that, in that regard, right, the mm-hmm. Some of the best days are exactly that being being that somewhat of a hero. Yes.
Right? Because you know, the outcomes affect real people, right. Or affect the business you're driving that, um, how much of the time that your SES on your team are focused on things like that, right? More of the, the operations restoring service versus pouring new features in automation, all the different things that sort of, we would expect more of the software engineering mm-hmm.
Of operation space. Do you have, do you have an idea like what that looks like percentage-wise? I don't, and mainly because we're not too prescriptive today about how teams across our organization are implementing SS r e. So we have SS r e roles.
There's actually three different kinds of SS r e job titles that you can have at Vanguard. And we also have plenty of people that are wearing the SSS r e hat without the SS r e title that they, sometimes they're working on performance improvements, sometimes they're responding to incidents, other times they're delivering features and they like to go back and forth in that way, and that's their preference. So it really is going to depend on the person, but I will say we, we do want to minimize time spent restoring service, but I do see a lot of SREs, especially more recently, very heavily focused in the, um, performance engineering space, always trying to ensure that we are not just able to handle loads that we've seen in the past mm-hmm. But pre prepare ourselves, whether that's through proactive provisioning or through elastic capacity for loads, unlike anything we've seen before.
That's an area that I see my SREs focusing on a lot. And then I also see the SREs really digging into, uh, learning from things that have already happened, uh, especially in terms of, uh, regional outages that we may have seen, trying to figure out strategically, uh, what we can do in, uh, major architectural improvements to help our systems, uh, be more resilient, get like additional nines, not just address one-off incidents. Oh, I love that. Yeah.
It's the, which is what you wanna see, right? Mm-hmm. Is that some, uh, minor percentage of your time is really sort of fighting the fires, right? Doing the mm-hmm.
Most of it's taking all of the learnings you've been picking up through, I'm assuming both the instance as well as chaos and the mm-hmm. Performance testing, feeding that back into the system to make it more resilient, right? Yes. More available, higher performance.
That's really awesome. Um, how about on the, and the other thing you said, and I just want to clarify this, that it isn't always a title, like mm-hmm. RAs is not a title, it's this, uh, it sounds like you've elevated this as a practice that's sort of permeating different titles. Is there really, I mean, how does that, how is it structurally, are there sre, are they embedded or are they part of these different product teams?
Uh, or do software engineers just assume that role? Yeah. The answer is yes. Okay.
And it's because we're working in such a large organization, and also because we haven't been doing SS r e for very long, so I, I don't wanna say that we will never standardize on an approach, but that upfront, we left room for experimentation with what an SS r e practice looks like. And we've actually seen success in multiple different ways. And so we have not really felt the need to steer people in a single direction just yet. We have areas of the organization that have staffed an SS r e team that gets treated more like flexible capacity that goes out on demand to help out.
We have areas where an individual product team has a fully staffed SS r e that is responsible for that particular product. And then we have areas where they say, we don't really need a designated SS r e because we wanna make sure that we are letting all of our software engineers get experience in all aspects of building and supporting this product. And so you've got different software engineers wearing the SS r e hat, depending on what task they've picked up to work on that particular sprint. Ooh, I, I really like that.
It, it, so in those, in that model, like that last model, especially, uh, as a leadership team, like you're putting together these programs to educate, that's going out to everyone, including these software engineers. Mm-hmm. Like, here's what SREs about, here's what service levels are about. Is that the way that you've been able to get that into these different teams?
And if you were giving advice to somebody else that's trying to go down the SRE journey. Okay. I, is that what you would, is there any other advice you would also give, like how to develop more SS r e practice on your team? Yeah, I mean, I think as a leader, looking at those three models, the last one where, where the software engineers gonna do it all is the, a super attractive one because it doesn't really require hiring anyone new or giving anyone a new title.
Uh, especially a title like SS r e, where I think the market value might be even higher than that of the software engineer right now. So I, I know that that one can be very attractive. And yes, the, the way that we've gone about that is making the training available, giving software engineers the time and space that they need to upskill in these areas and making it really clear, like the reliability engineering is really a set of behaviors. It doesn't have to be, uh, isolated to a role, but if you don't do it well and you don't provide sufficient platform engineering for these teams that are building products, you are gonna end up with cognitive load.
That is way, way, way too high for these individuals. So I think in order for that last model to be successful, yeah, you may not be investing in the additional SS r e resources on those teams or on a centralized team, but you will need to do some investment into platform engineering to reduce the overall cognitive load of the software engineers. So they've got the capacity to add SS r e to the list of jobs that they're doing. Yeah.
Makes complete sense. Reliability engineering, as we've been talking about Yes. Is a lot, right? Yes.
Um, and it has significant impact and benefit to the business. Mm-hmm. So I think this journey's great. Uh, do you, we're last question and we'll wrap.
So where do you see SS e going? I mean, we start, so origin story at Google, we see it across different enterprises getting adopted. What's the future five years from now? What do you think ss e's gonna look like?
Oh, I, I think it's, it's fascinating to speculate about. I think that I at least have not really seen standardization across the industry in how to implement SS r e in an organization. I think every, uh, organization does SS r e a little bit differently, and I don't expect that to change. But what I do expect to see is, uh, a continued shift toward investment in SS r e.
And I do think that ultimately, even with investment in platform engineering, I think you're gonna see ss r e as at the very least, a job title if not a designated team in a lot of these organizations, because it is so extensive what an SS r e is gonna be expected to do. And that doesn't mean centralize the on-call, but it does mean that the day-to-day activities, uh, there's so much there that you can do to just in the day-to-day focus on making an application more available, more performant, more resilient. Awesome. Well said.
Christina, this has been amazing talking to you. Thank you so much for this. Thank you. I've enjoyed this so much.
Yeah. And thank you for your, your talks. Absolutely. Normal job.
Anybody that has not seen her talks could check, should check 'em out. They're fantastic. So, Christina, thank you again. Thank you all.