The Business Necessity for Platform Engineering

Log in to watch

Las Vegas 2023

Download slides

The Business Necessity for Platform Engineering

Andy Domeier

Sr Director of Technology · SPS Commerce

Nathaniel Andersen

Senior Director, Technology · SPS Commerce

The momentum around Platform Engineering as an industry trend continues to increase almost exponentially. A challenge for many technology leaders is making sense of how industry trends might apply to our own business strategies. Like Cloud Computing, Containers, Kubernetes, and even DevOps time and again trends with this kind of momentum prove to be a critical success factor for an organization's technology strategy. As a result it's vital we understand how Platform Engineering applies to us and can be implemented to help support our organization's success.

In this talk I will share the story of SPS Commerce's evolution to platform engineering and how it supports our strategic business priorities to ensure we enable the continued growth of our business and improve our delivery on the organization's strategic priorities. We will also explore how SPS Commerce uses platform engineering strategies to improve operational confidence in resiliency and security while increasing developer productivity.

Chapters

Full transcript

The complete talk, organized by section.

Andy Domeier

Thanks a ton for coming to our talk today. We're real excited about this talk. Just to kind of set some expectations, this is really more just sharing our experiences and journey in this topic so far. We think it's a pretty fun story, and hopefully you all can take something away from it or relate to it in some way.

I do want to give a quick props to all of our folks back home at SPS Commerce, where we're from. A lot of what Nate and I are going to talk about today is the culmination of a lot of hard work from a lot of awesome folks, so we wouldn't be here and have all these fun things to talk about without them.

And props to the conference organizers too. I know it's day three. I don't know if you're all as exhausted as I am, but I'm exhausted because the content's been really awesome.

Along with the content, I think one of the things that's really surprised me coming in is I didn't anticipate this level of platform engineering content being here. And I think one of the things that we see with these kinds of events is, usually when you see that kind of momentum, it's pretty meaningful, and we can all build confidence together that there really is something here.

So really quickly, let's jump into intros. Nate, like always, why don't you go first?

Nathaniel Andersen

Yeah, great. Thanks.

The advantage of having a name that starts with A alphabetically: I get to go. I'm one of Andy's compatriots at SPS. I lead a handful of teams that focus on shipping software, solving customer problems, and I am generally one of Andy's top debating sparring partners.

Often he comes to me with pushes to standardize. So when he said, "I'd like to do a talk with you at DevOps Enterprise," I said, "With me? I'm not sure what you're going for."

But I'm super happy to be here, and I'm excited to talk about something that I actually have been convinced on in the last few years working with Andy.

Andy Domeier

Yeah, sparring partners is a good analogy. We've worked together for 13 years now, and I think one of the things that's been just awesome about getting to the spot where we're at is this really isn't something I think either of us would've thought we'd be on stage talking about together, and to the point where we agree.

So I've been working with us at SPS Commerce for 19 years now. It's a really fun organization. Nate's going to chat a little bit more about it.

I oversee the cloud operations group: network security, access management accounts, the platform teams, deploy and observability teams, and the things that kind of go along in that space with SRE as well.

Our talk today is titled, "The Business Necessity for Platform Engineering."

Nathaniel Andersen

And I added the subtitle. It felt, especially when Andy first pitched it, a little overly prescriptive. And there have been times in my career where I've been like, "I don't need a platform. I just need it to work."

So I think this has been a really good conversation as we built this content, and I think probably is something that's evocative of conversations that, if you're a platform engineer, you've had with your delivery teams, or if you're a delivery team, you've had with some of your shared services teams.

Andy Domeier

Yeah, I believe once, not that recently, you've said the words "out of my way" to me in the past. I think folks can relate to that.

Nathaniel Andersen

But before we get into platform engineering needs to be treated like a product, treating delivery teams like customers, I thought it would be good to talk a little bit about the customers that we are collectively trying to serve, and the problems that I'm particularly targeting.

SPS Commerce, you might not know the brand, but it sells solutions that connect members of the supply chain together to exchange assortment or item data, sales data, and fulfillment data. And the solutions we bring to market are somewhat varied.

To explain it, I think it'd be good to just cast ourselves in the lens of one particular type of customer that I have, which is a supplier. Suppliers' business problems are pretty complicated. To get a product to market and sell it and be profitable, they have to solve for a lot of things. So they will bring in companies like SPS to solve for some of their connection details.

The details that they need to solve for, like get their product produced, manufactured, and then to sell it, are further complicated by the fact that once they get a deal going, or maybe get five deals where you have retailers of the four different colors buying your product, the complexity just multiplies. Because those retailers treat the way that they interact with their suppliers as a business differentiator. So they're then requiring different things from suppliers for each retailer.

So the supplier has a very complicated problem. What my teams are attempting to do is take those data requirements and the rules that they use to be successful with their customers, and we try to roll those up into consolidated rule books with a standard interface, so that when the supplier plugs into SPS, they have a single interface where all those rules are normalized and their data exchange is standardized, so that they have a single set of validation, a core way of understanding how they're supposed to get their shipments done, get their items shared.

Generally speaking, this has been a successful business. It might sound a little niche to you, but we've got over a million connections on our platform, and lots of suppliers bringing lots of, not just suppliers, but logistics companies, retailers as well, along for the ride.

But the growth has been something that's been organic over time. We've actually had about 90 quarters of consecutive growth, which has meant that we've needed to solve problems and scale our organization as we've gone along.

Andy Domeier

I think the growth part explains a lot of why I've been here for 19 years, but also explains a lot about why we get to have the fun conversations we're having now with the technologies. We're consistently solving for scale and trying to accelerate the business and make sure that our technology has enough runway to meet the market demand and meet the opportunity in front of us.

So the way that we think about that then is we really think about things like being a high-performing technology organization as just a requirement. It's just baseline for us. We have to do this. We have to make sure we're doing these things to produce the runway we need within our tech to meet the needs that our business has in terms of the opportunity in front of us.

Everyone's probably really familiar with, if you step back, I think DORA has done a great job with a lot of data over time. And these four things continue to kind of lead the way in terms of saying, "Where's the bar? Where are you to the bar?" How frequently deploying, how fast can you get ideas to production? When you are making change, what's your success rate? I thought the S3 talk this morning was awesome. And then how fast do you recover?

So everyone knows kind of those four metrics, and Nate, being somebody who's leading product teams, does this every day and it's super easy, right?

Nathaniel Andersen

Yeah. And I like these metrics. They're valuable lenses to have. However, nine days out of, or whatever, six days out of seven of a week, I'm not thinking about these metrics as my primary driver. I'm thinking about my customer and the problems that they have, as opposed to how frequently am I shipping.

That said, I think they're good aims to have as you layer on how do you succeed at scale. And I actually think the story of the last few years really demonstrates that.

So I'd like to go into casting you back to 2014. At the time, I was leading an organization at SPS called Release Engineering. I was attempting to implement all the things I'd read in The Phoenix Project that I learned at DevOpsDays, things around DORA-like metrics, shipping frequently, having smaller teams iterating, failing fast, learning.

And I just wasn't able to get traction across the whole organization. So the CTO at the time, the new CTO, was like, "Hey, I've got an opportunity for you, a special project. Why don't you go try and solve this customer problem?"

So I got a focused runway to solve. Unfortunately, at the time, that was my top priority, but the rest of the business had a different top priority: move all of our workloads into the cloud. So the rest of the team was focused on that, which meant I and my two-pizza team were on a bit of an island.

We were able to consult with some of the experts, and I see a few of them actually even in the room, but we did have to learn a lot as we went along. We had to learn how to do CloudFormation, deal with VPC routes, and the reasons why, or how to provision our environment into production.

But with a lot of iteration and focus on the customer problems we were trying to solve, and the tech stack we were trying to solve for, we on our island kind of iterated into a nice spot. We became artists, with people taking different slices of the pizza, focusing on different problems. Our dev people got a little more opsy, and our ops people got a lot more devy.

It felt really successful. It was a really successful operating model, and we carried that forward into really adjacent problem spaces. We took a monolith and we decomposed it into a bunch of microservices. We had a need to launch a lot of new workloads, and so we leveraged serverless technologies to get those things done. We knew that we needed a better way of enabling data sharing, and so we enabled some event-sourcing solutions.

All of those things created something beautiful. We felt really proud of the solutions we were able to bring to market. And we just iterated customer problem after customer problem, and we produced something of beauty. And the saying might go, I think, "A thing of beauty is a joy forever." But snowflakes maybe have a slightly different implication for how beautiful they last.

Over time, a snowflake becomes its own source of pain. Because in order to adopt security protocols or controls, in order to adopt some of the shared services benefits that the rest of the organization was implementing, we weren't able to pull them in seamlessly.

And so our system that looked beautiful at one point in time was now a puzzle piece that didn't fit with the rest of the organization. Our serverless workloads ran into problems, and we needed to adopt Kubernetes. And to learn that full stack, implement, it felt like a heavy lift. Our artistry started to feel more like toil, where we were implementing things that the rest of the organization had already built out.

Andy Domeier

I think one part of this story that always gets me is that we're from Minnesota, so we're supposed to love snow. But in this case, it doesn't necessarily feel that way.

One of the things that I think's been really powerful for me in the journey that I've been on with Nate and everyone else at SPS is, again, we're chasing this growth opportunity. And as Nate's telling his story and showing some of the toil that comes from moving fast and focusing on that customer problem with as much tunnel vision as you can, but delivery, it really presents something that we feel we've grown a good understanding to at SPS, which is this priority friction that these concepts put us in, especially the folks that are doing product development.

At the end of the day, Nate talks a lot about wanting to focus on the customer problem. How can he really spend as much time as possible within his group solving for really meaningful customer problems? At the same time, we have these expectations, and the customers are setting these expectations, right? They expect our services to be available. They expect them to be secure. And they don't expect them, but we expect them to be cost-effective. We're going to come knock on Nate's door if he's spending way too much money.

And so I think that what we're starting to see is this movement of DevOps and understanding how to move quickly is bringing a lot of these delivery teams that are advanced, that can move quickly, into this space where they're like, "Okay, I've done this for a couple years. We've done really great things, but now we have this friction and this burden that's really heavy."

So I think platform engineering is really representing the response to this. Ultimately, the way that I feel like the story comes together really well is to talk about it from a concept of undifferentiated engineering.

If you step back and think about that just as a whole: everybody has to deploy software. Whether you're copy-pasting or dragging or whatever you're doing, you have to get your code to production somehow. It needs to be secure. That's not debatable, right? We need to monitor those things. Even if it's the customer calling you to tell you the site's down, that's a really bad way to monitor, but it's a form of monitoring.

And so thinking about the fact that everybody in our organization has to do those creates this approach that makes a lot of sense. Why wouldn't we share those things? We all have to do them. Why wouldn't we all do them the same way and share them and learn in a way where we can benefit from each other?

That's where platform engineering, from our perspective, has really come into play. And it's been really important to approach it with a product mindset. I think the reason why we're starting to see this, and if you've seen some of the talks this week or even some of the blog posts, the idea of approaching with the product mindset, I would challenge us a little bit there.

What we're really doing is not so much saying, "Hey, we have to productize our undifferentiated engineering," but we have to think about who the customer is. That's what actually matters here. The whole point of productizing is you're creating a new customer relationship that we didn't have before, right? It was just IT operations, and we were there. We were trying to help. We were trying to make sure that we were being resilient, being cost-effective, and enabling our product engineers to move as fast as they can. But really, that customer relationship is what sets us up.

Something that I wanted to share along our journey here, that it's been really fun to hear a lot of the other talks reference as well: something that gets really hard in this space is adoption.

I'd just be really curious, quick: does anybody in the room currently have a platform engineering project going on, and you're currently trying to navigate, how do you get people to move to it, move existing workloads to it? Yeah, quite a few. It's super, super hard.

I'm going to share a story that's been really great for us at SPS, and then share a few more thoughts.

Really generically, as Nate mentioned, we have a growing business. It's moving really quickly, and part of the opportunities for us there is to make sure that we're staying ahead of the curve when it comes to scale and resiliency. One of the things that we started looking at a few years ago is we really have to be more region-resilient. We have to be able to have an active-active network that's processing supply chain communications with higher availability and less dependency on a single region.

So from the executive management team, we've just got to copy-paste, right? You can do that, right?

Nathaniel Andersen

Easy. Just copy it over. Control. Easy enough.

Andy Domeier

It was really, we have a great leadership team, and we had a chance to really step back and think deeply about this. And when we started trying to decompose what it is that made up our network and made up our technologies, some things really came to light.

As you start trying to go through this process of, "Well, how are we going to take this tech? How are we going to make it active-active?" As you start to unpack it, you start to look at the different architectures and the different approaches you've used over time.

Nate had mentioned we did a lot to lift out into the cloud initially. So we had a decent amount of workloads that were pretty standard. They were running EC2 behind an ELB. Nothing too crazy. It worked great. Easy pattern, auto-scaled, cost-effective, no problems there.

As Nate and his teams got new opportunities to go solve customer problems, he referenced the fact that they got into serverless. They started looking at how to move quicker. If we saw a market opportunity or a feature that a customer needed, they could go deliver those things. And so we started seeing this approach and started seeing more serverless workloads come into our environment too. We started seeing those APIs being available for different pathways.

We are a very transformation-heavy product. We're doing a lot of data transformation between file formats. And so sometimes we have kind of heavier Java running, heavier Java workloads that run a bit longer. And so we needed to start looking more at the container world. The serverless space didn't make a lot of sense at the time. ECS was for sure the most approachable, kind of safest way to get going in container orchestration. And so we started moving a lot of our containerized workloads to ECS at the time.

That was when we started kind of moving forward with those types of patterns or container patterns over time. The industry and Kubernetes and the momentum and all the opportunities that come with that tech stack came in. And so we again then shifted more towards building services inside Kubernetes and really getting good at running that compute platform.

So now we're starting to unpack this, like, "Hey, take this multi-region." Well, okay, I suppose we could. But then, I don't know if you're all thinking this already, you start looking back at those and you're like, "Oh, and by the way, there's a Jenkins server that deploys those things, and we're still using Chef on those, and it works fine." It's not a bad pattern.

But it's a lot different than the CloudFormation that Matt was using to deploy our serverless functions. And I'm sure most of you probably at some point touched on it a little bit. It was pretty cool and worked pretty well for us for a while, right?

Nathaniel Andersen

Yeah. Okay. I thought it was kind of fun for a while.

Andy Domeier

And then there's a lot that we've seen great things with, either GitHub Actions, in our case Azure DevOps pipelines. And if you're playing with Kubernetes, use Helm.

So as you kind of keep going down this path and you start seeing these things, you're like, "Oh my gosh. Okay, let's really look at this. Let's start to think about some of the just basic problems that to ask Nate to try to go multi-region active-active, that he has to think about."

And there's a whole, I mean, the list is way longer than this, right? But there's a lot of basic ones. How are you going to route traffic? How are you going to decide where traffic's going to go? What are you supposed to do with your secrets? What build servers are you using? Do you need them in both environments or both regions? Are you going to copy that? What's your network strategy? Of the things you have today, how many of them are hard-coded to region? We saw quite a bit of that.

But I think overall what you see here is a lot of the questions that we're suddenly having to ask Nate and his team about are undifferentiated engineering, right? They're things that everybody has to go figure out. And so we got really, really lucky here, I think, with the timing to a point where in our business, this was our sell for platform adoption, right?

Why would anybody try to take their serverless functions in CloudFormation and go try to figure out how to make that multi-region when we have other ways to approach that? Or take those EC2 patterns. It's a really meaningful business reason to go this route.

You'd think that I asked my eight-year-old to do some of these slides. His were way too nice, so I ended up erasing it all and doing it myself.

But we did go with a platform strategy. We call our platform Atlas, like pretty much 90% of the world decided to call their platforms. But pretty basic stack. I don't think that this is too crazy. We were on an Istio service mesh on Kubernetes, and when we started talking about multi-region, we said, "You know what? We think that there's a really, really obvious path for us to be able to get here with Atlas."

And so we're able to load balance at the edge and create an environment where we can federate the service mesh across regions and have it really nice, consistent, easy-to-use, approachable platform for developers.

So now in the sense of trying to accomplish that business goal that we have of trying to get active-active multi-region, we have an interface, but we also have a business purpose, right? It's one of those scenarios where we're not necessarily just asking somebody to move because my OCD really wants to shut off that old tech, right? OCD isn't a business value we can necessarily drive a lot of action on.

This has been a really great and a really fun story for us. But I think the thing that I would just kind of want to really highlight here for us is we've had a lot of success in just thinking about what's your business trying to accomplish, and how do the platform technologies that you're trying to provide go solve those things?

Some of the traps we got caught in along the way is it feels really easy to worry about how to reduce the friction for net new services, right? It's kind of a fun process. How much button click can get us to an initial container running in an environment?

But have you stopped and thought about how many new services your teams actually start each quarter? For us, that was kind of a distraction at first. We wanted that net new service flow to be really easy to use, really approachable. And when we step back and look, we were deploying a ton, but it was on services that existed. We're not creating a lot of new services every quarter. We are, but in terms of what we're optimizing for, that wasn't really going to provide the value.

Speaking of value, we did want to talk and share a little bit of the data that we have. One of the things that makes me really excited about this space and this consistency is the more consistent that your team uses tools, the easier it is to pull data out of them. You can understand things like change rates, understand different aspects and approaches.

This is just some of our story that we thought made a lot of sense to talk through. I'd highlight just the bottom graph there. I had to pull the axis off, but the red line is incidents and the green line is our change rates. So we're deploying over thousands of times a month. And in general, we feel pretty good about it. We're generally breaking the change or deployment frequency records every quarter. We do kind of see a dip. We're very cyclical around the holiday season. It's a busy season for us, so that kind of changes. Usually Q4 is a little bit where we're kind of focusing more on being hyper-aware and hypersensitive.

Direct your attention to some of the middle numbers there. We do keep a really close eye on our change rates, on our change success rate, things like that. This system and the predictability that we have gives us an opportunity to do that at scale.

Had a lot of fun with these numbers and the journeys we've had, and we thought maybe we would just kind of wrap up by sharing some of the high-level takeaways that we feel we're taking back.

Nathaniel Andersen

So to summarize, at least from my anecdote, at a certain point, iterating locally hits a velocity ceiling. And there are times that, whether it's going to be platform engineering or another shared service that will spin out, optimizing for a narrower customer set can end up helping with focus and flow.

Andy Domeier

I think that, mentioned it earlier, the DORA metrics are great, but that customer relationship is where things really start to matter. If your DORA metrics are awesome, but your deployment frequency, if your deployment frequency is great, but you're not shipping anything that has a customer relationship that's producing value, it really doesn't matter, right?

And so this platform engineering product mindset gives this customer relationship that we have now, where my intentions and my motivations and the metrics that we're looking at are all oriented around the value that Nate's receiving. And for us, keeping Nate's team, to kind of steal from Nicole's talk yesterday, in that flow zone as much as possible is a huge part of what we're trying to accomplish.

Nathaniel Andersen

And just a logical argument to this point: generally speaking, the service economy, of which most of us, I think probably all of us, are pretty much a part, is built by abstracted levels of customers or consumers. And so it is valuable to add layers of abstraction or customer abstractions. So internal customers have a lot of value for providing focus.

That may feel a little counterintuitive when you're thinking about DevOps culture. I have a T-shirt that says, "Silos are for grain," which, as an adage, having hard barriers between your teams is something that we've tried to tear down culturally over the last decade.

But silos, or areas of focus in order to facilitate faster customer value, delivering customer focus, actually can accelerate your overall velocity. You can accelerate your quality and reduce duplicative work. So silos aren't always a bad thing when done kind of depending on a platform.

And so you might just strap a platform engineering rocket on that there silo and finish the talk.

Andy Domeier

All right. Thank you so much. We hope our journey was helpful.