Platform Strategies Accelerating AI Initiatives

Log in to watch

Las Vegas 2024

Platform Strategies Accelerating AI Initiatives

Sr Director of Technology · SPS Commerce

For years the Platform Engineering movement has gained momentum demonstrating organizational gains in availability, security, and developer productivity. Platform strategies often naturally encourage movements in other strategic areas of a technology roadmap such as API and data strategies. The consistent interface and accessibility to your data that platform strategies have built are proving to be an accelerator for AI initiatives that benefit from predictable and consistent interfaces to your proprietary data. We will see many of the organizations with effective and scalable cloud and platform strategies be the first to deliver meaningful AI wins to their employees and customers.

Chapters

Full transcript

The complete talk, organized by section.

Andy Domeier

Thanks for coming to my talk today, hopefully. So far the conference has been great for you all. I thought the content this morning was pretty awesome, so hopefully it hit the mark for you as well.

Really quick: my name's Andy Domeier. I'm a Senior Director of Technology at SPS Commerce. I had the opportunity to speak at this conference last year, and it's been a really interesting journey since then. Last year, my talk was about platform engineering strategies and bringing internal developer platforms to your teams as a product. And then shortly after that, our organization started looking a bit more aggressively into what our strategies around leveraging AI were going to be. And it just became this really interesting intersection. So it might be kind of interesting to come back and tell that story as well.

What I'm going to talk through today is a lot about accomplishments that a lot of folks at our organization have driven. I just get to be the one up here talking about all the cool things we're doing, not necessarily the one doing them all. We'll talk a little bit about the company and where we're at, then talk a little bit about an observation I have about movements from the past and how they've influenced us. Then I'll share a bit about some of the things we've done at SPS in terms of mobilizing AI and how that's been going. Definitely want to make sure that there's good hallway track afterwards. If folks have questions, please catch me in the hallway, or if you've tackled some of these as well and have either success stories or war stories, I'd love to hear those.

About SPS Commerce

Really quick about SPS Commerce. We are a supply chain communications network, and we help retailers, suppliers, and logistics firms all communicate electronically — and ideally seamlessly — with all their supply chain data. Suppliers have a really, really hard job. If you're a supplier in the retail industry today, you have a lot of complex challenges going from manufacturing product to shipping a product to working with the retailers who are trying to get your products on their shelves. It gets really complex really quickly, and that complexity increases really significantly when you start adding more and more retailers. Different retailers have different requirements and different rules — and so you have to start keeping all of the different rules. Shipping directly to a consumer for Amazon is different than Walmart, is different than Target. Those might all be different than if you're shipping to a store. It gets really messy really quickly.

So what we do at SPS is work to really try to create an interface where we can centralize a lot of the rules and logic and make an easier interface for suppliers when working with retailers to be successful with their supply chains. I've had the opportunity to be there for almost 20 years now. We grow like crazy. We solve really meaningful problems. So a lot of the things we get to tackle have to do with growth and scale.

Atlas, our internal platform

I did talk last year about our platform — we call it Atlas. I think almost everybody calls their platform Atlas. It's just a good name. From our standpoint, we are probably four to five years into a centralized developer platform journey. So we have a fairly mature platform. It runs on Kubernetes with the Istio service mesh, and we have deployment patterns and observability tools built into those things. We've had a couple of important business priorities — like getting more region resiliency — that have really driven more adoption to the platform.

So the John Deere folks who were talking this morning had some great stories, but they kind of talked a lot about driving with authority versus just adoption. When we had good business needs, the same kind of thing happened.

I figured it was an AI talk, so I dropped a lot of AI images in along the way. So this is what AI thinks Atlas looks like. It's probably pretty accurate.

Movements from our past

So let's get into it. One of the things when trying to get ready for this talk, I was trying to think about how to tee up some observations. I think if you think about where AI's at, and everything you're hearing about AI, we've seen really similar movements in the past. Not the exact same movements, but similar movements.

The first one that comes to mind for me is cloud computing. Think 25 years ago — if somebody were to tell you, "Hey, you're going to be able to stand up an entire business through a UI. You're going to have storage, servers, load balancers, and all sorts of security tools at your fingertips, and you're not going to own a data center." Just the idea of that was absolutely crazy. There was a lot behind that movement. I think we're all pretty familiar with what it meant to go from this not existing at all to the way that pretty much every organization's going to start off.

The next one, I would argue, is fairly similar — microservice architectures. Within the microservice architecture ecosystem, we saw a lot over the years. But at the end of the day, you get into these scenarios with growing organizations where you start to learn you probably shouldn't have 18 different applications share the same username and password and connect to the same database. Doesn't scale really well over time. We learned a lot through that story and through what it meant to operate monoliths, especially in this new cloud environment.

What we learned from past movements

So thinking about those past movements, what did we learn from them? If we step back and say, what kind of influence did these things have on us as technology leaders in the industry? There are a few things that jump out to me.

- Managed services offer just a ton of immediate scalability and availability — an obvious place to start when you're going to go forward with any kind of project if you're already in the cloud. - Infrastructure as code and repeatable deployment patterns have been strongly influenced by these movements. The ability to quickly deploy and recreate your infrastructure if you're in the cloud, in a consistent way, is so important. - Observability is a requirement. Happy to debate that if you want to in the hallway. It's such an important part of what we all do every day for the services we're building. - API strategies — if you start to think about microservice architectures and scaling out services to leverage the data that your business has or make it more available for services and products — obviously a really big deal. - Advanced scheduling and scaling — now that we've moved to this more ephemeral compute model. That's where things like Kubernetes have been coming from.

What did our learnings produce?

If you take those themes, I think a lot of this ties back to platform engineering. I liked the scrolling list on the John Deere presentation today, that got a big applause — all the things you need to be an expert in to be successful. We really look at SPS at solving these things centrally, or solving them in the right place, so that we're not requiring folks to go deeper if they don't have to. We can help them be faster.

Cloud, networking, account management, deployment, observability — all these things — if we can take some of those things away from aspects of deployment that our product engineers need to worry about, we can be a lot more successful as an organization. Part of the outcome of that learning is that we are really enabling them to focus on solving the customer problem. As much of our engineers' time can be focused on solving the customer problem, we're going to ship a lot more value.

GenAI is coming — it's mostly already here

Let's shift gears really quick and talk a little bit about gen AI. It's coming, it's mostly here. It's not tying my shoes for me yet, but it's coming. What does that mean for all of us if we've seen these movements in the past? Think about those organizations who heard about cloud computing and scoffed at it, or thought about microservice architectures but decided to stay monolith over time, and where they might be at now. Not saying they're out of business — but they're probably stuck. Probably a lot less agile, probably a lot less cost-effective than they could be.

Three perspectives for applying GenAI

A few things that we've done at SPS that I think have been really impactful: we stepped back and we created what we called a mobilization team. We said: what does it mean for us to approach AI technologies, and where may there be areas where we can accelerate really well, and where are there areas where things are going to be a lot harder?

When we talked through that and the different applications associated with AI, we started bucketing the opportunities into three different buckets:

1. Leveraging vendor features. Cheat code or easy button — you have a vendor, maybe Salesforce or GitHub Copilot, and they offer a feature you could just turn on. That doesn't seem like all that big of a deal, but it's a really interesting thing to think about. 2. Corporate or internal services and tooling. What kinds of things do your internal users do today that are maybe repetitive, require a lot of data or documentation, or general business activities that could benefit from gen AI tech? 3. Customer-facing features and functions. What about your services might be an opportunity for you to go to market with something you could sell or just improve your customer experience with?

Bucket 1 — Leverage vendor features

Like I mentioned, I feel like this is a bit of an easy button, but I would encourage you not to underestimate it. From our experience so far, getting this off the ground last fall — bringing things like GitHub Copilot in as kind of normal — getting folks out there and using it: if we're asking our developers to start thinking about how we're going to deliver and ship potential AI capabilities to our customers, if they're not using it themselves, they're really going to be limited in their ability to think about how they might want to apply this technology.

There are a lot of scenarios where there's just low-friction value to be added with something like Slack or Salesforce or those other tools where they're bringing something directly to your end users. And it isn't just tech — customer success, sales organizations, marketing — everyone improving and leveraging AI in different ways is going to help your entire organization be a lot more effective. So while not very techy, I do think this one's a really interesting one to consider, and it was really valuable for us so far.

Bucket 2 — Corporate/internal: the Sparky story

The next one is corporate or internal implementations of gen AI services. We bucket this one as your internal users. I think this is a really interesting opportunity. If you haven't thought about it, I'd encourage you to categorize some items in this way. The reason: if you're going to go into an AI project of any kind, or try to find some AI use cases that are solving some sort of problem, or a POC for your internal users — a lot of the limitations and restrictions that go into access management and governance are not necessarily going to be there. You can start by looking at, okay, I have my internal users, I know how to authenticate them — maybe they're behind your own AD or whatever. You can keep it inside your own network, and you can really govern what data you want to use.

Along with that, if you're just getting started on this road, it's pretty nice to have feedback. It's a lot easier to have feedback from somebody that can Slack you directly or maybe even sits right next to you, as opposed to the more formality that goes with going beta with a customer or bringing something brand-new to market. So it's a lot easier to get iterative, and it makes for a faster project.

The last one: surprisingly, at least for us, there are a lot of little wins. Just a lot of things where it was like, "Oh gosh, now we have this up and running. It was pretty easy to just get this benefit," without a lot of cost.

So on our side, the project that we did was called Sparky at #TeamSPS. We had a couple of developers that were able to step aside and focus on getting this service up and off the ground. A few things we did that were specific to SPS: it started more or less as a re-skin interface to GPT, just letting our users start to think about interacting with it. You certainly see some folks being a lot more inquisitive than others.

Once we got comfortable with the technology, the data, the architecture we built around it, we could start doing things. Really simple things to start with — like the company handbook, company policies, HR benefits, typical help desk questions. All these things that are documented and relatively static, but somebody's going to ask something about it a couple times a day. Great starting point for us to say, "This data isn't crazy proprietary that we'd never want out there, but it's also not zero value."

We did the same thing with the Atlas documentation — if an engineer couldn't quite figure out how to get tracing working or some sort of deployment aspect working. Then we took some of the information we use to help our suppliers work with retailers — to try to condense that and make it easier to ask questions about retailer rules. That puts our consultants in a better position to help answer our customers' questions.

We saw quite a few really interesting use cases with marketing and sales, which was not necessarily anticipated. But our mobilization team had representation from every org across the business, so they were in the room to share interests very specific to theirs. So for example, this deck — I got the template from Sparky, because it has all of our marketing branding loaded in there. Whereas before I would either grab my old deck and rename it and delete everything to start from scratch, or I'd spend 10 minutes looking around SharePoint to find where our brand content was before I could actually start my PowerPoint. That probably saved me five to 10 minutes. And then you can build different personas — we do that for training sales folks.

Bucket 3 — Customer-facing features

I almost didn't use this picture, I think it's so terrible, but I felt like I had to use it as just a demonstration of "we got a ways to go yet."

The idea now is — I think a lot of organizations, and really what we're doing a good job of (but I'm sure you all maybe have similar challenges) — on different aspects of the product teams or executive teams, they want to be able to say they have AI in their products. They want to try to get something to market that they can stamp "AI" on. Most of us probably are facing some conversations like that. If we're not, we will be soon. And there's good reason for it: if you're a SaaS or tool provider, there are lots of ways you can increase customer satisfaction. Good market perception by offering AI services. Better sales close rates. Ultimately your revenue's going to be better. So it's reasonable to think that most organizations should be planning, or in the process of trying to figure out, what kinds of ways that they might leverage this technology.

But — this is the big "but" — once you get to that point, you're leaving that internal ecosystem, your corporate users, and now you have customer production-level expectations. You have to ensure it's highly available if you're going to go to market and sell this product. You have to be able to resolve issues really quickly. You have to have the utmost confidence in your data security. And you have to make sure all these things meet all of your legal requirements and contracts. All the things you would do for any other product — but we haven't really put a lot of time thinking about these things when it comes to how you would do that for an AI product.

Why this sounds familiar

I grabbed this off the internet just because I thought it was a helpful diagram to illustrate. Not trying to be doom-and-gloomy. The service providers are doing a really great job of making these technologies available and approachable for everybody. They're making them really easy to use, which is great. But parts of these architectures — this is just a generic "what you might do if you use AWS and Bedrock" — include aspects like, "yeah, here's just a couple boxes. Here's my AppSync token, I'm just going to go ahead and talk to your S3 bucket, now I'm going to go talk to your database." And it may or may not have some sort of access management logic. But if you are producing something that is customer-facing, you probably need a little bit more than "this server has access to talk to this S3 bucket, and we're good to go." There's a lot more that can happen behind the scenes when we start working with gen AI.

When we started talking about that, we said: hold on a second. Some of these business problems sound pretty darn familiar to past movements. So let's step back and look at those four bullets from the last slide — high availability, performance issues, security and data, and legal.

How Atlas already solves the customer-facing-AI problems

High availability. Our current platform spans multiple AZs in multiple regions. We leverage cloud-managed services where possible — both OpenAI and Bedrock. So from an availability perspective, there's a lot we've done in our core platform already that puts us in a good position. We're solving that already with other services.

Resolving performance issues quickly. We leverage a lot of our observability tools within our platform. If you deploy, you're going to get metrics out of the box. We're going to centralize your logs for you. We're going to grab your traces, set up really well through an OpenTelemetry deployment.

Really fun success story here. The lead engineer who drove a lot of the Sparky project internally on our side had an issue. It was about a month in. Sparky went down. Nobody could figure out why. It didn't exactly go down — it was like it went to sleep. Like your friend just nodded off while you're trying to have a conversation with him and just quit responding. We did some digging, and he ended up finding through the traces that he got from being on the platform that his vector database had filled up. Now, he should have probably had good instrumentation on that database, but that database was a relatively new implementation in a different tech stack in general — we were still trying to POC and get this thing up and running. So some of those things were really new. But in that scenario, he called me right after he figured it out and was like, "Andy, I just gotta tell you — this thing would've taken me hours. But because I deployed my primary API interface through Atlas, we ended up getting that tracing, and I found this thing in like 15 minutes. I figured it out. We extended the tables, or they deleted some old stuff." Just from being in that ecosystem, they got the ability to respond and perform. They got that observability aspect of things.

Security and data access management. This one is the one that keeps me up at night right now — thinking about how far we have to go as an industry to get really good patterns here. This is something I'd love — if any of you are doing similar things, I'd love to hear about it after the talk in the hallway. I think this is an area where as a community we all really need to start learning from each other.

For us, the way that we've been looking at it is: that API strategy, the API gateway ecosystem we have today in Atlas, really solves a lot of the things that keeps us up at night. It requires identity and authentication, whether it's an SPS user, SPS service, or a customer connecting to one of our APIs. Along with that, we have an entitlement service that identifies: "Hey, this user is making this GET request on these APIs. Are you authenticated? Do you have authorization to do that? Yes or no? And then, are you entitled to the data that you're asking for? Yes or no?" Those kinds of complexities — I don't know that API gateways are always going to be the solve for AI workloads for that, but I think for now API gateways are in a really good position to help answer those kinds of questions.

This is going to be an area that's a little scary if you get pushed really hard on deadlines to get something out the door. There are so many easy ways to just wire up an LLM to S3 buckets, feed it a database schema. All of the cloud providers are making that as easy as possible — which I think is good. But if you start by passing some of these core functionality things, you could really put yourself in a pickle.

Legal requirements. You're all on your own. Everyone's got different business requirements. But the one thing that I would say we were really happy we did early on was, we got the conversation started. You do not have to have a POC. The product team doesn't even have to have come and talked to you yet about the fact that they want some sort of AI product. But you can at least start that conversation with the legal department and others — to kind of work through, "What does your current contract paper look like? Are there concerns with the kind of data you do have?" Those conversations take some time, and there's no reason to wait until you have working software. Just get those knocked out right away. That was really helpful for us.

The story visualized

So I just threw a couple of diagrams together really quick to visualize the Sparky story. The Atlas ecosystem — API gateways, Kubernetes, AWS, all the deployment patterns that are there — you can kind of see the different things we talked about that are solved within those boxes by running Sparky in there. We had a lot of good opportunity to navigate that.

General flow today for a user that might be leveraging Sparky: it's more or less a chat interface today. It is available via API, but it's not the most common use case for us. The user is going to log in — that identity and authentication step that happens with every other UI we're running is part of the process. The request is going to get over to the Sparky API. That request may have some specific SPS data that it's interested in.

I think this is a really important step for us: because our Sparky interface is deployed within our ecosystem, it maintains that API gateway identity. So it can talk and carry around the tokens and metadata necessary to make different internal requests, so that an API can validate that the requests it's making are legit. And then likely a lot of our workloads in Sparky are talking to OpenAI — we're using ChatGPT.

An unexpected lesson — region-specific OpenAI capacity

One thing we learned, this is really fascinating. If you go back to the whole "movements" thing — there is, like, a significant chip shortage, all the buzz about chip makers. We have run into so many region-specific capacity issues. We're not doing anything big by any means — but it's like, "Oh, they're out of capacity for the next half an hour." It's the craziest thing. It feels like you're back in 2005 where you can't spin up an EC2 instance because US-East-1 is out of servers. But that's where we're back again. It's an actual legitimate hardware problem. So we ended up having to load balance between regions as a way for us to route requests to a region that had capacity. That was a bit unanticipated, but I wanted to put it out there in case that was a tidbit that would benefit you folks.

Wrap-up

Three big themes in our journey so far in the last nine months.

1. Platform and API gateway strategies — whether you have an existing platform today or are working on a platform — this is maybe another reason to sell the platform mindset. The platform approach is going to help you build, deploy, and operate your AI services quickly and safely. You want to enable your organization to adopt these technologies, but you want them to be able to do it safely. 2. We benefited a ton from categorizing our implementations — from vendor, to internal, to customer-facing. That let us really segment and have more effective communication. 3. Reflect on the idea that this is a movement, and we've seen movements before. How did they affect your business in the past? Because some of the things that you've learned from those — or that influenced your solutions — probably are very similar to things you could apply today.