Architecting for AI: How Platform, Data, and Cloud Strategies Unlock Enterprise AI Success

Log in to watch

Las Vegas 2025

Download slides

Architecting for AI: How Platform, Data, and Cloud Strategies Unlock Enterprise AI Success

Andy Domeier

Sr Director of Technology · SPS Commerce

Scott Brons

Principal Engineer for AI · SPS Commerce

Over the past decade, enterprises have embraced Platform Engineering, Cloud-first strategies, and Data-driven architectures as key enablers of software innovation. However, as AI adoption accelerates, many organizations struggle to scale their AI initiatives beyond isolated experiments to meaningful business impact. The missing link? The intersection of platform, data, and cloud strategies. Organizations that have invested in developer platforms, well-governed cloud strategies, and structured data ecosystems will be the first to unlock scalable AI. AI workloads thrive in environments where APIs, data pipelines, and cloud resources are streamlined, enabling rapid experimentation and trusted operational deployment.

In this talk, we’ll explore:

• Platform Thinking & AI: Why treating AI infrastructure as a product improves velocity and reliability.

• Data as the AI Fuel: How organizations can structure their data pipelines and governance for trustworthy AI outcomes.

• Cloud & Compute Strategies: How cloud architecture decisions impact AI scalability and cost efficiency.

• Lessons from SPS Commerce: Real-world insights into how these strategies have enabled AI initiatives in a large, complex enterprise.

• Iterative Steps to AI Readiness: For organizations still maturing their data strategies, we’ll outline practical steps to start enabling AI in smaller, achievable milestones—ensuring value at every stage.

By the end of this session, attendees will walk away with practical strategies to align their platform, data, and cloud investments with AI acceleration—ensuring their AI initiatives aren’t just experiments but high-impact business enablers.

Chapters

Full transcript

The complete talk, organized by section.

Host Intro (Gene Kim)

Up next is Andy Domeier and Scott Brons. Andy is Senior Director of Technology at SPS Commerce and has presented at this conference numerous times about his adventures heading up internal platforms to elevate developer productivity. This is especially important for a company that has over 3,000 employees supporting some of the best-known retailers, suppliers, and logistics providers, each with different business rules and ERP systems.

This year, he is co-presenting with Scott Brons, Principal Engineer for AI, on their experiments to use AI not just to elevate developer productivity, but for employees across the enterprise. What I found so especially dazzling about this talk was the variety of use cases they experimented with and explored, and the fact that Scott used to be a people leader but, like so many others because of AI, decided to jump back into an individual contributor role because it is so fun. Here are Andy and Scott.

Andy Domeier and Scott Brons

Andy Domeier: Real quick, shout out to our folks at SPS Commerce. Scott and I get to be up here today, but we have a ton of awesome folks we get to work with. We're just the lucky ones here talking about the great things. A huge shout out to Jeff, Gene, and everyone who put this conference together. Such an awesome community and learning environment. It is just a really special event, and I'm so thankful to get to be a part of it.

I'm Andy Domeier, Senior Director of Technology at SPS Commerce. I oversee our platform engineering organization as well as our cloud engineering, and I've been with SPS for 20 years now.

Scott Brons: And I'm Scott Brons. I'm a Principal Engineer focusing on AI, and I've been with SPS for just under 10 years.

Andy Domeier: Really quick, a little bit about SPS Commerce. We're kind of an interesting situation. You probably haven't ever heard of us, but it is likely you are either wearing something today or you have something with you that leveraged our communication network with regards to supply chain data.

One of the things that is so fascinating about what we are doing is we are solving really complex problems in an environment that is getting even more and more complex. As a supply chain communications network, we are helping suppliers, retailers, and logistics firms all try to coordinate and communicate in an automated fashion so they can have seamless communication to orchestrate all the complexities.

Suppliers especially have a really hard job just in itself trying to operate their business. But then if you think about suppliers, they are also working with lots of different retailers. As a result, they are exposed to a lot of different requirements. Those requirements get really complicated, making it really easy to produce erroneous shipments or miss some sort of formatting structure or some sort of rule that these retailers have because they have a different approach to how they are potentially operating their business.

SPS Commerce approaches this business problem in a way where what we are trying to do is provide a single interface to our network to really help simplify and automate the data transformation coming in and out of different organizations and ERPs, and also automate the business rules associated with that. That helps people find and stay inside those guardrails and make sure their communication can be quick, seamless, and reliable.

A little bit more about SPS: one of my favorite metrics up here is that we are at 98 quarters of consecutive growth. We are two quarters away from triple digits in terms of growing consecutively. We have 480 engineers globally, and we have processed well over a trillion dollars in commerce just in North America. That is through over a million relationships that we are connecting worldwide.

If we think about data as being a really important fuel for AI, you can see where we have a lot of really interesting and meaningful data sets that we could provide a lot of value with when we start thinking about how we might apply AI to these things. We have been really curious and interested in trying to accelerate what we can do to adopt AI and bring it into our products in a way that can help bring more value to our customers. Along the way, we have met a fair number of different barriers that we are hoping to share our experiences on.

Scott Brons: When we started our journey a couple years ago, we really tried to break it down into two different barriers that we wanted to break down to make it so we could really work with AI.

First and foremost, people and culture. How do we enable people? How do people feel comfortable with it? How do people just start using it and know where to start? Then platforms and data. AI is the icing on the data cake, so without having our platforms and our data ready for AI to be able to use and work with, where are we going to start? That is where we started.

Andy Domeier: One thing I have to call out is Laura and Bruno's talk this morning. I was joking with Gene just a minute ago: it is like he knew all these talks were going to be building on top of each other. For some reason, this content is curated. I was really inspired by the way Laura and Bruno started talking about the data and the metrics in their talk.

For those of you in this room who are leaders, we have a really serious obligation right now. There is massive change coming to our teams and to our organizations, and we have an obligation to help do that and manage that change in a way where we are trying to alleviate human stress and help our engineers be successful.

What we found along the way is three common themes that really drove a lot of hesitancy internally on our side when it came to the adoption of AI. The first was just that paralyzed, "I don't know how to apply it. I don't know where to start. I'm not sure what to do."

The second was a bit more nuanced, but it was really important because we have seen this more and more in our senior engineers, and I think SRE referenced this a little bit: this sense of loss of authorship. These folks have a lot of pride in their craft and their engineering craft. What we noticed is that they feel like they are not the author anymore if they are leveraging AI technologies. They are having a hard time tying the fact that their knowledge, skills, abilities, and the prompts they use to get to some particular outcome are necessarily something where they have a sense of authorship, and they are not necessarily seeing AI as a teammate in that sense.

Scott Brons: Similar to loss of authorship, there is the fear of looking inefficient or replaceable. Across all of SPS, not just engineers: if I use AI, am I really good at my job? Am I easily replaceable at that point in time? I have a fear of starting to work with it because if I do work with it, it is going to prove those fears true.

We really needed to knock down the people and culture first. We went in with three different steps.

First, the early adopters, me being one. Like Gene mentioned, I was a manager for nine of my 10 years at SPS. In the last year, I finally realized I cared so much more about applying AI to our customers' problems, or to our internal customers' problems as well, rather than actually growing the careers of folks on my team. That was a problem, so it was time to switch. I jumped fully into working as a principal engineer with a bunch of other early adopters. It was not just me. There were a lot of folks who were really curious, and we knew if we tapped into that curiosity, we would be able to build with speed.

Our first step was building literacy. To build literacy, we first had to make a very safe place for everybody to be able to use our customers' data and our own internal data in a secure location where they felt comfortable actually using it. We built something we called Sparky, to spark your imagination. It is an internal AI agent platform. We built it on top of LangChain. I will talk much more about the internals of that in a little bit. It was a safe place for everybody to experiment and potentially build out production workloads.

But we cannot just put out Sparky and expect everybody to start using it. It is the same as any other AI chat box. It is an empty thing. So, training. We rigorously made a training schedule that everybody across the entire company, not just the engineering department, had to go through. It seemed simplistic at first: how to prompt, how to create an email, how to do certain things. Then it got into how to create an agent inside Sparky, how to assign tools to that thing to be able to actually use it, so people did not just come to an empty box and not know what to do.

Sparky was for experimenting: try out what you do at work safely inside our application.

Finally, we did not stop with training. We went to build a community as well: a community across all of SPS so people could bring up different ways they are using it, problems they are running into, different tools they want to start working with, and all that sort of stuff.

Then we had the AI Guild, which is actually for our builders. If we are building AI capability for our customers or for our internal customers or for anybody at SPS, how should we go? What should be our guardrails? How should we use CPS? What model should we be using? What areas should we be working with? We started to form those opinions and have a community to work through them.

Finally, the community of excellence. This is where we start to set guardrails: here are the guardrails about how we build MCP, and these are the guardrails about when you build an agent, this is how we expect it to be built and where we expect it to be built. As Bruno and Laura mentioned, Sparky is great, but it is not the only AI tool out there. There are many. How do we fast-track the security review, the legal review, and the onboarding of all these different products so we can actually try them out? It is really about building that community so everybody felt included.

Andy Domeier: As Gene mentioned, I have had the opportunity to be at this conference, and this is my fourth year in a row. I feel so excited to get the opportunity to be here again. Every year has been so fun because we just keep sharing what we are learning. A lot of our journey has built off of 2022, when we talked about what we are doing to simplify and centralize undifferentiated engineering, ultimately with the goal that our engineers are focusing on solving customer problems and carrying that forward.

One thing we learned as AI began to build momentum, and what we observed, is that a lot of the requirements coming from these AI workloads, or the need to incentivize, encourage, or change behavior around AI adoption, follows a ton of the same patterns and things we learned with platform engineering or adopting the cloud. There are a lot of similarities across these things. That is what brought us here today to share a bit more about that.

I think we will probably hear quite a bit about other people's perspectives on the way platform engineering and AI requirements overlap. But there is really direct overlap: thinking about everything from needing to deploy it, wanting to manage your cost, but most importantly, what are you doing with authentication? How do you ensure that you are routing traffic? How are you managing potential load or unanticipated load within your environment?

A lot of this sits on top of, or is a big part of, enabling access to data. Data is a fuel for AI. It is really important that we govern and manage our data, but it is also really important that it is accessible to everyone who needs it in a safe way so they can be innovative and experiment. How do we fix that barrier at the intersection so folks have the access they need to start exploring this?

One thing that I think is pretty basic, but has been really meaningful for us, is thinking about our data in terms of the authentication pattern to it. Being able to start with just that internal RBAC-controlled item is the most basic and easiest thing to start opening up to the right people safely and building that comfort level before getting to the tier that requires the utmost security and maturity around customer data.

One of the first experience reports we wanted to share had to do with DevEx. At the end of the day, our objective was to increase our literacy, and along the way we think that is going to increase our effectiveness. But a hidden motivation was that we knew we wanted to deliver AI features and functionality to our customers and our products. How many of you right now have some feature in your backlog that is "create an MCP server" or "create an AI chatbot," but the developers working on it are not actually using AI themselves yet? How are they supposed to successfully deliver AI features and functions to customers if they are not using it?

We felt very strongly that we needed to make sure we got internal momentum on this with our engineers and built their experience so they could carry that forward into our products.

In this case, it is looking at our production and enterprise data. On the left, you can see the developer documentation and our dev portal data, and that helps us drive some automation downstream with regards to pipelines and things our engineers are using.

Instead of sharing different metrics about team performance, change rates, and things like that, we thought we would share more specific details on projects we did, because they were really cool and meaningful.

The first one on the left: we had a service where we needed to change the way it delivered its payload responses back to the services that were calling it. We had 10 services dependent on the service. In a typical world, you may be feature-flagging this and updating these 10 downstream services once you have updated the upstream service. We found this repetitive approach, where we needed to solve the same problem in 10 different services, as a great candidate, and an engineer went after it. Something that would have been about two weeks, maybe a little bit more, in terms of updating those dependencies, organizing, and phasing in the feature flags, ended up going to two days from ticket to production. That was a really cool experience. We learned that where we have these repetitive tasks is a huge win for where we can apply AI.

Secondly, on the right, we had an intern on our team who picked up a ticket for adding a new linting step, just a pre-commit hook the team wanted added, but it was across 40 repos. Through iteration, he found a fun, fast approach to deliver this, going from a tedious two weeks down to three days. You will notice the reference to FAAFO there. You will all learn a lot more about what that is on October 21st. I think something is being released that will uncover that.

Scott Brons: We have two more experience reports, but before we get to that, I want to go over what Sparky is. I mentioned it is a LangGraph and LangChain-based library application, and we built it on top of Andy's platform, Atlas. That is where we deploy all of our different workloads.

Sparky is where you can define agents. Those agents are pinned to different models, and we serve those models from Azure or from AWS Bedrock and potentially other spots as well, but ones that we have vetted through security. We also have a list of tools. We started doing this before MCP was a thing, but we have the list of tools we created that reach out to different SPS systems or different data sources. Sparky was our safe place to experiment and build actual production workloads as well for internal use cases.

The first experience report using Sparky is for one of our internal tech support teams, the MMT team. They work with Microsoft Dynamics. When you work with SPS, there are a lot of different ERPs out there. Microsoft Dynamics is one of them. Sometimes our customers use that ERP to send and receive data from the SPS system and from their customers, which are the retailers.

This Microsoft Dynamics, this MMT area, this adapter can be really tricky to support. There are a lot of different requests from our customers and internal customers, and a lot of repeated requests. That team had a lot of great documentation, but it is in places not everybody has easy access to. Reading all the documentation rather than being able to apply it to your use case makes it that much more of a barrier.

That team needed to speed up how fast they were able to answer questions and speed up the quality of the questions that were coming to them. How can their customers or internal users answer their own questions before escalating it?

We paired with that MMT team and built an agent on Sparky with them. We attached the Jira tool to their project, the Confluence tool to their project sites for their knowledge base, and then they had a tool that has been around for a long time. They used Synergy for a lot of their old use cases, so they knew when this problem happened, here was the fix. Now they use Salesforce to manage all that, so we also had to attach in the Salesforce cases.

Just with the Jira and Confluence tool we released right away, and the team was automatically finding huge benefits. Then when we added cases and Synergy, it was through the roof. Right away they started adopting it big time. So much so that, within the year, we started this back in January or February, they are now automatically going to Sparky first to do their work. We are connecting that in through Slack and all that sort of stuff so they can easily stay where they need to be working and continue to answer their questions.

Since the beginning of the year, it is a smaller team, but they have had 112 different users using that agent, thousands of conversations, and 6,500 chats. Actually, that number is a little low because I pulled that a good month ago.

For our next experience report, this one really showed that we are on the right track with community and culture and enabling all of our different teams, because I really had no part in it. I found out about it after the fact. We enabled these teams by building up that community so they could review and onboard different applications and bring in these different things.

The product marketing team creates go-to-market materials for all the different products that release and that the sales team needs to put out. They are usually two- to three-person teams, and they take a little while to put out these different marketing materials.

What that team did is they came up with an assembly line. They created their own agent inside Sparky to get the go-to-market stuff started. They basically built out the scaffold, and then they started using Lovable. Once they built that out, they started discovering these other applications like Lovable so they could do more ROI calculations, and Gamma to make a better deck. Sparky is good, but Gamma is really specialized at that sort of stuff. Then they used Crayon to build out battle cards and things along those lines.

These two- to three-person teams started turning into the efficiency of an eight- to 10-person team, and they could really start churning out these high-quality go-to-market strategies for the sales teams.

Andy Domeier: To recap, three keys to moving beyond experiments. People and culture: I cannot hammer on that one enough. Really make sure the entire company, not just the engineering team, is confident in their usage of AI and has a safe place to do that, to provide our internal company data or customer data to these AIs and actually work with it. Make sure they are literate with it: not just that they understand that they type into a box and get an answer back, but that they actually understand what happens in the background, where the data is going, what tools are being used, and how AI works.

From a platforms and data side, what we have learned that has been meaningful is: do not reinvent the wheel where you do not have to. If you have gone through these things, whether for microservice or API ecosystems, you have solved a lot of things like data access management. If you are in the cloud, you are probably doing FinOps. A lot of these things are going to apply. Step back and ask where these things intersect. Maybe there are features and functions you need to bring to your platform that will help facilitate MCP workloads or things like that. But at the end of the day, there is so much overlap there. It is a meaningful accelerator for all of us to lean in and leverage those things.

One thing, even just in the talks this morning and what we learned along our way, is that there is a lot going into the adoption right now, and I think we have heard the word safe or safety a lot. Think about your engineers' experience in this space. How can they go be innovative? One way people are going to be innovative is because they feel free or they feel safe in the environment, whether it is environment variables and they know they are not going to bring down production, which was referenced earlier.

What we found is that where we have had pockets within our organization where folks felt safe and it was clear what they could be doing, exploring, and being curious about, we have seen really meaningful outcomes. I think the product marketing one is still my favorite. It is really special.

The help we are looking for: as you can imagine, with the communications network, we have a lot of event-driven or event-stream architectures. A lot of the things we have done so far within our workloads have been very API-centric. We would love to hear more from folks here. If you are doing things with Kafka or anything else and you are trying to bring AI into that ecosystem, or if you already have, we would love to connect. If you have not but you are going to, we would love to connect so we can stay in touch and learn together over time.

Please find us in the hallway. We would love to hear more about what you all are working on or questions you have for us. Thank you so much.