Thou Shalt Become Developers - From Ops to Dev at Wayfair
At Wayfair, we sell furniture and home goods online, all around the world. We’re a technology-first business, and one of the main ways we use supplier data is with Electronic Data Interchange (EDI). Through EDI, we exchange purchase orders, inventory, invoices, and many other documents with suppliers, carriers, and other partners across the business.For years, we’ve had an EDI operations team customizing an EDI COTS platform. We’re now replacing it with a microservices-based document translation system in the cloud, which means turning our EDI operations team into a cross-functional delivery team practicing You Build It You Run It.We’d like to share our approach to cross-skilling our operations engineers into developer roles while tackling cultural challenges and misconceptions, and learning from our mistakes as we go. We’re putting our people first, so we can take control of our own destiny in terms of throughput, reliability, and quality for document exchange within and, outside of, Wayfair.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
Welcome back to the last set of talks here at DevOps Enterprise Summit Live 2022. I am super excited about this talk because it's so different than any talk I've ever seen. It came in through the CFP, and I think it caught the attention of virtually everyone on the program community because it is so unusual and, quite frankly, so beautiful.
It's about an obscure part of the technology landscape. I'm sure every large, complex organization that actually has to transfer data to or from another organization has one of these functions. It's a function that is often taken for granted because it's almost invisible, and it's been running well enough for decades, until that is, it breaks, and suddenly almost none of the critical business processes of the enterprise work, often escalating to the very highest levels of organization. And that area is EDI, electronic data interchange.
Michaela Madden is the EDI product manager at Wayfair, and she saw a problem that she thought was very, very important, but most people wouldn't believe her. She is co-presenting with Tommy Hinrichs, technical principal, Equal Experts, and we'll be talking about what they saw and what they did about it. And I can promise you, after this presentation, I'm pretty sure you'll never think about EDI in the same way again. Here is Tommy and Michaela.
Michaela Madden
Hey, everyone. We're here to share with you how we transformed a key Wayfair delivery team with our in-team operations engineers transitioning into developer roles.
I know that we've all heard about developers moving into operational roles, but we're going the opposite direction in our corner of Wayfair, and we think that it's important to share these kinds of stories with the community.
My name is Michaela Madden. I'm originally from Vermont, but I'm currently based in Boston, Massachusetts. As Gene mentioned, I'm the electronic data interchange, or EDI, product manager at Wayfair.
I've been working in the EDI space now for around seven years, but around two years ago I transitioned into the role of the product manager. I kind of wanted to just switch it up, but I also wanted to have more involvement with the planning and also with the teams across Wayfair that I may not have had the opportunity to work with if I had just been in an engineering role.
I've been managing the roadmap and helping the team to balance the new system build-out with our existing operational responsibilities.
Tommy Hinrichs
I'm Tommy Hinrichs. I'm from Idaho, and I'm a technical principal at Equal Experts. We're a global software consultancy of over 3,000 expert practitioners, and we help enterprise organizations achieve long-term innovation and lasting transformation. Our North America office would love to hear from you. Stand up, Sid.
I've got 22-plus years' experience in all areas of the software development lifecycle, guiding teams to realize business value one thin vertical slice at a time. I have done pretty much everything in the SDLC, from introducing agile and Kanban processes, being TPO of a CI/CD system used by thousands, to principal engineer, to an engineering manager. And I started my career in operations in QA way back.
Michaela Madden
So a little bit about Wayfair. It's an e-commerce platform exclusively focused on the home. We offer our 24 million customers a first-class experience by unifying customer service, shopping experience, fulfillment, and supplier services.
We partner with our 23,000 suppliers through a third-party seller model to ensure marketplace access for each vendor. We deliberately have a broad supplier base, and no one supplier makes up more than 2% of our revenue.
We've got around 3,000 engineers that create purpose-built technology which powers everything that we do for our customers.
One of the ways that our suppliers can integrate with Wayfair is through electronic data interchange, or EDI, to exchange documents and transfer information between around 10 to 12 different domain systems across Wayfair.
Our long-standing EDI team is responsible for creating, maintaining, and troubleshooting two different things. The first being the EDI pipelines that all of our teams rely upon for sending orders, translating inventory, and many more. And the second being the EDI documents and non-EDI file exchanges, such as CSV, XML, JSON, that our product and engineering teams use for carrier and supplier interactions.
On this side, you'll see that there's an overview for a purchase order workflow. From inventory to purchase orders to carrier tracking updates, there are multiple transactions all underpinned by EDI messaging. We've got plenty of workflows like this throughout Wayfair, and more than 50% of all of our order volume relies upon EDI messaging, so having a stable platform is pretty important.
For over 15 years, our EDI messaging was driven by a single commercial off-the-shelf, or COTS, as you'll hear us mention multiple times throughout the presentation, monolith. The monolith was responsible for handling all of the carrier and supplier EDI and non-EDI file exchanges for both inbound and outbound traffic through SFTP and AS2 connections.
It was managed by a really small but highly experienced operations team that over time became subject matter experts in the different supplier and carrier domains. In the team view of this side, you'll see that by the end of 2021 we had a product manager, which is me, an engineering manager, and three operations engineers with between six to 20 years of experience.
We also added three Equal Experts developers to the team to guide our operations engineers over the course of our new platform build.
COTS monolith managed EDI messaging between the supplier domains, like large appliances, our CastleGate fulfillment program, and drop ship; and carrier domains, like domestic and international shipments, and drayage and ocean, who ship containers across the world for us.
But despite the team's best efforts, we still faced some significant issues. We couldn't improve our COTS reliability and performance, even after a lift and shift into GCP, which we thought would help.
Yearly audits were getting harder to manage, as there was a real manual process to tracking all of the changes in the monolith.
We couldn't continue to staff the team with more and more subject matter experts. As Wayfair grows and the need for EDI is required in different domain spaces, it's harder to get more people in those spaces that just know it like the back of their hand or will pick it up really quickly.
And we couldn't keep the carrier domain teams happy, because they wanted us to keep making more and more updates to their EDI messaging formats, and we already had limited capacity because we were babysitting the monolith.
So we put off the decision in previous years, but in 2021 we felt pretty confident that we could build a next-generation EDI platform ourselves. Wayfair has partnered with Equal Experts for a while now, and we brought some Equal Experts folks into the EDI team, including Tommy at one point, Sarah and Jonas, who's hiding over there.
Tommy Hinrichs
Our plan was to build a cloud-native platform and foster a microservices ecosystem. This meant our operations engineers had to transition from supporting a database-centric, desktop-based COTS monolith, architected 20 years ago with no concept of source control, to building and running their own microservices. And it's quite a challenge.
We had the three EE developers, Rajneesh, Jonas, and I, paired up with three Wayfair operations engineers, Randy, Hector, and Devin, for a long term.
Michaela Madden
So you'll notice in this view that the slide of the team has changed a bit, and I can go back just once just to show. And that's because in 2022 our team changed even more.
Two of our operations engineers are now Wayfair developers in training. One decided that he would prefer to actually stay an operations engineer, and that was okay. A new Wayfair developer that was already in the company heard about what we were building and asked if he could join the team to help. And now we're down to two EE developers.
So in the past year, we built our EDI microservices ecosystem, and we'll call it the supplier data translation, or SDT, for short. We're starting to cut over to the SDT platform, and we're in a strong position to take things forward over the next couple of quarters.
Our operations engineers are transitioning into developer roles, and they're building and running EDI microservices for themselves. They're now able to take on tasks like debugging, implementing fixes, and writing tests with limited or no help from EE developers.
Another change you'll notice from the previous slide is some messaging ownership has moved in the carrier domain from 0% to 50%. That's because we're empowering our carrier domain teams across Wayfair to create, maintain, and troubleshoot their own EDI messaging by transferring over ownership of EDI document translation for the document types that are leveraged exclusively by the carrier domain.
Now those teams don't have to wait for a central EDI team to prioritize their work against a limited central capacity, fighting prioritization of everyone else in the org. And the EDI team can focus its expertise on making the best possible version of the SDT platform for the supplier domain specifically.
Looking to the future team structure, we'll have a team with one engineering manager, one product manager, one operations engineer, and three Wayfair developers. The EE developers will move on to different challenges through Wayfair.
Our long-term goal is to move all of the carrier domain messaging off of the EDI platform and out of the team's scope, so that the team is no longer responsible for being subject matter experts across multiple different areas of Wayfair. And over the next couple of years, we're going to scale the SDT platform up and out.
We'll incrementally migrate all of the live traffic to the EDI microservices until all of the traffic is on the SDT platform. And once that's completed, we can decommission the COTS monolith. In addition to the reliability and performance outcomes that we're hoping for, we estimate a cost savings of around $400,000 in monolith run costs and licensing fees.
Elsewhere, we've already implemented some compliance as code, and we'll double down on that. Our team has to satisfy the SOX compliance framework because of the financial impact of the EDI documents running through Wayfair systems. We're moving toward automated auditing of all messaging and document translation code, which will hopefully save us a lot of time during audits.
We're also hoping to foster a community of operations engineers across Wayfair who might be thinking of a transition from an operations-based role into a software engineering role. Wayfair needs more developers, as does everyone, and the developer market is really tough right now. And the in-house operations engineers, they already have all the domain expertise anyway.
So with the work and training that the team members are doing now, they'll have more skills liquidity, and there'll be more opportunities for our staff members in the long term.
So how have we come so far in only a year? There are four practices that I'm sure you've seen on many different presentations this week that have worked well for us too: empathy for everyone, staying on track by clearly outlining responsibilities, minimizing cognitive load, and learning by doing.
Something that's been stated in almost every presentation that I've seen this week has been to have empathy for your team members and your coworkers, and we agree. We think that empathy is super important. The EDI team is going through a huge change right now, and a COTS monolith that people have spent blood, sweat, and many tears on, myself included, is being replaced.
So our operations engineers, who have done their jobs for years, are learning that the way that they contribute to Wayfair is fundamentally changing. So we want to be sure that we listen to each other's perspectives, understand different points of view, and work with each individual's strengths, aptitudes, and interests.
We've also set out ways that have emphasized shared learning to help operations engineers to understand development skills and to help developers understand operational skills.
We want to make it as easy as possible to understand and operate our microservices ecosystem. We're very aware it's a whole new ballgame for our operations engineers. When we design our microservices, we think about observability up front and how we can make message flow between our microservices easy to understand at a glance.
We also think about operations up front, thanks to the influence of the operations engineers on the team. We try to ensure our microservices are easy to reliably operate. For example, we've created a custom API for observing document flow in real time. We'll be adding a UI on top of that soon, and it utilizes Wayfair's standard tooling for observability, and we're creating our own custom monitoring on top of it.
We of course had to balance our existing responsibilities with the new system build-out. We had work that we had to do in the COTS legacy monolith because we were committed to completing that by the end of the year. And because of this, we created one combined roadmap for the monolith and new microservices work, and team members would alternate between working on the two systems when they started new tasks.
In the roadmap, we planned for less capacity for monolith changes, and stakeholder management was tricky, and it continues to be, as they've had to live with fewer EDI changes for about a year while we've built out the microservices from scratch.
In the long run, we want to hand over the responsibility for EDI carrier domain messaging to the teams that own the carrier domains. While we started that effort, we had to set clear expectations with our customers that any net-new messaging work was going to be given a lower priority than our planned infrastructure work, and our system build-out, and the transfer of EDI messaging domain ownership to the domain teams that should own them.
Over time this has actually helped our consumers prioritize their EDI work, because less central capacity was being made available to them. They started seeing the EDI work as just part of their work, not part of someone else's work.
Of course, there were deployment problems, live traffic, and stability issues with the COTS monolith, and this caused a delay in our planned work and sometimes required all-hands-on-deck recovery processes. So we had to be sure that there was a good bit of time allocated to keeping the current system up and stable, because half of all orders go through it.
Keeping the team's cognitive load front of mind is also a natural add-on from empathy. If we've got too much work crammed into our heads, then we really can't focus on cross-skilling as well as delivery. We're big on preventing overload, or at least trying to, through work in progress.
One way that we do that is to assign each team member as a project lead rather than spreading their time across multiple different projects. So they'll run, or they'll design, build, and then run, the deliverables without having to balance multiple projects at once, in an effort to prevent overloading them.
We also have a number of ways to minimize context switching. By limiting the work that needs to be built out in both systems simultaneously, we're minimizing the back and forth for each team member so they aren't overloaded and they don't have to worry about context switching too frequently.
One way to keep cognitive load low is to leverage previous experience. So we've deliberately matched up the current domain knowledge of our operations engineers to the same domains in the new system, in our new microservices ecosystem.
They don't get to work exclusively in those domains. There's not enough of them. You saw there's only three Wayfair employees. It gets to be the bulk of their work though, and it flattens the learning curve and makes the migration efforts much easier.
Plus we haven't had to build all of the SDT platform ourselves. We've purchased a GUI-based EDI transformation definition application by Altova called MapForce. It generates Java code, and that gets merged into our JARs and standard dev processes.
And we have a mass message validator and licenses for the EDIX12 standard. So these things were super exciting for us, and it saved us a lot of time and coding effort.
But the most effective skill that we've come across in cross-skilling is to learn on the job. Each of our Wayfair operations engineers has worked on the microservices build-out alongside developers from Equal Experts. Our operations engineers share their domain experience so that the developers can learn more about the different business domains that are underpinned by EDI messaging, and our developers share their technology experiences so that our operations engineers can avoid some of the gotchas and come up to speed a lot faster.
At first this happened in dedicated training sessions, but later on it naturally became part of every story kickoff. We do a lot of long-lived pair programming. An operations engineer and a software engineer will work together for months at a time. They design, code, and implement features together, with the software engineer explaining the reasons behind the solutions.
It's really an effective way for developers to digest the business domain and for operations engineers to learn more about implementation, architecture, and technology trade-offs. And all of our collaboration is feedback-first. We go as fast or as slow as our operations engineers need.
First we shadow, with an operations engineer watching a developer implement a task with a running commentary of why they're doing what they're doing. Then we build up to driver-navigator pairing, where the operations engineer is doing the work and the developer steering the direction. Then we move into some ping-pong pairing, with both people taking turns running the pair. And the whole idea is to shepherd people through a journey at a comfortable pace for a medium to long term.
But we still need your help. We don't have all the answers, though hopefully until I just explicitly stated that you were convinced otherwise. We'd like your help in tackling some questions that we've got, so we'd like to have you come and chat with us afterwards.
As a product manager, I'd definitely love some feedback on how other people have balanced stakeholder needs and wants in these types of situations where you're balancing two different systems, especially when they realize that your new platform is going to be great, but it probably won't be the answer to all of their problems.
We'd also like to hear how other people are balancing engineer time between a legacy monolith and a new system build-out. Our team members were sometimes limited in the time they could dedicate to the new system build-out because of high-priority, previously committed projects in the legacy monolith. And consequently, some team members felt like they were being left behind on the new build-out because they had to focus on the old system.
And can anyone here help me out with that? If so, come find me afterwards. Send me an email. It's on the last slide in the deck, or reach out on Slack.
We've also got some takeaways for you. Empathy is the foundation for transitioning operations engineers into developer roles and building an easy-to-operate system.
A COTS monolith may only take you so far if your business keeps growing and evolving. Eventually, you can reach its change and reliability limits, and it can be frustrating. You can build around it, but it still may only take you so far.
And finally, your operations team may contain the functional and technical subject matter experts that you've been looking for. So look to them for guidance first.
And thank you for coming. Please be sure to come find us afterwards. Thank you.
Tommy Hinrichs
I'm always happy to talk. Come and find me afterwards. Reach out on LinkedIn or Slack. Thank you so much.
Michaela Madden
Thank you.