DevOps At Target Year 3: Journey to Microservices and Cloud Native Architecture
DevOps At Target: Year 3
Chapters
Full transcript
The complete talk, organized by section.
Heather Mickman
This is my third DevOps Enterprise Summit, and I'm honored, actually, to be the first speaker in a really fantastic lineup over the next few days. I learn so much at this conference every year. Probably one of my favorite conferences all year that I go to, because I always meet really interesting people and get to hear a lot about what other large enterprise IT organizations are doing.
So, right. I'm Heather Mickman, senior director for platform engineering at Target. And that means I get to work on a lot of really interesting things, like building the platforms and capabilities used by our software engineers to build, deploy, and manage their apps for Target. I love what I do, and I'm really excited to talk about the innovative work that we're doing at Target and our journey to microservices in the cloud.
But first, just a little bit about Target. So I'm going to assume that everyone here is familiar with Target. Maybe a raise of hands if you've never heard of them. Okay. Oh, one person in the front row. Oh, Ross, my co-presenter for the last two years. Right.
So Target has almost 1,800 stores across the U.S. We have a huge and growing online presence with target.com and our mobile apps, and we've been around for 54 years. So not only do we have massive scale that we provide technology solutions for within our stores and online, but also our supply chain. Target's the second largest importer in the U.S., with more than 38 distribution and fulfillment centers across the country. We also manage three data centers and leverage two leading public cloud providers as well.
So that's a lot, right? It takes a lot of technology to run a modern retailing company, and it's one of the reasons why DevOps is so important to us.
All right. So today what I'm going to talk about is kind of our journey of DevOps and our journey to microservices in the cloud. So I'll talk a bit about why DevOps has been and continues to be important for Target. I'll talk about the timeline over the past four years, so we're going to go back and kind of time machine to 2012. I'll talk who has kind of played roles along the way and continues to be important in our journey, what we've accomplished, and how we've accomplished it.
And I'm going to do all of that in under 30 minutes, so wish me luck.
So why is DevOps important for Target? Quite simply, it's to keep up with the pace of change and innovation required to be a leading retailer.
So today I'm not going to talk about the complexities that we faced from a technology or organizational perspective. My guess is it's very similar across the crowd and what we see at most enterprise IT shops, and that was we had a lot of organizational silos, a lot of technology debt, and processes that slowed down innovation.
So a really good example of why DevOps is important for Target is as we look forward to our 2016 holiday season. Obviously very important for retailers coming up here in a few weeks as we kick off our holiday season.
As an example, we will bring on 70,000 new team members to our stores to ensure that we delight our guests through more than 170 million store transactions. So imagine the kind of systems, from both a stability perspective as well as an ease-of-use perspective, that 70,000 new team members will need as all of us are out doing our holiday shopping.
We'll do more than three million order pickups within our stores as well. So I'll say just personally, I love being able to manage the chaos of being at my house with my two little kids running around and order my stuff while I'm sitting at the kitchen counter instead of doing it while I'm at Target. And so I can just run in, grab my stuff, and head out.
And we'll also fulfill more than seven million orders from our stores. And so we're doing more and more fulfillment of online orders at each of our stores to ensure that we're able to get orders to our guests more quickly.
So to keep up with this pace of change and scale, it's critical that we have the technology and engineering muscle to make that happen.
So this is the timeline I'm going to talk through today. The story of DevOps at Target probably has its roots in 2011, but we really saw change agents starting to make progress in 2012. And as those change agents started to demonstrate success, we started having kind of more of a grassroots, growing community across our IT organization. And as successes became more clear and DevOps kind of became less mysterious, our senior leadership really started to support and help drive across our large enterprise IT shop.
Probably worth noting when I say large enterprise IT shop, Target's IT organization measures in the thousands. So making any kind of change at that kind of scale is going to take a lot of time. So patience and persistence in driving that kind of change across a large IT organization is critical.
And then in 2015, we had our new CIO join, Mike McNamara, and that's when we really started to build out at scale. He came in and set the expectation that Target would be an agile DevOps shop and accelerated all of the great work that had happened in the prior years.
Now, that positioned Target to have the engineering talent and organization in place to execute on new architecture patterns that we were kind of declaring in 2016. And that was microservices as our design and deployment paradigm, and building everything cloud native.
So first, in the time machine, going back to 2012. This was a small crew that I was working with to build the first APIs at Target to expose a lot of our core retail data for our internal development community. So exposing data for products, store locations. As an example, prior to building those initial APIs, if I was building an app and I just needed to get store locations within my application, it would probably take a good six months, maybe four months, but four to six months of development work just to pull store locations data. That's because it was housed in three different systems of record and also in a spreadsheet.
So it was a lot of work to do that. And now it's easy. You can just call the store locations API, and I don't have to spend six months of development just to get that data.
So a couple interesting things to note. We were a small engineering team that were actually within our enterprise architecture organization. And we started the team because we wanted to demonstrate by doing and actually building APIs so that we could set the standards and best practices, and also build the infrastructure and platforms that we needed to be successful.
Secondly, as we were building these APIs, we were very focused as well on how we were doing our development. And we focused on all of the DevOps-y things like CI/CD, infrastructure as code, agile, having an empowered engineering team, automating and measuring everything that we could. So we had that team that was focused on DevOps and practices that we've come to associate with it, and proving the standards to lay the groundwork for a services-first architecture that I would argue is a requirement to be a successful DevOps shop.
One other thing I want to call out on this slide is that it also demonstrates my amazing ability to take selfies and the courage that I had to volunteer to go into a dunk tank a couple of years ago. I still don't know how I got talked into that, but I did. It was amazing.
Okay, so in 2012, some of the key messages and themes that we spent a lot of time talking about across the organization: first was continuous integration.
So CI, this is a really big shift. To build CI into an engineering team, even small engineering teams, it really takes a lot of time to get it right. And I found it to be very important to stick to the key message of, "We're not deploying to production until we're green." And that was said day after day, week after week.
In fact, at one point, I always remember we'd have these weekly meetings, get the leaders of the team together, and the QA lead came in. Yes, I did have a separate QA team at the time. No more. We can talk about that later if you're interested.
But the QA lead would come in every week with a spreadsheet that had been created manually and be like, "All right, we're ready to deploy to production. Look, all of our tests passed. Here, I'm showing we're ready to go."
I would say, "Well, but is Jenkins green?"
"Well, no, because we ran into..." kind of insert the kind of problems you run into when trying to do CI for the first time. It's hard. It's hard to do. But after a couple of months, we were then doing CI and deploying a lot more frequently.
So secondly, and because we are deploying more frequently, we have to start thinking about infrastructure as code, because you don't want to manually build and maintain snowflakes. It's a friction point for us as we're doing deployment if we have to think about our infrastructure. So we started using configuration management tools to manage the state of our infrastructure. That was a really big shift for us as well.
Third, social coding, also very, very important. Making work visible and transparent. Of course, doing pull requests for all changes. And this has led to significantly higher quality in our code.
And then lastly, I don't think I could do a DevOps talk without underscoring the importance of culture. So for me, it was all about building an empowered learning environment for my engineering team, because culture is so important to everything.
So this is the culture I strive to create. It's fun and exciting to try new things, and it's important to never get comfortable and rest on your laurels. Try and fail and make it great.
One thing that I do on my team that I like to do is 10% maker time. In fact, I actually block four hours on Fridays from 10:00 to 2:00. Not that I think that everybody should be doing it, like, "Oh, okay, it's 10:00 to 2:00, I'm going to go do maker time," but more to just underscore the importance of taking time over the course of a week to think about something besides your day-to-day work, try a new technology, and just see what you can do to come up with a better way of doing our work.
Because really, at the end of the day, I don't show up to be average. I show up to be awesome, and I want to work with folks that want to be awesome, too.
All right, so back to the timeline. What we were doing in 2012 was starting to get noticed. In 2013, we had continued successes. The number of APIs continued to increase. The scale at which those APIs are being used continues to increase, and we're getting really good at deployments at this point. We also have a growing grassroots community and a lot more energy and excitement amongst the engineers within our IT organization.
At this point, we're probably doing, again, this is on the API team, we're probably doing about 10 deployments per week and doing those faster and faster. And so at that point, we really want APIs to our infrastructure. So the configuration management tools are great, but we want APIs to our infrastructure. So that's when we stood up our initial OpenStack environment.
And the important point to call out there is that we're removing friction points as we found them. Basically, anything that we find that stands between us deploying locally into production needs to be removed. And every year for the past four years, we've made some kind of significant progress in making it easier and faster to get to production more consistently, securely, and repeatably.
So in 2014, this is when our senior leadership team really starts to see the successes and wants to increase adoption of DevOps across the organization. So some things that we do to continue to build across our org is we start hosting internal DevOps events.
Some of you have probably heard me tell the story before of that first DevOps event that we had, where we had a small crew of organizers, myself, Ross Clanton, Brent Nelson, working to create this first internal kind of DevOpsDays. And we didn't know if were we going to be the only people that showed up, or was anybody else going to show up. And we ended up actually having 100 people at that first event, and it was fantastic and kind of set the stage for a number of future internal events that helped the rest of the organization kind of share and learn what we were doing.
One other thing that we did in 2014, in addition to those events that were kind of more focused on our engineering community, was a summit for our leadership team. And what we basically did was bring this conference, DevOps Enterprise Summit, into Target and had kind of our middle managers and above attend a daylong event where we had some fantastic folks, many of whom are on the list of speakers today, come in and tell their story to our leadership team to demonstrate what other enterprise IT organizations were doing. We were even lucky enough to have Gene come in and be our keynote.
And I remember one of my peers coming up to me afterwards and just saying, "Thank you. Thank you so much for that day, because prior to actually sitting down and thinking about a new way of doing our work, I didn't think it was possible, but now I see that it is possible."
Also in 2014, we started to talk a lot more about what we were doing within Target externally. Prior to that, we didn't talk much about what our IT organization was up to. So we presented at just kind of local meetups, the DevOpsDays Minneapolis. I think 2014 was the first time that we had one of those in Minneapolis. We hosted a couple hackathons as well.
And I'll say these external events have been really, really important, and kind of selfishly for me, it makes my job easier when Target is known as a great place to do interesting technology work, because then I can recruit and attract engineers easier.
All right. So here's some metrics. I actually pulled these from a talk I did in 2014. We had 30 APIs. Now we're doing about 80 deployments per week. Monthly volumes through those APIs are about 500 million, and we have less than 10 incidents per month.
Now in 2014, I thought those were amazing. I was super proud of that scale that we were working at. But what's most important about those numbers was the number of deployments-to-incident correlation, and that was a really important story to tell across our leadership team to dispel the myth of, as I do more deployments, I'm actually going to get more and more incidents, right? Because that's what we've all seen in the kind of decades that we've been working in IT, right?
Every time you do a deployment once a quarter, kind of the whole world is on fire and you're trying to figure out how to get things back to normal. But what we're seeing is, because our changes that were going in were a lot smaller and we're doing them more frequently, if we did run into an issue, really easy, we know exactly what the change was, so hopefully we can fix it and move forward, and/or we can roll it back if we need to. But hopefully we roll forward.
And also, I think it's just as important, the frequency at which you're doing deployments is important because now it's just part of my daily job, right? It's not something that I dust off a manual once a quarter and go through these 100 steps. It's something that's just part of my daily job, and so kind of the muscle memory around deployments, and so we're just better at them.
I'm also going to call out here the scale. I don't do this anywhere else, but I do want to brag a little bit about what these numbers look like today. We have more than 100 APIs now, and our API gateway served 42 terabytes of traffic in October, comprised of about 27 billion requests.
So by the end of 2014, we have a really good foundation. Teams are starting to change how they're working across our organization, but I'd still probably characterize it as having pockets of DevOps within our org.
And that's when we move into 2015 and our new CIO, Mike McNamara, joins. I like this slide because I'm hoping I'm going to get some kind of bonus points with Mike for having one slide dedicated to him.
But it was really, really huge for our IT organization when he joined in 2015. He took a lot of that great work that had been happening and amplified it across our organization, essentially setting the expectation that we will be an agile DevOps shop.
Specific changes that he implemented within our organization was to create product teams that were aligned with business capabilities. And this was fantastic, because now I have a team that actually owns a product from start to finish and is owning it in production as well.
The other thing he declared was that all teams would be agile by the end of the year. And I think some of us in the organization were like, "Ooh, are we really going to declare agile?" But it ended up being brilliant, because what it did was even some teams that maybe were more comfortable working in a waterfall fashion, and that's how their business teams like to work as well, it still set an incentive for them to change and think differently about the way they were doing work. So regardless if a team actually moved fully into kind of operating in an agile DevOps mode, we were making incremental change.
The third thing that we focused on was hiring engineers and decreasing the number of contractors that we used in our IT organization. So in 2015, our contractor-to-team-member ratio was probably something like 70% to 30%. Lots and lots of contractors. And over the course of the last year, we've actually flipped that number. So we've got this amazing recruiting team at Target and have hired tremendous talent over the last year.
Lastly, what I would call out that was a big change for us in 2015 was being very clear about what our priorities were so that teams understood what they should be working on, and probably more importantly, what they should not be working on. And this is really, really critical, I think, for teams to have a clear focus on what they're being worked on, because you can't get everything done.
And one example that I'd like to use here, as I think back about 2015, is as Mike joined, he saw what our error rates were for our point-of-sale registers and said, "You know what? I think we can do better than that." So he reached out to his peer on the store side and said, "Hey, we're going to put some of these new capabilities, this new development work that we're doing on hold, and let's give the team the space to pay down some technology debt and get those error rates to lower than what they've been in the past."
And it worked. It's fantastic. And those error rates have remained really, really low, and that's good for all of us, right? Target guests, we want lower error rates at our point-of-sale registers. And so it was the first time for me that I saw a CIO saying, "Let's stop doing some work so that we can actually focus on the stability of our systems."
Important, right?
So in addition to the organizational changes in 2015, we also had some important architectural and engineering changes for our dotcom site. We were beginning to rebuild our site in 2015. So our digital product team started to embrace microservices, and we started to break apart a monolithic application.
Now, an example here: we worked really closely with the inventory team as they were building 22 services that are handling massive scale for all of our inventory calls. And as an example, just one of those services has a TPS at 87,000, right? So lots and lots of calls.
And secondly, in addition to those architectural changes, we made the decision to move from a private telco for hosting our dotcom and guest-facing applications to a public cloud provider.
So that's important, right? So now Target, we're going to be responsible for managing our infrastructure. We don't have it outsourced to someone else. So at that point, that represented a shift in what Target needed from my team.
What Target needed now, instead of having a team that was centrally managing and building all of our APIs, we have each of our product teams that are owning the APIs, and we needed a team that was focused on the platforms to provide the developers what they needed to build, deploy, and manage those applications. So that's the shift where, as we shifted to being a platform engineering team and providing all of our developers what they needed.
So in 2016, here's what we've been up to. So as the platform engineering team, we want to give all of our developers what they really want, right? And that's an easy way to ship. So we focus on removing friction points and making everything self-service, because I don't want me or my team to be in the way of shipping.
Now, our initial platform and approach abstracted the cloud platform and the work that developers needed to do, but it took about 30 days or so for us to onboard new application teams to our platform. So more on that in just a second.
What we've really been focused on in 2016 is a move to immutable infrastructure, because that was the next maturity point for us, and that's what we've been really busy doing in 2016. So we shifted our builds to RPMs that get baked and deployed as images, and now we're on our way to containers. In fact, we have our first production container workloads that have been running for the last month or so with our first beta teams that we're working with. Some of them are starting to run at scale as well, where we see volumes of up to 5,000 TPS.
So what we did is we moved from initial platform in one to two is, we moved from an onboarding time of 30 days for an app to get up and running and deploying to production to five days. And we have a goal to get that down to two hours as we continue to mature our platform and as more and more teams across our enterprise start to use that platform for deployment.
So again, we focus on removing friction points and making everything self-service. So I'll get to some numbers on what that means soon. But first, maybe just an overview.
Some interesting things to call out then about our platform. We use it to manage across our private and public clouds, and there's a number of components to that platform, but the newest is Spinnaker. So Spinnaker works with public cloud providers out of the box, but it didn't work with OpenStack, and OpenStack is our internal private cloud. So we built that driver with other companies like Netflix and Google, and now everybody can use it.
Open source is important to Target, right? We want to work with the community and contribute back. And now I can... Yeah. Anyways, that's all I'll say about that, I guess.
So in 2016, we have all of our non-commerce apps for our dotcom site, all of our non-commerce apps deployed to this public cloud provider, and they love it. It's super fast, and they're doing lots and lots of deploys. We're not doing lots and lots of deploys right now. They do kind of slow down this time of year for us. But leading up to it, we're doing lots.
We also have more. So I mentioned that we're starting to run containers in production as well. So we have about 10 or so product teams that are deploying to Kubernetes, and we do have that first production workload that's been running for about a month or so.
One thing that we're getting better at is self-service logging and telemetry. So we have this running at scale, but we're still kind of learning and getting better at that. You can see up there, we use the ELK Stack and then Graphite and Grafana. And when I say running at scale, we ingest probably six terabytes of logs a day. And so we're still kind of working through what's the best way for us to implement those different clusters so that we don't run into lags as our teams are using those as they're monitoring their apps.
Okay, so now some metrics. So I mentioned that we started with a beta version of our platform in 2015, and that was a homegrown platform that could only manage about two to three deployments a day. Time from commit to production was about a day, or could be about a day or so, because all of those commits would kind of queue up until they got into production. And it wasn't terrible when we first started doing that because there was no real impact to the bottom line, and we were just learning. And no one would really notice if it keeled over because we just had a couple app teams, smallish app teams, that were using it.
But it wasn't a great experience for our developers and certainly wouldn't be acceptable for us as more and more workloads were starting to use that platform. So we had to make a change.
So in 2016, we shifted to the platform that I just described a little bit, and now we're hosting 42 applications across 25 different teams. Time to production is less than five minutes. I also mentioned that we're working to get our kind of onboarding time, so the first time a team's using that platform to deploy, getting that onboarding time down to two hours over the next month or so. We're working on that.
Let's see. We'll also put that initial platform to rest in January. It's important to turn things off as you turn new things on. And are really excited in 2017 for teams across our entire enterprise IT org to start using this platform. Right now, it's basically limited to our dotcom and guest-facing applications.
So what's next for us? So in 2016, we've learned from our successes, we've learned from our failures as well. It's important to us to have the flexibility of building our own platforms that allow us to scale and focus on things that are specific for Target.
Also, the importance of listening to your customers. Always listening to our customers. I don't want to build something that no one wants to use or where they have to be mandated to use it. And at the end of the day, we're really just focused on Target guests and what's best for our Target guests, and that's providing our devs the ability to innovate and provide new experiences quickly.
Also, I mentioned our architectural direction to microservices in the cloud, in both our public and private cloud. We have lots and lots of work to do here, but we've set the direction, are moving, and making a lot of progress.
I also want to call out Target's Dojo. So hopefully many of you have heard about it, and if not, kind of find me or anyone else from Target, and we can talk a little bit more about it. But it's basically our immersive learning environment where teams come to learn and set aside time to improve their product.
So you can imagine, as we're shifting from enterprise IT org to an agile DevOps shop, it takes time. And the Dojo provides a place where coaches can sit down with these different teams and kind of show them the different ways of doing work. We have three different dedicated spaces for this in each of our technology centers. So we have one in Minneapolis, one in a northern suburb of Minneapolis, and then also in Bangalore as well.
So I've learned a lot in the last few years. I love the work that I do and the people that I get to do it with every day. So I thought I'd end with what I try to do every day, regardless of what it is that I'm working on.
First, to be curious and always look for ways to improve. Empower change, set an example, and do it. As a leader, our actions are so much more important than our words, so show your teams and lead the way with how we do change in our organization.
Culture. An empowered, learning engineering team will do incredible things. Set a culture and set a tone for your team to enable them to do what they're best at.
Stay bold and stay ahead of the curve. Always look for new ways to do things and improve, and always remember to own your failures as much as your successes. I don't know about you, I learn a lot more from my failures than my successes, and it's important to reflect back on those.
So that's what I have to say. You can follow our journey on Twitter. I try and do my best to tweet, but I'll be honest, I'm not the most amazing tweeter in the world, but you can still follow me if you want.
But pull me aside, ask me questions if you're interested in talking anything more and learning more about what we're up to at Target. Thank you very much.