Log in to watch

Log in or create a free account to watch this video.

Log in
Las Vegas 2024
Share

Fixing Our CI/CD: How Ephemeral Environments Increased Our Cycle Time to Deploy

At Pepsi Our DevOps and SRE teams were constantly struggling to keep static testing / staging environments up and running. New commits and PRs by nature are buggy and prone to breakage jamming up our pipelines. Features were taking too long to get to production but compromising on testing / reliability was not an option. We couldn’t deploy “continuously” with broken pre production environments.We implemented ephemeral environments and had great success solving all of these problems.

Chapters

Full transcript

The complete talk, organized by section.

Benjie De Groot (Shipyard) — Opening

Hello, everybody. I'm Benjie, and this is Jason.

Jason Fertel (PepsiCo) — Hello

Hey. How's everybody doing?

Benjie De Groot

I'm with Shipyard. Jason's with PepsiCo's e-Commerce division. We're going to talk about how PepsiCo's e-Commerce division was able to decrease their cycle time inside their CI/CD leveraging ephemeral environments and Shipyard.

So again, I'm Benjie, that's Jason. Why don't you start — how many people here know what ephemeral environments are? Okay, that's pretty good. Does anyone call them anything besides 'ephemeral environments,' just from a hand-raise perspective? Okay, PR environments, that's good.

Jason Fertel — PepsiCo Digital Commerce background

Alright. To give you a little bit of history of what's now called PepsiCo Digital Commerce — but at the time was called PepsiCo E-Commerce — I joined in 2018 and I was the first engineering hire for this group. We had the luxury, frankly, of being able to build everything from scratch.

This was all new ways of operating for PepsiCo as a whole. But we didn't have to deal with some of the legacy that the rest of PepsiCo IT has to deal with. It's effectively a fully integrated organization, which encompasses every part of PepsiCo: all the brands, plus marketing, supply chain, sales — and, as of 2018, the technology organization.

By 2021, we had most of our stack in place. It took a while as we grew exponentially — especially with the sort of explosion of growth starting in 2020. We built up this stack that I think we would comfortably call a modern tech stack. We were able to do what a lot of folks here are doing or striving to do inside their organizations. Granted, it's in this small 200-person organization within PepsiCo.

Benjie De Groot

And PepsiCo is not just Pepsi, for those that don't know.

Jason Fertel

Yeah — Frito-Lay.

Benjie De Groot

For those unfamiliar with PepsiCo — because I've had a lot of questions about this — Baker, Quaker, Captain Crunch.

Jason Fertel

It's a pretty encompassing portfolio. SodaStream — apparently not Coca-Cola — Sabra hummus by which there's been some issues. By the way, I wasn't allowed to drink any of the…

Benjie De Groot

He's been thirsty.

Jason Fertel — bottlenecks identified

We weren't super formal about this, but we were able to measure some of these things. We were looking — where are our bottlenecks, and what can we do to start to address them so we can be more effective as an organization? A lot of this had to do with our cycle time. And while we were able to deploy on demand, we were still stuck with a lot of bottlenecks and waste that was taking us time to get to production. And on top of that, our defect rate was frankly too high with that move toward production for new features.

Benjie De Groot

So it was a quality issue — probably the biggest thing you were facing.

Jason Fertel

Yes. Effectively, we didn't have the confidence to deploy. Things would wait too long, because you basically had blocking with QA and UAT. The whole staging process was brittle and often broken. There was a consistency of issues. We had one part of the CI/CD pretty good, but the other part — making sure that the stuff that gets deployed every time somebody merges into master, the guarantee that that would actually not take the system down or not have a high rate of unexpected defects — was causing issues.

Benjie De Groot

So it was really about quality control and testing.

Jason Fertel

Yes.

Benjie De Groot — bottlenecks identified, what next?

So, bottlenecks identified — what do we do?

Jason Fertel

Our team has a modern look on DevOps, using all the latest techniques — GitOps, etc. There were folks on the team that had built stuff with ephemeral environments. So we did the research.

Benjie De Groot

The actual bottleneck was these mid-stage pre-production environments — and those are long-lived, static, always on, costing money. So you did some research — ephemeralenvironments.io — quick shout-out there.

Disclaimer: ephemeral environments.io is an open-source resource that we do sponsor at Shipyard. It talks through what ephemeral environments are. You had had some other friends that worked at other companies, and this is kind of a standard modern practice to have on-demand environments, right?

The big thing about ephemeral environments — well, there are six things: collaboration, testing, security, dev-workable, cost control, and on-demand. Obviously there is the ability to build this in-house, and a lot of people do it. Tim, I'm looking at you. I see you. I'm not so happy about Tim, but it's fine. But in all seriousness, there's a lot of stuff there. Ephemeralenvironments.io is a great resource if you're building your own, or if you want to find a vendor — Shipyard.

The big concept here is that as developers are writing code, they should be able to share that feature with essentially a snapshot of production — obviously with the data sanitized, security authentication — but give that to the other stakeholders. That's the key, especially from the testing bottleneck that you guys had. Correct me if I'm wrong, but I believe you'd have these staging environments and UAT environments, and they were very brittle, oftentimes broken, oftentimes occupied. Something that I don't believe PepsiCo dealt with, but some of our other customers did, is: a lot of times it was like, 'we have this really important demo, so don't touch staging for three days.'

Jason Fertel

No, it's that's always the problem. It's like, we've got an executive that needs to see this one color blue, and… yeah.

Benjie De Groot

So that was a consistent issue you guys were dealing with. So you found ephemeral environments — then what happened?

Jason Fertel — Build vs. buy: PepsiCo built their own first

We built our own. Of course.

It wasn't so much that building our own was that difficult. The problem: once you build something, you have to support it.

Benjie De Groot

But it wasn't as feature-rich.

Jason Fertel

It wasn't as feature-rich, in order to access these environments. They weren't user-friendly — they were really built for developers. You had to connect to our VPN in order to actually access these environments, and that was challenging for non-dev stakeholders.

There was a mix of authorization data issues. It was purpose-built for specific projects at first; it didn't scale well to other projects and the different needs of different projects. And it was a burden to our DevOps team.

Benjie De Groot

And you have limited resources on the DevOps side, of course.

Jason Fertel

Our job is selling chips, not… you know, not…

Benjie De Groot

…and soda. Wait, I believe you said the rollout was a little flat.

Jason Fertel

Oh, yeah. Sorry, sorry, sorry, sorry, sorry. What was the line? He's been working on this for three days.

Benjie De Groot

It fell a little flat and it needed a bit more fizz. He's been working on that one. Also, I see Smitha in the audience — she gets credit, that was hers. Thank you, Smitha, for workshopping that line with us.

In regards to the build-vs-buy thing — I'm obviously a little biased here, but the big thing is that you can do it. We see it a lot. But over time, the sprawl, and the feature set is just limited. I think the next slide actually talks about some of why it was a bad idea. So tell us, Jason, why was it a bad idea?

Jason Fertel

It was, as I said, purpose-built for one team, lots of bugs. Data seeding, data duplication — instead of actually using your staging database…

Benjie De Groot

So it was a singular — another big thing was, it wasn't truly ephemeral. Because you were using a shared data store.

Jason Fertel

Yeah, and that was a big feature we didn't have. I don't think a product manager was ever able to use this.

Benjie De Groot

It wasn't built for product managers.

Jason Fertel

Which is part of the problem. It was mostly a code review kind of tool at the time.

Benjie De Groot

So the other stakeholders were a little bit left behind. And that makes sense — you're not building on a product.

Jason Fertel

And on top of that, while the pull request was open, the environment was up.

Benjie De Groot

So then magically you found this company — what was the company's name?

Jason Fertel

Uh, I think… what was it?

Benjie De Groot

Shipyard.

Jason Fertel

Ah. Okay. So then you partnered with Shipyard and what happened? The rollout was — so, so effectively, Shipyard provides a lot of tooling that makes it very easy to roll out, as you're already containerized. It fits right in. Obviously this was when we were a little bit, this was — it fits right in.

And on top of that, it has all the things that our security teams like — SSO. And there's authorization that we didn't really have with our own internal system. So this gave us a lot that we didn't have, coupled with things like logging, integration with Datadog, data snapshotting, which to us is one of the core features — the ability to have that state rolled up with every environment that gets set up. And great support.

Benjie De Groot

Slack is a powerful tool. They might not like that they give crazy support. I do not like giving crazy support, especially this morning, when we had a little issue. But of course that's happened. No issues. It was yours's fault. But anyway.

Benjie De Groot — about Shipyard

I'm Benjie. Thank you, Jason, for talking about your problems. We're going to get back to PepsiCo in a second. Quick background on myself and Shipyard.

I started off as a professional services consultant, helping a bunch of companies. One of the big things — and I think we all know this — there's a unicorn or two, and a lot of companies that build a lot of the infrastructure. It works until it doesn't, until they leave. They get promoted, they're on vacation, and then we get into all kinds of problems. I was on the ground being brought in as a high-priced Kubernetes consultant to fix all these pipelines.

I started building Shipyard as an internal tool for myself. In an early deck, I had a picture of me with my exoskeleton from Aliens on. (JD gets that reference. I don't think anyone else does, but it's okay. I love Ripley.) We built out this tool to help companies deal with these issues. Timeline: raised VC dollars, been around for about four years. Turnkey DevOps automation platform. Definitely the market leader, I feel confident saying, in ephemeral environments. There's a lot of other people out there dealing with environment management, DevOps platforms.

And now I'm speaking at ETLS. I don't know if Gene's in the room, but I gotta say — as a DevOps professional for 10+ years, big fanboy of him and Gregor and all the people. It was really cool to be up here talking to all you guys on the main stage. That's pretty cool.

01The 'infrastructure spaghetti' problem

What I saw on the ground every day was the infrastructure spaghetti problem — not actually trademarked, but I put that in there. You guys can use that. This might look familiar: you have all these different technologies, all glued together — maybe in Jenkins, maybe in CircleCI, maybe in GitHub Actions. Maybe Legacy Land — Chef, Ansible, Puppet. Hopefully that's as far back as anybody in this room is going at this day and age. You have this amalgamation of a lot of different services tied together in a very deterministic way. So we kind of set out the mission of Shipyard to actually untangle that.

02Value of Shipyard

When you untangle those things — what we've seen with PepsiCo and a bunch of other customers — is that we help you scale your engineering teams. We help you prioritize velocity and DORA metrics. That's a big one. I was at Laura's talk, the DX folks, yesterday with Nathan. It was great. And they're helping measure DORA. I am very happy to say to folks: we're not measuring DORA, we're moving DORA metrics. And we're always trying to improve slightly, Nathan.

The big thing Jason didn't touch on — I don't know how significant this is to PepsiCo, maybe we can talk about that — but for a lot of other customers it's cloud cost. There are these long-lived environments, there's a lot of waste. I would like to pretend I'm some eco warrior. I do care a lot about the environment, but the waste on a cost perspective and a cloud perspective is ridiculous. You have 24×7. If you do the math in your head, I'm pretty sure all of you have a staging, dev, and UAT (maybe testing) environment. They're on 67% of the time that they're not being used. Nights and weekends, that accounts for eight hours a day. Maybe they're being used at an international company — maybe 12 hours a day, so we'll call it 60%. They're sitting there burning dollars, but also burning the environment.

03How Shipyard fits in — GitOps

The way we're executing on this mission — and we're going to get back to PepsiCo in a second, I promise — the biggest thing sucking with Jason and other design partners at the time was: 'if you give me some new YAML format, and configuration thing, I spend six months doing that — then I should just build it myself.' And there are reasons why you shouldn't do that anyway, but let's just say that's correct.

The big thing here, I think we know, is foot guns and configuration and YAML. I have a saying: NAY — not another YAML. Our whole thing is: how do we fit into your existing lifecycle, get you moving the first day with a crazy amount of features and prove value?

The way we do that is we fit in specifically from a GitOps perspective. We tie into your source control — GitLab, GitHub, Bitbucket. No Perforce — the gaming one. I can never remember the name of Perforce. Perforce. Thank you. No Perforce support. Sorry. Microsoft Source Safe — there is an Azure one that we do support. Won't want to talk about that either.

What we do is: your developers essentially make a PR, make a commit, and we just take care of the turning-on of the environment for you. And then we actually manage the lifecycle of it. We have this thing called SLV — Since Last Visit. It knows when folks are building and using those environments. Jason mentioned our SSO — these are for all the stakeholders. When they're using the environment, we know it. If they're not, it turns itself off. That's a huge cost savings.

I don't know about PepsiCo exactly, and we're never allowed to talk about numbers with PepsiCo, but a bunch of our other customers — we save about 50–60% on their pre-production cloud cost bills.

04Features

I mentioned a bunch of these features already. By the way, I think my eyesight's going, because it's easier to read when I look up there than down here.

- Fits into your existing SDLC. - Data snapshotting and seeding — turning on your environments, Tim, it's not too challenging. You already have a lot of this automation — your Helm charts, your Argo. Where it gets challenging is on data and state. If you're trying to give a QA person, without any developer intervention or DevOps person intervention (which is kind of the point), access to these environments — they need to be populated with some type of state or data. If you don't have that, then it's kind of useless. There's no way to log in, no way to use it, no way to set it and forget it. That's our secret sauce — our mojo. We have this instant snapshotting. Think of it as forking data from your golden repo. - Autoscaling, security, one-click shareable, integrated. We integrate with all the big ones — Jenkins — and all the small ones too. Full API, full CLI. Lots of integrations — Datadog, New Relic.

Here's a quick look at some of our screens. This is the dashboard most of the developers and DevOps folks live in — unified look at all these environments, see the lifecycle, logs, telemetry, velocity, insights. At the bottom you'll see the data portability one. This is again our bread and butter. We not only give you snapshots — you can mix and match, you can take from a child to a parent and sibling environment, and you can do all kinds of cool regression and automated regression testing.

Something that tends to resonate a lot, especially with the testing folks out there, is that when you run a test and it fails, and then you fix it, and then you run it again — the data's now corrupted. Most tests are not deterministic. Leveraging Shipyard, leveraging the data stuff, you can actually reset the data to the beginning of that run again. That's a super powerful thing.

05End result — get rid of the waiting

We're getting back to PepsiCo in a second. Sorry for the advertisement. The end result of all of this — and I think this will resonate with a lot of folks — is you get rid of the waiting. You're not waiting on a broken environment, you're not waiting on a bunch of folks to do a bunch of things. So you are able to increase that velocity-to-prod, back to the DORA metrics. That's really what it's about — how quickly can you get to prod safely, confidently.

06Back to PepsiCo — before and after

Going back to PepsiCo — this is kind of what your developer workflow worked like, Jason. This is the slide we worked on together.

Before: you make a PR, you're waiting for a shared environment. Maybe it's called Dev, maybe it's called Staging, maybe it's called UAT or Testing. I think a year ago it was Staging in your guys' instance. There's a lot of waiting, and then there's some iterating, but then there's waiting again. And as you scale the team — you start off at one, you're 200+ developers, all of a sudden that waiting becomes exponential.

I've talked to a few of you already. There's a statistic that the CEO of CircleCI gave me, Jim Rose: the minute cost of a developer is $1.71 per minute. So if you have 200 developers waiting five minutes for an environment, that's a lot of money. You can imagine how that compounds over time.

After: with Shipyard, on every commit, on every PR, you have these ephemeral environments. Everyone's happy. Brian can get a little — yeah, you got Bobby now too. Bobby, love Bobby. Bobby might be watching this. Hi, Bobby. We love you, Bobby.

The end result at the end of the day — clicker — was that you got your DORA metrics all elite.

Jason Fertel

Yeah.

Benjie De Groot

Be more excited. Elite. You got Nathan here — show him how excited you are.

Jason Fertel

Yeah, no, I'm — we did it. Alright. It's truly been a journey that we've gone on. Bringing ephemeral environments right into our stack has definitely had a direct improvement on velocity, quality, and developer productivity.

Benjie De Groot

And they're expanding. I can say that — more teams?

Jason Fertel

Yeah — more teams. I think we're allowed to say that.

Benjie De Groot — close & resources

Is there anything else we want to cover? We baked in a little time for questions, but because we're in the big room, there's no microphone, so we're not really going to do that. You can find us afterwards, or probably me more than Jason. Leave him alone.

Jason Fertel

No, I'm still here tonight.

Benjie De Groot

Okay — go find him on the blackjack tables. Take him off the blackjack tables, please. Just kidding, he's doing fine.

Few resources. Oh, and this is for you: PepsiCo Careers — PepsiCo's hiring, right?

Jason Fertel

You're not… okay, well, keep an eye. They will be again soon. It's end of quarter or something.

Benjie De Groot

- ephemeralenvironments.io — that's the resource. I highly encourage you guys check that out. - I don't tweet, but I tweet through proxy on @ShipyardBuild. It's really just like an information thing. Or sorry, X. Sorry, whatever. Can you use that logo anymore, or is that — I'm keeping the logo. - GitHub — that's actually important. At our GitHub, there's a whole bunch of resources, sample repos for all kinds of greenfield projects. It's not just about Shipyard. There's a lot of really cool, powerful things there. - Our website is shipyard.build.

I appreciate you all taking the time. I want to say thanks to Gene and everyone, and Perry. Thank you. One thing I will say — I'm very proud of this — we got this deck in, we're top five last minute in the, of all the presenters. By the way, nobody noticed.

The big thing here is that I'm happy to chat. I do have to get on a plane in a few hours, but I'll be around. I have these fans that say 'Shipyard — make your environments a breeze.' I've got three left if you find me. It's USB-C. They're very good. They're not USB-B. I think — they're not USB-A either. What happened to B? What happened to B? I don't know. But I have these fans, and I'd love to talk to you guys about that. I'd love to see what you're doing.

Honestly, at the end of the day, the ephemeral environments term — it's lovely to see how many folks actually know that term. It's growing. When I said that three years ago, everyone's like, 'I own a Tesla.' I'm like, that's not what that's related to. It's cool to see people like Tim and American Airlines starting to implement that in-house, even if you're not using Shipyard. We just want everyone giving their developers, their product people, their QA people, the ability to move quickly and take the load off DevOps people so they can focus.

Jason Fertel

And for those of you who haven't used ephemeral environments — or shipped code with ephemeral environments — they almost feel magical when you, for people testing it, the fact that every pull request you're able to do this. It's still to this day, and we've been doing it for years, such a positive to our developer workflow. And it's for the whole, all the stakeholders. You've got executives that look at these things.

Benjie De Groot

Yeah, I, yeah. Tell me every time just to make sure, before they come on there. But no, I really appreciate everyone's time, and it's been great getting to meet a bunch of you over the last few days. Some great talks. Just want to thank IT Revolution and everybody for having us. Thank you all. Have a great rest of the day — and four minutes to transit to your next talk. Thank you.