DevOps 每 The Toyota Way

Log in to watch

Las Vegas 2022

DevOps 每 The Toyota Way

Director of Engineering · Toyota North America

In 2019 we showed how the tools and automation were streamlined so teams could best ride the DevOps journey. The journey has a third element - the people themselves. This time we share with you the journey of how we changed the culture of the people using the Toyota way to adopt the CAMS model.The journey shows how we demonstrate thought leadership in providing self-service, fast, secure engineering solutions that are catered to a wide spectrum of maturity levels across hundreds of teams. We will also show what it takes to build a solid developer experience platform and how it benefits companies and their development teams to move fast in a cost-efficient manner.The journey shows a smooth road ahead when there is harmony in CAM. That is when we have happy team members who provide solid results and yet have a work-life balance!

Chapters

Full transcript

The complete talk, organized by section.

Kishore Jonnalagedda

How many of you know that car? Raise a hand. There you go. Bing, bing, bing. That's right. It's a Supra. The yellow one is the 2.0 I4 engine. Awesome when you're driving on daily drive. The red one, that's a 3.0 I6. I'll let you drive the prototype, which I'm not going to talk about.

So there is no best, only better. Okay, Toyota, awesome. That's the principle behind continuous improvement. Today, hopefully, there's a lot to cover, and I'm going to talk you through some of the journeys, the principle behind continuous improvement, TPS, and father of TPS, Taiichi Ohno. I don't know if you know, he was inducted into the Auto Hall of Fame most recently. How about a round of applause for him?

Thank you. Core principles: high quality, low cost, shortest lead time. That is three out of three, by the way, just so you know, three stars. And no, that's not our next prototype. Core purpose: eliminating waste and continuous improvement.

So why did we even need that? Well, when Toyota started many, many years ago, market demand started proliferating, software started evolving, the role of software started coming up, richer features. This time you go to a dealer, and not many people are going to appreciate if I give you a bunch of coupons and say, "Come back a year later." It's got to be a digital way to do this. So many, many things like that. So the Toyota Way: Genchi Genbutsu and Kaizen. Genchi Genbutsu: go and see firsthand. Kaizen: continuously improve.

When systems started improving, we had a lot of these systems developing in a siloed manner, not necessarily with one cohesive plan. So locally optimized lifecycles. I'm sure you've seen applications locally optimizing for themselves. They do a whole lot of things, and of course the number of applications started increasing as market demand increased. So there was a time when we did some Genchi and saw, well, now we need to look at it from an enterprise point of view.

How do we look at that? There was an existing process. I'm sure most of you see this, an application lifecycle process that started off: ITIL and ITSM. Well, then let's put some DevOps bubbles, and it slowly evolved into this. If you saw the previous talk, some of this is obviously there, some of the tools that we discussed and how we used it. So the DevOps principles and practices: the first, obviously easiest to think about, is automation, and we're all engineers in the back; then measurement; and then ultimately bring it all the way up to the full-stack DevOps.

So we had to slice it, planning the process, the various lifecycle stages, and we didn't want to solve everything under the earth. So here's how we kind of looked at it, and this is just the first cut of many features that come up when you ask teams, when you kind of go talk to them. This is how you're going to look at it.

Along with this, there's also the cloud and on-prem. Many of our systems have been in data centers, and of course the cloud was evolving. So then we had to make a choice. What should we first solve for: cloud? Should we solve for all on-prem? What are the tools that are available and the automation? Naturally, in our choice, we said, let's first optimize for cloud.

Then we had to set a vision too, so we don't deviate from that. Because there are many ways, every time you try to optimize something, people are going to say, "Oh, this one time you're going to do it this way." No. Let's set a vision that says make sure that the velocity of adoption is a critical thing. So the application teams that are trying to adopt the cloud: make it easy for them. Then how do you bring reliability and cost efficiency into that thing? And of course we've got to pick the tools.

On to our journey for tools. This is a glimpse of various tools that we picked. Of course, open source, being engineers, is the first thing we gravitated to. Eventually, when we got the money, sure, we got some commercial products. Scalable build slaves. Like I said, many applications organically evolved, so there are varying things. There is not necessarily a cohesion in how all these things work. So there were a lot of heterogeneous things. For example, .NET applications behaved a little differently than Java ones and several other such frameworks. And then how do we bring in that unified CI? And how do we bring in those metrics that mattered across the board?

This was our landscape, and we started introducing it to the application teams. We said, guys, we got wires into the dashboards. There's a way to measure these things. It's a win if you do this. We had teams say okay. And that's how it looked. Yeah, not one-for-one, but that's how it looked. So like, hey, what happened? Where are the metrics in the dashboard? Like, yeah, we're going to do it. It took a while, so I did my own thing. That's not how it's supposed to work. It's supposed to be a straightforward plug-and-play. Well, what happened?

Time to do some Genchi and then Kaizen. Turns out applications have become richer. More and more frameworks. It's not just, you don't write from scratch, Hello World style. There are frameworks that do it, and then you kind of do it. So there is application learning itself, and there are a lot of moving parts. You're not just integrating with one or two, you're kind of integrating across. There is a whole slew of things.

If I were to put a developer hat in and know my customer, this is how it looks. The developer is looking at it from the point of view of, well, I have a frontend. I have a backend, maybe other backends, batch, calling other external APIs. I got a lot to deal with here already. And now you're telling me, I'm not even thinking about how to write a Jenkinsfile. I'll just Google it and probably do it. And not even Helm charts. Database is forget about it. So that doesn't come naturally. They're not necessarily investing in it first. Ultimately, we will solve it as an application team, but not in the most optimum way, like you saw.

So this is when we were like, okay, what do the application teams really want to do? They want to build good products for themselves. They want to build good, high-quality digital products. They're not necessarily looking at it from a platform point of view. So how do we make it better, simple for them, from a platform point of view? That requires not just tools and metrics and automation; that requires a bit of a cultural change.

In that effort, we sliced it into two parts. One is a platform team that has a bunch of experts that can think about platform aspects of it: reliability, cost efficiency, all of that. And then the developers that are writing their developer code. But then there's got to be a contract established between these two.

Some examples of the contract are X as a template. Think of it, if you're using Kubernetes and on Amazon, let's say EKS as a template, or X as code: infrastructure as code, monitoring as code. Basically, ultimately everything is code. So there is no this person who does this or that. Ultimately the application is still only successful or live when everything is working in that picture. But the contracts are well established. And then the SLX: service level agreements, objectives, indicators, all that good SRE aspect of it. This is how we connected those two.

I'm going to go into the various responsibilities of those teams. A couple of products I want to highlight here. Again, there are many products; for the purposes of this talk, blueprints is what we call them. What I was talking about: X as templates, accelerating teams to shift left faster, and giving those hooks and things for things like measurements and whatnot.

And teams that provide services, like the SRE team. Again, there is no one central SRE team. It is a shared responsibility model. So there are services that are provided here by the SRE, by the cloud economists. Again, if you don't have a cloud economics team, you're losing big chunks of money. And principal engineers on the platform team that actually understand application code. They're just regular application engineers.

Put together, we brought those two together. So here's some good practices. Well, here's your application, then let's bring them together, with the guiding principles of inner sourcing. Not everything can be solved by the application or the platform team itself. So ours is a model of inner sourcing where application teams should they go there first, and should they find something good, we have a way for them to contribute back to the platform for the greater good.

Self-service, which is key. I don't want my team or our team to stand there and say, "I'm going to push the button for you." No. Here's a self-service way. It'll go put all the necessary guardrails and give it to you. And community driven: once an application learns some good ways to do it, there's got to be a way for them to teach and contribute back. That is another big thing.

And ultimately API. When I say API, I don't necessarily mean exposing a service like a RESTful service. It is an interface. It's a well-defined interface. If they give a template, there is a known interface that they're going to be coding into, so people know how to use it.

Let's talk a little bit about blueprints. I know there are various things when you Google it. Here's an Amazon S3 example that I was taking. The red ones at the bottom, and as you go up to the green, the value provided to the teams incrementally increases, and it's a multiplication factor. Let's talk about the example of a Terraform module, infrastructure as code, deploying something for an S3. Straightforward. That's the red one.

Now a combination of that. Now we're talking about a specific single-page architecture application. That's a combination of these things, maybe S3 with CloudFront, with a firewall, some observability hooks into some known observability dashboard, and failover mechanisms. If something goes sad, you kind of have to switch regions, and other architecture could be how do you do the CI/CD for that specific SPA?

Combine this, and then the ultimate green one is a reference architecture. Now we're talking about a specific thing: Angular app, specifically maybe using Webpack as its build tool in CI/CD, and going through the necessary checks for security tools using what I've just shown earlier, and with CloudFront and firewall, maybe an authorization module and authentication module, all of this in a can. Ultimately the app team is still writing Angular code. This is what I mean by blueprints, and that requires a level of expertise that you kind of have to iterate multiple times with the teams.

Coming back to the self-service model. Creating the blueprint is one thing, and we can't create blueprints in a silo. You have to work, and that's how we did it. We had to work with the teams that are actually using it, and there is a customer at the end of this thing that is going to receive it at a known time. Working with them, we iterated and perfected that thing.

Self-service portal, same way. The customer is going to use the self-service portal. We're not going to just give away the blueprint. Here's the self-service portal exposing it. How do you use it? Minimal inputs, so all they need is a well-defined, "What's your T-shirt size for this thing? How do you get it?" And building that software catalog. So now we know what applications are there, how their lifecycle is, what is deployed in production in the portal, backed by good education and training.

That education is not necessarily our team just providing articles, or what is available online. This is necessarily only to the Toyota ecosystem and also community driven. Other teams have done something, they've learned from it, and there's a way for them to contribute back in blogs and whatnot.

I'm going to show you a screenshot of how that might look. Economics and FinOps aspect of it, that is a continuous thing, and the SRE metric. Those need to be transparent for application teams, so they can continuously invest in their applications. There's always that feedback loop, and that needs to be transparent.

What do I mean by that? Here's a screenshot of our portal. We call it Chofer. That's the Spanish spelling for English chauffeur. People found this to be easy to type on a URL, and Chofer comes from Toyota's autonomous driving program, Chauffeur. The intent is people should be able to automatically drive the things they want. And like I said, the blueprints that we're talking about are meant to be rich for the Toyota ecosystem. There is a well-architected score that is published with all these blueprints in collaboration with the necessary expertise teams.

What we're looking for here is, are they being used properly? What is the rate of adoption? How many people are clicking on that thing, and rate of drop, friction points, or delay points? And of course, there's got to be a help icon to say, well, I'm stuck. What do I do?

That in itself triggered moving the needle from a quality point of view, cost efficiency. Again, the cloud economics team, the feedback that they brought back automatically got embedded into these new blueprints. So the next version, you would get the best of the breed. That's something that happened throughout the thing. That's what community learning helps. When you contribute back, you're actually getting the whole thing. And ultimately improving lead time by some. But notice they're not full green yet. There's still work to be done.

What do we do? This is where the community-based learning that I was talking about came in. FAQs, blogs, tech articles that are catered specifically for problems. These are all very Toyota-centric. Then a big one that added to some of this thing was a thing that we call cloud design for educational sessions. Again, the intent is to create brand awareness. Everybody top down from the CIO to the developer knew what Cloud Design was, everybody knew when it was on a periodic basis, and they were like, "Oh, okay, today's Cloud Design session is so-and-so." You would join. There's no need to go separately market this. Once that awareness caught on, people just logged on to the sessions.

Along with surfacing this, one big one is insights on the single pane of glass. Your security score. This is a single pane of glass that is visible for anybody. Now any team can look at any other team's scores and kind of get a feel of how those teams are doing over a course of time. That pivoted some of that gamification and brought that gamification aspect to it. Teams started working their way through, and slowly we started seeing those green stars coming out.

High quality: you can see that when there was an outage in the cloud, you wouldn't see as many applications being impacted; they are naturally doing their automatic failovers. Cost efficiency: there's no more low-hanging, obvious thing. Efficient lead time: we could see that from the production release time cycles. Again, these three cards that I'm showing are just a glimpse of all the various cards that are coming out. That's how these things are.

If you guys are engineers, I just want to throw it out there as a big O for this. Like I said, it's a multiplication factor. I know there's a lot of calculation in it, but the primary thing that I want you to take out is: if you're not using what I say as blueprints, I mean the whole gamut of the platform, then you're iterating multiple times, which is day one, and day two gets even more exciting. You're going to iterate again. When you're iterating, of course you're putting money down, you're going to put your developer time down, and lead time increases. Whereas the other side, the adoption rate that we've seen is fairly constant.

Now, I'm sure you all got the big O, but let's look at a graph. There's another product team, by the way, that looks at these graphs on a periodic basis, that looks at insights on how things are going across the space. What we looked at was those teams that are actually adopting the platform pieces are gaining something in the order of eight weeks. I've done other products in my past life, but this is one of those products that we built that has the fastest ROI on these things.

I didn't mention earlier, but the technology of choice that we did for the Chofer portal is Backstage, backstage.io. We adopted it like two years ago, back when Backstage was still not as mature, but now it's way more mature. So thank you, Spotify, for open sourcing that.

One more interesting thing about teams that are going around building their own platform features: it's not easy to get skills for these platforms, especially cloud and all those. If you go Google them, they're pretty much all over the place. I live and breathe because I'm hiring Google engineers or cloud engineers. The lead time to hiring is one aspect, then bringing them up to that maturity level so that it can actually produce that ecosystem-friendly application or templates, that itself is in the order of six weeks. Overall, what we saw was we have a real value for putting this platform. It was actually resonating in how we're doing this.

We're coming to the end. I want to leave some time for questions. Shared responsibility: like I said, platform team is one aspect of it, but then there's a whole lot that the application team also has to do. Giving a good tool is one thing, but then you have to write it so that you know how you're going to debug at 2 a.m. You've got to put two and two together. The balance is struck only when you have both sides working closely with each other.

Finally, what I like to call it: DevOps bringing balance to life. That is what it is. By the day my developers go home at five o'clock and they don't have to look at it just because there's an outage, that is bringing balance to life. You don't have to come on a Saturday to do a release or do something else at midnight. It just happens. That's what that is.

The journey doesn't end for us. Of course, there's more. We've done the outer loop aspect of it, but now the inner loop, while the developers are still working on their laptops: how do we make this more integrated, make it more efficient, bring richer... Now that we've seen a catalog is actually fruitful, well, let's go closer to the business and bring those richer templates and whatnot. Ultimately, we're researching the usage of the SPACE framework to see our developer productivity, see our best that's going to give, given the current nature of geographical separation and whatnot. Not everybody's in the same room. That is something that we're exploring.

That's all I had. Thanks. That's my LinkedIn. Drop me a note. We are hiring, by the way, if you're interested. And there were some awards from the past.