Super-charging Software Development at Intel through DevOps

Log in to watch

Las Vegas 2020

Download slides

Super-charging Software Development at Intel through DevOps

Madhu Datla

Senior Engineering Manager for DevOps, Global Infrastructure and Systems Engineering Team · Intel

Peter Tiegs

Principal Engineer · Intel

Imagine the functional and integration testing complexities involved when 20,000+ software developers distributed across 10+ countries all trying to make changes to code, testing and putting a complex system together - At Intel, we re-imagined and transformed how Devops for complex products should work at a SCALE. We have battled religious wars on tools, processes where every team across the company have already invested in their own well-established localized workflows and tools.

• How do you drive change especially in a large and diverse company such as Intel?

• How do you help teams of all shapes, sizes and different maturity level move towards faster integration cadence?

After more than half decade of change management we have lessons to share on how to approach large scale modernization using Devops and analytics solutions. Our solution is now able integrate changes across the company quickly and output a packaged software kit to our customers at 40X more capacity.

Hardware and Software co-development is becoming more and more relevant these days with uptick in device development from wearable's to software that needs to be integrated with hardware frequently. You will learn some key insights not only on Hardware and Software dependencies but also Intel's modernization efforts.

Chapters

Full transcript

The complete talk, organized by section.

Madhu Datla and Peter Tiegs

Madhu Datla: Hi, I'm Madhu Datla. I'm a senior engineering manager at Intel, responsible for developing DevOps and systems engineering capabilities.

Peter Tiegs: Hi, my name is Peter Tiegs. I'm a principal engineer at Intel focusing on DevOps, and today we're going to talk a little bit about our journey supercharging software development at Intel. Madhu.

Madhu Datla

Intel processors and products are used everywhere, like laptops, mobile devices, servers, autonomous vehicles, and IoT devices.

Within our client organization, in the last five to six years, the number of products that we have been delivering has grown exponentially year over year. Because each of our customers wanted to differentiate their products, we created a segmentation strategy which allowed us to meet growing market demand and support a large number of use cases.

In each of these product SKUs can be a different set of components interacting with your processor. And the software on that product would differentiate how the system would behave in different situations. For example, I have an Ultrabook here, which can be in a laptop mode, or a tablet mode, or a tent mode. And in each of these modes, the user experience is different, and the software and hardware interactions are different. Another customer may want to just sell a thin and sleek laptop at a different price point.

There are several thousands of products from our partners, like Google, HP, Asus, Lenovo, Microsoft, each of them offering different price points and different usages. So essentially, the number of products has basically exploded. Every year, we are integrating 4X more products, and this is happening across multiple operating systems like Windows, Linux, Chrome, which are getting more frequently released.

So our ecosystem has become extremely complex and dynamic, and our hardware had to be validated and released faster with the highest quality standards possible.

In order to provide the best customer experiences on these Intel systems, our development team is working across the globe. We have 15,000 engineers doing software development in a variety of languages and in multiple geographies around the clock. And one thing that is unique about Intel is our product development is extremely complex. One of the reasons is because we have a large dependency on our hardware, and in the early phases of our product development, the hardware is very unstable. Integrating 30-plus software components on a daily basis, and data that is getting developed across the globe, is not a trivial task. We had to innovate. We had to come up with enterprise-wide DevOps infrastructure to support the complex development here at Intel.

Peter Tiegs

So, as Madhu was saying, there's 30-plus different software components that go into our platform-level products of the various SKUs that we have. This platform-level continuous delivery mechanism that we've put together is based on common continuous integration and continuous delivery practices, but scaled up and set into a segmented strategy where each of these various software teams, whether it's the graphics driver or the wireless WAN driver, or the audio driver, runs their own CI process and then delivers those in a CI-like process into platform integration, where we bring in not only all of those software stacks, but the OS and the standard software and the drivers.

Historically, before we started to look at this as a platform-level continuous deployment process, we used to do big bang integration, where each of these teams would deliver their software at some arbitrary time. The system integrators would pull together, ask people for what versions, what share drive that version was on, what SharePoint site this version was on, and pull them all together. Put them together once for the first time and see if they would work. And nine times out of 10, they did not work.

So what we did was we put together this process based on continuous integration, where we would incrementally add a new version against a baseline of all of these ingredients, assemble it into what we call a base SoC kit, and then turn on or enable different features depending on our target customers, whether it was for IoT or for client. As Madhu pointed out, we had a couple different client SKUs and capabilities, or even now going into the data center and server-based platforms so that we could deliver a BKC, or a best known configuration, out to our customers.

All of our software for platforms goes through this pipeline now, and we've enabled a common repository for sharing source code because of the ability to debug use cases by our upstream validation teams through source.

And as you can imagine, with a diversity of software teams delivering into this system as well as the system itself, we have a diversity of tools within our entire DevOps portfolio. I'm sure you recognize many of these tools operating in many of the different roles that we need in a complex DevOps system. And while it may seem like the right goal is to drive down to a single pipeline and a single set of tools in a single toolchain, we have found that that's not really an achievable goal in the reality of DevOps at a complex enterprise-level system.

One, software teams have legacy that they need to support, and there are certain things that may be tied to specific tools. Two, some of those software teams are coming in from acquisitions, whether we're purchasing a new company, or a team is coming in and using that stuff historically, and they're not necessarily using the same tools that we've used historically. And one other really important piece is that we want to stay up-to-date and modern as tools change and evolve. The DevOps space is incredibly dynamic now. And so having the ability to have a mix of tools in the system so that we can stay up-to-date with tools.

We also need to balance both commercial off-the-shelf tools in addition to open source tools, as well as internally developed tools. Like you'll see some of these blue ones, OneKit, OneBKC, and Axon, which are Intel-developed tools to support our own DevOps operations as we need. And having that mix to find the best set of capabilities for our product and to get these systems to work together is really what we need for a toolchain.

So if we look, for example, at one particular pipeline: say one platform team needs to do their builds on Kubernetes with some physical hardware with Ansible, and their source code is stored in GitLab, they're doing the build integration through Jenkins, and they need to run Clockwork and Black Duck Binary Analyzer for security scans, and then they need to deliver their binaries out to Artifactory and report through some of our own Intel internal systems.

Similarly, another team might have a different mix in their pipeline of tools that is needed. They might need to use GitHub on top of Protex and TeamCity for their build system instead of Jenkins, and they need to report their test results through Splunk and create a report in Power BI. So these mixes and matches of toolchains that we have within the entire portfolio make it really challenging to make sure that we have a consistent, live, stable DevOps platform for our teams to deliver their software on. The toolchain is constantly evolving and for good reasons.

And feel free to ask questions on whatever the platform is as we go. So one of the things that we did to help us handle and wrangle this diverse set of tools within the toolchain to support this broad ecosystem, as Madhu kind of touched on, is we created a DevOps enterprise program.

And we focused on three areas where we would want to make sure we were consistent. The first area is systems engineering. We wanted to make sure that the data about the software going through our DevOps pipeline was consistent so that we had a virtuous feedback loop. As things came into the system and were processed at the platform level, we could provide data back to the software teams, and the data that the software teams delivered into us was good. Madhu mentioned pre-checks. This is where the kind of data around those pre-check algorithms coming in to watch our various software components coming in.

We also had a team focused specifically on where our source code was and how we could be consistent with our build automation. This team made sure that even if we had a diverse mix of source code systems out there, that the source code was available through common access permissions. And that for building, when there were common behaviors that needed to happen regardless of what pipeline you had, there were tools and capabilities enabled for the teams to draw upon so that we had at least some consistency, regardless of whether we were using the same tools or not.

And finally, the test and release team focused on standardizing and simplifying how we were testing and reporting our test results between the different teams, as well as standardizing the release channels of our platform software out to our customers. This program was a key piece to making sure that we at Intel, with our big enterprise, could handle all the diversity and that explosion of software and SKUs that Madhu mentioned before.

So a little bit more detail into the enterprise DevOps exploration. Knowing that we had a diverse set of tools in the toolchain, we wanted to set some foundational rules to make sure that the tools work together. So one of the key areas, as we touched on in the build and source team, was focusing on inner sourcing, making sure that the source code was available, whether it was for debug purposes or just sharing knowledge between the company. We needed to make sure that we had binary storage consistently available. We have a global enterprise, and some software that's built in Folsom, California, may need to be tested in Bangalore, India.

We needed to make sure that the build infrastructure, the compute capacity that we needed to do this computation, was distributed worldwide so that software, regardless of where they were around the world, had a consistent environment. And we deployed a hybrid cloud with Kubernetes on-prem and the ability to go off-prem as needed.

Finally, knowing that we needed to reexamine our toolchain on a regular cadence to stay up to date with the best practices within the DevOps industry, we planned that into our system, that we would rotate and reexamine the toolchain on a three-to-five-year cadence. One of the side effects of that is we decided to build reusable libraries that would abstract away the tool specifically from the logic that we needed to deliver software for our business. We call this thing the abstract build interface, and it helps us avoid vendor lock-in, and it allows us to survive in an environment where we have a diverse toolchain.

Ultimately, the secret of DevOps at this enterprise was to remove the barriers of the software engineers to deliver software and value to the customers.

Madhu Datla

Thank you, Peter.

We fully believe that what cannot be measured cannot be improved. There are several commonly used metrics in the DevOps community, like average build time, average number of build failures, total number of test regressions, the nightly regressions that are happening, tool reliability-related metrics like downtime. But in large enterprises, the tools are managed by several teams. Like Peter mentioned, a typical DevOps workflow is probably going to be achieved through a combination of tools.

Developer productivity is one of the important measures that we need to consider when thinking about an enterprise DevOps system. When one of the tools in that toolchain, or the stack of tools that you are offering as a DevOps toolchain, when one of them is down, it is impacting the overall reliability of the solution. The developer who is waiting for that build to come out has to wait for longer. His objectives are getting delayed.

So we propose that you have a solution-level objective, which is a measurable criteria for the whole solution. We also need to set a boundary condition for those objectives, and a good example is we expect that a build should be finishing within expected build time, let's say five minutes, but a 5% variation is okay. We also need a clear guide for some of the triage engineers who can root cause one of the issues as quickly as possible, and the team needs a guide, which is an escalation path, and they know when to call for help or escalate.

We also have something called solution-level mission interrupts, which are systemic failures in the system. Right? These are the interrupts that are interrupting the overall business workflow. When a system-level mission interrupt happens, you are not able to deliver something that is significant. A typical failure needs to be classified as a mission interrupt if it has a significant business impact. Conducting systematic retrospectives to root cause and making sure that we are continuously improving on the issues that we are finding is extremely important so that we can avoid the recurring failures.

The other important thing to keep in mind is you want to have a way of finding the faults as quickly as possible. Right? So the average time to find the faults in a service needs to be coming down over a period of time. It is always a good idea to define certain quality gates. We call them pre-checks, and within the pipeline, establishing the pre-checks will allow a seamless integration of one software team's deliverables to the other software team's deliverable.

In summary, the enterprise-wide DevOps toolchain can be quite complex and messy. You should fully expect that the tools need to coexist with other solutions for a long time. Instead of tying your business workflow to the individual tools, invest in standard interfaces so that when new technologies come along, it is easier to migrate to the new technologies. Lastly, think about the solution-level objectives to hold the individual teams accountable.

And the system needs to be tolerant for multiple destabilizing factors because these are enterprise-wide systems; there could be many factors that could go wrong. Defining those tolerance levels for each of the teams to optimize their solutions will be beneficial in the long run.

Feel free to reach out to us with any questions, and we'll be more than happy to answer. And thank you for listening.

Peter Tiegs: Yes. Thank you for listening.