Dell’s Journey of IDP Scaling

Log in to watch

Las Vegas 2023

Dell’s Journey of IDP Scaling

Technical Staff Engineering Technologist · Dell

At Dell, we rolled out an IDP to 15K developers, across 100+ products which deliver in multi-faceted ways. In this talk, we will explore how we did this at Dell for it's multi-billion dollar business units, curating developer workflows and doing it in a way which is unlike any other out there. Come and listen to our journey!

Chapters

Full transcript

The complete talk, organized by section.

Henrik Koren

Good morning, everybody. My name is Henrik Koren. I'm working for Dell. I'm a distinguished engineer, and I'm going to present you our Dell journey for how did scale on our internal development platform.

So what is our journey and where has it happened? The journey was based on, focused on, the ISG. ISG is our Infrastructure Solutions Group, the part of the business that consists of storage, servers, networking, cloud solution, data protection and communication solution, and many more solutions that we have inside.

We have something like more than a hundred product groups and more than 50,000 people that are doing different things differently. Our focus to us was increase the IP collaboration and reduce the duplication of things that were done by each of the groups. We also wanted to make sure that all our product groups are much better aligned with the processes and the services that they are consuming.

So where we started this journey: we started this journey by acknowledging that we need to make change. And in order to make the change in such a big company, you need management champions to support your journey.

After that, we also understood that we need to do it in how we are communicating and which language we are using inside, and focusing on CI/CD processes. The first thing that we started was to make a research to understand actually what's going on in all our product groups, find the common usages to understand how much overlapping we have between us, and identify which tools we're using across the board.

Afterwards, we've gone into a phase of planning, and in that phase we created substreams in order to evaluate exactly what are we going to do in each one of the fields, like in the source control, in release management, how we will use our Jenkins. Each of these substreams came up with recommendation of best methods and best practices.

Another thing that we did is we had many, many tools, and we needed to shrink down the number of tooling that we're using inside our company and provide a baseline of best practices for those tools. We compiled a list. We're calling it the North Star tools for us, and this is the recommended tooling that everybody needs to use inside our company.

And the last thing is, we went and started to build this platform to implement it, started to interact with our customers, and identified users that, at the beginning, were willing to start our journey with us, even that we're not perfect or ready a hundred percent to do that.

So what we were able to do here is to bring everybody across the board to look on the same system and start the collaboration, and to focus on a centralized system. So this is how we begin with a lot of things that were out there, and a lot of things that were here that we need to align.

So one of the main things that we saw, and I think that also we saw during this conference, was the thing is with development platform and services, or traditional DevOps groups.

So what we're saying is that IDP is coming to help our developers to minimize or reduce the development cognitive load on the developers themselves, provide engineering governance, that they can consume the services much more easier, and find a better balance between the development needs and the business needs themselves.

This is helping everybody to work faster and to deliver more reliable software. And these are our fundamentals that we find of the platform engineering and the difference between them and the DevOps engineering.

So how is looking our system and how we see it? First of all, we have our North Star tools. As you can see up there, we're using GitHub Enterprise, Jenkins, JFrog, qTest for test management, Jira for defect and task management.

On the other hand, you can see the tools that we need to integrate with, that everybody needs to use for their CI/CD pipeline, like Verity, SonarQube, McAfee, like that, Twistlock, and all those needs to be present in the CI/CD. And in order for that to happen, and for not each of the organizations to integrate those, we created our IDP. And that is what we're trying to, this is what we came up with.

So the base and the glue of our entire system is based on Kafka. We chose Kafka because it's giving us a loosely coupled integration. It's easy and reliable system that makes our system much more flexible and ready for... And so, as I said, Kafka is used for the integration, not only between our components, but with the outside components also.

So anything that is interacting with us, or we're giving back an answer, it will be through the Kafka messages. Next one is we deployed all our services on a Kubernetes cluster, and we're using CyberArk as our identity manager.

And now let's go a little bit to understand what kind of services are we providing to our customers. First of all, we have a UI. The UI is a self-service portal to enable and disable the services that they can consume.

So we, as a DevOps team, we don't need to be involved in what is running or what they are consuming at each time. And if it's not relevant, maybe they don't need to use it. If it's relevant, if they want to block it, they can also block the service. If it's not successful or has issues, they can block it for the enforce for the developers.

Another thing that we are having here is a testing flow. Our system can get test requests from many sources, and will run their tests on our platform, and they don't need to set up test beds or manage the workflows.

Of course, we also support CI/CD and managing those workflows inside. We have security policies that are checking what's going on with the product, integrating the tools for the security, like Twistlock and Checkmarx, compliance policy.

We have reports that are running on the outputs of all our checkers or all our systems and provide useful data for our management to understand what is the status of a product before it's going out and being released.

We are trying very hard to implement a shift-left policy. Those are all the checkers, those are all the implementation and integration that we're doing for our customers. We have something like 12 or 15 ready-to-use tooling that our customers can just call from their pipeline in some cases, and they just can enable that through our portal and we will do it without even needing an integration from their side. So it's just a free service that we're providing.

All this data is aggregated and stored in a few databases. Each of the databases is used for a different use case and different data that is coming in, like Elasticsearch for our logs and for tracing, Postgres for long storage of data that we want to aggregate and make a lot of reports based on.

So this is just how it started, very simple, in one view of how we're working.

Because we have so many products, we needed to grow and to be reliable. And one of our requirements from us was we need to be able to freeze our configuration for each of the product groups. If they are in a critical path, we cannot change their configuration or upgrades, so we can freeze that.

And because of that requirement, we decided to go with an implementation that is looking like this. What does this mean? It means that we are using or implementing a single-tenant methodology. What this says is that each customer has its own replicas of all the services that are out there.

This is giving us a lot of flexibility, redundancy, easy to configure all those services, make small adjustments for everyone. So because each one is doing it a little bit differently, the business or how they are using it, or even we have more secure products that we cannot get into it, so they will have special permission for special users.

So all those needs to be adjusted for each one of those product groups. And now that this is in the Kubernetes cluster, it's also very easy to manage the resources that everybody's getting. Because by the end, in Kubernetes, it's a namespace. Each of this tenant is getting its own namespace. We can set up his resources that he needs to consume.

If they have a spike or they have a rogue process, we can contain that in there and it will not impact any other of our customers. So business as usual for everybody else, and they will suffer. So they will have problem, we'll come and fix that issue, whatever is happening in that namespace.

So what we did in addition to all this, we are focused on support. We are treating our customers like outside customers, real customers, and not like locked-in customers, because they can bypass us. If we'll not give them good enough service, they will have their own DevOps. They will go back to their own way that they did the business before that.

So we need to focus on that. We have defined SLAs. We practice, "You build it, you own it." So our own developers are the support system. Those are the ones that are on the Slack, on the Teams, responding to any developer need or question.

Most of the times it's small questions, but it's unblocking developers from going on and doing their day-by-day job. If there is a bug, of course we'll take it off, open a Jira ticket and we'll take it offline and we'll deal with it. Of course, customer bugs are first priority before anything else because those are blockers for our organization.

So we want happy developers by the end of the day.

Another thing that we're doing here, just a little bit of a technical one, is we're following Ops practices and processes. To have more than a hundred deployments, we cannot do it manually. So we're using Argo CD to deploy all this monster, and it has helped us many times in migrations and to grow our deployment.

And also we will use it when we need to do a DR. So for us, a DR or a deployment or a scale-up is the same thing. Argo CD is there for us and it'll do all the work.

Just an additional thing that we did is we have a pretty fast, it's taking a few days to onboard new members on the systems. And we are trying to make sure that that is also smoothly done by our teams, again by our developers, so they can encounter and fix the problems that they're seeing during the implementation or during the onboarding.

Another key factor of this is we drink our own champagne. We are the customers that are also using this platform. So if we're suffering, I bet that our customers are suffering. So our developers are not trying; they need to use the system. They are as any other one. And if someone is complaining about, "We have issues," so I'm telling them, "Go and fix them. It's our system." So we should be the customer zero, and we are the customer zero for these systems.

I talked about the implementation flexibility. We have meaningful reports, and of course microservices. I think that's it.

So just a few closing remarks about how all this, and if you are starting such a journey, where do you need to focus on?

Identify what developer workflows and the organization needs to curate. Start with common tooling, processes, and services that are many times quick wins and easy to see. Focus on balance developers' experience with the business needs. Evaluate the policies that the organization needs to standardize on.

Get your IDP out there. Improve it. Have a cycle of improvement all the time and feedback. Choose technology that has scaling ability and an abstract layer. In our case, it was Kafka.

And the most important one is have fun and don't work so hard.

Thank you very much.