Log in to watch

Log in or create a free account to watch this video.

Log in
Las Vegas 2023
Share
Download slides

Experian Engineering

Building a better future with modern technology.

Chapters

Full transcript

The complete talk, organized by section.

Moied Wahid

Thank you. Thank you. Thank you, Gene. That was really humbling, to be here. Good afternoon, everyone. How's everybody doing? Awesome.

Today I'm going to share how we are building a platform that can run heterogeneous workloads and field learnings on how we did the technology transformation.

Gene already talked about me. Besides working at PayPal, I was leading the technology transformation there. I worked at Yahoo, eBay, Netscape. Netscape was acquired by AOL.

Experian is a purpose-driven company. We are there when people are making life's important decisions, like buying a home, sending kids to college, buying a car. At least these are the very important decisions I made in my lifetime.

As we embark on the modernization journey, why would we build this? There are five key constructs, five key objectives while we are doing this modernization.

We want to build a resilient, scalable platform in cloud that can support heterogeneous workloads for real-time transactional workload and batch and analytics.

The first construct is zero client disruption. What that means is you build feature parity in the new system and make sure the code is backward compatible. And we use something called a strangler design pattern.

In the strangler design pattern, the interface remains the same. Behind the scene, you modernize the code. And we also did a silent launch. What that means is we run the shadow traffic on the new system and ensure the functional behavior of that application remains the same as we route the traffic. Once we're satisfied with the performance and the functional behavior, we throttle up the traffic.

The second construct is about zero downtime. When we deploy an application, we have a two-by-two pattern, where minimum, you deploy in two regions and two availability zones.

We also have an ability for every feature that we roll, we have a wire-on, wire-off feature. You have an ability to turn on and off and run the shadow traffic through the new systems.

The third key construct is all about security. First, it's all about encrypting the data at rest, transit, and in use. We do 2.5 billion encryptions per day.

We talk about DevSecOps. Since everything is operating in cloud, it is every developer's responsibility to manage the cost. The total cost of ownership is super important.

The fifth construct is all about leveraging the latest tech. We have an autonomous fleet. We have a configuration-driven infrastructure where you can define what the state of affairs should look like for this application pool with BYE, and then if there is any drift in the system, this fleet will autonomously heal by itself.

When we took stock of everything that we own, the fleet, three patterns emerged. One is rehost and rebuild. The second pattern is re-platform. And the third pattern is refactor and re-architect.

Let me double-click on rehost and rebuild. This pattern is all about the applications and the workload where there's no code change, the same tech stack, no hardware dependencies, applications are cloud ready. You don't have to do an OSS upgrade. It's a lift and shift for us.

The second pattern is all about re-platforming. What I mean by that is there's a minimal code change. The lights are really bright here. The same tech stack in application server or the database or mid-tier application, mid-tier software need to be upgraded, and we need to containerize that and put it in PaaS.

The third pattern is the hardest of all, where your major code changes are happening, or you're rewriting the code, or you are re-architecting your infrastructure. The stack, your containerization, you convert the application, containerize that, and deploy in the new fleet.

This is a snapshot of how the tech stack looks like. When we started five years ago, this is how it looked: 25-years-old mainframe, monolithic legacy architecture, depends on proprietary technology.

If you look at the middle stack, the distributed systems, what I mean by distributed system is in the pizza boxes in your data center. This system is SOA-based solution applications, and we are doing batch on databases, which is not a very great way to do it.

And we started analytics in cloud, and we built the data lake. Imagine, the business of Experian is the data furnishers, the financial institutions. They send us the data, we enrich the data, we curate the data, and we have SOR and the core database, and we replicated the data to cloud.

So let's say somebody asked for Mo's credit report through an API interface. We compute, we run hundreds of models to compute score. The same job, same credit report, if someone wants to get 100 million users' credit report. Experian owns 1.2 billion consumer profiles. And when you process hundreds of models and compute a score, and if you want to compute that for 100 million users, we process in a batch system.

And the workload on the cloud is analytics. It's all about finding a needle in a haystack. It runs in a big data stack.

So this is a future state, where we replicated the data using CDC log, and in sub-seconds we were able to replicate that in the cloud. And majority of the traffic is served in the cloud right now.

So the transaction workload. Earlier, we had three types of workload. We have four workloads. We introduced streaming, where let's say if there is a derogatory remark on your credit profile, the consumers can subscribe to the Kafka queue and you'll get a notification so that we can prevent fraud.

So on the batch side, what we did was we rewrote that application in Scala and Spark so that we can leverage at scale. And we had our own implementation of RocksDB, customized, sharded for our application.

In the microservices and software application world, DevOps is table stakes these days. But there is a huge problem in the modeling space, where how you manage your code to how you take it to production, and the infinity loop around that is really complex.

So I'll go bottom up, in the interest of time.

If you think about we build a platform, security is a key construct. What I mean by security, the four key constructs of the security: one, knowing your user, means identity management, user access management.

And the second thing is about data. Knowing your data, it's all about data security. It's all about privacy. It's about who has access to the data, how long, and how much.

Experian has hundreds of petabytes of data, so it's very important that we need to have proper access control there.

And knowing your infra is the third key construct. It's all about vulnerability management, how you're staying ahead of CVEs that you get every single day.

The fourth key construct is knowing the unknown. It's about threat protection, how you're staying ahead of it, DDoS and other things.

The second key construct of the platform is about standardization. What that means is nobody can spin up an EC2 node or a Fargate or EMR cluster by going into the console. Instead, we build a PaaS layer. We wrote hundreds of lines of Terraform code so that it's abstract and there is a standard way of spinning up the cluster, spinning up the fleet.

The reason we did was drift detection is super important when things go wrong. And if you have a standardization, there is a predictability and it's easy to debug.

The third key construct is all about elasticity. How the systems can flex up, flex down based on the workload imposed on it.

The fourth is all about resiliency. Resiliency is about having multiple deployments in different regions.

Observability is a very key thing. If you can't measure, we can't improve. So we get billions of events per day so that we get a lot of signals, and we stay ahead. We know the problem before the customer finds out about it.

So what we did is we created an MLOps platform that can, once you log in, you see all the models that you own, you have access to, and if there is a model drift, you can jump right in.

The data engineering is all about enrichment, optimization, and publishing. And developers, data scientists can write the code, and we provided an IDE, integrated development environment, using JupyterHub so that they can write the code and, with a single click, they can publish the code into the SCM.

And the code is versioned along with the inputs that went into the model, the training set that went into the model, the datasets that we use for versioning, so that it is auditable, it is reproducible. And the dataset and the source code for the model is versioned along with the artifact of the model.

And we publish this. We have a DTR so that it is all versioned together. There is a universal metastore or feature store where, when data scientists are writing the model, they know what features are available for them to model this.

And we created this polyglot application that can support Python, R, C, and SAS.

This product supports both LLM and machine learning models, and it also generates an API endpoint that you can invoke these models, both in real time and in batch system.

It allows us to run A/B test. What I mean by A/B test is it can route partial traffic to different models and you can see how the models are performing, which version is performing better. And also a champion/challenger that can compare the new challenger model, how it's performing against the incumbent champion model.

And we have different data connectors for internal, external data sources.

The model governance and MRM is also built in so that our platform stores and manages the artifacts, the experiments of the models and deployment, and draws a lineage across each of them for transparency and compliance.

And we have also model monitoring that monitors the model. This is very typical in model performance. It's a known fact that model will degrade over time. And we want to maximize the performance of our application in real time. And we provide the feedback loop so that we can retrain, recalibrate, package, validate, deploy, and monitor.

What are the lessons learned from this transformation journey so far?

Transformation is really hard. You really need an irrational optimism, and you need to be relentlessly dissatisfied. And it needs a lot of commitment, a lot of grit to make this happen. Every time you unpack something, there are 200 other things that come up.

So we have to create a blueprint of what the long-term architecture looks like, and we march towards that.

And we fundamentally believe, I come from Netscape background, my code still runs in Mozilla. So we have open architecture. We leverage Spark. There is one incident that we had where Spark was not acknowledging the less for. So we downloaded the Spark code, made the change, and contributed back to open source. I think we remove the vendor dependencies because of the open architecture.

Engineering excellence is super, super important. Writing great code. This is the art. We are artists. We take pride in the code. It's deep engineering. Every time you make a code change, you raise a PR, somebody else reviews the code, and the whole pipeline gets started.

The third key construct is all about talent management. We have a mix of interns, subject matter experts, cloud experts, and who has battle scars during the transformation. We have folks from different, everybody from Silicon Valley in the team.

So the fourth construct is all about creating a culture of experimentation, failing fast, learning faster. This is all about creating a rainforest effect.

What I mean by rainforest effect: in rainforest, in the COVID, I was doing some plantation in my backyard, doing gardening, set up the irrigation system. Me and my son automated that. It takes a lot of time and a lot of effort. And you had to prune the trees. You had to fine-tune the water supply to the trees.

But if you look at in rainforest, the water flows, the big trees give shelter to the small trees. Everything flows by itself and the whole ecosystem. So we created that kind of a culture.

The last construct is all about customer centricity. Hyper-focused on creating a better customer experience. From the customer's point of view, it's all about commitment to have customer-first approach.

So how does it look after five years?

We went from monolithic to, we have microservices architecture. We had waterfall model. We have now Agile and Kanban. We use SAFe methodology.

We had scattered prioritization. Now we have streamlined prioritization, where we have voice of the customer. We have product. We have sales engineering. We have seat at the table. We are building solutions together.

We used to have five releases per year five years ago. Now we can push code every single day.

And our commitment to product and sales is from ideation to live site, we can deliver a product in 90 days.

From manual testing to automated test-driven testing.

Traditional operations, where developers write code, give it to ops to manage that. We have today shifted that whole thing to left. We have CI/CD and DevSecFinOps.

I'm emphasizing on FinOps because once you are in the cloud, you can burn really fast. Every developer today, in all our scrum standup meetings, every product line, every engineering team owns a budget, and they see how they're performing on a daily average.

And traditional deployments to containerized deployment.

Last, we had from reactive monitoring, we went to SRE. We just heard a really interesting talk on SRE. Super fascinating to learn. And we have autonomous fleet.

So one ask I have from all of you is we have some really interesting work that we are doing. If you're interested, please check out our job site.

Thank you for your time. Really appreciate it.