MLOps - Accelerating Data Science with DevOps

Log in to watch

Las Vegas 2019

MLOps - Accelerating Data Science with DevOps

The business impact of high-performance DevOps has been proven widely. Unfortunately, as enterprises now try to use big data with machine learning, data science tends to get left behind. Challenges that engineers and data scientists face when developing an ML-based system are reliably deploying models, managing ML assets at scale, and knowing when models are going stale.

Introducing a DevOps approach to Machine learning helps solve these challenges by providing structure for collaborating and bringing models into production.

This talk covers real improvements that MLOps has brought to large enterprise customers in maintaining asset integrity across teams, accelerating model operationalization, and enabling a sophisticated AI application lifecycle. We share pointers on how you can follow the same journey in your org.

Jordan Edwards is a Senior Program Manager on the Azure AI Platform team. He has worked on a number of highly performant, globally distributed systems across Bing, Cortana and Microsoft Advertising and is currently working on CI/CD experiences for the next generation of Azure Machine Learning. Jordan has been a key driver of dev-ops modernization in AI+R, including but not limited to: moving to Git, moving the organization at large to CI/CD, packaging and build language modernization, movement from monolithic services to microservice platforms and driving for a culture of friction-free devOps and flexible engineering culture. His passion is to continue driving Microsoft towards a culture which enables our engineering talent to do and achieve more.

Shivani Patel is a Program Manager at Microsoft working on MLOps for the Azure Machine Learning Platform team.

Chapters

Full transcript

The complete talk, organized by section.

Shivani Patel and Jordan Edwards

Shivani Patel: I'm Shivani Patel. I'm a program manager at Microsoft in Azure Machine Learning.

Jordan Edwards: I'm Jordan Edwards. I'm also a program manager at Microsoft, working on Azure Machine Learning.

Shivani Patel: Today we're going to talk about MLOps and basically accelerating your data science workflows with DevOps.

Jordan Edwards: For today, we're going to be covering what is MLOps, what does the MLOps lifecycle look like, and how do customers in the real world use MLOps and get to an MLOps workflow?

Shivani Patel: Raise of hands, how many of you have heard of MLOps, used MLOps, or are familiar with it?

Jordan Edwards: About--

Shivani Patel: A little bit. Yeah, like 10% of the room. Some of this might be repeat for some of you, but let's dive into the ML piece of it. What does the machine learning lifecycle generally look like? First you start with getting data, data acquisition, cleaning it up, putting it into a dataset, and then developing, experimenting, and training, and eventually coming to a model that is solving a real business problem. Then you package it up into a format where you can actually run it and get value out of it.

Next you validate it, making sure that it's reaching all accuracy and performance thresholds, and any regulatory compliance that you have in place for that model. Next is the fun part of actually deploying it and making predictions, actually getting the real value out of your model. The last phase is monitoring it. You want to make sure that you're constantly looking at the model, making sure that it is going to perform and give you the value that you expect it to. Once it's deprecating and not performing as well, you kick off this training pipeline all over again. That's the machine learning lifecycle in a nutshell.

What is MLOps? That's basically bringing DevOps principles to your machine learning workflow. It's integrating a continuous integration flow into the data science workflow. It's automating the building and testing of your code, creating these repeatable training pipelines, and then providing this continuous deployment workflow, which is automating the validation of the packaged model and deploying it out into your target device, server, or wherever you're deploying your model to. Then it's monitoring not only your pipelines and your infrastructure, but the model performance, as well as the new data that's coming in, and creating this data feedback flow to kick off this pipeline all over again.

Jordan Edwards: Sweet. So how is this different from DevOps? How many of you have heard of DevOps? Okay, just making sure we're at the right conference. As you can see, you bring data and models into your system and it pretty much just implodes it. In a traditional software development lifecycle, you have a small integration test, a tiny unit test, and you're not managing a lot of pieces. When you move into an ML-based lifecycle, you have data tests, model tests, system-level monitoring, model monitoring, and training pipeline monitoring.

Shivani Patel: So it gets a little complex.

Jordan Edwards: A little bit, yeah.

Shivani Patel: What ends up happening is that your ML code is actually such a small piece of your infrastructure. You have so many more assets that you have to manage in a scaled-out machine learning workflow.

Essentially what makes it really different is that you're putting together three different workflows. You have your data engineer who is cleaning up this data and creating these data pipelines; your data science workflow, which is experimenting and creating the model; and then your software engineer who is operationalizing the model. What makes it so different is that the versioning of your assets ends up being very different. You need to be able to version the datasets, the schema, and how it's changing. Along with that, you also need to have the lineage of where your model is coming from, where it's being deployed, and what datasets are being used in your model. You end up having a billion more requirements on additional artifacts that you're creating.

Jordan Edwards: Yeah.

Shivani Patel: The second piece is model reuse. It's very different than reusing software. You need to create a training pipeline where you're caching a bunch of the steps that you're creating, so then you can transfer learn or fine-tune the model that you've created so it stays relevant in that context. Inevitably, models tend to decay over time. Everything in the world is changing around us and data is changing, so you have to go in and update these models over time.

Now, what does the lifecycle look like? Initially, you start by doing this exploratory phase of first getting the data, creating a model, finding an algorithm that works, and then you have this really useful model. You have your data engineers hoping, "Okay, maybe this data acquisition works. I'm going to actually scale it out." Data scientists create the model, and then you have your software engineer or your DevOps engineer saying, "How the heck do I deploy this out? Where do I deploy it? How do I package it?" There tends to be a lot of friction between the personas here.

The next phase is actually reproducing that process of creating your model. This is where the continuous integration piece comes in: turning your training process and your training pipeline into this frozen pipeline that you can reproduce over and over again from wherever. For those here who aren't familiar with data science and data scientists, how do they normally work today? Right now, data scientists are working on their own laptops. They're training in their own environment and tracking their own things. There's no standardized approach to creating your model. They're doing it in their own context, which is really hard to share across a huge team.

Jordan Edwards: A lot of them are researchers or have PhDs. They're not classically trained software engineers, and thus they don't understand things like version control. Taking any of the work they do and bringing it to production is tricky.

Shivani Patel: Exactly. You need to be able to capture all these pieces that we talked about: datasets, the environment that you're working in, the code that you're creating, and all the metrics that come along with creating that model.

The next phase is, let's get this model running somewhere so you're actually getting value from it. Once you push this model into a centralized store, forcing the data scientists to standardize a little bit of their approach and push what they've created into a central store, you kick off the training process of packaging up the model in the context of its deployment environment. Your training environment and deployment environment may be different, so you need to make sure you have that environment, test it with sample input data, and make sure it's behaving the way that you expect it to in the package format. Then you release the model, triggering it and pushing it out into a device, a server, or wherever your deployment target is. That's where your ML engineer comes in.

The last phase is actually automating it and reaching that happy state of MLOps. We start with a data engineer who has a centralized data store. They create a pipeline that's constantly pushing cleaned-up data into a centralized data catalog. That's where your data scientists can pick up the new data and trigger this pipeline that will go through standardized steps of creating a model and compare it to the last known model that was out there, the last known good model. Then they push this new model that's been certified by the data scientist into the centralized model registry. That's where your ML engineer or software engineer comes in, picks it up, and triggers a deployment pipeline that will go through the same steps as before: package it, certify it, and give you a standardized way of testing the model, so you're deploying your model out with confidence. Then maybe you do some A/B testing, which is testing different versions of the model, and monitor the feedback in the outputs that are coming out of the different versions of the model.

Jordan Edwards: How do I do A/B testing with machine learning models?

Shivani Patel: With A/B testing, you'll have a scoring endpoint, basically the way that you call the model, and you'll have a bunch of different versions of the model behind that same scoring endpoint. When you call it, you'll configure the traffic of data going to each of those versions, so then you can ramp up and down according to the performance of those models.

Jordan Edwards: So the model behaves like any other microservice. Okay, cool.

Shivani Patel: Exactly. You want to bring in the data that you're getting, pushing it back into your central data store, so then you have fresh, new, current data that you can train your models on.

Jordan Edwards: Can you describe what the data analysis services are?

Shivani Patel: Yes. Basically, models that are out in production: the data that's coming into the model can be changing over time. That will change the performance of the model because it's getting these unexpected inputs. You need to be able to monitor those pieces, understand the performance degrading immediately, and kick off retraining pipelines as you see the model performance decaying over time with the new data.

Jordan Edwards: Let's say you've got a model running on an offshore oil rig, detecting if a valve or a compressor is about to blow. Temperature goes from spring to summer. It's now much hotter. The sensor values are higher. You're going to get a lot of false positives that there's an incident. Those types of signals are what looking at drift on the model is all about.

Shivani Patel: You want to make sure it's staying relevant in that context as time and the world changes around us.

Jordan Edwards: I'm going to talk about how Microsoft at a macro level tries to address the issues in the MLOps space, and then run through a few customer examples to give you a flavor for what MLOps looks like in the wild.

Shivani touched on the key problems of MLOps. How do you reproduce models when data scientists are largely researchers coding on their laptops? How do you reproduce predictions? Say a bank is running a model that is determining if they're going to approve or deny a person a loan. They need to be able to demonstrate exactly why the model gave that prediction for approved or rejected. You need to be able to trace back exactly which data and which features were used to train this model in the first place, and prove that things like fairness testing and bias testing were done, especially when you're dealing with these highly regulated industries.

When it comes to operationalization and automation of the ML lifecycle, that's about more than just the data scientist, the software engineer, or the data engineer in isolation. It's about how all three of them work together in a collaborative fashion. Often, the data you have when the data scientist trains a model isn't available when you're trying to make predictions, so how do you ensure you can have that same data, quality of data, availability of the data, and also that the model is performant? If you're trying to have a real-time system detecting if there's a giant spill on a factory floor, that model can't take 10, 15 minutes to run in a real-time context. How do you operationalize and automate three different personas, oftentimes using different tools and technologies together? The answer is, hopefully, with DevOps.

There's also how you do collaboration within and across teams on your ML workflow. As Shivani mentioned earlier, you can't share models around like you can share normal software packages. You need to actually share a pipeline along with the model that can be used to reproduce it and tune it based on the data specific to that scenario. One example is a model detecting anomalies in video feeds, like the workplace safety one I mentioned earlier. That model might be trained for a factory in Beijing, but if you want to use that same model for a factory in the U.S., you need to customize and tune that model based on footage of what the video feeds in that factory in the U.S. are going to look like.

Another thing we've seen that is really common with large enterprises is that each of their organizations has independent data science teams running projects and experiments right now, and lots of them are working on very similar workflows, maybe 80% or 90% the same. How do you collaborate and share on, "I'm using these datasets to train these types of models and solve these types of problems"?

When it comes to enterprise readiness, there are highly regulated industries, areas around governance when dealing with data and machine learning, especially when you're talking about deep learning and models that are really hard to understand because they're so convoluted and deep that even people with PhDs don't know what they're doing. How do you do compliance and infrastructure as code when you start to deal with specialized types of hardware? Cost management comes into play if you're using compute with large amounts of GPU, large amounts of memory, and jobs that take a really long time to run. It's not like a normal software build where it takes a few minutes. Some of these jobs can take days or even weeks to run, potentially, to get a good model. So how do you trust that?

From a Microsoft and Azure point of view, we have this recommended flow, which involves using these three different technologies together. Data Factory, Machine Learning, and DevOps services each have pipelines that are optimized for the different personas that are involved in this flow. You have Data Factory for your data engineers, Azure Machine Learning for your data scientists, and DevOps for your DevOps professionals. Those are the three personas, hopefully working happily together to get the end-to-end flow going.

As far as what an Azure platform, or any platform, needs to be able to effectively manage MLOps flows, there's keeping track of your infrastructure, your code, and your datasets. Data versioning is super important. You can't dump all your data in a Git repository, turns out, and you can't really diff large binaries in Git. You need ways to track metadata, profile, hash, and compare changes in your data over time to determine: Has my data changed enough to warrant training a new model? Has my data changed too much to the point where now my training pipeline is going to be useless because half the columnized features are dropped out of the tabular data?

Tracking your environments is also important. Say I train a model on my laptop. I want to know the exact state of the world when I trained it there. What were all the Python packages? Was I using Docker? If so, what Docker image was I running? How do you easily shift that from the training to deployment side of the house?

You also need to track all your runs and experiments. Another common thing we'll see from enterprises is, "This data scientist left my team. I have no idea what they did. All their work is gone," because it was just sitting in a Jupyter Notebook on the laptop. Having an easy way to track your runs in the cloud should not impede the agility of your data scientist, but should give you a way to centrally track and understand the work that's going on, the types of experiments that are being run, and the types of models that are being produced. And of course, there are the models themselves. It's important to be able to share and reuse those and integrate those models and events around those models into your end-to-end ML-infused application lifecycle.

When it comes to making models less of a black box, you need ways to explain how models are behaving. We have a few different approaches that we try inside Microsoft, mostly using open-source explainers called SHAP and LIME. They analyze, for a given machine learning model, all the features and the relative importance of those features into your model that's making a prediction. It's super useful if you're talking about any highly regulated industry, whether it's financial services or healthcare.

You need to profile the model, determine how long the model takes to run, and deploy it to a variety of contexts, whether that's as a real-time API, to an edge device, or as part of a larger data pipeline. Most customers we've seen right now are mostly working on getting their models integrated into data pipelines in batch and running them in more of an offline fashion to help build trust around the model, with an eventual goal of getting those models running in real time on the edge, closer to the devices, making predictions faster, and adding more business value. For enterprise lifecycle management, there are data pipelines, training pipelines, release pipelines, and eventing to connect everything together end to end.

As far as documented best practices around how to do this, there's actually a repository on GitHub you can look at that has this whole flow set up, including ARM templates to deploy everything on Azure. Basically, you have your data engineers working in tools they're comfortable with, dropping data into blob storage or a SQL database. You have your data scientists checking code into Git with built-in experiment tracking. When they're ready to say, "I think this model is good," all they do is submit a pull request to the master branch of the Git repository. That will run code quality checks, data checks, unit tests, and build and publish the reproducible training pipeline that's then shared along with the model. From the model registry, you tie into your DevOps release pipeline, whether that's Azure DevOps or anything else. It doesn't really matter. The whole goal is to show how this flow can work end to end and can go from a data scientist doing something in a notebook locally to the code checked in, a reusable training pipeline published as an official model, and full end-to-end lineage to figure out where everything came from in that flow. That GitHub repo at the bottom shows how you can actually do it.

To bring ML workflows to production, you need scalable compute and storage to train your models, the ability to manage all the assets, collaborate and share on them, package and validate your workflows before bringing them out, and deploy and serve them at scale. Our general recommendation from a Microsoft point of view is to use containers whenever you can. It's easy to encapsulate all your dependencies and easy to build it once and run it anywhere, whether it's in the cloud on a Kubernetes cluster or running on that offshore oil rig I mentioned earlier. Then monitor the models when they're deployed and know, as Shivani mentioned, when you should retrain them.

Now, just a few customer stories about how our customers are actually doing MLOps. This is a transportation company in Canada, and they're using 16,000 models at scale, one model for each of the bus stations to determine bus departure times more accurately. They're called TransLink; you can go look them up. Basically, they use DevOps and Azure Machine Learning together to submit a request to pull all the data in, train the model, and do the scoring in batch. Based on the results of that, they'll tune their calculations around estimated delays and how long it's going to take for the buses to arrive or depart. This is an example for more classical government industries, how you can take a pipeline, train it for one bus station, and then scale it out to 16,000. It's how you take that flow and bring it up and out and really help do production ML. We've seen the same thing with customers who will train a model for one retail store, then take that same training pipeline and apply it against all the data for all their stores in the country or around the world. This is how you get that value.

Another customer in the retail sector is using MLOps to ship recommender systems. In this case, they use Azure Machine Learning and DevOps together to customize, train, and publish models. They deploy the model as a real-time API on Kubernetes, and they also use the model inside a Spark pipeline to generate static recommendations on products, which they push into Cosmos DB. Those are served from the website, the mobile app, or whatever. The whole point is that it's the combination of machine learning services and DevOps services that allow them to do this at production scale. And because I ran out of space, the data engineer is an important part of that: making sure the Spark pipelines are working properly, making sure all that data in the data lake is available and clean. Those are all super important roles.

Another example is MLOps for running predictive maintenance at the edge. Take machine learning and DevOps together. In this case, they have the maintenance models deployed and running on IoT Edge devices, taking all the sensor data info as it comes in, determining if they want to send an operator to the offshore oil rig to take a look and see what's going on. That same feedback collection process goes in: the operator can say, "That was a good prediction" or "That was a bad prediction." Those signals are fed back in to help improve the model over time.

Shivani Patel: A quick recap. The key takeaway is that it's not just using a bunch of technologies together. It's bringing this workflow of three different personas to collaborate together and scale out. Essentially, it's a lifestyle that we want to introduce with MLOps. It's not impeding on certain workflows, but bringing all these workflows together to collaborate and push their data and artifacts, and what they're creating, into centralized stores that other personas can go in and pick up.

Jordan Edwards: If you're looking into, for your organizations, how to do digital transformation and how to bring IT into the fold for data science, the recommendation is to not try to force a ton of existing software engineering practices onto them. Instead, give them tools to make it easier for them to get compute in the cloud, make it seamless for them to track their work, and make it more apparent to them that when they're training these models, someone is actually going to use them. One of the most important pieces that we bring up when talking to customers is that 88% of models that are trained in the enterprise never actually make it to production. Unless you put these MLOps processes into place, your organizations are just never going to get there, really.

That's all our content, so thank you.