DevOps for AI
Because the AI field is young compared to traditional software development, best practices and solutions around life cycle management for these AI systems have yet to solidify. This talk will discuss how we did this at Microsoft in different departments (one of them being Bing).
Gabrielle Davelaar is a Data Platform Solution Architect specialized in Artificial Intelligence solutions at Microsoft. She was originally trained as a computational neuroscientist. Currently she helps Microsoft’s top 15 Fortune 500 customers build trustworthy and scalable platforms able to create the next generation of A.I. applications.
While helping customers with their digital A.I. transformation, she started working with engineering to tackle one key issue: A.I. maturity. The demand for this work is high, and Gabrielle is now working on bringing together the right people to create a full offering.
Her aspirations are to be a technical leader in the healthcare digital transformation. Empowering people to find new treatments using A.I. while insuring privacy and taking data governance in consideration.
Jordan Edwards is a Senior Program Manager on the Azure AI Platform team. He has worked on a number of highly performant, globally distributed systems across Bing, Cortana and Microsoft Advertising and is currently working on CI/CD experiences for the next generation of Azure Machine Learning.
Jordan has been a key driver of dev-ops modernization in AI+R, including but not limited to: moving to Git, moving the organization at large to CI/CD, packaging and build language modernization, movement from monolithic services to microservice platforms and driving for a culture of friction free devOps and flexible engineering culture.
His passion is to continue driving Microsoft towards a culture which enables our engineering talent to do and achieve more.
Chapters
Full transcript
The complete talk, organized by section.
Gabrielle Davelaar
What does DevOps for AI mean? Well, I don't have to explain what DevOps is. I'm assuming everyone knows this. But it's a bit different when we're talking about DevOps for AI because the traditional way of DevOps is not that much working for AI.
What we see for this is that a CI/CD solution requires reproducibility, validation, storage and versioning, deployment tracking, and data collection, and all these things because eventually, if you have this really cool, awesome model, you want to bring this into production. And this is where we see a lot of customers fail big time. We actually call it the disappointment valley. Why? Because people start very enthusiastically with this model. They hire a data scientist, and the data scientist is going to work on some dataset and then figures out the model and then says, "Okay, I'm done. Here you go. You can have it in production." And a typical software developer will say, "Yeah, not so much. It definitely doesn't work this way. We have to start all over again." So you get two annoyed people. You get the software developer saying, "What the f**k are you actually presenting here?" And I'm saying, as the traditional data scientist, "Well, I did my job. I gave you the model. That's what you were asking me."
Jordan Edwards: Right. Or, "Here's a Jupyter Notebook. Figure out how to make it work in production."
Exactly. So, how to make this work?
These are also the trends that we saw. People are dealing with suboptimal knowledge, not knowing exactly what you're doing, either the business delivery manager, the data scientist, engineer. And that actually brings also to when you have to do a return on investment. People want to see their money getting back to them, especially those business owners that are saying, "Yeah, I'm going to give you that million." They want to see their return on investment.
And then what we also see is people just start random, using a dataset. They think, "Oh, this works. Definitely going to use that," put it in, and then at some stage actually are unable to replicate it.
Then we have another problem, evil black box. GDPR, everyone knows that. It's a major pain point for a lot of big customers, especially the customers that I'm working with. They have to represent to regulators how they were building their models, how they replicate it. If I provide a model that will tell you whether I should get my mortgage, then I at least want to know if I got rejected, why I got rejected. And here you have a pain point because how am I going to show that if I don't actually know a lot how I built my model, how I can reproduce it, et cetera?
And then we have this problem that we're all familiar with: "I've always done this this way." And this problem actually is with a lot of data scientists, and I am one of them. Previously, coming out of college, I ran my model, and then I got into a huge fight, actually, with one of the engineers saying, "Well, I don't know what you're doing, but I cannot put this in production." So my model actually ended up in a PowerPoint presentation. And well, we all know what happens with PowerPoint presentations.
So yeah, that was definitely a disappointment, but I also soon realized that actually the data engineer that I was talking to and the software developer were actually making a point, and that also got me to the stage of saying, "Well, we have to do something about this." So I moved to Microsoft actually to work on this and see how we can bridge this gap.
So how can DevOps for AI help in this space? Well, it can provide you an overview. It can provide you an overview on the resources, the ability to know more on how much resources are actually used. Is this really helping, or do we have to shut down? Because again, the business owner wants to know, do I get my return on investment?
And now we're getting to the digital audit trail. The kind of trail that you are running back to, saying, "If I start all over again, I can reproduce this whole model, and I know exactly which type of dataset I've been using and who has been working on it and where it all started." And this is very important, actually, for regulators. They want to know these kind of things. They want to know how you came to a certain decision. And if you can show that regulator, especially in a financial sector, "This is how I got to my model," and even though there is a mistake in there, they won't be that harsh on you as they otherwise would because you are able to replicate it. Everyone makes mistakes, but at least if you can track it down and being able to follow what happened, it will definitely help saving your... How to put it politely?
Okay. And then building transparent models. Also the kind of thing is that you want to know what your model is doing, and you want to know if there's an algorithmic bias in there. Algorithmic bias is if your dataset is skewed to one side and there's no equal distribution to it.
And then the data science unicorns. Everyone knows that. At least I hope so. Everyone kind of has the experience with working probably with a data scientist coming freshly out of university and working the way you do in university, because obviously they have not been trained in traditional software development. So if you're starting about CI/CD, if you're talking about unit testing, you will see blank faces happening.
Jordan Edwards: Even Git.
Even Git. I even got a question like, "Is this an acronym? And what kind of acronym is this? Is this a new model that I have to know of?" No, it's actually not a model, but it will definitely help you to get there. So yeah.
That brings us to kind of how to do it in a much more mature way. You want to do model control, model validation, model versioning, model storage. Storage is so important because ultimately it will help you trace back how you came to that model. And you have to do that quite organized because otherwise you will have a swamp with so many models and so many datasets, but if you're unable to connect them, you still have a swamp, and you still are unable to create a digital audit trail and a model deployment.
So now we're getting somewhere. We're going to go a little bit more into the tech side. What does it mean if you are combining DevOps with AI? You want to kind of have three phases: experiment, develop, operate. And you want to prove feasibility and build it and then operate it and eventually scale it, because ultimately, that's also the goal that we want to achieve. We want to make it possible for people to have a successful AI production that can scale on a large scale.
So during Ignite, we actually had a very awesome case that we have been working on. Shell has been showing a very cool case where they showed using an AI model at retail stations to kind of predict behavior, and they want to scale this to 45,000 retail stations. That means that you have 45,000 retail stations that all have their own models and all have their different situations, but still wanting to bring that together and still want to learn from that. That's a lot.
So now we're getting to an awesome stage where I'm going to hand it over to my fellow engineer, who's actually going to show you how we do this. Do I click it? Okay.
Jordan Edwards
Cool. So again, from a data science point of view, these are sort of the three major steps that we have in creating models. There's the data preparation step, where you're going and taking data out of your lake, shaping it, extracting the features that you care about to go and build the model. There's the experimentation step, where you use the IDE of your choice, submit a job on some type of compute, and try to find a model that actually solves your problem. Then we get to, once I have a good enough model, I want to be able to register it and track it and allow my developers to be able to use it in production scenarios. And so that's sort of the three major steps we have there.
The model lifecycle. So from a data scientist point of view, this is really what they care about. They are taking data in, creating a model, publishing it. They may be customizing an existing model. But from their point of view, all they really care about is the model asset itself. The deploying of the model is sort of a black box to them. And the model may be used in a variety of different places. You could be deploying it to the cloud. It could be running as part of a larger data pipeline, like inside of a Spark pipeline, or it could be deployed on edge devices, as Gabrielle mentioned is the case with Shell. And so you have this whole funnel. This is the data scientist point of view. Now we're going to talk about when we introduce the developers, what changes.
So when we talk about breaking the wall and lifecycle convergence, we have the app developer flow, which I'm sure you're all very familiar with. So you have IDEs, source control, CI/CD, going to the cloud. Data scientists, again, it's a bit different, more simplified. From their point of view, they're just building a model that's going to solve a problem. They might have a little test app or a couple of cells in their Jupyter Notebook that's showing that the model works, or they may write a paper about showing how the model works, but that's all they really care about.
When we talk about normal applications versus AI-infused applications, you'll see two new assets which enter the domain. One of them is data, and the other is the model. So before you get started with creating a new version of a model or building your first model, you need to analyze and see, has the data changed? Has the profile of the data changed? Is one of the columns not there anymore or all zeroed out? Do I have enough data to go ahead and build a new model? Then you have to analyze the model itself, comparing to the previous version. It's not like with software where your unit tests pass, and you're good to ship. You might have, say, higher precision but lower recall on what you're trying to do. Then you talk about testing the model in the application. Again, the model is usually built on a different stack than the rest of the app. So how do I make sure that I have those same features available in my application so I can actually use the model to predict? And so all these things just make it tricky.
So step one, this is what I call the nasty handoff phase, where the data scientist throws the model over the wall to the developer, says, "Hey, make this work in the application. I think I've got something that's better than your conditional statement you have in your code." That usually takes on the order of a couple of months today to be able to actually use the model in a real application in production, even longer for some teams, depending on the compliance concerns that are involved, what type of data is in the model.
After that, after a little bit of back and forth between the developers and data scientists, we usually get to this step here, where the developers say, "Hey, at least put the code you're using to generate this model in source control somewhere so I can be able to reproduce the model." And so most of the customers we're working with today are just at this phase now, where they're trying to be able to automate the training process and have reproducible models, not just a path to something in a data lake somewhere saying, "Hey, here's the model. I don't know where it came from or what data's inside of it." That's very dangerous.
And along with that, you can now throw things like unit tests on your code. So before I go and waste eight hours of GPU compute time training a new model, how about I make sure the code actually passes all the tests first? It's surprisingly common, we'll see somebody burn hours or days of compute time, and there was a bug in their code. So how do you short-circuit that?
Next, we talk about, now that I have this process to get the model trained automatically, where do I store the model? How do I version the model? Do we apply standard semantic versioning concepts like we do in a packaging environment? From a lineage point of view, do I need to trace what dataset went into my model, what code went into my model, which compute was used to train my model? Because that also has potential concerns on the compliance side. The second your data or your customer's data enters a different compute context, how do you handle that? And so that's where we start to talk about lifecycle management as well. The goal from a model CI/CD point of view is to give you the controls and knobs to be able to effectively manage that lifecycle.
The final step we get to, the happiest path, is when you actually have feedback flowing from your model deployed across a variety of targets, and that data can go back to your data scientist, and they can actually use live information from your app to improve their model. So they can see, was it helping the users or not? How was the model behaving when it was being used in a real application? And they can have a healthy relationship now with the developer, where they can actually say, "Hey, if you instrument and add this extra telemetry into the app, then I can improve the model for you." So now you get to this healthy and productive flow where the friction's gone, the model's being used in the app, the developers are happy, and the data scientist's happy because their work's being used for a real production system.
And so this is just, again, a different pivot on the happy path flow, where you have a company-wide model store or model catalog. Developers can browse the catalog, figure out which models they want to use in their application. They can ask data scientists to say, "Hey, can you customize this model for me?" "Sure. Okay. I'll take it off the store. I know how it was produced before. I can feed different data into it or try different features on it." Be able to take that, publish it out, and then seamlessly consume it on the development side. So even giving the developer something like, "Oh, here's a Swagger spec where you can just generate a client library to call my model, or package it up into a DLL, or make it easy to use in my modules I'm trying to deploy out to the edge." All those things are just sugar and help to make this a happy relationship.
So pain points. I think I touched on a few of these already, but the ML stack, the code is often R or Python, or Spark, some variant of Spark, Java Scala code. It's usually not the same as the rest of the application stack. So again, that featurization logic needs to be rewritten. There may be lots of glue you have to wire up. It's also hard to track breaking changes when you're dealing with different languages.
And then on the model side, testing accuracy of models is not easy for developers. They don't really understand it. How do you design tests that can float and have some variance on them, instead of just being like, "Okay, was this the exact metric produced that I expected?" Where do you set that barrier from one version of a model to another? How much float can you actually support? And you need to be flexible there and work with the data scientists to figure out what's an acceptable accuracy loss.
Also on the performance side, how do I compare and contrast the improved accuracy of a model, but it takes three times as long to run now as it did before? So these are all things that you need to think about when you're bringing models into production systems.
When we talk about traditional applications, I'm not going to waste too much time on this. You're all familiar with it. So traditional CI/CD pipelines, you have build and release in place. In AI applications, again, now you have two personas, working in two different contexts or environments. Normally, they're working in different repositories as well. So as I'm working in these repos, I go and build my code and test it. Either the model gets directly integrated into my app and deployed, or the model's deployed as a separate service and I call out to it over a REST API or something like that. But in either case, you need to have integration testing in place to make sure that it works.
So, what we have now is, this is a proposed process for doing CI/CD for models, and we have services in Azure that help support doing this.
This is from a feature branch point of view. Every time I commit to my data science repo, I want to actually create a sandbox environment. It says conda here, but we encapsulate everything we can in Docker containers to clean it up. Make sure you have all the requirements there, lint the code, run unit tests on it, publish those test results, and also look at your code coverage to make sure that you're not adding in a bunch of extra functions and features into your model that aren't being covered. Sorry.
So from a PR point of view, whenever I kick off a PR, I want to actually do testing and validation and go and train the model. However, I may not always want to train the model on the full set of data, because again, that could take time to do and be expensive, so I may want to have a smaller sample set I can use. Also in the case of compliance, when I'm doing pull requests, I may want to use, say, public data I've scraped off the internet and not my customer's data, because then I can debug and figure out what's actually going on in the training process if I'm having issues. This example here is talking about deploying to ACI. That's an Azure Container Instance for those of you who don't know what it is. But basically, it's deploy it onto a container you can spin up quickly and tear down, make sure that it actually works when you've put it in the container, and put the service input on top of it.
Then here's just the example flows for when I'm actually going to production. I've got my model artifact, packaging it up and deploying it out to a production cluster. Today, our happy path for real-time inference of models is to use AKS, which is the Azure Kubernetes Service. So we deploy it out to there, put an autoscaler on it, and then your applications can start calling into it.
One second, sorry. Okay. So other pipelines we're going to be talking about here. One of the things we haven't talked about in this talk yet is converting and quantizing models. So I may have trained a model on a set of data, but to make it run on this edge device, I need to shrink it. It's three and a half or four gigs of a deep neural net. But my device may not have that much memory on it, especially when you're talking about edge devices, or I may not want to wait for the time to propagate that model out to everything. So we have converting and quantizing of the model where we shrink it, and then analyze how much accuracy it lost when we pruned layers of the graph out.
Okay. Then there's retraining. This is sort of the happy nirvana I showed in phase four there, where now that I have my model that is reproducible, I have validation on top of it, I have a proper place to store it and version it, and I have deployment pipelines that allow me to do a safe and controlled rollout of it. I can do real retraining, and then I can get it out to customers and do A/B testing on it. So I can have both models in my service at the same time and flight a subset of the customers with the new version to see how it works. This is how we do it today in all of our big AI-infused products like Bing and Office.
So, a couple of demos here. I'm just going to jump to the CI/CD pipeline because I have that one popped up already. Could you flip to my computer? Yep. Awesome.
So, this is an example of build pipeline showing basically how to train a model. In this case, I'm using Azure DevOps. For those of you who aren't familiar, it's an awesome CI/CD solution from Microsoft. Here, the first thing I'm doing is installing the ML extension for Azure Machine Learning. I'm unit testing my code before I go and submit the training job, analyzing the quality of my code, storing those test results along with the build. Then I'm basically creating my log for the run with the name of my experiment, submitting the job to go run. In this case, we're running on a data science VM, but you can also submit a job to go and run on a batch compute cluster that will spin up and tear down on demand for you. You can also submit jobs to go and run against a Spark cluster. So the intent here is to keep the code sort of agnostic of the compute. We also have data stores as a concept, so you can mount and unmount data stores as part of your training job. I'm downloading the model, I'm putting the model in the model registry here, and then I'm basically preparing my dependencies I need. So in this case, I have a score file and a conda file that I'm going to use to create my container. Copying the dependencies and then preparing the artifacts for deployment, and just show you what one of those actually looks like.
This is a live demo. So, drill into the logs for this one here. Right. So I can say, oh, here's my test that ran. I've got 40 tests sitting inside of here so long. I can see how long it took to run. I've got the artifacts. So if I look at these, pop them open here. You'll see, okay, I've got my score file. I've got an example model file I could use that's not the one in the registry, but I have one stored in the registry as well. Then I have some other dependencies that I care about, like classes I may need when I bring my model to production.
And then you can see from a logs point of view about how long everything took to run. I can see when I go and run my experiment. I can drill into here, basically say, "Oh, okay, here's all the dependencies." Authenticating with Azure, and here's the actual training of the model. We also give you a link where you can go and click and see your experiment run results live, and then here's the metrics for my model. So we pin these as well. And then I can also track all the individual runs.
On the release side, so here's an example release pipeline. Again, using similar steps to what I had over here, if we look at what these commands are actually doing. So if I drill into this, this is that test deployment step I was talking about. So basically, I'm creating the container image, which has my model packaged up inside of it. I'm creating a service, testing it, and then tearing it down. And I can see these commands are pretty simple. I'm creating a container. Here is my model artifact, my score file. The code's in Python. Go ahead and deploy it, create the service. Again, it's a dev test service, so we're just creating a container instance, making sure that it actually works, running this test query against it. I have this as an inline script, but you could also put this in a repo if you want to treat it more as an infra as code. And then I tear it down when I'm done because I don't want to keep my ephemeral resources running. There's no reason to do that.
Then I have the production AKS deployment. If I look into here, again, similar steps as before. I've already created that image, so I don't need to make a new one. I'm just going to take that same image from the previous phase. That's this image ID parameter here. Deploying to... You can pick which AKS cluster you want to deploy to. And I also make sure that it works over here. I don't want to tear down the production one because that's a production service.
And just to give you a quick insight into the repo as well, so you can see the actual code. Right. So here's some example training code. So this is what's actually getting submitted. Then on the scoring side, I can say, "Okay, here's my conda dependencies that I care about, and then my score file."
Gabrielle Davelaar
Cool. And I think it's also important to note that what we're now also trying to develop at this moment is that it is going to be agnostic. So what we find really important, obviously, we would love people to use our products, but we find it way more important for the industry to change in this, because this is really necessary for AI to grow to a large scale. So if you would like to use Jenkins, or if you would like to use Databricks in those pipelines, fine by us. It's really more about standardizing a way of working rather than saying, "Please use Azure." I do say please use Azure, but on a side note, it is also possible if you want to use other products.
And it is really important to have this standardization because companies are running into these problems, and that's what we call the value of disappointment. And it can actually help a lot of customers with this.
Also, one note, if you would like to have the presentation, because I saw a lot of people making photos, feel free to reach out and we can send it to you. We're also posting it on--
Jordan Edwards: I think the decks are uploaded, right?
Yeah.
Jordan Edwards: Yeah.
I think so. And otherwise, we have it on LinkedIn now also available. Okay. So, thank you so much for being here. If there are any questions, please feel free to ask them.
Jordan Edwards: All right. Thanks.