Seven Steps to Move a DevOps Team into the ML & AI World

Log in to watch

Las Vegas 2018

Seven Steps to Move a DevOps Team into the ML & AI World

Managing Research Director, Hybrid Cloud, Software Defined Infrastructure and Machine Learning · EMA Research

Nobody questions the potential for machine learning and artificial intelligence to revolutionize DevOps productivity and business efficiency as a whole. However, organizations leveraging ML and AI to their full potential are few and far between. Done right, ML/AI and DevOps can turn enterprises into “digital attackers” that can release higher-quality software faster and at lower cost. Torsten Volk will outline the seven basic rules to follow to implement ML and AI in real-life DevOps situations. This session will cover: Getting started with AI and ML. The role of parallel pipelines. Which metrics to use and how to use them effectively. Why patience and experimentation are key to making it work.

With over 15 years of enterprise IT experience, including a two-and-a-half-year stint leading ASG Technologies' cloud business unit, Torsten returns to EMA to help end users and vendors leverage the opportunities presented by today's hybrid cloud and software-defined infrastructure environments in combination with advanced machine learning.

Chapters

Full transcript

The complete talk, organized by section.

Torsten Volk

Can everybody hear me okay? There we go. Or do you hear me with a German accent? Yes. Oh. Yeah, that's the microphone. What we want to talk about today is AI in DevOps, and believe me, it is difficult for me. I have a hard time putting the term artificial intelligence, AI, in there. I think we all know what we are really talking about, and I want to start and set the frame a little bit to have that joint understanding because Enterprise Management Associates, we are an analyst firm, and we are all about pragmatic solutions and trying to get people from point A to point B with the least possible pain. And artificial intelligence, what you can see here, and I'm going to turn all of those cool videos on, can help us there.

It can help us produce better software. It can help us do that cheaper. It can help us do it faster and eventually get to a continuous release process. And what you see here are all of the different expressions of AI, if you will. So at the top left, we see different types of cars that are all learning to go around a track. And they couldn't learn that in traditional code because there is no-- They can, but then they would only work for this track. And if you change the track, the cars, the little blue and green and red cars, would have no idea where to go and how to drive around that track efficiently. And that comes down to the whole challenge of artificial intelligence, and that is I have three components.

I have features, and those are the input variables. Do I drive left? Do I drive right, for example, for those cars? Then I have actions. The actions are basically-- Sorry. I have features where my car can drive to the left and it can drive to the right, and I can detect what it's exactly doing currently. And then I have actions where I turn the wheel to the left and to the right and steer around that course, and then I have a reward. If I get around the course, I get a higher reward. Just like in a Mario game, if I get further down in the level in Mario, I get a higher reward for my artificial intelligence model. And those are things that I can really only do if I use artificial intelligence models instead of structured coding.

And what you can see, for example, in the Mario model, there are a lot of individual decisions, basically clicking the individual buttons, pushing the individual buttons on the joypad, A, B, X, Y, up, down, left, right. And I get a reward for pushing them in a specific situation, and the situation you see in the Mario example at the bottom or at the left in that gray box there, where it shows the obstacles and it shows basically how Mario sees the world. And then Mario gets rewarded for responding to that limited worldview differently. And what you can see on this next example is, it's a computer game that probably a lot of you recognize. It's Asteroids, and this player is a neural network that is playing Asteroids.

And you can see it just lost very quickly. So it changed its strategy, and it's now driving around much more and shooting and driving and turning and shooting and is checking out if that is a strategy that works out or not. And once that one is hit by an asteroid, it'll try the next thing and the next thing, and it will get rewarded the more of those asteroids it hits, and it will get basically punished the quicker it dies and the less it achieves in this level. And the reason why I put this example there is this guy actually almost plays like a human being, even though he has no idea what Asteroids is. This is just purely a pattern-based trial and error effort, and if you get hit after a certain behavior, or if you hit more asteroids, then you get rewarded more, and if you don't, you get rewarded less.

And to end the history or the science part of this talk, this is from 1993, and this is the first convolutional neural network that really changed the economics in how checks were processed. Before, you had no way of automatically reading people's handwriting. You got 80%, 85% hit rate when you tried and do that, and that was just not enough. And in 1993, this Yann LeCun, he's now, I think, Chief Technology Officer at Facebook, came up with this convolutional neural network algorithm and model that could, for the first time, do this. And that was the turning point that makes us compare artificial intelligence and machine learning to the Industrial Revolution, to everything that really fundamentally changed the economics of how we do business.

For example, instead of having lots and lots of people transcribing checks into computers, we now scan them in, and they go. They're all done, and that's the same thing in a lot of other disciplines as well. So the reason why we're having this talk today is AI is not trivial, even though it seems trivial in many ways. But we can have a large number of different kinds of neural networks, for example, and those are just neural networks. And then for every single neural network, I have a lot of parameters, hyperparameters. I have a lot of data that's attached to it. Each one of them has its limitations, performance requirements, hardware requirements. You need to configure it properly, otherwise it's not going to work.

My favorite example always is, I always get very, I don't know, almost emotional. I've been doing this stuff for a long time, and you talk to data scientists, and they talk you through the individual steps that it takes to set something like this up. And this is a very much simplified overview. But every time you do a step wrong, for example, for the hyperparameters, you do it too deep, or you do it not deep enough, or you pick just the wrong configuration. That thing will still run for a week or for three days and cost you $50,000, and you get nothing out of it, basically zero. Which is why we are often so dependent on data scientists helping with these things, which then prevents that everybody can do it.

And that is really the big goal. Everybody needs to be able to leverage this, and here's why. You can see, again, this car, it's a little bit of a different model. It has three sensors in the front, one in the middle, one in the left, one in the right, and it just randomly tries out how to get through that maze. It doesn't know anything about driving. It just learns by crashing, and then the reward function has a low output, and it tries it again with a different strategy and a successful behavior, speed and where it should stay on the street and how it should turn, how much it should brake before turning. All of this is rewarded versus if it goes too fast in a corner or if it does crazy things or hits the wall immediately, then that gets punished.

And at a certain point, you will see that car starts driving around the corner. But in the beginning, it looks very hopeless. It looks like this car will never do anything. And the reason for that is it doesn't know anything about the world. It doesn't know that it's a car. It just looks at sensor data. How far is the wall away in comparison to what I do to the gas pedal and what I do to the steering wheel? That's all it knows. And so it sees, for example, that it's a bad idea if I'm very close to the wall to push the throttle through all the way, but it will try that anyway because it doesn't know anything about reality. It explores basically all the different options that it has and goes through that whole evolutionary process over and over again.

It doesn't have a memory from another car or from a different model that gives it a head start. And that's one of the issues that makes this a little bit tricky, but what is important, and that's really my rule number one, and we've been playing with these models since '99, basically. And the interesting piece is they're still pretty much the same. They are just a little bit more accessible to everybody. But to get them into the enterprise and to get people to benefit from it, they have to have a basic understanding and not be intimidated by data scientists saying, "You can't do any of this. You have to pre-process. You have to do all of this configuration, and it takes a few months. And at the end, I don't even know if it'll work." So that's why at the end of the day, you have to really be able to understand the advantages of this whole feature, action, and rewards model.

And that's not all that difficult, but the difficult part is in the next step, or I have actually an example here for this step. We did some research, and we looked at, this is called driverless AI. This is a software called H2O.ai, and that is really interesting because it builds the model, and you can see it does all of the features. It parameterizes how it measures the errors. It does basically everything iteratively by itself. All it needs is resources, and then it shows you in real time how important the variables are that were the input variables that it found, for example. And that is something that lets you then try things out like you can do things, you download datasets randomly, and you see what you find.

For example, what kind of person likes a certain kind of microbrewery is what I ran with 60,000 examples. That is public domain data. You can get all kinds of interesting data from websites like Kaggle and run them through a model like that, and you just get a feel for what you can achieve. And then at the end stands a predictive model. If this worked and if the error is acceptable, then you can provide this as an API, and you can start using this for any kind of software, like a microservice. Number two is start with narrow challenges, and that goes hand in hand with number one.

There is a lot of things. We have played with a lot of those technologies and experimented with a lot of those technologies. And this is the most prominent, and to me, upsetting example, where there was IBM Watson for oncology in the news a few months ago. And they basically said, "Yeah, the thing doesn't work. It's all crap. We should have never started with it. All the hospitals, all the doctors, nobody likes it. It doesn't recognize cancer. It's terrible." But in the beginning when it started, everybody thought, "Wow, it's the biggest breakthrough. It changes the way we cure cancer. It'll change everything," right? And the interesting part is it works perfectly within the parameters that I would expect it to work in.

I haven't dug into the individual details, but what happened here was there was a scandal made, or there was a whole big problem made out of this software, this AI-driven software, basically not independently learning to come up with a cancer treatment, right? And to be compliant and to be secure and to just work the way it should work and like a doctor works almost. And that is just a very high ambition, and there was no checks and balances and no limitations of AI considered. And this is basically the opposite of a narrow approach. And rule number three, treat AI as an experiment. That's my own experiment. That's for my aquarium. And I used a technology that I'm not going to name, but the product managers of this technology thought it should be able to read out this industrial display here and just transcribe it into a JSON API.

And you see a preview of that to the right, but you can see the numbers don't match. It doesn't recognize a lot of the text. Those three numbers are the most important thing. They just don't work, right? And the funny thing is that on the example, it works just fine, and the example looks at least every bit as difficult or easy as what I did in my own little lab when I tried and automate my aquarium. But it shows that artificial intelligence is not intelligent. It cannot read. It cannot know, look, the 62 here, that's not a 6.2 or this 5.1, it's a pH, so it has to be a .1. pH cannot be 51. Things like that I would expect if it was truly intelligent, but at the end of the day, it's finding patterns of pixels on a high-resolution photo and is basically transcribing those into API output.

And it does it exclusively on correlations. It doesn't know what it's talking about, if that's a number, if that's a letter, or if that's a picture. That's not at all what it's doing. But if I had done this-- This was just my own project. If I had done this as a production project for a customer, for example, there's a lot of similar thinkable scenarios for industrial controllers that have these displays, I would've said, "Yes, absolutely. That's easier than most of the stuff that they do in the demos, so this will not be a problem." But it didn't work. It couldn't be made to work. It was just a dead end, and in the end, we had to use a standard library that had nothing to do with AI and machine learning and was a lot more labor-intensive to implement.

And there's a couple of interesting examples that we always need to remember when we talk about what to do with AI in building our own software. We can see up to the top left, one of the image recognition tools totally mischaracterized basically weapons here. At the bottom right, somebody managed to build a mask that was fooling a facial recognition software. Top right, there were two computers that were learning how to chat together and having fun in their own language, which didn't do anything. And here we have a couple of other issues that showed the difference in expectations versus what can really be achieved, and that is, from a project management perspective, a really important piece. And that's why I really always recommend to start with turnkey APIs.

If you talk about AI/ML artifacts, turnkey APIs are really, really cool, and they do a ton of stuff. And if they work, just like one of the guys in the keynote said, it's not core for us to configure TensorFlow and to build the model and train it ourselves. If we can get something out of the box like this one here-- So I want to show also something from IBM that was positively in the headlines. This is Visual Insights here. And what this does is it basically replaces human quality controllers. It finds issues that it hasn't seen before, and what it requires, obviously, is that you give it some examples in the beginning. But the whole overall AI modeling process and software is out of the box, and that is an excellent way to start.

And you can start at a few different types of starting points, basically. You have pre-trained APIs where you can do a whole bunch of commodity stuff. You can get them from Google, from Amazon, from Microsoft, from IBM, from a ton of other vendors. They're kind of commodity. Then in the middle, those are, to me, very interesting stuff like this H2O.ai Driverless AI, IBM Visual Insights, Azure Machine Learning Toolkit. There is a lot of interesting things that you can turnkey use without a ton of implementation. And then to the right, we have things like Algorithmica and SageMaker from Amazon, where it's basically a bunch of APIs where you at least don't have to deploy your own tooling. And next point, rule five: make it modular.

And I'll keep this short because we don't have a ton of time, I think. Make it modular just means that use a different color on the board for AI, because you really don't know if your model will actually produce the results that you think it will. And believe me, I've done it so many times when I was still doing work and not just talking about work as an analyst. But where we really thought, "Oh, yeah, that's a slam dunk. That will absolutely 100% work." Like with that aquarium thing. I would've bet a lot of money that that would've worked. And so would have the guys who did the API, but in the end, it didn't work. So we have to have a different color for this so that we see, yeah, that's AI.

We have to have some contingency planning in place, in case something doesn't work immediately. Right? And funnily enough, that aquarium thing now works or would work. I haven't redone it yet, but it was fixed. Right? The vendor retrained it. It took months to retrain it for such a relatively simple thing, and it is now fixed. And now I could use this API very reliably for automating my aquarium controller. Number six: treat it as code.

And that's actually I said this this morning for the first time, AI as code. And it truly, I think, is absolutely critical that we think of AI as code because those issues that we had with the cancer, with the Tesla, with all of those things, it comes down to the same thing of having infrastructure as code. If we have it as code, we can version control it, we can reproduce issues and basically turn it into artifacts, let other people use it as well, and let everybody benefit from what we found out. And ideally, we have a pre-trained model that people can use. So yeah, it's basically a service, right? We have a REST API, and we deploy to a Docker container, serverless function, streaming framework like Apache Spark.

We can do anything with it. And what this it's just another service, little piece should really show is that it's a service that my main service relies on for a certain thing. Like in my case, read out a pH of my aquarium and open up the controller to higher or lower it if it's off, right? So that's a capability that depends on AI, but I should have a plan B where I have a static library or something that I can, in the meantime, use to make sure that I have at least that capability available to some degree. And then there's another thing that we as an analyst firm see a lot, and that is the project metrics. If you have an AI project, there's always a lot of it's very easy to justify overruns.

It's very easy to justify that the ROI was not there yet in the first six months or at milestone one or two because it's so cool and the implications are so large. But that's how AI is currently losing a lot of trust and a lot of enthusiasm with a lot of people whose budget it is that they're spending on this, right? So, it's absolutely critical to manage expectations, to have the same milestones, the same metrics that we would have for other projects as well. And another interesting piece now is rule eight: the advantage of public data.

And I just looked at the Kaggle website that have all this public data available. You can basically get a data set for most problems that you can solve with data. So in this case, if you go and explore this and combine it with your corporate data, you can fill in a lot of gaps that you have where you don't know something about reality, and you can't really train your model properly. But if you add this open-source data, public domain data in a lot of cases, then you can actually predict something really, really well. So that's another thing to consider. If you say, "Oh, there is no data available." There is data available. I just reviewed a whole bunch of exploration tools. Literally, you can get all the Yelp reviews for all of the US, everything, with all the columns, with all the comments, with absolutely everything.

You can get that and learn a lot about people within your corporate context. If you want to achieve a certain goal, you can use Amazon. You can do sentiment analysis on Amazon reviews. They're all available as a text file, and you can add them into your own models. So that is number eight.

And number nine. This is really interesting in that you build basically with AI. You can build a virtual reality because you perceive the world through that AI. The AI basically pre-sorts what you're seeing. And there's a lot of entry points that if you want to exploit that as a competitor, that can be very detrimental because the character of AI is generally entirely opaque, so you can't see into the model and see why it made a decision. So if somebody takes a memory stick and feeds a whole bunch of data into that model that is racist, that follows a certain agenda, that does a certain thing, that is there to discredit my company or to make my products work less well, you will not see that anymore.

You will not be able to prove that anymore once the memory stick is removed and somewhere in the trash, unless you find it, because that stuff is then all in the training process, and you have to basically continuously test your model to see what it's doing. In two months, will my aquarium example still work? I said it worked now. Well, that means it worked four weeks ago when I tried it last, and I thought, "Wow, great. That worked." But I have no guarantee that in four weeks from now, it will still work. So there's a lot that I can inject. There's dictionaries that, for example, show the algorithm alternative words for a certain term that if I manipulate those dictionaries, I can get you into big trouble because instead of a nice balanced evaluation of a text or of anything that you want to do, you might get something entirely biased and skewed.

So feedback loops is another thing. A lot of people, when Microsoft, I think it was, launched their chatbot, they had a lot of fun manipulating their chatbot through external inputs so that it would be very politically incorrect. And rule number 10, and that goes really in line with all of this, is it's a strategic investment into AI.

You can continuously improve your product. You can affect your entire business. You are definitely going to positively impact your competitive position with it. But it's a little bit like quantum computing, like other things that are today a little bit of a hype, quite visionary, and you think, "Oh, yeah, I'm almost there." But in a lot of cases, 80, 90% is fantastic and makes for a fantastic demo, and I've done those. Incredible demos, and then you want to use it in production, they say, "Yeah, 85%, no. Can't do that. We need 93%." And you have no way to get from 85 to 93%. There is just no way that you can see today, and it may end up that you have to throw away your whole software. And I've thrown away a lot of that software sadly in the past.

So today what we have is we have a lot better systems. We have a lot easier to use systems, and it's all much more accessible. So at the end of the day, the takeaway is features, actions, rewards versus instruction-based coding. Teach your guys to think in those dimensions. What are the features that I want to measure? What are the actions that I want to take? And how do I feedback information if something worked or not? I asked VMware, why does vSphere not come with a AI-driven administrator built in? Why do I have to have my own administrator? It's all technology that I could do, but the answer is, and that's the truth, and it tells you a lot about how AI works. You have no way of reliably feeding back reinforcements for your actions.

So you add a storage volume and you configure that thing, and you do 100,000 things at the same time, and at the end of the day, maybe in six months, you have a horrible disaster because of that, but could be because of something else. So you have no real easy way to say, "Oh, yeah, this belongs to that," and train that algorithm by feeding back the outcome to the model. So yeah. At the end of the day, it requires creativity, discipline.

It's a transition, almost like the transition to DevOps. It's a painful transition because you can spend a lot of hours and a lot of money on not achieving all that much, and that's why those rules really apply in terms of if you want to use AI and optimize your DevOps process, you don't optimize the whole DevOps process. You look at one metric. You look to find a certain thing that can help you become a little bit better and learn something from it, and do that over and over again. And then you get to a point where you solve a lot of individual problems with AI, but you're not solving the global problem with AI because if you try and do that, then again, there's a lot of headlines that show what happens when you try and do that because we are just not there from a tech perspective.

And the key here is the AI thinks fundamentally different from humans. A human reasons. If you ask me a question I don't know what I'm talking about, if it's at least in my field, chances are I can extrapolate. The AI can't do that. The AI just will give you garbage and not even know that. So yeah. That's really the key, and it's really a gradual transition to AI just like it is to DevOps. And sadly, I didn't leave any time for questions, but that's it. Thank you.