Building Scalable Orchestration
This session will share the challenges faced and lessons learnt in solving large-scale IT orchestration problems. It will hopefully provide best patterns and lessons so people solving similar problems don’t make the same mistakes.
We will cover the evolution from a monolithic model to asynchronous microservices over 4 generations of an orchestration engine, what real-life problems were solved and the challenges that we encountered.
This orchestration engine has been in production for over 6 years across multiple BFSI, SIs and other enterprises in Asia-Pac, including one of the world's largest stock exchanges. The company and team behind this technology were acquired by Nutanix in 2016.
Chapters
Full transcript
The complete talk, organized by section.
Jasnoor Gill
Okay, so technically we are a sponsored session, but this will not be a vendor pitch, so you can be thankful for that. I might even finish early. One of the things my boss once told me was, "No one ever complains at a conference if something gets over early and they have some more time to head out."
So we'll try to run through this. Most of this is historical. It's how we built certain tech that we built. Some of it has lessons for those of you who are trying to build stuff in-house in the enterprise. We can do questions at the end. I think we'll have enough time for it.
How many of you are IT or ops, not really DevOps yet? Some of you, at least. How many of you are trying to build internal tooling and automation and connect to multiple tools? Okay, fair amount. Good. So this is hopefully in that direction.
Like I said, most of this is based on experience, so your experience can be different. One of the things we did as a part of this was we changed databases, and that tends to bring the fans among the audience sometimes, right? So I'm not against any particular database. It's just what we saw and what we did and the results that we observed. And of course, some of this was six years ago, so obviously every database has evolved a lot in that time.
I think we'll get started.
Hey, guys. My name is Jasnoor. I work as a product manager at Nutanix. Nutanix as a company is based around letting you get elastic public clouds, not public clouds, but private clouds with a public-cloud-like experience. We do a lot of automation and management to let you quickly scale out commodity x86 servers and get an AWS-like experience, as we like to call it, on top of that.
Where the orchestration bit comes in is that sometime last year, Nutanix acquired an orchestration company called Calm.io. I actually came in via that transfer. And so what I'm talking about is some of the tech that we built at Calm.io that's getting integrated into Nutanix.
We were primarily an orchestration engine. So we did a lot of automation for large IT enterprises. And we spent about seven to eight years. So this was about 30 people, primarily based out of India. Most of our customers were in Asia-Pac before we got funding and tried to do a US marketing thing and tried to go worldwide. But the primary focus was on building effective and scalable orchestration. Our customers were mostly people like banks who were trying to do custom orchestration, who were trying to get databases, failovers, HA stuff working.
What we're going to try to do is just an introduction to orchestration and what we tried to do, multiple generations of what we built. The first parts of our code base were probably built around 2009, 2010 timeframe, prototypes. Then we kind of evolved it and had multiple generations of our orchestration engine.
Obviously, there were lessons learned. There was stuff that blew up at customer sites. There was stuff that didn't scale as much as you would like. Some of it I can talk about. Some customers I cannot name. And that's just how things are.
But to start with: orchestration, right? Everyone needs to do it. The minute you start trying to manage IT, the minute you start trying to plan, figure out how you do a release, this is what the dictionary says orchestration is. But all of that is around how do you coordinate? How do you make sure, when you have a large team of people, how do you orchestrate different activities that you need to do on a daily or a weekly or a monthly basis, whatever you have?
Every problem that you need to do in IT is at its core, or at least that was our thesis, orchestration. We started off saying, "How hard can this be, a problem to solve? It's orchestration. We'll be able to build something in three months and get it working, and hopefully it solves every problem for us, for our customers."
But as you can imagine, these things snowball. So what you think is a simple problem and will get solved in three months turns into a six-year project and multiple iterations. And before you know it, you have 30 people trying to solve that particular problem.
Primarily what—and this is just computer science stuff, right? Anything, any C or Fortran program, at the end of it goes to Lisp. So we were fans of Lisp, and that will come into the story as a part of the whole orchestration engine stuff. But anything that you try to do and solve eventually comes down to doing a series of steps in a particular sequence, which is what orchestration is at the end.
You can build as many layers of abstraction on top of it. You can build automation for it. But at its base, at its core, your algorithm or your custom stuff, for those of you who are struggling with orchestration, is going to be: how do I build a flowchart? How do I do decision trees? How do I do if/else loops? How do I say I can do a for or while loop to iterate over 100 servers that I have? That's the problem essentially every one of us hopefully is trying to solve.
What does orchestration give to you? Orchestration lets you define primitives. And I'll stick to the flowchart example. Every one of us knows those. It gives you a particular way of saying, "These are my base building blocks." If I can have a simple language to describe orchestration or to describe things I need to do in IT, I can just go and I can call those functions.
Hopefully, if I can string together enough of these functions, I will get a workflow or a runbook or a series of scripts that you might have that will go ahead and solve my problem for me. But what you want to do is you want to make sure that the implementation of stuff you need to do is different from what you want to do.
So if, as a part of orchestration, you need to log into a server and run a script, that part should not ideally have to be rebuilt every time you want to log into a server. You want a standard script or a standard way of saying, "This is how I go log into a server and run particular tasks."
And of course, we had the whole DSL hang-up. You should be able to just define your orchestration flow in a flat file and say, "This is what I want to achieve." And people like Chef and Puppet pretty much do similar stuff. Some of it works, some of it doesn't work, but that's the basic principle.
In an ideal world, I should have a single file. I maybe learn a custom language or maybe use a subset of an existing language like Ruby or Python. But I want to be able to describe everything I need to do. I want to pass it on to a separate engine that interprets it for me and says, "I already understand how to log into a machine. I already understand how to do error handling," which becomes very important. I want to know how to retry if there's a failure. I hopefully want to be able to handle stuff like network partitions.
But at the end of the day, you want to separate out your orchestration code and your implementation code from what you would like to achieve. Because you want to achieve different things, but you want a common base or a common engine that gets you wherever you want to go, whatever you're trying to do in IT.
So that's what you want. You want to separate out—and this becomes... Hold on. Where's the thing? This is important. You do not want a failure to log into a box affect the rest of your flow. That's something that should be solved. And once it's solved, you want that to always just go ahead and work.
So that becomes important, to just be able to separate out those into two layers, and then you can just go ahead and fix your algorithms. Because the important part is, what is it that you're trying to accomplish? Those are things you don't want to have to worry about. If there's an issue, you go fix that issue versus trying to say, "I'm not able to log into a box." I use that example because it's one of the most common things you'll end up doing, and it can fail for any number of reasons.
What's the solution? We'll go ahead and we'll build an orchestration engine of our own because, well, this is way before you had things like Ansible and other tooling that came after. So how hard could it be? Some people have done it. But we'll just go ahead and we'll build orchestration, and we'll sell it to people.
2009 is when the first prototypes were written. E1 is what we called it because we also had a Greek hangover, so we started with alpha, beta of code. Epsilon is what finally worked enough that it got us to a customer, and then we just kept that same name. So we never ended up naming it. It was just version Epsilon that went to customers.
So E1 was the first version of our engine. We designed a separate workflow language, so we wanted to go in the DSL direction. Our CTO, who I will not name here, was a big fan of Lisp, so we went with tree data structures. We'll just recurse over stuff. We'll just create a tree of everything you want to do in IT and orchestrate it, and we'll go ahead and solve the orchestration problem.
We wrote an engine so you could supply a DSL to it, and you could say, "Read this, interpret it, and just go ahead and run these steps in the correct order." The assumption, of course, being that if you have a well-written DSL, you don't really need to worry about the engine. You write the engine once, you pass it different things you want to do in a DSL format, and the system goes and executes those, and it solves all your orchestration problems, hopefully. It abstracts away all the actual things you need to do, especially the error handling and all that stuff.
So that's, I think, one of the biggest advantages we got. We focused on creating a UI because, well, when you go to large enterprises, they want to know that it's easy to create these workflows, it's easy to manage these workflows. So you have to build some sort of UI versus having people learn the DSL format.
Now, you can still do it. And for those of you who've tried out, let's say, Chef and Puppet, it takes a certain amount of time to get used to something like that. You have to learn what's the format Puppet expects, what's the format a Chef cookbook expects. And we tried to sidestep some of it by saying that we'll give you a UI which will essentially let you create flowcharts. You just add a for loop, add whatever you want, give input parameters to a task, say I want to run a script, and be able to specify all of that. And I have some screenshots later that will go over what actually shows what we tried to build.
In terms of what our stack looked like, we went with MySQL because fairly popular, usually works. Most of our code base ended up being in Python, primarily because most of our developers were already aware of Python. It seemed like a good language. It scales. It worked for us. Crypto and performance, we chose C because it just runs much faster.
The system was built to scale, or at least that was the initial design, and we'll get to what blew up and what didn't work for us. We had NGINX. We had a master and multiple slave model. We used WSGI because, again, it's easy to get multi-process communication going. And I'm sorry, some of this is more dev-heavy than some of you might expect, but this is just the back end of how we ended up doing stuff.
This was a single monolithic app. This is way before we had microservices. Well, the first version of your product, you don't expect that someone will try to do 10,000 servers with it the first month. So you built an app and you said, "We'll build a single monolithic app. It'll take your DSL. It'll walk the series of steps it needs to do."
We used Paramiko, so you could talk over SSH to different servers and go ahead and run your scripts and different orchestration you wanted. Worked fine till someone tried to run 3,000 servers on it, right? And then the whole system blew up. Because of course, you are a startup. You've never tested on 3,000 servers. You tested on 100 and you said, "This works fine." QA passed, and it got shipped to a customer.
MySQL ended up, queries were taking too much time. Nothing would load. The more servers you had to manage, the more tables you had, the more rows you have, and this went nuts.
Interestingly enough, one of our first customers, someone we convinced to try out a prototype product as a startup, turned out to be a stock exchange. Primarily because they had a lot of pain around existing tooling they had tried, so they were willing to take a chance on something. So they put us in a small, non-important department and said, "Why don't you try it?" It worked on 30 servers, so they decided to really go up and say, "Before we pay you money, let's try you out on a couple of thousand servers," and you had a lot of fun stuff.
But primarily, we had so much data to parse and so many logs and so many things we were doing that we couldn't scale. We couldn't cache stuff because it was all in real time. We had to show people what we were doing, what actions were running. So that was all real-time stuff. We tried doing a cache. It didn't work out.
So this was E1. We had maybe one customer who was paying us some money, but they were not very happy because it ended up they were actually having to run multiple instances of this. So your single pane of glass was actually multiple panes of glass across different business units that did something that they wanted to do at a smaller scale.
And of course, how do you solve it? Let's figure out how to fix that problem. We decided to rewrite the engine, put five people on it. Well, three dev guys on it. We tried tuning MySQL. It didn't work for us. And this is obviously where I have people who object, saying MySQL is awesome and it works. Well, this was 2010, '11. Didn't work for us.
We tried Postgres, ran a POC. It seemed to give us better performance. So we decided to move over to Postgres. Again, how hard can it be? We had SQLAlchemy in use, so we thought we were abstracted away from the database. It turns out whenever you have developers who are writing code to a database, they will always end up using stuff which is not as abstracted as you thought it was, even though you would imagine that MySQL and Postgres would be relatively compatible. It was not. So it was a fair amount of rework, more than expected, of course. Which IT project finished exactly on predicted time? So it ended up taking us a lot longer.
The biggest mistake that we made was that you also get requests. If you are an internal IT, you get requests from your business all the time. "Can I also have this feature, that feature? This shiny new thing that looks like Docker, this thing that looks like Kubernetes or Hadoop that I want to try out." So we also had incoming requests from two or three customers at this point. And of course, we decided we would also give their requests, build in new features, and try to do a port.
Not a good idea. If you know exactly what you're building and if you're just trying to translate that into a slightly different system, relatively easy to do. You know what your inputs are, what your outputs are. You have hopefully test cases that work, that tell you what that system is supposed to do.
The minute you try to build in new features into that, it doesn't work very well. You need more test cases. You need to figure out how that feature is going to work at the same time while you're trying to port it over. So this cost us a fair amount of time and, in hindsight, maybe not the way we should have done something.
Interestingly enough, spent time on it. It worked. We got awesome performance out of this. So just queries and reporting improved, and this is purely Postgres at that point. Now again, tuning and stuff aside, for us, this worked out well.
We could do HA setups. We could do recovery if we had hardware failure. If you had a backup, the system could resume in some ways from crashes.
And of course, state becomes important in orchestration engines. For those of you who have tried to do it, you need to make sure that if you have an ongoing process which has made five mutable... Well, some of you are on immutable infrastructure and you don't need to worry about state, so leave that aside. But all of you who are on legacy stuff and stuff that breaks, you're on step five of an orchestration process and your system goes down, you need to be able to come up. You need to be able to make sure that your system can resume from step five onwards. It shouldn't blow up when trying to do step one to three again because it realizes that something has changed. The system is not what it expects. So those are things that are slightly important when you're trying to do something in production, for example.
And we had our own challenges around that. So with the second generation, the backend couldn't scale beyond a certain point. And that's simply because when you have a database and you're running a lot of workflows and a lot of orchestration processes, you start hitting locks with the DB. There are a lot of things that are commonly accessed, so you ended up with deadlocks. You ended up with multiple workflows that try to use the same information. Because with 50 workflows, it's fine. But the minute the customer tries to run 500 workflows at the same time and stress-test it, your system hangs. Nothing can get past it. So that was one problem.
The second major problem that we had was because we had the whole workflow tree structure, which was our Lisp hangover in some sense. State is complicated. State becomes a part of your program. You're storing stuff inside your program. It's a long-running task that you have that understands where it is.
In orchestration in IT, you can't really say that I will not resume from the previous stuff. And I think I already went over some of that, but resumability becomes one of those features that you cannot give up. You can't say, "I'll just blow up and it's your job to go clean up and start again." Because maybe it took three hours before the system failed. Are you going to ask the user to go back and spend another three hours? You don't want to. So you have to figure out, how do I resume? How do I restore state for the system?
And if it's a tree structure, in our case, I need to recreate that exact flow till the point the system went down, and then I need to resume from that point on. It's a hack. I don't recommend this to anyone. It doesn't work very well because now you have essentially two code paths. You have one code path that says, "This is a brand new structure and this is how I should go execute." But you have a second code path that says, "If you detect that this step has already been executed, you must only do this, skip all the other things."
So my whole code base is essentially a list of if/elses trying to handle resume scenarios for everything. So didn't work very well. If you have to add any new feature to this system, you have to add it twice every single place where it's going to be referenced, because you need to handle the resume part for everything. So not something that scaled well.
And we decided to do another rewrite because that's what developers do. If something doesn't work, we'll try and we'll fix it. If it takes time, it takes time. You want perfection, you have to wait for it.
This was interesting. So this is pretty much what we finally ended up with at the base orchestration layer. And we tried to do certain things, but this took a lot longer because we ended up essentially rewriting, throwing away most of what we had built, and saying, "We know what problem we are solving. We know what input/outputs we have. But we need a new system that will not have the previous problems." And while I don't think at that point we had microservices and stuff being talked about much, this is what we ended up building.
You want asynchronous communication between your processes. You do not want stuff waiting on other things happening in your system. You want message queues. You want to be able to just send a message and then go to sleep till you get a response back saying that a task is done.
If you end up blocking, it will impact your ability to scale. It will impact your ability to resume and restart your systems. You do not want multiple places where you are just waiting for other things in your system to return. You want to be able to just pass a message and say, "I could get killed at this point, and my system does not care," because I've just thrown the state across.
This is the other interesting part, especially if you're trying to do orchestration in our case, but I guess it holds for a lot of places, for a lot of architectures in IT. Wherever possible, whenever you have a long-running process, you need to save state. And we didn't do it for a while, which makes resume hard in our case. But this, at the end of the day, is what will let you say, I can kill my system. I can do an HA or I can have a hardware failure. I can do a restore. I can read from the database.
Now, of course, database corruption is one of those things that you still need to handle. But at the end of the day, if you know exactly where you were in a workflow or in a system, you can pick up from that point onwards without worrying about recreating state and what else I did and building complicated if/else code paths in your code. You look at the current in-progress state that's saved and you say, "This is where I was, and this is from where I will start." So that becomes a very important part.
Recursion is one of the other things, and there are arguments against and for, and some people use it. But we decided to go with pure messages, saying instead of trying to call our own selves and building code for it, push to a message bus, read the message later, and just act on it. And again, that helps when you want to scale your system.
There are still places where you will need to block if you are an orchestration. You're calling Jenkins. Jenkins needs to go and finish a build. You're trying to execute a script. You need to wait for the script to finish, and that's fine. Those components you can't get around waiting for them to finish, but at least if your workflow calls for it. Till a script finishes, it's successful, you can't go ahead. So in those cases you'll still need to block. But you want to minimize how much of your system is actually going to be locked at any given point waiting for the next step or the next success message.
Because again, if you want to run 500 activities in parallel on your system, and if you're trying to block, you will end up running hundreds of processes. You will end up eating probably 64 gig or 128 gig-plus of RAM in a large system. And at some point, based on how you're trying to scale, the system will fail because it will depend on how many processes you can run and how much RAM you can consume. So that scaling-up model is not going to work for you.
Our stack for what we ended up doing, we pretty much used Pyramid. We used ZeroMQ for our transport. We built something and open sourced it. It's called ZWSGI. It's on GitHub for those of you who are interested. It lets you use ZeroMQ with HTTP WSGI stacks. So you don't need to change your HTTP message passing. But it actually converts it to ZeroMQ, uses ZeroMQ for the transport, and it gives you a lot more advantages of using the queues and the messaging. But you still have HTTP at both ends.
And what that means is if you want to go into debug mode, it's a lot easier to debug HTTP than to try and debug message queues.
So this gave us a large amount of scalability in the system. It got us to microservices. You could run multiple copies of each service. You could scale out only a component that is blocked or only a single component that is running low on resources, for example. If I only need to go ahead and open SSH connections, I only call the SSH multiplexer. I don't go ahead and create multiple copies of the complete code base.
And most of you would get this already. Microservices has been around for a while. We all know that you want each component to only do things that it should. You don't want to mix different activities. And so you can scale out only that component when required.
This gave us a fair amount of scalability. We could run a lot more workflows. We actually benchmarked to 5,000 connections per process. But what that also means is you can run multiple processes. So that gives you a lot of connectivity if you want to, for example, talk to 30,000 VMs and go and orchestrate stuff.
Some of the things, I think some of these we covered, but this state management: what part of your problem is the state of the program versus the state of the problem that you're trying to solve? That's important. Push state out. Have some sort of storage, have some database where you can store where you are in the program or where you are in an orchestration process. If you store it in memory, the minute your program crashes, you're done. You cannot move forward from there.
The only way I would say to avoid that is if you are maybe Docker and you don't care if your system blows up before a container comes up, because you'll just go spin up another container. But in real life, most of your applications will care about state. Most of the programs you run will care about what data is saved where in the system. So you want to make sure that this is saved. And it might be obvious to you, but it took us a couple of tries to get to a point where it was useful to us.
And of course, this is a corollary of the third point. If it's a long-running transaction, and if you don't save your state, it'll get in trouble. It will blow up at some point. It will cost you time. It will cost you effort to get back to that particular state in the system.
Microservices, asynchronous design. As far as possible, just go ahead and push messages out to other services. Don't try to tightly couple everything. It makes scalability a lot easier. And this is applicable to any system. This is not really applicable to orchestration specifically. But any system you build, as asynchronous as possible is the way you should ideally go.
Now, this is the other problem. While it sounds very good, if you don't have good debugging infrastructure in place, if you decide to go microservices, now you have a lot more places where things can blow up. You have a lot more places where things can go wrong and where you can have bugs. And now instead of a single monolithic app where you are tracking what my error is, you actually have to track across, let's say, seven different services, saying what data got passed between each service and at what point did I have a bug in the code. And that means you need a lot more instrumentation and a lot better logging than you would have for a monolithic app.
Now, this sounds very good, like microservices and everything being scalable. But when you have developers who are trying to debug stuff, if you have not built instrumentation in time before you go there, you're going to suffer for it. It becomes infinitely harder to go and debug a microservices-based architecture.
Message passing. You want replies. You want to just pass a message and don't worry about what happened to it. You want to make sure you just have any message pass. Now, this doesn't mean that you have to build your own. There are enough tooling out there that you can just go ahead and use. But it makes the system much more reliable, and it makes it much more scalable from, at least an orchestration perspective, it did.
You have queues from a message-passing perspective. The other important thing with orchestration or with any system is that if you have a tightly coupled system, it will blow up at some point. If you have too many requests than your system can handle, it will go and it will say, "The process crashed and I can't do anything about it."
If you have a good queuing system built, you can have graceful degradation. And what that means is you queue your system up so that it'll be slower. It will still process those requests at some point, and that point might be 15 minutes later. But the system is not going down because it's under too much load. That's the other advantage of doing it this way.
Well, while it was nice from an orchestration perspective, we had more problems. And these are not really orchestration problems, but these are more DevOps and more process problems.
You can build orchestration, but as we like to say, orchestration is assembly. You can do whatever you want with it, but someone needs to sit and hand-code assembly language. It's not pretty. It's not fun. The more workflows you have, the more DevOps you are, the more agility you want in your system, the more change you have to deal with on a day-to-day basis, which means that someone needs to go and fix these workflows on a regular basis. And that frankly does not scale.
If you have an orchestrator today in-house at a large enterprise, it might be a BMC or a CA or an HP or a vRA or one of those orchestrators. Every time you need to make a change to the orchestrator, you need a specialist to work on it. Or maybe it's an implementation project where you have to outsource it to a consultant. So you have a tender, figure out how much they'll cost you, and then figure out how they'll come and change it. By then you're probably three apps in and you're doing something completely different.
Well, at least with DevOps, you're hoping that you're much faster and you're not going to have to build individual workflows in your system. So we ended up realizing that orchestration is not really the best solution in these cases. It can't keep up with the speed of change in your system. You want more abstraction. Of course, every problem can be solved by one more layer of abstraction on top.
We went ahead and we did—well, this is not really a rewrite, this was fresh stuff that we built. And this is what became the Calm.io platform. But what we ended up doing was saying that we want to be able to describe not at a workflow level but at an application level what your DSL reads.
It should be able to describe what VMs I have, what dependencies I have between these VMs and my system, what packages I need to install on my system. And you could compare it to something like Terraform today, for example, for those of you who are on the AWS side of things. It lets you define what your application looks like.
The thing that it doesn't do, and the thing that we tried to do, is that a challenge with IT is how do I manage my applications? You can go ahead and launch an application, which can be, let's say, 15 VMs or 20 bare metal servers. That's not what IT does day in and day out. 80% of the time, IT is trying to keep existing infrastructure running.
Once I have an application set up, or once I launch any system, any DSL to set up an application, my problems are just beginning. I need to go ahead and figure out how I'm going to upgrade this application, how I'm going to scale this application, how I'll take a backup every day. And all of those problems also need to be described somewhere for them to be solved, whether you do it in scripts or whether you do it as a part of the same system, which is what we tried to do.
So you want to automate everything, not just the ability to say deploy and then throw it over the fence to a junior ops guy and say, "You have to figure out how to keep this application running day in and day out." That's also a problem that you need to think about.
And of course, it's not easy, is it? You have enterprises. Anyone here who's completely on Docker? One. Anyone who's partially got some Docker running, maybe production, maybe staging? Some more. And everyone else is, I'm hoping, well, at least on VMs at some point.
But the problem is, unless and until you are a five-person or a very small startup or very fast-moving company, which is slightly rare at that level, you will end up with a little bit of everything. If you're a bank or a large enterprise, you have legacy that you still need to keep running, which means you have physical servers, you have VMs of different types. You will have to start looking at Docker, and you might even have multiple container management platforms depending on who wins and who gets where in the next two years.
And this is that same multiple panes of glass problem. For each of these separate systems that you have, you need to figure out how you're going to go ahead and manage them. And if you end up with every separate system, you end up with blind spots on what's actually running in your infrastructure. So try and do all of that, or be able to talk to multiple systems. That was one of the other things that we focused on.
I think I'm almost done. So this is what our workflow engine looked like. It's a much more traditional enterprise UI that we built. So it had a list of workflows that were running, and it would show you state of the system. This is the assembly-language build. So you had to go and you had to code your workflows into the system.
This was the first bit of our application automation thing. So we still had a wizard-based interface. We said we'll go and we'll ask you information about your infrastructure. We'll go ahead and then be able to design your application for you at some point. Worked somewhat. Well, it works for a single VM, which is what every, I don't know, vCenter console, if you say Launch VM, this is what it does. It asks you five questions, says, "These are the properties of the VM I will launch." And it works well.
Where these systems fail is when I want three VMs brought up in a particular order, and I want certain things to happen on one VM, not on the others. It just becomes very, very hairy and complicated.
We said we will do drag-drop, and how hard can it be? Again, you can do it in Visio. You create a diagram of what your stuff looks like. If you have a new joinee in your organization, you go to a blackboard and you draw a bunch of boxes and you say, "This box talks to this box. This box has a database. This box has an application server. Maybe I have 20 of these."
So we tried to do that. And it took multiple iterations before we actually got something we were happy with. So the UX improved. This is pretty much where we ended with as a final version. So this is sometime early last year, where you had a proper drag-drop interface, and you could actually go and define what your VMs look like, how they talk to each other, how the orchestration works across the system.
And that's the end of the presentation. I think that was pretty fast. Not too bad. This was production of August of 2016. And then the company got bought, and so we had a lot more roadmap stuff to think about. And we're now integrating a bunch of this orchestration into Nutanix, so we can talk to the Nutanix platform. We can talk to the hybrid-cloud use cases.
What else do I have? Oh, yeah. This is the part of the vendor pitch that you have. So we have our user conference end of June in Washington. I don't know how many of you will actually be able to travel there. Feel free to look it up online. We will have live streams. We'll have a bunch of announcements around automation at the conference. So follow us, have queries, let us know.
And that's about it. Not much of a vendor pitch, man. You guys should be happy.
Q&A
Questions, arguments? What did we do totally wrong, and what part of the code you may not agree with?
Q: What data stack did you use to store the state?
A: Come again?
Q: What data stack did you use to store the state?
A: What data stack? Okay, we went with Postgres, and we ended up putting everything in Postgres because what we realized is that with orchestration, we wanted transactions. We had the NoSQL thing, and we tried to create a bunch of JSON and put it as documents in MongoDB, for example. We actually still do it for some parts, but it doesn't work very well because the minute you have an orchestration flow, you want to keep track of a transaction.
Because you have five steps, and you want to make sure that till all five are done, you're not committing that particular transaction. And if your system fails in the middle of a transaction, you know exactly where you were when you have to rebuild or restore that state.
So that ability to do transactions becomes very important in orchestration, especially when you're doing long-running stuff. You want to be able to track at a micro level what transaction completed. Because again, it's that immutable versus mutable stuff, right? If I know that I can run this again without worrying about it failing, then it can be another transaction, versus an existing one where I know that if this fails, I didn't do something properly.
So we still stick to the traditional RDBMS model there. At some point it will cost us in scale, I would imagine. Maybe. But we're not there yet, so it's worked out fine for us today.
Nutanix, internally, we also use a scale-out Cassandra architecture. We have a fork of Cassandra that we've worked on. So that is something that we're looking at using for a lot of documents and JSON stuff that we need to store, because it gives us a lot more HA and it's scale-out. So if a node goes down, your data is still there, that kind of stuff. So you don't get that Postgres. Yes.
So you ended up with an active-passive setup where you had a master-slave, and you're hoping you don't lose two nodes at the same time. And if the primary goes down, you have time to bring up the secondary. So that's the challenge with doing that.
Q: Are you using Postgres for your Postgres cluster?
A: Come again?
Q: Are you using Postgres?
A: No, I don't believe so. Actually, we don't even run a separate Postgres cluster. Scalability-wise, we actually just put it in one VM with the rest of the engine. And we've not had issues with scale yet, so we're fairly happy with Postgres in general.
Anything else?
Cool. I think we have some goodies left at the back, probably. So feel free to grab some on your way out.
Thanks, folks, for your time. And you can go to nutanix.com to learn more about the Nutanix platform. See, there's my vendor pitch.