Staying Calm While Being Eaten by a Unicorn

Log in to watch

London 2019

Staying Calm While Being Eaten by a Unicorn

Senior Technical Marketing Engineer · Nutanix

Acquisitions are a common thing in our industry, sometimes your company eats others, sometimes you get eaten. When this happens, there is an opportunity to merge the best of both operations practices.

Early in Nutanix history, they bought a small company called Calm.io. Calm brought with it strong DevOps practices that has been infused into the Nutanix culture and transformed the way products are brought to market. We are not sure if the unicorn ate calm or if the unicorn was calmly devoured.

Learn how to identify which best practices should be brought into the fold, and how to manage that change for a successful transition to a DevOps focused organization.

Michael is a Technical Marketing Engineer for Nutanix, based out of Durham, NC. His responsibilities include creating technical content and speaking to customers about Nutanix Calm, and Nutanix Cloud Native. He has a strong passion for learning new technologies, and figuring out how they can be automated.

In previous roles Michael has worked as a developer, systems administrator, and systems reliability engineer, giving him a unique view on the challenges faced by IT, and a strong belief in how they can be solved through distributed systems and automation. Michael studied Computer Science and Industrial Engineering at the University of Michigan.

Chapters

Full transcript

The complete talk, organized by section.

Michael Haigh

Hello. My name is Michael Haigh. I'm a technical marketing engineer at Nutanix. In my day-to-day job, I get to speak with Nutanix customers or potential customers and create technical content for some of our cloud-native and automation products.

I appreciate you coming to Staying Calm While Being Eaten by a Unicorn. The title should make sense, hopefully, by the end of the talk. Before we get started, let's do a brief overview of what I'm going to be discussing today.

I'll talk about my company, Nutanix, so you get a feel for what we do. I'll also be talking about the company that we acquired a couple of years back, Calm.io. Next, I'll talk about how we integrated the product, or our products, together, and then the architectural decisions that we made to help enable us achieve DevOps outcomes. Then, since the acquisition and the release of the new product, I'll talk about how Calm has influenced the rest of Nutanix. Depending on the amount of time we have, I'll talk a bit about how Nutanix customers implement DevOps. We'll definitely finish it up with our top lessons learned.

All right. If you don't know about Nutanix, we pioneered hyper-converged infrastructure in 2009. It's essentially converging your compute, storage, and virtualization down into one platform to give you the agility and simplicity of the public cloud in your own private data center.

We're also very committed to customer choice, whether that's hardware options, hypervisor options, consumption models, you name it.

Our mission is to make data center infrastructure invisible, elevating IT to focus on the applications and services. We were founded in 2009. We have over 12,000 customers currently, almost 5,000 employees. In 2016, we went public with a valuation of about $2.2 billion at that time, officially making us a unicorn.

I won't spend too much time on this, but how do we actually make the data center invisible and allow our customers to focus on the applications rather than the infrastructure? You'll see the slide. In traditional data centers, you'll have your storage array, you'll have your compute with a virtualization or a hypervisor on top of it, and then a complex storage area network in the middle. To provision a new application, it often has to go between three or four different teams. Handoff, handoff, handoff, and we all know that is not ideal.

Rather than having that sort of architecture, Nutanix essentially converges all three of those things down into one platform. We take software and allow the local attached disk drives on the servers to get aggregated into one large storage pool, and essentially virtualize the storage controller. Now, rather than managing three or four different components in your data center, you have a single pane of glass to manage your data center. All that's required is a top-of-rack 10-gig switch, and these clusters scale really well as your demand or your needs grow.

We also have, as I mentioned, a single pane of glass for management. This is Nutanix Prism, essentially allowing you to manage your compute, your virtualization, your storage, all from that single UI. This will come into play in just a little bit. We also have a product called Prism Central. When you grow outside of a single Nutanix cluster, whether that's several in a data center or several scattered across the world, we have a product called Prism Central, which allows you to manage all of these clusters, again, from that single UI.

Switching gears a little bit, let's talk about Calm.io. This is the company that Nutanix acquired in 2016, just before our IPO. They were also founded in 2009. However, they were much smaller, with only about 35 employees at the time of acquisition. Their mission was to enable enterprises to achieve DevOps outcomes through hybrid cloud application lifecycle management and automation.

What does that mean exactly? Here's a screenshot of what we kind of internally refer to as old Calm, the pre-integration version of Calm. This is an application blueprint. You can model your bare-metal, VM, or container-based applications in a blueprint, which allows various teams within your company to collaborate on that blueprint, depending on which team they're on. You can then launch that blueprint onto any of the public clouds or your on-prem data center, and then manage that application throughout its entire lifecycle. Also, it has an integrated self-service portal to enable your end users to go in and deploy and manage their own applications, all while adhering to IT governance and quotas and budgets.

Obviously, third day of the conference, hopefully the terms automation, governance, and self-service are pretty familiar to you. Calm.io was a platform essentially to help enable our customers achieve DevOps outcomes.

But not only that, Calm internally was a very DevOps-heavy shop. The Calm software was microservice-architected. They performed extensive continuous integration and automated testing, and they also used Calm internally to enable their own developers for self-service. Nutanix was really fortunate. I don't know that we realized at the time how fortunate we were to acquire a company that had been living and breathing DevOps for years and years.

As I mentioned, in 2016 we made that acquisition, but we had some difficult choices. I've been talking about how Nutanix has this single pane of glass. You have Prism Central, which is managing your entire data center infrastructure around the world. But now we have a new product. We could certainly opt for speed to market, do a simple rebranding of the Calm software, and release it. But we kind of just then lost a lot of our benefits of having that one pane of glass.

Alternatively, if we did integrate the two, how could we make it so the Calm.io team that we just acquired doesn't lose all of their proven DevOps methodologies that they've been implementing for years and years? Also, if we did integrate it, it would take quite a bit of time. We actually did make that decision to integrate it natively. It took us about 16 months to rewrite the software, and we were fortunate enough to be able to take that time and rewrite most of the Calm software and integrate it natively within Prism Central.

How does that look? What sort of decisions did we make to be able to help keep Calm's DevOps roots? As I mentioned, Prism Central is our multi-cluster manager. It could be deployed as a single virtual machine, or scaled out to multiple virtual machines for redundancy. But it is still more or less your traditional monolithic application.

What we decided to do was run the Calm software as a couple of Docker containers within Prism Central. This allowed for a variety of things like independent releases and microservice architecture that we're going to talk about.

How successful has this been? Traditionally, we release new Prism Central software every three to four months, which at this conference might sound like a while. I do want to note, we're providing infrastructure for large enterprises like banks and hospitals. Those sorts of corporations cannot generally upgrade every week or every day, and stability is crucial. But the success of Calm, and how the architectural decisions and how the team is run, has allowed the Calm team to release every one to two months, so two to three times faster. How have we made this happen?

First up, we have a product called Lifecycle Manager in Prism Central. This allows you to upgrade the various components of your Nutanix stack independently. It could be disk or node firmware from a physical perspective, it could be the virtualized storage controller, or it could be Calm. In this screenshot, you see we'll have an available Calm update, and all the end user has to do is select the Calm containers, click update, and it'll take about 10 minutes. We'll refresh those Calm containers, all without affecting any of the rest of the software.

We also make use of extensive microservices. On the left, we have the Calm container. That's essentially the front end. It's made up of four microservices. Then the back end, the orchestration engine, which is Epsilon, is made up of seven to eight microservices. We split up our teams in this manner to allow them to work on new features where they don't impact other teams or other microservices, as long as they adhere to the published API requirements.

Just a high-level overview, I won't spend much time here, but this is kind of a general workflow of launching a Calm application from an API perspective. Calm interacts with Aplos, which is the Prism Central API engine, and all the way through, everything is API-based from start to finish within Nutanix Calm.

The next topic is from our team perspective and our new feature perspective. If we have a new feature request in Calm, we essentially create an agile, small working team. Internally, we call it a pod. It's made up of a product manager, a UI engineer, a back-end engineer, and then a software development engineer in test, so essentially a QA engineer.

A new branch will be created off of our master branch and named appropriately according to our internal naming schema. That release, depending on the feature, might take a couple of hours, it might take a couple of days, it might take a couple of weeks. However long it takes for all of the members of the pod to be in agreement that the code is in a good spot, we kick off a Phabricator code review.

If you haven't heard of Phabricator, it's a code review tool similar to Gerrit. I know a lot of companies use Gerrit. It was originally created by Facebook and then spun out. It's a really powerful tool. Definitely recommend looking into it.

The Phabricator code review has a lot of automation built into it. It does several things, including adding a Jenkins reviewer, and then also other members of the Calm team based on some predefined rules. The Jenkins code review pipeline will get kicked off prior to any other member, any actual person reviewing the code. It takes about three hours to run through all of the automated tests.

There's a lot of unit tests at the beginning. They build all the RPMs. They even build the containers. We'll build the new Calm and Epsilon containers. They will get placed in a test instance of that Prism Central, and then do a large amount of integration testing as well. Assuming all three hours' worth of those automated testing is successful, then the rest of the actual people in that code review will have to sign off on that release.

Once it's signed off, we'll merge it back into master in an automated fashion. Another different Jenkins repository pipeline is kicked off. It's going to look pretty similar from a screenshot perspective. The main difference being the Prism Central, essentially the virtual machine, the monolithic app that I was talking to you guys about, is running master code, so the latest and greatest Prism Central code as well. Again, it's about three hours of automated tests, and when successful, then everything is good. If it's not successful, then we report back to the pod that creates that branch, and they can figure out what went wrong and take steps to remediate it.

Okay, switching gears. Let's talk about how Calm has influenced Nutanix. I mentioned we were really fortunate to acquire a company like Calm. To a certain extent, we just had to get out of their way and allow them to do the things that they were doing really well. Since then, we've seen the success of the Calm software, the release model, and the testing model. That has influenced the rest of Nutanix.

Pretty soon here, we're releasing a microservices platform built on Kubernetes to run all of our services, so we'll be slowly migrating services over. We have a product called Nutanix Buckets, which is an S3 object storage solution that will be the first product to run on MSP. It should be released in about a month.

We also have a customer-facing version of MSP called Nutanix Carbon. Essentially, it allows you to deploy Kubernetes clusters in your on-prem environment, and then manage them throughout their entire lifecycle. That also, like Calm, runs as Docker containers within Prism Central. We've seen how well the container architecture works and allows for those independent upgrades.

Lastly, we have a product called Xi IoT. If you were here for the lightning talks last night, my colleague Dave Hocking talked about implementing this using a Lego tool that sorts marbles. Essentially, it allows our customers to run containers at the edge at scale.

All right. Let's spend a bit of time on how Nutanix customers implement DevOps. I don't want this to be a product pitch, so I'm going to be talking about competitors as well, just general things that we've seen that have worked well, whether it's with our products, with some competitors, or public cloud providers.

Obviously, the core infrastructure needs some sort of compute and virtualization, and then some sort of storage services. Doing this, whether it's in the public cloud or a private cloud in your own data center, is critical. But you need to have some sort of cloud-like functionality where you really can easily spin new workloads up and tear them down at will.

I mentioned Buckets earlier. Object storage, if you haven't used object storage at all, like Amazon S3, is a fantastic way to provide storage to your Kubernetes-based applications or provide backup. It's massively scalable and works really simply with HTTP requests, so it's very friendly for developers.

You can also use, for other Kubernetes workloads, most storage companies have something called a CSI driver, Container Storage Interface. Essentially, that allows you to use block-based or file-based storage for your Kubernetes workloads in a persistent manner.

For the more traditional back ends, a database-as-a-service product is hugely critical. If you have developers that have to go and request to the database team for a new dev or test database, and they have to wait a week, we know that's not going to be very successful. The ability to spin up new databases on the fly via API is hugely critical. Something like Amazon RDS or Aurora work really well. We have a product called Era that does a similar thing in your on-prem environment, and it also does some advanced copy data management.

I've mentioned Carbon already, a Kubernetes platform. Kubernetes has a ton of advantages. But if you have ever run a team or if you yourself run Kubernetes, you know there's a lot of challenges involved as well, between maintenance and upgrades, and just the general day-two operations are often very difficult. Unless you're an enormous behemoth company, some sort of managed offering like GKE or EKS is a great idea.

Automation: I've been talking about how much we've used automation internally between Jenkins. We also use Calm internally. A lot of our customers also use Terraform, for instance, so you can stand up new Nutanix clusters or use Terraform to stand up new public cloud workloads. It's hugely critical. Again, we don't want to have our developers waiting around on our operations teams to stand up new environments.

Lastly, some sort of application observability and monitoring tool is hugely critical. If you're switching over from a traditional application to a cloud-native application, you often can't use the same observability and monitoring tool. They're just not really built for the distributed nature of cloud-native apps.

I'm going to show a short video, which I'll walk through. We'll actually show off Nutanix Calm a little bit and then just kind of show a general workflow. Excuse me, let me advance the slide one more. This is the general workflow of what we'll see in the video. This allows your developers, through automation, to just do a simple Git commit and Git push, and at the end of the 10 minutes or so, as the environment gets stood up, they'll have a fully functional application they can perform additional tests on, or ideally do automated testing.

Let's take a look at how that works in action. This is Nutanix. This is Prism Central, as I've been mentioning, our multi-cluster manager. We'll go ahead and navigate down to the Calm section. This is a list of our application blueprints, and we'll select a particular application blueprint that is using some of the products that I just mentioned.

First up, we're talking to Era through APIs to provide a production-grade PostgreSQL database. The cool thing about Calm also is it deploys onto the public clouds. If you are looking to have your end users be able to choose, do I want to deploy this on-prem or in AWS, for instance, now you could build this same blueprint with a new application profile, and we could talk to RDS and call RDS APIs instead of Era.

We're also tying into Nutanix Buckets to store our images for our application. Again, that's our S3-compliant object storage solution, so something like Amazon S3 works really well.

Lastly, we have a container, essentially our web app. We'll see it running momentarily. Essentially, it's built on Oscar, which is an open-source e-commerce website. If you Google it, you'll see it. It's built on the Django Python framework. There's a deployment of six containers. You see their normal Kubernetes selectors and labels. Finally, it's being exposed via a load balancer service on port 8000.

The next part, depending on time, we'll be able to see Epoch, perhaps not. Essentially, Epoch is that application observability monitoring tool. Then we also have a load generator. If you are subjecting this to automated testing, you probably want load on the application.

To see a normal developer workflow, we'll do a Git status, and we see we have a couple uncommitted changes, and then we'll do a Git push. Those are things your developers are doing every day. That's going to trigger the Jenkins build. In this instance, we see we have a failure. We're going to get the fast feedback that Jenkins provides, which is just fantastic. We're going to see I have a simple typo in my commit. We can go ahead and get, for instance, an RSS notification for a developer that our build failed, and we can very quickly and easily make that change while it's fresh in our minds.

We'll fix that and do another commit and push. This time, we should see it build successfully. What happens is Jenkins is going to do a lot of unit testing beforehand. If it builds a container successfully, it gets pushed to Docker Hub or an internal container registry. Then it'll launch that Calm blueprint that we just went over. This will take about 10 minutes, so the video is going to fast-forward. But we can do a full audit trail of exactly how that application is getting deployed and if it runs into any errors.

Once it's actually running, now we have this application that our developers can test. We're going to go ahead and access the application in a second, and we'll see our new storefront. This can then be subjected to further automated testing or manual testing. We'll grab that IP and enter it in our web browser, and we'll see we have our e-commerce website with some Nutanix products. We see that we have our new shop tagline: Built for DevOps Enterprise Summit.

I'm going to pause there. I'm probably going to skip the Epoch portion from a time perspective. I definitely want to leave plenty of time for our lessons learned here at Nutanix.

This is kind of in conjunction with, one, the Calm CEO, who's now a general manager at Nutanix. He started out as a developer. He's been doing some sort of continuous integration and continuous delivery since the late 1990s, for a long time before the rest of us have come around to this. Also, this is just from talking to customers day in, day out.

First up, in our opinion, your automated test suite coverage is absolutely critical, if not the most important thing. These numbers are not hard-and-fast rules. Everybody's environment is a little different. But anywhere between 60% and 70% of automated test coverage we feel is very good. Above 70% is absolutely excellent. On that note, don't try for 90%-plus, 100% coverage. You're going to be spending all of your time actually writing test cases and not developing new features.

Really the only way to do this is require coverage at commit. When I talked about the agile small pod earlier with the QA engineer, they have to have their test cases written and done before that code can get checked in to merge to the master branch. If you say, I'll get around to this tomorrow, tomorrow you're going to be working on some brand-new feature and not actually get around to it.

If possible, from a code perspective, do not use multiple repositories. We often see our customers think, if I want to do microservices, then that means I have to do multiple repositories. Ideally, it's not required to do multiple repositories for microservices, and avoid it if at all possible. There are certain cases where you have to do multiple repositories, but it's going to make your automation a good deal more difficult.

Next, Kubernetes is a means to an end, but it is not required to do DevOps. You can certainly do DevOps through extensive automation in virtual-machine or bare-metal-based instances as well, automation being the key.

Lastly, kind of similar in that vein, do not simply containerize. What I mean by that is don't take your monolithic application, containerize it, and go run it via Docker or run it through Kubernetes and expect you're going to see some sort of savings. The true savings of containers comes through microservice-based architecture. That often requires rewriting your application. Whether that's through an acquisition or a small new feature of your new product, that's a great place to start. But again, don't just containerize your existing app and expect to see any efficiencies.

We're supposed to finish with, here's what I'm looking for help with. This is more of a personal thing. I really want to learn Go. If anyone is familiar with Go, I've never used it, and I know we're using it a lot more internally at Nutanix, so it's something on my to-do list. If you have any recommendations as far as tutorials or any Go recommendations, let me know.

Again, thank you for your time today. My name is Michael Haigh. I welcome feedback, so please come ask me any questions or provide any feedback. Thank you.