Devops For 24 x 7 SaaS and Disaster Recovery, Know What It Takes Before You Leap Into It

Log in to watch

Europe 2022

Download slides

Devops For 24 x 7 SaaS and Disaster Recovery, Know What It Takes Before You Leap Into It

Amitabh Prasad

Software Architect · IBM India

Shikha Srivsatava

Distinguished Engineer and Master Inventor · IBM

We've all heard of and practiced variations of disaster recovery methodologies for achieving 24 X 7 availability for SaaS applications. There is a range of dependencies for optimal RPO (recovery point objective) and RTO ( recovery time objective) including where the application is running, inter-dependencies of microservices, dependencies on datastores, and storage among others.

Based on our experience of SaaS-ifying applications to run on hyperscalers, we are identifying patterns for our applications to adopt to optimize RPO (recovery point objective) and RTO ( recovery time objective).

We will describe our challenges, learning, and patterns for next-gen SaaS applications, and guidance on how to adopt them in the delivery of your next SaaS applications running in single or multiple clouds.

Chapters

Full transcript

The complete talk, organized by section.

Shikha Srivastava

Hello, everybody. Thanks for joining our session on DevOps for 24/7 SaaS and Disaster Recovery.

My name is Shikha Srivastava. I'm a distinguished engineer at IBM. I'm working on getting IBM's SaaS running on multiple clouds. And with me is Amitabh.

Amitabh Prasad

Hi, guys. My name is Amitabh Prasad. I am working as software architect with IBM India Software Labs, and I'm also working with Shikha for SaaS enablement of different IBM products or different IBM services.

Shikha Srivastava

And the topic we chose today is in context with SaaS. As you can tell, we both are working on getting SaaS running on multiple clouds. And the agenda that we thought of is: start with SaaS, what are the challenges with getting SaaS up and running? And in that challenge, one of them is business continuity and how critical is disaster recovery.

What are the different disaster recovery models? Based on our learning as we've been working with multiple clouds, what we have learned, we want to bring that to you. And as we are working on getting our SaaS to multiple clouds, there's a pattern that we are establishing for our SaaS for the disaster recovery, and those are the patterns that we are planning to bring to you today.

So, starting with SaaS. SaaS is the business application that runs on top of the layers provided by the cloud provider, which is the infrastructure layer: virtual machine compute resources, network storage, and then the platform on top of it, which is all the services that the SaaS would need to run. And this includes things like container orchestration, CI/CD pipeline, auto-scaling of the underlying compute resources, container orchestration, Kubernetes, and other application development tools.

And the SaaS is the business application that provides the business capability or the user capability that businesses are looking for to include in their mission-critical needs or to use the SaaS directly for their users.

As we look at SaaS, the most critical thing that gets talked about SaaS is the SLA, which is service level agreement, that a customer needs to see how stable a SaaS is, which is consistent, acceptable performance. What is the availability, 24 by seven availability of the service itself, and how reliable it is for customers to leverage that particular SaaS in their own mission-critical business applications.

Resiliency: SaaS should be able to upgrade, update with security, features, bugs, with zero downtime. And security and compliance goes with the SaaS hand in hand. SaaS requires some data to be saved. SaaS requires the services to be compliant, no CVEs, et cetera, and all of that is security and compliance. And there's a lot of regulatory compliances that SaaS have to adhere to. All of this and more constitutes the SLA, which is the service level agreement that any of the SaaS consumers looks for in the SaaS service before signing up to use that SaaS.

And then as that service is delivered with the right SLA, there's a lot of things that happen at the ground level to bring the SaaS to the multiple clouds. Collaboration and communication is the one that we think is the most important. It also includes the aspect of shift left for our SREs. So, the SRE aspects, which includes the disaster recovery as well, along with observability and others, to start early in the SaaS development than later, which is where collaboration communication comes in.

Continuous integration of the code into SaaS, as the SaaS is running for the bug fixing, for the upgrades, for fixing CVEs, security fixes, et cetera. And the continuous delivery, which is deploying the fixes to the SaaS where it's running. Observability includes the logging, monitoring, audit capabilities that are all required to make sure that the SaaS can be completely continuously monitored, continuously logged for 24 by seven availability.

Change and incident management. Changes need to be applied. Incidents will happen. How that management happens with the assurance of a secure service and also assurance of 24 by seven availability. Security and compliance is a key aspect. SaaS needs to be secure and compliant, and shifting this left is also critical. Doing a lot of the compliance check in the CI/CD pipeline is critical for a SaaS to be secure and compliant in multiple clouds.

Business continuity. This is the topic that we're going to discuss today. SaaS can be running in a single region, and when running in a single region, it can be in multiple zones. That assures that if the region doesn't go down, the service is available twenty-four by seven, assuming all the other characteristics of SaaS are available and stable. But if the region goes down, then what happens to the service? That's where a lot of the disaster recovery topics come into play. And if you're running your service in a region with a single zone capability, then you have to worry about the disaster recovery as well. And test quality scale, all of that is critical as well for the SaaS, as the SaaS gets provisioned and launched.

As I mentioned, disaster recovery is the topic for today. We have been deep into this topic as we're bringing SaaS to multiple clouds. What is it the cloud provides? What is it that we have to make sure the application considers when running as a SaaS for disaster recovery? All of that, we will go through in this topic today.

So, why think about disaster recovery? Disaster recovery is one of the key aspects of the SLA. When a customer signs to use the SaaS, they look at the SLA, and SLA covers a lot of the characteristics that I mentioned before, and then it also includes the disaster recovery as well, how soon you can recover the service if a disaster happens.

In SaaS, service recovery is more critical than the parallel debugging, root cause analysis, fixing the bugs that cause that outage. To make sure that the service can be recovered instantaneously or within the agreed-upon SLA is the most critical item to be thought of. And as you think about it, or as we thought about it, there's two aspects that, or two key metrics that we had to look into.

What is the recovery point objective? It is the amount of data that can be lost and that a service provider can absorb or can afford to lose. And the second one is recovery time objective, or RTO, is the amount of downtime a business can tolerate and provide a good SaaS as well. So, those are two critical metrics we have kept in mind as we are delivering SaaS across multiple clouds.

As we looked into disaster recovery, disaster recovery is expensive for sure. It's critical to ensure why backup, and why backup is really driven by the business continuity and the SLA that is signed between the provider and the consumer of the SaaS.

What to back up. Do you need to back up all the data? Not true. As you look at the SaaS service, or as we looked at our SaaS service, we have a range of SaaS services. There are some management services, there are some services that provide direct value to the customers. Depending on how much management data you need to store versus you can recreate as the service comes back up, that's a key thing to look into. As well as if it's runtime data and the service is built upon that data, then it's critical to save that data or back up that data. So really, critical thinking on what you need to back up is required.

Where to back up. The key point there is security and the privacy for the customer data. If your service is running in a EU region, for example, and you're storing some of the customer data as a part of your SaaS service, then it's critical that that data is stored or backed up in the same region and not in another region, like in a US region.

How to back up. Backup is done iteratively, it's done continuously, should be done at frequent intervals, and has to be well thought through and automation put in place. How much data to back up and what is the automation to delete the previous data based on the SLA agreed upon is all of the automation to be put in place. Doing these steps manually is waiting for a disaster to happen. So, putting automation in place is critical for both backup and the next point, recovery.

The recovery plan should be put in place as the SaaS goes live. Algorithm for recovery with automated steps, well-rehearsed, well-practiced, is critical for meeting the SLA for the SaaS.

Now, as we looked into our SaaS disaster recovery models for different clouds, we also ran into some challenges of what cloud provider provides and how we should do it, and that led us to different models we can think of and where we should start or what is critical for us.

Amitabh, do you want to go through the different models?

Amitabh Prasad

Sure. Thanks, Shikha.

So, moving on to the disaster recovery models, right? When we are looking at disaster recovery models, so these are the four models that are generally being talked about. And point to note here is these models are not new and have been there for a while. Only thing that changes in the cloud-native world is how and where you want to back up those data.

So, to start with, the simplest one is backup and recovery or backup and restore, whereby we are backing up basically system data or backing up all the required data in some kind of a cold storage, and then bringing it up as a part of cold recovery. Obviously, the RTO and RPO will be maximum in this case, and we'll touch base on this a little later.

Second one is the pilot light. There you have some part of your back-end service is running. The simple example could be your database replication could be happening in a DR region, and your compute resources or your server may not be active. And in case you want to bring up your cluster in a disaster recovery region, you get your database up and running, bring your services, and you can start serving client apps once again.

Third one is warm standby, whereby what happens is a scaled-down version of your application is running, and you have certain automation in place so that in case you want to scale up your application to recover for any disaster, you can quickly bring this up. And the fourth one is multi-site active-active. As name goes, basically there are multiple regions which are serving clients, and this is the kind of technique or this is the model we use when you have requirement of 24/7 availability.

So one thing to note here is between these four model, the critical part is, say, the RTOs and RPOs. So, say when we talk about backup and restore, the RTOs and RPOs can go up to hours, say maybe eight or 10 hours or maybe more. For pilot light, you may bring it down with certain automation in place, maybe in 30 minutes to 45 minutes. With warm standby, maybe 10 minutes.

So one thing I would like to point here is, in each of these three models, there will always be some type of recovery time, and there will be some time when the system will be down. Only in case of multi-site active-active, you are available 24/7. But at the same time, it is the most complex technically as well as it's quite expensive. And in case your service has a requirement of 24/7, then this should be thought right before we are taking it to SaaS, because a lot of services or back-end services requirement will derive, or rather your 24/7 requirement will derive what kind of services you can have.

Next slide. Next slide, Shikha.

Shikha Srivastava and Amitabh Prasad

Sorry, my computer frozen.

No problem.

Yeah.

Amitabh Prasad

So before we go further, I think let's go ahead and look at typical SaaS architecture. I think Shikha has based on this earlier. Like you will have some kind of a management layer data and some of the client-specific data, but we'll go a little deeper in this case to understand it better. Basically, this is also important to understand what are the different states where application can have data and what is needed to protect those.

So, as we talked about, any SaaS deployment should be highly available, and if at all, you deploy it across different availability zone, so that automatically takes care of any zone failure or data center failure, along with any hardware failure in that zone. And you'll always have multiple replicas of your application deployed. When we talk about application in Kubernetes world, we are talking about pods, PVs, CRs, and different Kubernetes objects. So that gets created. And since we have multiple replicas, that automatically takes care of or that gives a degree of redundancies and also manage certain amount of load.

Now, the other piece to look-- No, go back to the previous slide. Yeah. The couple of other things to note is SaaS deployment should always support multi-tenancy, and this is to share infrastructure cost while maintaining desired level of isolation. Now, isolation in Kubernetes can be achieved in multiple different ways, but one of the example could be like you have a namespace-level isolation.

So what happens is when different tenants sign up for your system, you spin up a new namespace, you deploy certain kind of network policies so that application data cannot be shared across different namespaces. So the point we are trying to make here is there will be certain runtime data, like say network policies, maybe some secrets, ConfigMap, that may contain client's information, whereas there may be some management objects that can be recreated. So it's imperative to understand those information too.

And the other two are cloud databases where any application will end up storing certain states. And then any serious application these days does have some kind of a stateful sets where PV and PVCs comes into picture, which is when the data gets stored in the volumes.

So, yeah. Next slide, please.

Now, based on this, because we have a HA deployment and all, so we decide to go ahead with the robust backup and restore for the business continuity. And HA deployed architecture automatically gives us 99.99% per year of availability. So this way, this strategy ensures that we have the required availability as per the service requirement. At the same time, we are protecting ourself with any natural or man-made disaster.

So our current technique is giving us a RTO and RPO of approximately 12 hours when we are talking about regional failure. And within the same region, it is less than 10 minutes of RPO and close to five hours or so of RTOs, five to six hours of RTOs. We are working on to bring those down further. The reason why we have 10 minutes of RPO within same region and little higher for regional failure is because of certain technical challenges where we have to back up the data or move the data across different regions.

So, as we discussed in our previous slide, what we're talking about here is we have three different locations where runtime data can be created. One is runtime configuration, which may contain certain client-related information. The second one is volume. Again, volume is the way by which any stateful sets store their data, so we have to protect those. In our case, we are using volume for in-cluster databases as well as for storing certain files, the files that get processed, which is huge in our case. And then the third categories are the cloud databases. Again, that contains mainly runtime data, client data, as well as application data.

So how do we protect, right? This slide gives us a brief idea about how our backup technique or backup design looks like. So, as we have discussed, we have three categories of resources. We'll go over different techniques that we are using to protect all these three. At the top of it is Kubernetes objects. So there we use Velero, which is an open source tool that is used for protecting any Kubernetes-related resources. We'll drill a little bit more as we move ahead.

And the next category, in this case, is the volume. The way we take our volume backup is using volume snapshot. So where we tag the required volume, those volumes that get mapped with the persistent volume and persistent volume claim, we identify those volumes that needs to be backed up, and then we use a snapshot technique available by the cloud provider to take the snapshot and move it across region.

And the third categories are the cloud services, mainly databases in our case. Since we are using cloud services as a service, so we completely rely on the service provider APIs to take backup. What we do is we identify the cloud services based on the services or based on the product that is using those services and configure it so that there can be a continuous backup using the cloud provider backup agent that gets moved to the S3 bucket, and then subsequently moved to the DR region.

Yeah, next slide, please.

Shikha Srivastava

So, as we were doing all of the work to get our SaaS in multiple clouds and the different options that Amitabh talked through, what exactly we doing in each of the different use cases of how SaaS is built, we came up with a pattern.

These are the five of them. As any of the SaaS is going to multiple clouds, they look at these five patterns. Backup runtime Kubernetes objects. There are CRs that SaaS services create when they run on Kubernetes, and that's an example of runtime capability or runtime artifact that needs to be backed up. Now, is it critical to back that up versus regenerated through GitOps is another question.

The second pattern is around configuring cloud data stores or cloud resources for backup. Do not try to back up the data stores, cloud data stores on your own. Rely on cloud provider's capability for this, their configuration. Enable volume replications. And this, Amitabh will drill into this one a lot more. Handle non-cloud data stores. Try not to use a lot of non-cloud data stores. If you are, then there are techniques to apply to handle those. And the last one here is the GitOps. GitOps, to redeploy static and runtime objects wherever you can. That is a very reliable way of recovery. So those are the five patterns we have established as we are looking at the disaster recovery for our SaaS.

And with that, Amitabh, can you drill down into each one of them as to what the pattern is?

Amitabh Prasad

Sure. So to start with, we are talking about Velero. And in our case, we deploy Velero using OADP. OADP is OpenShift API for data protection. OADP, what advantage we get through OADP is, Velero can be deployed as any other operator, right? Basically, it supports operator deployment.

And the other thing is it also provides a set of-- Basically, since it's a wrapper over Velero, so it allows you to call Velero APIs in a cloud-native way, like say as if you're calling any other Kubernetes APIs. So that's the advantage we get through OADP.

Apart from that, as far as Velero goes, right, Velero is an open source tool. We talked about that. But it has the capability of backing up entire namespace, or it can back up only subset of object. So in our case, we use this technique. So basically what we do is we label all the resources that are runtime, and then we use Velero to only identify the runtime resources and put it into S3 bucket. It can be any of the S3-compliant object store. Doesn't have to be AWS S3 bucket as such. And then once that data is there in the S3 bucket, it gets replicated across region based on the frequency and our requirement.

So, the next slide, please.

The second category is the cloud resources. So more prominently in cloud resources, what we have to back up where application states are, are the data stores. And in our case, we try to make use of as much of cloud database as possible, so cloud-native database as service. So as far as backing those cloud services goes, all we did was we identified the resources or cloud resources that need to be backed up for a particular service. And the way we did it was by adding, again, we went with tagging approach.

And we identify whenever backup agent runs, it identifies all the resources for a particular service that needs to be backed up, and it creates a, basically, snapshot, put that into S3 bucket, and then replicate it across region. So because we are using cloud-native data store and we are completely relying on cloud provider backup techniques, so it automatically gives us basically incremental backups and we really do not have to pause our application as such for taking any backups here.

So next slide.

And the third one in this category is the volume. We'll spend a little time in case of volume backup because all or most of the application these days that we are getting does have some kind of an STS in it, that is stateful sets. And stateful set identifies or basically stores information using persistent volume and persistent volume claim. Basically, this PV then gets mapped to a particular volume or particular volume ID on the infrastructure side.

So what we did was we created a small controller that runs on each Kubernetes cluster, and it monitors all the namespaces or other application namespaces. Whenever a volume is created, it identifies the set of volume that needs to be backed up. There could be certain management-related information might be stored in a volume that doesn't require to be backed up. So we are not blindly going and backing up every volume, but rather only what is needed for our application or basically for our application to be restored and whatever cannot be recreated.

So, by this technique, what we did was we basically, you can go back and we can take a snapshot of those volume by using infrastructure or cloud provider APIs. Now, Velero also provides a couple of ways by which we can take the backup of volumes, including a snapshot. But what we realized that when we move those snapshot across region or for a disaster recovery requirement, those were not adequate. Rather, they were always a challenge to identify the volume, convert that volume into PV, PVCs, and put that back into the cluster.

So this particular technique, we have a link in our resource section, which gives us a very neat way of taking backup of the volume and converting that into PV and PVCs in the DR region. Yeah, next slide, please.

So the last pattern we want to talk about is GitOps. Now, GitOps is basically used primarily for application deployment. For our case, we use Argo CD as a GitOps technique. So it sits between Git and your all registered cluster, and it keeps on continuously monitoring any drift that is there on the cluster versus what is currently available in the Git repo. In case it identifies any drift, basically, it can do an rsync, or basically it can do a sync automatically, or it can raise an alarm and admin can come and synchronize those data.

So basically, we use this particular drift to our advantage or identification of drift to our advantage. So during restore, what we do is we basically recreate our databases. We convert volume into PV and PVCs and restore any of the runtime config. Once that is done, we point Argo CD to the DR cluster and let Argo CD identify what is missing and recreate rest of the resources. So this is how we are addressing our RTO requirement, whereby it will give us enough, which will allow us to meet the RTO that is defined by the product team.

Shikha Srivastava

Yes. All right. That brings us to the last slide. Thank you for listening to us. We really would appreciate hearing what you are doing in this space and learn from you as well. Let your questions come in or comments come in, and here are some of the links to the references that we mentioned in the talk. Thank you. Have a good rest of the conference. Bye.