Introducing Chaos Engineering to SAP DevOps
Due to the complexity of installation and criticality that it always poses to any organization, SAP remains on the industry as a slow-moving piece of every large business' IT. Join this presentation to understand how to start changing that, and apply Chaos engineering techniques to test your SAP installation High Availability readiness with 12 different scenarios of failure.
Chapters
Full transcript
The complete talk, organized by section.
Guilherme Sesterheim
Hello. Hello, everyone. I'm Gil, and I'm part of ProServe at AWS. Nice to meet you all.
I don't know how you guys feel, but I feel like, with this badge from the conference, I feel like a cow walking in thinking here, right? Let's mind that. Nice to meet you all.
Today we're going to talk about introducing some chaos engineering practices into SAP. And come on, looking at the conference agenda, I was looking at the time slot and I found this other guy talking about DevOps at Netflix. Come on, guys, I want to complain about that. Come on. I want to be at his session as well.
I do understand that this is a very important topic for us that are SAP administrators. I come from a very strong open source technology background, and I am part of an SAP team inside AWS. So if your companies eventually want to engage AWS to help you better run SAP on AWS, you'll probably be engaging, I don't know, somebody, me or somebody else on my team, right?
Let me first start with two questions here. One of them is more importantly into the open source part, and the other one is more importantly into the SAP part. Does anybody want to try? What is chaos engineering? I have the concept more in a few slides. So no wrong answers. What is chaos engineering? Why we do that?
I'm going to grab a meeting. Yep.
Connection. So no problem.
Awesome. Thank you for that. Yeah. Yep. Nice.
Yeah, it makes sense. And both of you, you remember me from the speech we heard from Mr. Kim yesterday, I guess, about Netflix, right? The practices that they did and how confident they were. "Bring it on," when they had the notice from AWS about some EC2 maintenance that affected their Cassandra databases. Yeah, thank you for that.
Now, one more probably to the SAP closer folks here. Why don't we do chaos engineering for SAP?
Different products and large scale to manage all those things, test those applications.
Awesome. Cool. Thank you.
Typically have, or things people don't want to, we don't want to break that, so we won't try to break it.
Thank you for that. Just somebody else. Yep.
Awesome. Thank you for that. And I want to stick to this point where we have something in there. We have an EC2 instance, and we don't want to touch that, because come on, we don't want to touch that.
This is important. And mainly this is, I don't know, what we spend most of our time discussing with customers. So I'm going to stick to that point. This is an instance. We don't want to touch that because that's working, and we don't want to mess anything that's working, right?
This is Bert. He's my dog. He's back there in Tampa. He's the most gracious, friendly thing that ever existed. He's never bitten anyone. He hardly barks. He's never bitten, I don't know, any dog, any other animal. He's, again, extremely friendly. Love him so much. He's there right now with my wife.
This guy, he's always happy. Whenever I do something, whenever I'm in a bad mood, he's close to me. They are feeling, sharing the feeling with me. So I'm very close to this guy. And let's say when this guy goes, oh my God, I cannot even think about that right now. A lot of crying, a lot of grief, a lot of sentiment involved, right?
Now, let's change this guy. Let's remove Bert and replace it with a server.
That's the biggest challenge for us right now, why we don't do chaos engineering in SAP. Maybe this is a metaphor that you guys have already heard, but basically, let's say my HANA server goes down. There's a lot of crying, there's a lot of sentiment, there's a lot of grief involved as well, because I raised this guy and I do care about him. I installed everything. I baked my golden AMI. I installed SAP. I migrated my data from on-premises or from whatever I had it. So if this guy, if this server goes down, a lot of sentiment, right?
On the other side, we have the cattle, right? So this is the metaphor we want to do here. For typical other non-SAP applications, okay, we can do the cattle mindset to a few SAP applications. Hybris is one example. But when we talk about ECC, like NetWeaver-based and HANA, that's harder, right?
But when you have the cattle, basically we can more easily inject faults into that environment. And yeah, we can see that going bad, then, okay, talking about Hybris just as one example, another pod is going to come up if you're running on Kubernetes, of course. That, for me, is the main difference.
So whenever I start to talk to some company, "Oh no, I don't want to touch my HANA server," and I'm talking about HANA here because this is the example I'm going to approach, right? "No, I don't want to touch my PAS. I don't want to touch my ASCS," et cetera, et cetera. So this is what we're going to talk about, right?
What is SAP, just very brief for those who don't know the importance of SAP for an organization here. What is chaos engineering, just bringing in the concept. Then chaos engineering for SAP and what I'm proposing here, challenges and benefits, and the help I'm looking for.
So what is SAP, for those that might not be familiar with SAP? SAP is pretty important, right? The company SAP is one of the world's leading producers of software for the management of business processes, developing solutions that facilitate effective data processing and information flow across organizations. So this is their own definition. It's on SAP's website.
But what I want you guys to stick with is this number. So nowadays, 77% of world revenue transactions go at least through one SAP system. Okay? This is massive. We are talking about the spine of our organizations several times. Okay? So this is SAP.
Hopefully that's not new. When we talk about chaos engineering, this is a concept from Thoughtworks' site. Found it pretty interesting to bring the concept here. Chaos engineering is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions.
And this goes to your answers from previously, saying, okay, let's see what's going to happen when something fails. I don't know, my networking, let's say that one storage disk gets corrupted, the server goes down, I don't know, somebody kicks the instance and it reboots. So let's see what happens to my environment. This is all about chaos engineering.
And we do want to do that in our servers so we don't get caught again by surprises once they happen.
Bringing a very brief context, background around chaos engineering as well. There in 2003, Amazon started with these GameDays, what they called. Basically, they used to inject faults manually into the servers, into the data centers to, again, see what happens. Of course, it's not that simple. There were a lot of plannings, running things in non-prod first and then going to prod, of course.
In 2006, Google started with the DiRT. I'm sorry, I don't remember the D, but the RT is resiliency testing. So basically, they were doing the same thing. They started injecting failures on their environments to see what's going to come from that.
But then in 2011 is when this really got famous, right? And we all understood the importance of chaos engineering to our environments. Netflix started first with the Chaos Monkey. Again, a concept we heard. I see there's not much context there, but we all heard from Mr. Kim yesterday, and probably we will hear this same concept throughout the conference.
In 2011, Netflix started with this open source project, first with the Chaos Monkey, that covered a few items on chaos engineering. And I'm going to approach one of them right now. That evolved into the Chaos Kong, Chaos Gorilla story. And nowadays they call it the Simian Army, and we can find all of that on their GitHub pages. It's open source.
Right after that we have some new players. So even from Cloud Native Foundation, the Days of Chaos. This is something that they advocate for, for giving more tools and empowering people to run that on their environments. And later on we have Facebook Storm and even companies that started offering that as a service, as these guys I found from Gremlin while doing the research to present this to you here today.
Cool, we are good on chaos engineering. You're good on what is SAP. Now let's talk about chaos engineering for SAP, right?
What I'm about to present to you guys today is, again, the outcome of a project. We had an internal project there on AWS ProServe and some of the outcomes we had, right?
In terms of business impacts, these are the metrics we evaluated. These are probably the main guys that we want to test for SAP. And I'm focusing here right now just on HANA, okay? So this is what we might want to test on HANA, because this is what usually brings us the biggest headaches when they happen to our environments.
Network reliability and security, latency. So latency is a very bad Gremlin, using the last word, that happens to our environments. Because when latency happens, wow, it's so hard to troubleshoot that and to narrow down to, "Hey, you're having some latency problem." And last, the server resiliency.
So I'm going to focus today on the last one, the server resiliency, because this is the one that we were able to test the most, right?
Just something that I wanted to mention. Some of these items, they are possible to be reached nowadays using AWS Fault Injection. Nowadays with Fault Injection, we can even simulate an entire AZ going down. So that's pretty nice to test in our environments. This is not what I'm doing here, but I wanted to mention, because this might be handy if you are using that in a future project, let's say.
Cool, so chaos engineering for SAP. These items on the right, they are the items that we typically approach when we are working with a customer. Let's say when, again, when you bring AWS to better run your SAP on AWS, you have some folks from ProServe that will help you there. So these are the typical scenarios that we run in the end of, let's say, a migration or an evolution of your SAP environment.
Basically running those commands manually on the instances just to make sure, okay, we have some high availability, we have some layers of resiliency on this environment, right?
By my own experience, this can take up to three months if we have a big SAP installation, right? I cannot say the company name, but a big company in the entertainment area, we took almost three months to run this.
So some of these commands, they might be well known for us all. And the one I want to focus on today is the system crash, because this is the one most interesting, let's say. Because if you do `hdb stop`, that's fine, talking technical stuff here, but your Pacemaker will pick up. That's probably the easiest one to spot. `pcs node standby`, killing the process, okay, that's kind of more interesting. Rebooting, that's fine too. But the system crash basically doesn't tell anyone what's happening.
So you basically just crash an instance and, oh, let's see what either Pacemaker is going to do, what AWS is going to do. This is the one I'm approaching right now. There's going to be a QR code in the next slides where you guys can see a blog post describing all of these, and we have some sample code to use as well. Okay? But again, let's focus then on the system crash.
Cool. Starting this story, we have our users, again focusing just on HANA, okay? We have two servers: Availability Zone 1, Availability Zone 2. Our users, they are connecting to the first one. That's fine. That's working. Cool.
So let's do the real chaos engineering here, right? We basically crash the system. This is a way to basically crash the Linux server without giving any notice to any server, any process that is running on there. We basically go into the instance and run this command and let's see what happens. So this won't tell Pacemaker, "Hey, I'm going out." This won't tell even AWS, "Hey, the server has a problem."
As soon as this command is run, the instance freezes and the instance goes down. AWS will see that eventually, in a minute or two, and the status of your instance is going to go from running into stopping and then finally stopped, and the second region is going to pick up, right?
Pacemaker from the second one is going to realize that, hey, I lost communication to the first one, so now I'm the leader. Cool. So that's it.
Once this guy then is back, oh, it's very nice. So now we focus on this guy, right? So we bring him up. Okay, what happened? Let's do some troubleshooting. Let's see what went wrong with this guy. Let's see the logs. Once we bring this guy up, the data starts synchronizing back from the secondary into this old primary here, right?
Basically this is the blue-sky scenario, right? This is what's supposed to happen. And this is what, again, there's the QR code, this is what we were able to automate, and that's why we are calling it chaos engineering for SAP.
So I'm going to talk about the benefits in the next couple of slides, but let me make a pause here. Questions? How many of you have already tried something like that? How many of you, I don't know, do you get excited? Do you get scared? How do you feel about these questions?
All right. So after this is done, also, again, this is a project we developed. In the end, you get this report. So this is also generated, again, sample code to be run with this. In the end, this same process that did all of these changes on the environment for us, that crashed the instance, that went and monitored everything, what happened, if the secondary instance is picking up, if not, the automation is going to tell you.
In the end of that, after everything is run, this HTML report is generated, and it also tells you which of those steps did run well. So did `hdb stop` run well? If not, it's going to say fail over there. And it's going to tell also the detailed report for your guys back home, for the Basis team to troubleshoot this.
So what were the commands before each failure got injected, right? What was the status of `crm_mon -A1`? What was the status of that? What is the status of the Python system replication? What was the status of the Pacemaker, right? All of that is captured under this.
And if we go ahead, we have then the detailed explanation for each of these tasks. In this example, I'm showing a failed one, right? So we had a failure on the crash node. Exactly this scenario we are approaching here. And why is it crashed? So after everything, okay, the server crashed it, the solution waited for the server to be in stopped state, okay? Then the solution brought the instance back up and still the PAS server is not connected to the right node it was supposed to be.
So this is what this guy is showing us. And if we scroll up into this report, we are going to get, again, all the details for all of those commands. If we were to run them manually, how was the system looking exactly before each of those commands ran? Once `hdb stop` ran, the first one, not this example, how was my Pacemaker configuration before the system crash command ran? How was my system replication, Python system replication, working?
This is the challenges and benefits for the... oh, let me make a quick pause here because I guess I have a few minutes. Questions so far? How does it look? This is an important asset for auditing as well, right? Sorry, not compliance, auditing. Yep.
Q&A
Q: Are you running this?
A: We have some customers.
Q: How often are you running this?
A: We have some customers interested in running this every 15 days. This was the most aggressive one I could find, right? But the majority of the customers, they want to run it every three months. Some of them six, depends on the industry, right? How regulated they are. But yeah, these guys from like every two weeks, those were the most aggressive I've ever heard.
Cool. Oh, yep.
Q: What's an average time you take to bring it back up? Is the time reducing, or how it is?
A: Yes, it is reduced. I'm going to approach that, sharing the data that I can share so far. But yeah, good question. Let me just get to the next few slides here. Any more questions?
Cool.
Guilherme Sesterheim
So challenges and benefits. Usually the challenge we have, to your point, time. So I wrote here can take up to a month, but again, I know from my own experience that can take up to three months to run all of that list, right? `hdb stop`, Pacemaker, system crash, et cetera. All of those scenarios on the first one, on the primary, and then on the secondary, waiting for the failovers to happen, et cetera, in a very big, large-scale HANA installation can take up to three months. I mean it.
Nowadays, most of the companies I've interacted with, they run that manually. So, okay, we have a big Word document saying each step that we have to run. Okay, now the Basis team logs into HANA primary and runs `hdb stop` as user `<sid>adm`, whatever. Big lists, right?
These lists are very manual and error prone because they take months or maybe even years to get a list very mature for an embed, if you're doing that. And you still don't trust 100% your list, which is some of the cases. You are getting very specialized people just watching for things to happen. So you're getting probably the most expensive guys on your team just to watch high availability testing, chaos engineering testing.
And also it's not something that is very friendly to audit, right? All you have is the Word document that says, "Hey, this is what we did. And yeah, believe me, this is what we did."
So the time for an empty installation, fresh HANA, the entire environment, right? So HA, ASCS, ERS, and PAS, and a couple AAS. The one for, again, an empty database, 45 minutes. This runs. For larger installations, a couple days, okay? One or two days I've seen it running.
And all of this, so you have some parameters there that you can configure and say, "Hey, I don't know, the timeout that you want to wait. What is an acceptable timeout for my instance to come back up? If that instance doesn't come back up in that amount of time, hey, we have a problem. Let me know." So the automation will do that for you.
The results are automatically documented, and it's a repeatable process. The majority of the code is running in Ansible. So you can automate that in a CI/CD pipeline if you want to.
Cool. Questions. And I'm going to wrap with the help I'm looking for.
Yep. Hope I made up to that guy on DevOps for Netflix. Come on.
Q&A
Q: So when you're injecting this chaos into the system, going with your AZ 1 case, is every time the failure mode the same area, or it is changing every time why the system goes down?
A: The failure injection, you mean?
Q: Yes.
A: Oh yeah. This is the list of what's automated so far. So these are all the failures. If you really stretch it, you are going to have 13 scenarios. So `hdb stop` on primary, then `hdb stop` on secondary after failover, right? Same thing for Pacemaker. Same thing for `kill -9`. Yeah, this is it. You have 13 because the number... oh, you have `hdb stop` as well on the guy that is not the primary. So even if you're powering down the secondary, you want to make sure that it's still working, right? So yeah, these are the scenarios covered.
Q: And do you give your team the scenarios how to handle recovery?
A: Well, we are ProServe, so we usually work with customers, right? So the customers decide that. But no, usually it's just one set, one of these scenarios at a time. So again, we are very precautious with SAP servers. Okay, let's start with `hdb stop`. This is the most common one. This is the one we are most comfortable with because this is not new. So we usually go one by one, not doing the full shot at once, right?
Guilherme Sesterheim
And for the last minute. So the help I'm looking for. This is something we've been discussing a lot, because self-resilient actions, whenever we are touching some SAP environment, it is very likely that we are going to have a change request going on. We already described everything we are doing, if we are running `hdb stop`, if we are running a system crash, et cetera.
So self-resilient actions is a hot topic, because if we are doing some self-resilient action, so let's say my system is telling that I'm not, HANA is telling me I'm not ready to be tested because I don't have a backup up to date, right? Can I go ahead and trigger a backup? This is a question, right?
So we definitely can dig more into self-healing actions for the entire thing. And we are precautious of that because of the change requests. We don't want to cross any border.
So we tested with this scenario, triggering database backups, and so automate even more self-resilient actions. This is the help we are looking for.
Questions? Oh, your time, Tamara. Thank you very much.