"Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions".
Chaos engineering is an exciting discipline whose goal is to surface evidence of weaknesses in a system before those weaknesses become critical issues.
Through the test, you can experiment with your system to find useful insight into how your system will respond to the types of turbulent conditions that happen in production. This method is quite effective in preventing downtime or production outages before their occurrence.
Litmus is an open-source chaos engineering platform that helps SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments.
Getting Started with Litmus Chaos Tests in Kubernetes
In this blog post, we will use the GKE cluster and we are going to do a Memory chaos experiment on top of the GKE cluster at the node level.
You must have Kubernetes (1.17 or later) cluster up and running
You must have KUBECTL access to your Kubernetes cluster
Make sure to install helm
We assume that you have all prerequisites and now you are ready for installation.
Once you have the Kubernetes cluster ready then install the litmus using the helm chart.
Add litmus helm repository
$ helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
$ kubectl create ns litmuschaos
Install litmus Chaos Center
$ helm install chaos litmuschaos/litmus --namespace=litmuschaos --set portal.frontend.service.type=NodePort
Verify your installation
$ kubectl get po -n litmuschaos NAME READY STATUS RESTARTS chaos-litmus-auth-server-65bbc55ddf-ssckq 1/1 Running 0 chaos-litmus-frontend-588f4c66cf-4zbw8 1/1 Running 0 chaos-litmus-server-7fb7954696-cwvhl 1/1 Running 0 chaos-mongodb-5db7f895dd-xqlzf 1/1 Running 0
$ kubectl get svc -n litmuschaos NAME TYPE CLUSTER-IP PORT(S) chaos-litmus-auth-server-service ClusterIP 10.64.4.23 9003/TCP,3030/TCP chaos-litmus-frontend-service NodePort 10.64.5.219 9091:30213/TCP chaos-litmus-server-service ClusterIP 10.64.11.232 9002/TCP,8000/TCP chaos-mongodb ClusterIP 10.64.6.22 27017/TCP
Accessing the Chaos Center
For logging in to Chaos Center, you must port-forward the chaos-litmus-frontend-service and copy the PORT of the chaos-litmus-frontend-service and use your IP and PORT in this manner <NODE-IP:PORT> The default credential is below, and you can change the password on the first login.
Once you have a login and the project is created then it will automatically register the Self Chaos Delegate. For verification, you have to run the below command.
$ kubectl get po -n litmuschaos NAME READY STATUS RESTARTS AGE chaos-exporter-d767fcf5-cjr5w 1/1 Running 0 32m chaos-litmus-auth-server-65bbc55ddf-ssckq 1/1 Running 0 88m chaos-litmus-frontend-588f4c66cf-4zbw8 1/1 Running 0 88m chaos-litmus-server-7fb7954696-cwvhl 1/1 Running 0 88m chaos-mongodb-5db7f895dd-xqlzf 1/1 Running 0 88m chaos-operator-ce-7cf6cc79b4-99jgr 1/1 Running 0 32m event-tracker-5ddb594676-q2sgk 1/1 Running 0 32m subscriber-5cbcb4df94-kzvcl 1/1 Running 0 32m workflow-controller-8c548f686-h7ql4 1/1 Running 0 32m
Now you can see self-agent under Chaos Delegates in active state.
Node-Memory-Hog litmus Chaos Experiment with GKE Node
This experiment helps us to verify the resiliency of applications whose replicas may be evicted on the nodes turning un-schedulable due to memory issues. Now we are going to check the reliability of Kubernetes Node by injecting a stress test.
As we know our self-agent is in an active state so the next step is to select Chaos Scenarios and click to Schedule a Chaos Scenarios. In this workflow creation page choose the self-agent as a delegate and click on Next.
After that, you will have to choose chaos scenarios and you can choose pre-defined scenarios but for this blog, we have chosen Chaos Hubs.
In this stage, you can define the workflow name and description.
Now it’s time to tune the chaos scenarios and as we are testing memory hog, we will select generic/node-memory-hog.
Adjusting the weight of the experiments in the chaos scenario means giving a weightage to your experiment as a way of attaching the importance of that experiment to your workflow. The resilience score is calculated based on the weightage and probe success percentage of the experiment.
Once you have verified and confirmed all details then click on Finish. Now we have successfully scheduled the chaos experiment!
Once all steps have been completed successfully, the workflow graph looks like this:
The experiment is complete!
We can also check the Chaos Analytics result report for the experiment:
In this chaos experiment, the resilience score is 100% for 10 points for the memory stress process, which shows us that the Kubernetes Node has good resilience for the defined configuration.
In this blog post, we have seen how we can perform the chaos experiment of node-memory-hog by using litmus chaos tools, and with these tools, you can perform so many experiments like azure-disk-loss, gcp-vm-instance-stop container-kill, pod-network-latency, node-drain, pod-network-loss, pod-autoscaler, node-restart, etc.
Comment below to learn more about LitmusChaos and how it can help you in preventing downtime or production disruptions.