Sometimes you have to break a couple of eggs to make an omelet
Chaos engineering is simulating unexpected events on your systems and tests how they handle it. In short you could compare it to code testing but for systems. It can help you identify points of failure, harden your systems, help train employees and so on.
Chaos engineering is being used by some of the major tech companies and there are a lot of tools in development at the moment. In this blogpost I’ll take a look at chaosblade. This is an open source chaos toolkit being developed and used by Alibaba. For this example we’ll run the chaos experiments by using puppet. So we can easily launch experiments on the systems in our environment.
The steady state
The steady state is the state of your machines when they are running as they should. This can be a number of things such as configuration files, services, endpoints,…
Chaosblade does not generate such a steady state or you also can’t really define one here. You just start and stop experiments. So this is were puppet comes into play. As most of these things are components you define in your Puppet code. In this blog I’ll write a small piece of code that will gather these definitions. We’ll run an experiment and see if anything is impacted by it. We’ll break off the experiment and evaluate our state again.
Setup
We’ll run a few services inside a docker container and unleash an experiment on it. The complete code can be found inside this github repo: https://github.com/negast/puppet-steady-state
Next we’ll launch the compose file and install some modules.
docker-compose up -d –build
docker exec -it puppetserver /bin/bash
puppet module install puppetlabs-apt –version 8.3.0
puppet module install puppet-nginx –version 3.3.0
puppet module install puppet-archive –version 6.0.2
puppet module install negast-chaosblade –version 0.1.4
exit
docker exec -it ubuntu1 /bin/bash
apt-get update
puppet agent -t
After running the puppet agent the apache instance should be available at http://localhost:8085 in your browser.
Benchmarking
First we’ll benchmark the instance without cpu load. Luckily apache has a benchmark tool built in we can use.
docker exec -it ubuntu1 /bin/bash
ab -n 90000 -c 100 http://localhost/
First benchmarks
Time taken for tests: 18.008 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 6 | 3.2 | 5 | 37 |
Processing: | 3 | 14 | 6.6 | 12 | 74 |
Waiting: | 1 | 9 | 6.0 | 7 | 71 |
Total: | 8 | 20 | 7.2 | 18 | 86 |
Time taken for tests: 18.798 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 6 | 3.9 | 5 | 50 |
Processing: | 3 | 15 | 8.2 | 12 | 114 |
Waiting: | 1 | 9 | 6.6 | 7 | 74 |
Total: | 4 | 21 | 9.6 | 18 | 114 |
Time taken for tests: 17.998 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 5 | 3.2 | 5 | 39 |
Processing: | 0 | 14 | 7.1 | 12 | 128 |
Waiting: | 0 | 9 | 6.4 | 7 | 99 |
Total: | 1 | 20 | 7.5 | 18 | 129 |
Time taken for tests: 19.682 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 6 | 4.2 | 5 | 37 |
Processing: | 2 | 16 | 8.7 | 13 | 100 |
Waiting: | 0 | 10 | 6.9 | 7 | 86 |
Total: | 4 | 22 | 10.1 | 18 | 105 |
Time taken for tests: 15.650 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 5 | 2.7 | 4 | 45 |
Processing: | 3 | 12 | 5.2 | 11 | 65 |
Waiting: | 1 | 7 | 4.5 | 6 | 59 |
Total: | 5 | 17 | 5.7 | 16 | 68 |
First benchmarks
Uncomment the puppet code in the in the site.pp to enable the experiment on the ubuntu1 container.
chaosexperiment_cpu { ‘cpuload1’:
ensure => ‘present’,
load => 99,
climb => 60,
timeout => 600,
}
Now run the puppet agent again and you should see following process pop up:
You can also consult the blade cli tool to view your created experiments
root@ubuntu1:/# blade status –type create
{
“code”: 200,
“success”: true,
“result”: [
{
“Uid”: “cpuload1”,
“Command”: “cpu”,
“SubCommand”: “fullload”,
“Flag”: ” –cpu-percent=99 –timeout=600 –climb-time=60 –uid=cpuload1″,
“Status”: “Success”,
“Error”: “”,
“CreateTime”: “2022-01-03T09:59:57.3833659Z”,
“UpdateTime”: “2022-01-03T09:59:58.517777Z”
}
]
}
Next well run a couple of benchmarks again
Time taken for tests: 23.633 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 8 | 5.5 | 6 | 135 |
Processing: | 1 | 18 | 9.5 | 16 | 149 |
Waiting: | 1 | 11 | 7.4 | 9 | 145 |
Total: | 2 | 26 | 11.8 | 23 | 190 |
Time taken for tests: 38.229 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 13 | 9.4 | 12 | 108 |
Processing: | 2 | 29 | 16.1 | 25 | 171 |
Waiting: | 0 | 17 | 11.8 | 15 | 139 |
Total: | 2 | 42 | 20.4 | 38 | 191 |
Time taken for tests: 52.768 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 19 | 13.5 | 16 | 177 |
Processing: | 0 | 39 | 22.9 | 34 | 338 |
Waiting: | 0 | 22 | 15.9 | 18 | 259 |
Total: | 1 | 58 | 29.7 | 52 | 369 |
Time taken for tests: 40.652 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 15 | 11.5 | 12 | 127 |
Processing: | 3 | 30 | 18.7 | 24 | 299 |
Waiting: | 0 | 16 | 12.7 | 13 | 267 |
Total: | 4 | 45 | 25.2 | 38 | 302 |
Time taken for tests: 47.876 seconds
Connection Times (ms)
min | mean | [+/-sd] | median | max | |
Connect: | 0 | 18 | 12.3 | 16 | 111 |
Processing: | 4 | 35 | 20.3 | 30 | 159 |
Waiting: | 0 | 19 | 13.9 | 16 | 142 |
Total: | 9 | 53 | 27.1 | 47 | 224 |
So we can see that the cpu load does have an impact but the service keeps being operational.
Stop service experiment
Next Let’s try an experiment that stop the apache process completely
blade create process stop –process-cmd apache2 –timeout 60 –uid stopp1
or in puppet code
chaosexperiment_process { ‘stopp1’:
ensure => ‘present’,
type => ‘process_stop’,
process_cmd => ‘apache2’,
timeout => 60,
}
This experiment makes it so that the apache process is virtually stopped for 60 seconds. Even when running additional puppet runs, the service state is not corrected. After the timeout the process resumes as normal. And we return to the steady state where the endpoint is available.
Results
So what can we learn from these two experiments? Firstly we can determine from the cpu experiment that when load gets too high this will impact the speed of our requests but necessarily bring down the whole system. The next step would be thinking about how we can react to unexpected loads to make a more durable environment. And we can again test the system using benchmarking and chaos testing. We could for instance add a load balancer, launch additional containers, …
From the process interruption we can see that if the system freezes our site is totally unavailable. Again we’ll have to think about how we could enhance the environment to react to this. We could add monitoring that may start or restart services, restart systems, …
Afterthoughts
So we did some minor experiments and did just one at a time. But you can launch multiple experiments at once and stress the system. For example we could simulate someone tampering with files and see if puppet fixes it, restarts services etc… We also learned more of the steady state and how running these experiments can help us think about expanding the steady state. We noticed that puppet does not react well to stopped services. So a thought would be to add a custom resource that also tests endpoints and we could let puppet react to it. This could make Puppet solve a lot of problems in your environment leaving you more time for testing and leaving you with a more robust system. We could also determine additional services that are missing in our puppet code. Expanding our knowing of our steady state.