Self Healing Systems part one

The road towards self-healing systems, part one

You reach the highest possible level of IT systems management when systems regulate themselves. But the road to self-healing systems isn’t always an easy one. Let’s focus on the first step: from monitoring to observability.

More than ever, IT systems are mission-critical. When these systems fail, production lines come to a screeching halt, you loose data, and miss business opportunities. And that’s not all. Recovering from a system failure is a daunting and expensive task. Besides that, it adds costs to the damage that was already done.


How do you keep business-critical IT systems up and running and avoid unplanned downtime? Through monitoring platforms that keep an eye on the systems’ performance. Traditional monitoring solutions track parameters such as available disk space, CPU usage, available memory, response times, etc. And for all the parameters that you measure, you define thresholds.

Once a threshold is exceeded – for example when free disk space hits 10% or less – you set up the system to send out an alert. Or in an alternative scenario? As soon as the threshold is exceeded, you program the system to automatically add extra disk space, until there’s more than 10% free disk space.

But here’s the thing with traditional monitoring platforms. They are perfectly capable of properly tracking how a system is doing on several parameters. And you link a series of actions to the values that you measure. But the monitoring platform doesn’t know what’s on the inside of that system.

From monitoring to observability

So, to be really on the ball, monitoring isn’t enough. A simple example explains why not. On the last day of the month, a payroll system generates large amounts of pdf files. It sends out these documents and then automatically deletes all the temp files that were created. Due to this peak in activity, the threshold for free disk space may be exceeded. A traditional monitoring system will send out an alert or automatically add extra disk space. Why? Because it doesn’t know that it is dealing with a payroll application and, as such, a temporary situation.

Hence the need for observability! For this, we need the system to send information from within to an aggregator. You find that information in logfiles, application metrics, API end points, etc. This aggregator will make correlations and help you find patterns in all the information. Which you would otherwise not find as a person. In the above example we only learn after a few months that the disc space issue is recurring with a one month offset. We will see that because that pattern will become clear in the correlations made in the aggregator.

On top of finding repeating events over a longer period of time, we will also learn that other processes run in an increased level at the same time. This will allow us to find out what exactly is happening inside the system. Which is exactly what observability means!

Learn more about self-healing systems in our next blog!

Dying to see what else we do for a living? Check us out!