Adjust cluster health states to be more meaningful #159
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently the cluster and storage health can be one of two states: Optimal or Degraded. This can be quite limiting: for instance many different things can cause a Degraded state, some of them trivial, some of them cluster-threatening. Especially if additional checks are added in #154, we would want the ability to better deliniate overall cluster health especially in a more meaningful way.
I can think of a couple useful things to change and implement:
Eliminate the distinction between cluster and storage health: storage issues affect the cluster, so these can be combined into a single health report.
Use a percentage/out-of-100 based system where different levels of degredation reduce helath by a given ammount depending on the severity of the problem. Individual health plugins in #154 could state how much they affect the health of the cluster on a per-node or per-cluster basis.
Use the overall out-of-100 health to calculate an overall cluster health status using a standard "monitoring" trio of ok/warning/critical levels.
Adjust the example monitoring plugins to read and use this level and act accordingly.
One big complexity to note would be working the Ceph health messages into this, but luckily their levels are fairly consistent (WARN and ERR being roughly -10 and -50 respectively) and they output useful messages that can be collected and displayed. I'd probably just make "Ceph" a monitoring "plugin" but a hardcoded one that always exists similar to the various node states, etc.