Adjust cluster health states to be more meaningful #159

New Issue

joshuaboniface · 2022-10-08T00:25:10-04:00

joshuaboniface commented

2022-10-08 00:25:10 -04:00

Currently the cluster and storage health can be one of two states: Optimal or Degraded. This can be quite limiting: for instance many different things can cause a Degraded state, some of them trivial, some of them cluster-threatening. Especially if additional checks are added in #154, we would want the ability to better deliniate overall cluster health especially in a more meaningful way.

I can think of a couple useful things to change and implement:

Eliminate the distinction between cluster and storage health: storage issues affect the cluster, so these can be combined into a single health report.
Use a percentage/out-of-100 based system where different levels of degredation reduce helath by a given ammount depending on the severity of the problem. Individual health plugins in #154 could state how much they affect the health of the cluster on a per-node or per-cluster basis.
Use the overall out-of-100 health to calculate an overall cluster health status using a standard "monitoring" trio of ok/warning/critical levels.
Adjust the example monitoring plugins to read and use this level and act accordingly.

Currently the cluster and storage health can be one of two states: Optimal or Degraded. This can be quite limiting: for instance many different things can cause a Degraded state, some of them trivial, some of them cluster-threatening. Especially if additional checks are added in #154, we would want the ability to better deliniate overall cluster health especially in a more meaningful way. I can think of a couple useful things to change and implement: 1. Eliminate the distinction between cluster and storage health: storage issues affect the cluster, so these can be combined into a single health report. 2. Use a percentage/out-of-100 based system where different levels of degredation reduce helath by a given ammount depending on the severity of the problem. Individual health plugins in #154 could state how much they affect the health of the cluster on a per-node or per-cluster basis. 3. Use the overall out-of-100 health to calculate an overall cluster health status using a standard "monitoring" trio of ok/warning/critical levels. 4. Adjust the example monitoring plugins to read and use this level and act accordingly.

joshuaboniface commented

2022-10-08 00:27:08 -04:00

One big complexity to note would be working the Ceph health messages into this, but luckily their levels are fairly consistent (WARN and ERR being roughly -10 and -50 respectively) and they output useful messages that can be collected and displayed. I'd probably just make "Ceph" a monitoring "plugin" but a hardcoded one that always exists similar to the various node states, etc.

joshuaboniface added the

improvement

label 2022-11-23 02:06:08 -05:00

joshuaboniface referenced this issue

2023-02-13 00:38:09 -05:00

Move monitoring in keepalives to plugin system #161

joshuaboniface closed this issue

2023-02-13 00:38:43 -05:00

joshuaboniface referenced this issue from a commit

2023-02-22 18:13:51 -05:00

Merge branch 'revamp-health'

joshuaboniface referenced this issue from a commit