Adjust cluster health states to be more meaningful #159

Closed
opened 2022-10-08 00:25:10 -04:00 by joshuaboniface · 1 comment

Currently the cluster and storage health can be one of two states: Optimal or Degraded. This can be quite limiting: for instance many different things can cause a Degraded state, some of them trivial, some of them cluster-threatening. Especially if additional checks are added in #154, we would want the ability to better deliniate overall cluster health especially in a more meaningful way.

I can think of a couple useful things to change and implement:

  1. Eliminate the distinction between cluster and storage health: storage issues affect the cluster, so these can be combined into a single health report.

  2. Use a percentage/out-of-100 based system where different levels of degredation reduce helath by a given ammount depending on the severity of the problem. Individual health plugins in #154 could state how much they affect the health of the cluster on a per-node or per-cluster basis.

  3. Use the overall out-of-100 health to calculate an overall cluster health status using a standard "monitoring" trio of ok/warning/critical levels.

  4. Adjust the example monitoring plugins to read and use this level and act accordingly.

Currently the cluster and storage health can be one of two states: Optimal or Degraded. This can be quite limiting: for instance many different things can cause a Degraded state, some of them trivial, some of them cluster-threatening. Especially if additional checks are added in #154, we would want the ability to better deliniate overall cluster health especially in a more meaningful way. I can think of a couple useful things to change and implement: 1. Eliminate the distinction between cluster and storage health: storage issues affect the cluster, so these can be combined into a single health report. 2. Use a percentage/out-of-100 based system where different levels of degredation reduce helath by a given ammount depending on the severity of the problem. Individual health plugins in #154 could state how much they affect the health of the cluster on a per-node or per-cluster basis. 3. Use the overall out-of-100 health to calculate an overall cluster health status using a standard "monitoring" trio of ok/warning/critical levels. 4. Adjust the example monitoring plugins to read and use this level and act accordingly.
Author
Owner

One big complexity to note would be working the Ceph health messages into this, but luckily their levels are fairly consistent (WARN and ERR being roughly -10 and -50 respectively) and they output useful messages that can be collected and displayed. I'd probably just make "Ceph" a monitoring "plugin" but a hardcoded one that always exists similar to the various node states, etc.

One big complexity to note would be working the Ceph health messages into this, but luckily their levels are fairly consistent (WARN and ERR being roughly -10 and -50 respectively) and they output useful messages that can be collected and displayed. I'd probably just make "Ceph" a monitoring "plugin" but a hardcoded one that always exists similar to the various node states, etc.
joshuaboniface added the
improvement
label 2022-11-23 02:06:08 -05:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#159
No description provided.