Add persistent cluster fault messages #164
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Add the ability to define persistent (cluster-level) health messages, a la the
ceph crash
system.Messages can be added to this persistent list in order to indicate states that might be transient but for which a persistent notification is important. Most notably node fences should use this as the fence may clear with no further indication besides a
flushed
node of what happened.Persistent messages could then be cleared by a specific command which would acknowledge and clear them. Acknowledgement could possibly extend to other states as well.
Planned as the next major feature (0.9.82 assuming no intervening bugfix releases).
Key features will include:
a. View persistent messages (current
cluster status
).b. Acknowledge (show but eliminate health impact, with notes) or delete (fully clear along with notes) persistent messages.
After thinking a bit about naming, I'd choose
fault
here. This lends itself well both to a frontend command (pvc cluster fault ...
) as well as to a separate backend in Zookeeper, which would then be combined with the remaining monitoring information incluster status
.Off the top of my head the following would be
fault
events:Faults should cause a health degredation of at least 20%.
Can also add the ability for node monitoring to register events as
fault
s, and ensure they're persistent.Might also look to move health monitoring out into its own daemon,Moved into #171.pvchealthd
, as it is quite decoupled from the rest of whatpvcnoded
does, and this would also allow it to "see" failures ofpvcnoded
itself.Add persistent cluster health messagesto Add persistent cluster fault messagesWith the completion of #171 adding more advanced functionality like this can be accomplished in the
pvchealthd
daemon instead of complicating the main node daemon further.Started implementation but have realized that this system should really be used for any cluster health degradation issue.
Thus all cluster health-decreasing events would become faults. All faults would thus have a time of first notice available, and be acknowledgeable.
This would also include check plugins.
The fault ID can be a hash of the delta and message, which will prevent re-firing the same issue multiple times. Thus the "time" would be the last time the fault was in that condition.
This basically entirely eliminates individual node health, and moves the entire thing to a cluster level.
Persistent faults have been added and will be included in 0.9.84.
All node monitoring plugins will generate a fault at 1/2 of their rated health delta. The cluster-wide faults are added with various values.
Frontend support is complete, with the ability to acknowledge or delete a single, or all, faults.
Faults use a full hash of the message text, health delta, and name (for plugins) to ensure that identical faults are updated but different faults generate new fault messages. This might be a little noisy in some cases (e.g. node crash) but this is desired.