Add persistent cluster fault messages #164

New Issue

joshuaboniface · 2023-10-23T02:07:10-04:00

joshuaboniface commented

2023-10-23 02:07:10 -04:00

Add the ability to define persistent (cluster-level) health messages, a la the ceph crash system.

Messages can be added to this persistent list in order to indicate states that might be transient but for which a persistent notification is important. Most notably node fences should use this as the fence may clear with no further indication besides a flushed node of what happened.

Persistent messages could then be cleared by a specific command which would acknowledge and clear them. Acknowledgement could possibly extend to other states as well.

Add the ability to define persistent (cluster-level) health messages, a la the `ceph crash` system. Messages can be added to this persistent list in order to indicate states that might be transient but for which a persistent notification is important. Most notably node fences should use this as the fence may clear with no further indication besides a `flushed` node of what happened. Persistent messages could then be cleared by a specific command which would acknowledge and clear them. Acknowledgement could possibly extend to other states as well.

joshuaboniface added the

improvement

Daemon

Client

labels 2023-10-23 02:07:10 -04:00

joshuaboniface added this to the 1.0 milestone 2023-11-17 01:37:16 -05:00

joshuaboniface commented

2023-11-17 01:46:03 -05:00

Planned as the next major feature (0.9.82 assuming no intervening bugfix releases).

Key features will include:

Zookeeper support for a persistent messages and responses.
Allocation of persistent messages during certain events.
CLI interface to:
a. View persistent messages (current cluster status).
b. Acknowledge (show but eliminate health impact, with notes) or delete (fully clear along with notes) persistent messages.

Planned as the next major feature (0.9.82 assuming no intervening bugfix releases). Key features will include: 1. Zookeeper support for a persistent messages and responses. 2. Allocation of persistent messages during certain events. 3. CLI interface to: a. View persistent messages (current `cluster status`). b. Acknowledge (show but eliminate health impact, with notes) or delete (fully clear along with notes) persistent messages.

joshuaboniface commented

2023-11-26 03:32:36 -05:00

After thinking a bit about naming, I'd choose fault here. This lends itself well both to a frontend command (pvc cluster fault ...) as well as to a separate backend in Zookeeper, which would then be combined with the remaining monitoring information in cluster status.

Off the top of my head the following would be fault events:

Node fence (record the fenced node, who fenced it, and the power states).
VM crash + restart (record the active node and VM state).
Service failures (possibly?)
Any Ceph HEALTH_ERR event.
Backup or DR failures (after #169 is implemented)
...

Faults should cause a health degredation of at least 20%.

Can also add the ability for node monitoring to register events as faults, and ensure they're persistent.

Might also look to move health monitoring out into its own daemon, pvchealthd, as it is quite decoupled from the rest of what pvcnoded does, and this would also allow it to "see" failures of pvcnoded itself. Moved into #171.

After thinking a bit about naming, I'd choose `fault` here. This lends itself well both to a frontend command (`pvc cluster fault ...`) as well as to a separate backend in Zookeeper, which would then be combined with the remaining monitoring information in `cluster status`. Off the top of my head the following would be `fault` events: * Node fence (record the fenced node, who fenced it, and the power states). * VM crash + restart (record the active node and VM state). * Service failures (possibly?) * Any Ceph HEALTH_ERR event. * Backup or DR failures (after #169 is implemented) * ... Faults should cause a health degredation of at least 20%. Can also add the ability for node monitoring to register events as `fault`s, and ensure they're persistent. ~~Might also look to move health monitoring out into its own daemon, `pvchealthd`, as it is quite decoupled from the rest of what `pvcnoded` does, and this would also allow it to "see" failures of `pvcnoded` itself.~~ Moved into #171.

joshuaboniface changed title from ~~Add persistent cluster health messages~~ to Add persistent cluster fault messages

2023-11-26 03:36:06 -05:00

joshuaboniface referenced this issue

2023-11-26 03:44:17 -05:00

Separate out pvcworkerd and pvchealthd into discrete packages #171

joshuaboniface commented

2023-12-01 02:14:41 -05:00

With the completion of #171 adding more advanced functionality like this can be accomplished in the pvchealthd daemon instead of complicating the main node daemon further.

With the completion of #171 adding more advanced functionality like this can be accomplished in the `pvchealthd` daemon instead of complicating the main node daemon further.

joshuaboniface commented

2023-12-01 04:22:13 -05:00

Started implementation but have realized that this system should really be used for any cluster health degradation issue.

Thus all cluster health-decreasing events would become faults. All faults would thus have a time of first notice available, and be acknowledgeable.

This would also include check plugins.

The fault ID can be a hash of the delta and message, which will prevent re-firing the same issue multiple times. Thus the "time" would be the last time the fault was in that condition.

This basically entirely eliminates individual node health, and moves the entire thing to a cluster level.

Started implementation but have realized that this system *should* really be used for any cluster health degradation issue. Thus all cluster health-decreasing events would become faults. All faults would thus have a time of first notice available, and be acknowledgeable. This would also include check plugins. The fault ID can be a hash of the delta and message, which will prevent re-firing the same issue multiple times. Thus the "time" would be the last time the fault was in that condition. This basically entirely eliminates individual node health, and moves the entire thing to a cluster level.

joshuaboniface referenced this issue from a commit

2023-12-07 21:36:29 -05:00

Add initial fault generation in pvchealthd

joshuaboniface commented

2023-12-08 09:33:40 -05:00

Persistent faults have been added and will be included in 0.9.84.

All node monitoring plugins will generate a fault at 1/2 of their rated health delta. The cluster-wide faults are added with various values.

Frontend support is complete, with the ability to acknowledge or delete a single, or all, faults.

Faults use a full hash of the message text, health delta, and name (for plugins) to ensure that identical faults are updated but different faults generate new fault messages. This might be a little noisy in some cases (e.g. node crash) but this is desired.

Persistent faults have been added and will be included in 0.9.84. All node monitoring plugins will generate a fault at 1/2 of their rated health delta. The cluster-wide faults are added with various values. Frontend support is complete, with the ability to acknowledge or delete a single, or all, faults. Faults use a full hash of the message text, health delta, and name (for plugins) to ensure that identical faults are updated but different faults generate new fault messages. This might be a little noisy in some cases (e.g. node crash) but this is desired.

joshuaboniface closed this issue

2023-12-08 09:33:40 -05:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#164