pvc/node-daemon
Joshua Boniface c6d552ae57 Rework success checks for IPMI fencing
Previously, if the node failed to restart, it was declared a "bad fence"
and no further action would be taken. However, there are some
situations, for instance critical hardware failures, where intelligent
systems will not attempt (or succeed at) starting up the node in such a
case, which would result in dead, known-offline nodes without recovery.

Tweak this behaviour somewhat. The main path of Reboot -> Check On ->
Success + fence-flush is retained, but some additional side-paths are
now defined:

1. We attempt to power "on" the chassis 1 second after the reboot, just
in case it is off and can be recovered. We then wait another 2 seconds
and check the power status (as we did before).

2. If the reboot succeeded, follow this series of choices:

    a. If the chassis is on, the fence succeeded.

    b. If the chassis is off, the fence "succeeded" as well.

    c. If the chassis is in some other state, the fence failed.

3. If the reboot failed, follow this series of choices:

    a. If the chassis is off, the fence itself failed, but we can treat
    it as "succeeded"" since the chassis is in a known-offline state.
    This is the most likely situation when there is a critical hardware
    failure, and the server's IPMI does not allow itself to start back
    up again.

    b. If the chassis is in any other state ("on" or unknown), the fence
    itself failed and we must treat this as a fence failure.

Overall, this should alleviate the aforementioned issue of a critical
failure rendering the node persistently "off" not triggering a
fence-flush and ensure fencing is more robust.
2021-07-13 17:54:41 -04:00
..
monitoring Improve Munin check with extinfo 2020-10-19 11:01:00 -04:00
pvcnoded Rework success checks for IPMI fencing 2021-07-13 17:54:41 -04:00
daemon_lib Add daemon_lib symlink to pvcnoded 2021-05-30 00:00:07 -04:00
pvc-flush.service Increase start delay of flush service 2020-08-11 14:17:35 -04:00
pvc.target Correct name of systemd target 2020-02-08 20:39:07 -05:00
pvcnoded.py Update copyright year in headers 2021-03-25 17:01:55 -04:00
pvcnoded.sample.yaml Add initial SR-IOV support to node daemon 2021-06-15 22:56:09 -04:00
pvcnoded.service Remove systemd deps on zookeeper and libvirt 2021-01-28 16:25:02 -05:00