pvc/node-daemon/pvcnoded
Joshua Boniface c6d552ae57 Rework success checks for IPMI fencing
Previously, if the node failed to restart, it was declared a "bad fence"
and no further action would be taken. However, there are some
situations, for instance critical hardware failures, where intelligent
systems will not attempt (or succeed at) starting up the node in such a
case, which would result in dead, known-offline nodes without recovery.

Tweak this behaviour somewhat. The main path of Reboot -> Check On ->
Success + fence-flush is retained, but some additional side-paths are
now defined:

1. We attempt to power "on" the chassis 1 second after the reboot, just
in case it is off and can be recovered. We then wait another 2 seconds
and check the power status (as we did before).

2. If the reboot succeeded, follow this series of choices:

    a. If the chassis is on, the fence succeeded.

    b. If the chassis is off, the fence "succeeded" as well.

    c. If the chassis is in some other state, the fence failed.

3. If the reboot failed, follow this series of choices:

    a. If the chassis is off, the fence itself failed, but we can treat
    it as "succeeded"" since the chassis is in a known-offline state.
    This is the most likely situation when there is a critical hardware
    failure, and the server's IPMI does not allow itself to start back
    up again.

    b. If the chassis is in any other state ("on" or unknown), the fence
    itself failed and we must treat this as a fence failure.

Overall, this should alleviate the aforementioned issue of a critical
failure rendering the node persistently "off" not triggering a
fence-flush and ensure fencing is more robust.
2021-07-13 17:54:41 -04:00
..
CephInstance.py Fix typo in CephInstance path 2021-06-10 00:36:02 -04:00
DNSAggregatorInstance.py Handle an additional exception case 2021-06-14 17:15:40 -04:00
Daemon.py Bump version to 0.9.25 2021-07-11 23:19:09 -04:00
MetadataAPIInstance.py Ensure we don't grab None data 2021-06-13 16:43:25 -04:00
NodeInstance.py Fix typo in schema path name 2021-07-05 23:23:23 -04:00
SRIOVVFInstance.py Ensure MTU is set on VF when starting up 2021-06-22 02:26:14 -04:00
VMConsoleWatcherInstance.py Use more compatible is_alive in thread 2021-06-13 14:36:27 -04:00
VMInstance.py Make extra sure VMs terminate when told 2021-07-02 11:40:34 -04:00
VXNetworkInstance.py Fix name of schema element 2021-06-13 20:56:17 -04:00
__init__.py Use consistent naming of components 2020-02-08 19:34:07 -05:00
dnsmasq-zookeeper-leases.py Update copyright year in headers 2021-03-25 17:01:55 -04:00
fencing.py Rework success checks for IPMI fencing 2021-07-13 17:54:41 -04:00