Joshua M. Boniface
c6d552ae57
Previously, if the node failed to restart, it was declared a "bad fence" and no further action would be taken. However, there are some situations, for instance critical hardware failures, where intelligent systems will not attempt (or succeed at) starting up the node in such a case, which would result in dead, known-offline nodes without recovery. Tweak this behaviour somewhat. The main path of Reboot -> Check On -> Success + fence-flush is retained, but some additional side-paths are now defined: 1. We attempt to power "on" the chassis 1 second after the reboot, just in case it is off and can be recovered. We then wait another 2 seconds and check the power status (as we did before). 2. If the reboot succeeded, follow this series of choices: a. If the chassis is on, the fence succeeded. b. If the chassis is off, the fence "succeeded" as well. c. If the chassis is in some other state, the fence failed. 3. If the reboot failed, follow this series of choices: a. If the chassis is off, the fence itself failed, but we can treat it as "succeeded"" since the chassis is in a known-offline state. This is the most likely situation when there is a critical hardware failure, and the server's IPMI does not allow itself to start back up again. b. If the chassis is in any other state ("on" or unknown), the fence itself failed and we must treat this as a fence failure. Overall, this should alleviate the aforementioned issue of a critical failure rendering the node persistently "off" not triggering a fence-flush and ensure fencing is more robust.