From daddf13b01e9436a35e00774ca582b6df127655e Mon Sep 17 00:00:00 2001
From: "Joshua M. Boniface" <joshua@boniface.me>
Date: Sat, 16 Sep 2023 23:46:18 -0400
Subject: [PATCH] Add fencing documentation

---
 docs/deployment/cluster-architecture.md  |  4 +-
 docs/deployment/fencing.md               | 97 ++++++++++++++++++++++++
 docs/deployment/hardware-requirements.md |  2 +-
 3 files changed, 100 insertions(+), 3 deletions(-)
 create mode 100644 docs/deployment/fencing.md

diff --git a/docs/deployment/cluster-architecture.md b/docs/deployment/cluster-architecture.md
index 95e5817..9ec00c5 100644
--- a/docs/deployment/cluster-architecture.md
+++ b/docs/deployment/cluster-architecture.md
@@ -78,7 +78,7 @@ Many PVC daemons, as discussed below, leverage a majority quorum to function. A
 
 This is an important consideration when deciding the number of coordinators to allocate: a 3-coordinator system can tolerate the loss of a single coordinator without impacting the cluster, but losing 2 would render the cluster inoperable; similarly, a 5-coordinator system can tolerate the loss of 2 coordinators, but losing 3 would render the cluster inoperable. In addition, these coordinators must be located in such a way that a majority can communicate in outage events, in order for the cluster to remain operational. This affects the network and physical design of a cluster and must be carefully considered during deployment; for instance, network switches and links, and power, should be redundant.
 
-For more details on this, see the [Fencing & Georedundancy](/deployment/fencing-and-georedundancy) documentation. This document also covers the node fencing process, which allows automatic recovery from a node failure in certain outage events.
+For more details on this, see the [Fencing](/deployment/fencing) and [Georedundancy](/deployment/georedundancy) documentation. The first also covers the node fencing process, which allows automatic recovery from a node failure in certain outage events.
 
 Hypervisors are not affected by the coordinator quorum: a cluster can lose any number of non-coordinator hypervisors without impacting core services, though compute resources (CPU and memory) must be available on the remaining nodes for VMs to function properly, and any OSDs on these hypervisors, if applicable, would become unavailable, potentially impacting storage availability.
 
@@ -130,7 +130,7 @@ The "upstream" network requires outbound Internet access, as it will be used to
 
 This network, though it requires Internet access, should not be exposed directly to the Internet or to other untrusted local networks for security reasons. PVC itself makes no attempt to hinder access to nodes from within this network. At a minimum, an upstream firewall should prevent external access to this network, and only trusted hosts or on-cluster VMs should be added to it.
 
-In addition to all other functions, server IPMI interfaces should reside either directly in this network, or in a network directly reachable from this network, to provide fencing and auto-recovery functionality. For more details, see the [Fencing & Georedundancy](/deployment/fencing-and-georedundancy) documentation.
+In addition to all other functions, server IPMI interfaces should reside either directly in this network, or in a network directly reachable from this network, to provide fencing and auto-recovery functionality. For more details, see the [Fencing](/deployment/fencing) documentation.
 
 #### Cluster
 
diff --git a/docs/deployment/fencing.md b/docs/deployment/fencing.md
new file mode 100644
index 0000000..fb5f13f
--- /dev/null
+++ b/docs/deployment/fencing.md
@@ -0,0 +1,97 @@
+---
+title: Fencing
+---
+
+PVC features a fencing system to provide automatic recovery of nodes from certain failure scenarios. This document details the fencing process, limitations, and expectations, as well as how this factors into georedundant designs.
+
+[TOC]
+
+## Overview
+
+Fencing in PVC provides a mechanism for a cluster's nodes to determine if one of their peers has stopped responding, take action to ensure the failed node is fully powercycled, and then, if successful, automatically bring up affected VMs from the dead node onto others awaiting its return to service.
+
+Properly configured fencing can thus help ensure the maximum uptime for VMs in the case of a faulty node.
+
+Fencing is enabled by default for all nodes that have the `fence_intervals` configuration key set and for which the node's IPMI is reachable and usable via `ipmitool` on the peers. Nodes check their own IPMI at daemon startup to validate this and print a warning if failed; in addition a regular health check monitors the IPMI interface and will degrade the node health if it is not reachable or not responding.
+
+Fencing can be temporarily disabled by setting the cluster maintenance mode to `on` and resumed by setting it `off`. This can be useful during maintenance events however the administrator should be careful to `flush` any affected nodes of running VMs first to avoid trouble.
+
+## IPMI Configuration
+
+For fencing to be enabled, several configurations must be correctly set.
+
+* The node must have a proper IPMI interface, as detailed in the [Hardware Requirements](/deployment/hardware-requirements/#ipmilights-out-management) documentation.
+* The IPMI interface must be either in the [cluster "upstream" network](/deployment/cluster-architecture/#upstream), or in another network reachable by it. The former is strongly recommended, because the latter is potentially susceptable to network faults in the routing between the networks which might cause fencing to fail in otherwise valid scenarios.
+* The IPMI BMC must be configured with an `Administrator`-level user with IPMI-over-LAN privilieges enabled.
+* The IPMI interface (IP or hostname) and aforementioned user of each node must be configured in the `fencing` -> `ipmi` section of the `pvcnoded.yaml` file of that node.
+
+PVC will automatically check the reachability of its IPMI and its functionality early during node startup. The functionality can also be tested via the `ipmitool -I lanplus` command from a node.
+
+The [PVC Ansible framework](/deployment/getting-started/) will automatically configure most aspects of this IPMI setup, though some might require manual configuration. Ensure you test before putting the cluster into production.
+
+## Fencing Process
+
+### Dead Node Detection
+
+Node fencing is handled during regular node keepalive events. Keepalives occur every 5 seconds (default `keepalive_interval`), during which each node checks into the cluster by providing the current UNIX epoch timestamp in a configuration key.
+
+At the end of each keepalive event, all nodes check their peers' timestamps and compare them against the current time. If the peers detect that a node has not checked in for 6 intervals (default `fence_intervals`), or 30 seconds by default, one node at random will begin the fencing process as the watching node. First, a timer is started for 6 more `keepalive_intervals` (hardcoded), during which a checkin from the dead node will cancel the fence (a "saving throw").
+
+### Dead Node Fencing
+
+If all 6 saving throw intervals pass without further updates to the dead node's timestamp, actual fencing will begin; by default this will be 60-65 seconds after the last valid keepalive. The exact process is as follows, all run from the selected watching node:
+
+1. The dead node is issued a `chassis power off` via IPMI-over-LAN to trigger an immediate power off.
+1. Wait 1 second.
+1. The `chassis power state` of the dead node is checked and recorded.
+1. The dead node is issued a `chassis power on` via IPMI-over-LAN to trigger a power on.
+1. Wait 2 seconds
+1. The `chassis power state` of the dead node is checked and recorded.
+
+With these 6 steps and the 2 saved results of the `chassis power state`, PVC can determine with near certainty that the dead node was actually powered off, and thus that any VMs that were potentially running on it were terminated. Specifically, if the first result was `Off` and the second was any valid value, the node was definitely shut down (either on its own, or by the first `chassis power off` command). If it cannot determine this, for instance because IPMI was unreachable or neither power state result was `Off`, no action is taken.
+
+### VM Recovery
+
+Once a dead node has been successfully fenced and at least 1 more `keepalive_interval` has passed, the watching node will begin fencing recovery.
+
+What action is taken during fencing recovery is depdendent on the `successful_fence` configuration key, which can either be `migrate`, which will perform the below steps, or `none` which will perform no recovery action and stop here.
+
+First, the node is put into a special `fencing-flush` domain state, to indicate that it is undergoing a forced flush after fencing. Then, for each VM which was running on the dead node:
+
+1. The RBD locks on all VM storage volumes are cleared.
+1. The VM is temporarily `migrate`d to one active peer node based on the node's configured `target_selector` (default `mem`).
+1. The VM is started up.
+
+If, at a later time, the dead node successfully recovers and resumes normal operation, it can be put back into service. This **will not** occur automatically, as the node could still be in a bad state and only barely operating; an administrator must closely inspect the node and restore it to service manually after confirming correct operation.
+
+### Failures
+
+If a fence fails for any reason (for instance, the IPMI of the dead node is not reachable), by default no action is taken, as this could be unsafe for the integrity of VM data. This can be overridden by adjusting the `failed_fence` configuration key in conjunction with the node suicide discussed below, however this is strongly discouraged.
+
+### Node Suicide
+
+As an alternative to remote fencing, nodes can be configured to kill themselves by adjusting the `suicide_intervals` configuration key to a non-zero value. If the node itself does not check in for this many intervals, it will trigger a self restart via the `reboot -f` command. However, this is not reliable, and the other nodes will have no way of accurately determining the state of the node and whether VMs are safe to migrate, so this is strongly discouraged.
+
+## Valid Fencing Conditions
+
+The conditions in which a node can be successfully fenced are limited, and thus, autorecovery is limited only to those situations where a fence can succeed. In short, any situation whereby a node's OS is not responding normally, but its IPMI interface is still up and available, should succeed in a fence; in contrast, those where the IPMI interface is also unavailable will fail.
+
+The following table covers some common scenarios, and whether fencing and automatic recovery can be exepected to occur.
+
+| Situation | Fence Possible? | Autorecovery Possible? | Notes |
+| --------- | --------------- | ---------------------- | ----- |
+| Node OS lockup (load, OOM, etc.) | ✅ | ✅ | A key design situation for the fencing system |
+| Node OS kernel panic | ✅ | ✅ | A key design situation for the fencing system |
+| Node primary network cut | ✅ | ✅ | Only affecting primary links, not IPMI (see below); a key design situation |
+| Node full network cut | ❌ | ❌ | All links are down, e.g. full network failure including IPMI |
+| Node power loss | ❌ | ❌ | Impossible to determine if this is a transient network cut or actual power loss without IPMI |
+| Node hardware failure (CPU, memory, etc.) | ✅ | ✅ | IPMI interface should remain up in these scenarios; a key design situation |
+| Node hardware failure (motherboard) | ✅ | ✅ | If IPMI is **online** after failure |
+| Node hardware failure (motherboard) | ❌ | ❌ | If IPMI is **offline** after failure |
+| Node hardware failure (full chassis) | ❌ | ❌ | Full power loss, etc. if IPMI is offline |
+
+Care should be taken to understand these scenarios and which situations can be recovered from automatically, and which require manual human intervention to confirm the situation ("is the node actually physically off?") and manual recovery.
+
+## Future Development
+
+Future versions of PVC may add support for additional fencing modes, for instance the ability for a fence to trigger a remote power device (switched PDU, etc.) or to detect more esoteric situations with the node power state via IPMI, as need requires. The author however believes that the current implementation satisfies the vast majority of potential situations for which autorecovery is beneficial and thus such work would not see much benefit, though he is open to changing his mind.
diff --git a/docs/deployment/hardware-requirements.md b/docs/deployment/hardware-requirements.md
index 6dc2370..a97dabd 100644
--- a/docs/deployment/hardware-requirements.md
+++ b/docs/deployment/hardware-requirements.md
@@ -30,7 +30,7 @@ All aforementioned server vendors support some form of IPMI Lights-out Managemen
 
 * It is **recommended** for a redundant, production PVC node to feature IPMI Lights-out Management, on a dedicated Ethernet port, with support for IPMI-over-LAN functionality, reachable from or in the [cluster "upstream" network](/deployment/cluster-architecture/#upstream).
 
-This feature is not strictly required, however it is required for the [PVC fencing system](/deployment/fencing-and-georedundancy) to function properly, which is required for auto-recovery from node failures. PVC will detect the lack of a reachable IPMI interface at startup and disable fencing and auto-recovery in such a case.
+This feature is not strictly required, however it is required for the [PVC fencing system](/deployment/fencing) to function properly, which is required for auto-recovery from node failures. PVC will detect the lack of a reachable IPMI interface at startup and disable fencing and auto-recovery in such a case.
 
 ## CPU