Not sure how this didn't cause an issue until now, but the wrong key
path was used and this was getting unexpected data with the newly-added
version string instead of the proper mode string.
Should correct issues on cold start as well as if a VM crashes
uncleanly, which would prevent the VM from starting due to stale RBD
locks.
This implementation has four parts:
1. Update how IP addresses are handled, specifically by replacing all
previous instances of "vni_ipaddr" with "vni_floatingipaddr", and then
adding the "vni_ipaddr" with the real data for this node's IPs. Also
include the storage IPs in this where they weren't before, so each
this_node actually has the local IPs plus floating IPs. This enables
the next two steps.
2. Modify flush_locks to take this_node as an argument, and update the
run_command function to only operate against this node, rather than on
the primary coordinator.
3. Have the flush_locks check each lock against the current node, to
verify that the lock is actually held by the current node. This is the
only way to do this safely. During fencing, we override this by not
passing a this_node which bypasses this check.
4. Have the VM start do the check for VM failure/startup and execute a
flush_locks before actually starting the VM.
Previously, contention could occasionally cause a flap/dual primary
contention state due to the lack of checking within this function. This
could cause a state where a node transitions to primary than is almost
immediately shifted away, which could cause undefined behaviour in the
cluster.
The solution includes several elements:
* Implement an exclusive lock operation in zkhandler
* Switch the become_primary function to use this exclusive lock
* Implement exclusive locking during the contention process
* As a failsafe, check stat versions before setting the node as the
primary node, in case another node already has
* Delay the start of takeover/relinquish operations by slightly
longer than the lock timeout
* Make the current router_state conditions more explicit (positive
conditionals rather than negative conditionals)
The new scenario ensures that during contention, only one secondary will
ever succeed at acquiring the lock. Ideally, the other would then grab
the lock and pass, but in testing this does not seem to be the case -
the lock always times out, so the failsafe check is technically not
needed but has been left as an added safety mechanism. With this setup,
the node that fails the contention will never block the switchover nor
will it try to force itself onto the cluster after another node has
successfully won contention.
Timeouts may need to be adjusted in the future, but the base timeout of
0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably
during preliminary tests.
Use a pair of transitional states, "takeover" and "relinquish", when
transitioning between primary and secondary coordinator states. This
provides a clsuter-wide record that the nodes are still working during
their synchronous transition states, and should allow clients to
determine when the node(s) have fully switched over. Also add an
additional 2 seconds of wait at the end of the transition jobs to ensure
everything has had a chance to start before proceeding.
References #72
Rename "pvcd" to "pvcnoded", and "pvc-api" to "pvcapid" so names for the
daemons are fully consistent. Update the names of the configuration
files as well to match this new formatting.
References #79