Improve synchronization of primary/secondary coordinator transitions #170
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is a bug that I've encountered exclusively on my testing cluster, but not on any production systems so far.
Sometimes, the transition will jump suddenly on one side, missing critical steps. This then causes one or both sides to become "stuck" in the "relinquish"/"takeover" states, the floating IPs to be bound on both sides, and thus general issues.
I have not found an obvious cause, though backgrounding service stop events in
c76a5afd04
seemed to help somewhat. Ultimately I think relying onsleep
events and the read/write locks is just not sufficient.Instead I think it's worthwhile to redesign the node transition lockstep process to reduce the number of steps and ensure more synchronous locking between steps.
This will have to be carefully considered as any change will be breaking on cluster functionality for at least one transition during an upgrade, so that must be considered. The new solution must also be very robust to avoid needing to rebuild it again in the future.
This was mostly solved in
4c0d90b517
by implementing read-side timeouts.While more improvements to the state transition code could definitely happen in the future, this at least takes care of the pressing issue.