Improve synchronization of primary/secondary coordinator transitions #170

Closed
opened 2023-11-26 03:21:06 -05:00 by joshuaboniface · 1 comment

This is a bug that I've encountered exclusively on my testing cluster, but not on any production systems so far.

Sometimes, the transition will jump suddenly on one side, missing critical steps. This then causes one or both sides to become "stuck" in the "relinquish"/"takeover" states, the floating IPs to be bound on both sides, and thus general issues.

I have not found an obvious cause, though backgrounding service stop events in c76a5afd04 seemed to help somewhat. Ultimately I think relying on sleep events and the read/write locks is just not sufficient.

Instead I think it's worthwhile to redesign the node transition lockstep process to reduce the number of steps and ensure more synchronous locking between steps.

This will have to be carefully considered as any change will be breaking on cluster functionality for at least one transition during an upgrade, so that must be considered. The new solution must also be very robust to avoid needing to rebuild it again in the future.

This is a bug that I've encountered exclusively on my testing cluster, but not on any production systems so far. Sometimes, the transition will jump suddenly on one side, missing critical steps. This then causes one or both sides to become "stuck" in the "relinquish"/"takeover" states, the floating IPs to be bound on both sides, and thus general issues. I have not found an obvious cause, though backgrounding service stop events in c76a5afd0464d83af039a728c41308501e6d4582 seemed to help somewhat. Ultimately I think relying on `sleep` events and the read/write locks is just not sufficient. Instead I think it's worthwhile to redesign the node transition lockstep process to reduce the number of steps and ensure more synchronous locking between steps. This will have to be carefully considered as any change will be breaking on cluster functionality for at least one transition during an upgrade, so that must be considered. The new solution must also be very robust to avoid needing to rebuild it again in the future.
joshuaboniface added this to the 1.0 milestone 2023-11-26 03:21:06 -05:00
joshuaboniface added the
bug
label 2023-11-26 03:21:06 -05:00
joshuaboniface added the
Daemon
debt
labels 2023-11-28 16:47:52 -05:00
Author
Owner

This was mostly solved in 4c0d90b517 by implementing read-side timeouts.

While more improvements to the state transition code could definitely happen in the future, this at least takes care of the pressing issue.

This was mostly solved in 4c0d90b5176ae933387ea8ce0ea180fe1605e24e by implementing read-side timeouts. While more improvements to the state transition code could definitely happen in the future, this at least takes care of the pressing issue.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#170
No description provided.