Improve synchronization of primary/secondary coordinator transitions #170

New Issue

joshuaboniface · 2023-11-26T03:21:06-05:00

joshuaboniface commented

2023-11-26 03:21:06 -05:00

This is a bug that I've encountered exclusively on my testing cluster, but not on any production systems so far.

Sometimes, the transition will jump suddenly on one side, missing critical steps. This then causes one or both sides to become "stuck" in the "relinquish"/"takeover" states, the floating IPs to be bound on both sides, and thus general issues.

I have not found an obvious cause, though backgrounding service stop events in c76a5afd04 seemed to help somewhat. Ultimately I think relying on sleep events and the read/write locks is just not sufficient.

Instead I think it's worthwhile to redesign the node transition lockstep process to reduce the number of steps and ensure more synchronous locking between steps.

This will have to be carefully considered as any change will be breaking on cluster functionality for at least one transition during an upgrade, so that must be considered. The new solution must also be very robust to avoid needing to rebuild it again in the future.

This is a bug that I've encountered exclusively on my testing cluster, but not on any production systems so far. Sometimes, the transition will jump suddenly on one side, missing critical steps. This then causes one or both sides to become "stuck" in the "relinquish"/"takeover" states, the floating IPs to be bound on both sides, and thus general issues. I have not found an obvious cause, though backgrounding service stop events in c76a5afd0464d83af039a728c41308501e6d4582 seemed to help somewhat. Ultimately I think relying on `sleep` events and the read/write locks is just not sufficient. Instead I think it's worthwhile to redesign the node transition lockstep process to reduce the number of steps and ensure more synchronous locking between steps. This will have to be carefully considered as any change will be breaking on cluster functionality for at least one transition during an upgrade, so that must be considered. The new solution must also be very robust to avoid needing to rebuild it again in the future.

joshuaboniface added this to the 1.0 milestone 2023-11-26 03:21:06 -05:00

joshuaboniface added the

bug

label 2023-11-26 03:21:06 -05:00

joshuaboniface added the

Daemon

debt

labels 2023-11-28 16:47:52 -05:00

joshuaboniface commented

2024-10-14 13:17:22 -04:00

This was mostly solved in 4c0d90b517 by implementing read-side timeouts.

While more improvements to the state transition code could definitely happen in the future, this at least takes care of the pressing issue.

This was mostly solved in 4c0d90b5176ae933387ea8ce0ea180fe1605e24e by implementing read-side timeouts. While more improvements to the state transition code could definitely happen in the future, this at least takes care of the pressing issue.

joshuaboniface closed this issue

2024-10-14 13:17:22 -04:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#170