Add better locking during node primary transitions #59
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
With the increasing number of tasks assigned to the primary coordinator, implement better locking to ensure a smooth transition. This would qualify as a bug due to several failure cases.
Failure case
One observed failure case is if the Patroni database fails to switch. If this happens, both nodes plow ahead after a trivial delay, but as a result the database leader is now in an incorrect and undefined state (not corresponding to the active primary node), which can in turn cause the DNS Aggregator or Provisioner daemons to fail.
Ideal solution
The ideal solution would be to implement some sort of token passing via Zookeeper between the two nodes, letting each one know when the last one's switchover process has finished, and thus allowing error handling, recovery, and repetition in the switchover. For the above failure example, the current primary could pass a token saying that it is ready to switch the database primary; the candidate primary waits until this token exists, then passes a token back saying that it is taking control, and the current primary waits until this changes. Then, if the switchover fails, the candidate coordinator can pause, try again, and only pass the token back when successful. This could continue through all the various stages of primary node transition ensuring that the candidate primary succeeds at its startup task before the current primary continues on. The token in this case can be read/write locks against a particular Zookeeper key (e.g.
/primary_node/token
) which are exchanged as the steps progress.changed the description
Fixed by
8c252aeecc
closed