Add better locking during node primary transitions #59

New Issue

2019-12-13T00:31:49-05:00

JoshuaBoniface commented

2019-12-13 00:31:49 -05:00

(Migrated from git.bonifacelabs.ca)

With the increasing number of tasks assigned to the primary coordinator, implement better locking to ensure a smooth transition. This would qualify as a bug due to several failure cases.

Failure case

One observed failure case is if the Patroni database fails to switch. If this happens, both nodes plow ahead after a trivial delay, but as a result the database leader is now in an incorrect and undefined state (not corresponding to the active primary node), which can in turn cause the DNS Aggregator or Provisioner daemons to fail.

Ideal solution

The ideal solution would be to implement some sort of token passing via Zookeeper between the two nodes, letting each one know when the last one's switchover process has finished, and thus allowing error handling, recovery, and repetition in the switchover. For the above failure example, the current primary could pass a token saying that it is ready to switch the database primary; the candidate primary waits until this token exists, then passes a token back saying that it is taking control, and the current primary waits until this changes. Then, if the switchover fails, the candidate coordinator can pause, try again, and only pass the token back when successful. This could continue through all the various stages of primary node transition ensuring that the candidate primary succeeds at its startup task before the current primary continues on. The token in this case can be read/write locks against a particular Zookeeper key (e.g. /primary_node/token) which are exchanged as the steps progress.

With the increasing number of tasks assigned to the primary coordinator, implement better locking to ensure a smooth transition. This would qualify as a bug due to several failure cases. ## Failure case One observed failure case is if the Patroni database fails to switch. If this happens, both nodes plow ahead after a trivial delay, but as a result the database leader is now in an incorrect and undefined state (not corresponding to the active primary node), which can in turn cause the DNS Aggregator or Provisioner daemons to fail. ## Ideal solution The ideal solution would be to implement some sort of token passing via Zookeeper between the two nodes, letting each one know when the last one's switchover process has finished, and thus allowing error handling, recovery, and repetition in the switchover. For the above failure example, the current primary could pass a token saying that it is ready to switch the database primary; the candidate primary waits until this token exists, then passes a token back saying that it is taking control, and the current primary waits until this changes. Then, if the switchover fails, the candidate coordinator can pause, try again, and only pass the token back when successful. This could continue through all the various stages of primary node transition ensuring that the candidate primary succeeds at its startup task before the current primary continues on. The token in this case can be read/write locks against a particular Zookeeper key (e.g. `/primary_node/token`) which are exchanged as the steps progress.

JoshuaBoniface commented

2019-12-13 00:33:31 -05:00

(Migrated from git.bonifacelabs.ca)

changed the description

JoshuaBoniface commented

2019-12-19 20:07:11 -05:00

(Migrated from git.bonifacelabs.ca)

Fixed by 8c252aeecc

Fixed by 8c252aeecc2cd325d4ef57f3dbe578c547ad5f5b

JoshuaBoniface commented

2019-12-19 20:07:11 -05:00

(Migrated from git.bonifacelabs.ca)