[BUG] Handle the whole Zookeeper cluster going down gracefully #4

New Issue

2018-06-21T22:38:37-04:00

JoshuaBoniface commented

2018-06-21 22:38:37 -04:00

(Migrated from git.bonifacelabs.ca)

As-is if the whole ZK cluster goes down, each node will go down as well and possibly fence one-another. Handle this gracefully (perhaps via all-VMs-stop?) Needs research: how to know the state of the ZK cluster? Kazoo doesn't seem to do this.

JoshuaBoniface commented

2018-06-22 12:32:09 -04:00

(Migrated from git.bonifacelabs.ca)

As far as I can tell there isn't an easy way to do this. However, this also ties into another potential feature - attempting to connect to other nodes' Zookeeper instances to reconnect. I'm very torn on this - on the one hand I do like the tight coupling of PVC to its node's Zookeeper, especially given that ZK conns would need to be re-handled.

I suppose a valid solution is to simply attempt a connection to the other nodes' instances and, if all are unsuccessful, assume the cluster is dead and disable fencing; otherwise just hang on for the local instance to start again (without actually switching conns).

As far as I can tell there isn't an easy way to do this. However, this also ties into another potential feature - attempting to connect to other nodes' Zookeeper instances to reconnect. I'm very torn on this - on the one hand I do like the tight coupling of PVC to its node's Zookeeper, especially given that ZK conns would need to be re-handled. I suppose a valid solution is to simply attempt a connection to the other nodes' instances and, if all are unsuccessful, assume the cluster is dead and disable fencing; otherwise just hang on for the local instance to start again (without actually switching conns).

JoshuaBoniface commented

2018-06-22 12:52:23 -04:00

(Migrated from git.bonifacelabs.ca)

Actually, I'm realizing that this is already "solved" in the architecture.

If the entire ZK cluster goes down, ALL nodes will stop their keepalive threads. They will therefore never try to do fencing (since the update_zookeeper function will no longer be called).

This still leaves the failure scenario where the ZK cluster is dead and all nodes are in a disconnect state, however I'm not sure that taking any action at all at that point is desirable. PVC is basically always acting on the assumption that the ZK instance will come back, and it will wait indefinitely for that to happen. However at what threshold is it worthwhile to consider the whole cluster dead and take the very drastic action of shutting everything down? I'm not sure that will ever make sense. Comments @michal?

Actually, I'm realizing that this is already "solved" in the architecture. If the entire ZK cluster goes down, ALL nodes will stop their keepalive threads. They will therefore never try to do fencing (since the `update_zookeeper` function will no longer be called). This still leaves the failure scenario where the ZK cluster is dead and all nodes are in a disconnect state, however I'm not sure that taking any action at all at that point is desirable. PVC is basically always acting on the assumption that the ZK instance will come back, and it will wait indefinitely for that to happen. However at what threshold is it worthwhile to consider the whole cluster dead and take the very drastic action of shutting everything down? I'm not sure that will ever make sense. Comments @michal?

JoshuaBoniface commented

2018-06-26 12:11:41 -04:00

(Migrated from git.bonifacelabs.ca)

Won't fix due to logic above.

JoshuaBoniface commented

2018-06-26 12:11:42 -04:00

(Migrated from git.bonifacelabs.ca)

closed

MichalKozanecki commented

2018-06-26 21:35:38 -04:00

(Migrated from git.bonifacelabs.ca)

Don't assume, allow it to be configurable and set a sane default.

However at what threshold is it worthwhile to consider the whole cluster dead and take the very drastic action of shutting everything down?

I would say this as the default is fine, (if state is unknown, stay alive) but admin should configure it based on their use case. For example in some situations the admin might prefer to "commit suicide" if the state is unknown for longer than 60 seconds - unknown states might be untenable for certain classes of users.

Don't assume, allow it to be configurable and set a sane default. > However at what threshold is it worthwhile to consider the whole cluster dead and take the very drastic action of shutting everything down? I would say this as the default is fine, (if state is unknown, stay alive) but admin should configure it based on their use case. For example in some situations the admin might prefer to "commit suicide" if the state is unknown for longer than 60 seconds - unknown states might be untenable for certain classes of users.

MichalKozanecki commented

2018-06-26 21:35:38 -04:00

(Migrated from git.bonifacelabs.ca)

reopened

JoshuaBoniface commented

2018-06-26 21:38:10 -04:00

(Migrated from git.bonifacelabs.ca)

Right but there is also zero way to determine cluster health from within Kazoo (because, even if every node is unreachable, that means nothing), so the point is moot - I can't actually implement this in any reasonably clean way. Fencing is 100% a remote-host operation, not a local one.

Right but there is also zero way to determine *cluster* health from within Kazoo (because, even if every node is unreachable, that means nothing), so the point is moot - I can't actually implement this in any reasonably clean way. Fencing is 100% a remote-host operation, not a local one.

MichalKozanecki commented

2018-06-26 21:52:18 -04:00

(Migrated from git.bonifacelabs.ca)

I think you misunderstood. if the cluster is unreachable, it is by definition unhealthy. Perhaps not in reality, but from that nodes point of view.

This isn't so much about fencing decisions (though I suppose you could view it that way) as much as what should the daemon do if it can't determent current state. An admin should be able to control via a config option what they want to do

do nothing
kill it self (after 5 failed 10 second retries, or just a simple timeout or whatever)

The sane default I agree is do nothing but in certain cluster designs, where the storage network and control/system networks are fully gapped you can run into a situations where you may with the response to an isolation event is to commit suicide.

Situation 1: A node gets isolated on its control/system network, but rest of cluster overall is healthy, and storage network is healthy. As cluster is healthy, it tries to fence the node, but due to the system networks being down, it fails. node stays up, and cluster restarts VMs, and now you have split brain at storage level.

Situation 2: A node gets isolated on its control/system/storage network, but cluster overall is healthy. As cluster is healthy, it tries to fence the node, but fails. Node stays up, and cluster restarts VMs. Network isolation gets repaired (either manually intervention, or automatically {transient event}) and now the VMs are alive in two locations - split brain.

I think you misunderstood. if the cluster is unreachable, it is by definition unhealthy. Perhaps not in reality, but from that nodes point of view. This isn't so much about fencing decisions (though I suppose you could view it that way) as much as what should the daemon do if it can't determent current state. An admin should be able to control via a config option what they want to do * do nothing * kill it self (after 5 failed 10 second retries, or just a simple timeout or whatever) The sane default I agree is `do nothing` but in certain cluster designs, where the storage network and control/system networks are fully gapped you can run into a situations where you may with the response to an isolation event is to commit suicide. Situation 1: A node gets isolated on its control/system network, but rest of cluster overall is healthy, and storage network is healthy. As cluster is healthy, it tries to fence the node, but due to the system networks being down, it fails. node stays up, and cluster restarts VMs, and now you have split brain at storage level. Situation 2: A node gets isolated on its control/system/storage network, but cluster overall is healthy. As cluster is healthy, it tries to fence the node, but fails. Node stays up, and cluster restarts VMs. Network isolation gets repaired (either manually intervention, or automatically {transient event}) and now the VMs are alive in two locations - split brain.

JoshuaBoniface commented

2018-06-26 21:58:23 -04:00

(Migrated from git.bonifacelabs.ca)

OK I think the real solution is then to make sure a node is actually fenced before trying to recover the VMs. Which I planned to implement eventually once I could test fencing on physical hardware 😆

In that case, neither of those can happen without manual intervention because fencing won't try to migrate VMs away if the fence fails. Only when a fence succeeds (as defined by the ipmitool command) will it then say "OK I killed the node, let's start the VMs elsewhere".

I have to look at it as a fencing problem, because if ZK is unreachable then it's impossible for a node to self-fence and migrate anyways. A node has to be dead to the perspective of another alive node.

OK I think the real solution is then to make sure a node is actually fenced before trying to recover the VMs. Which I planned to implement eventually once I could test fencing on physical hardware :laughing: In that case, neither of those can happen without manual intervention because fencing won't try to migrate VMs away if the fence fails. Only when a fence succeeds (as defined by the ipmitool command) will it then say "OK I killed the node, let's start the VMs elsewhere". I have to look at it as a fencing problem, because if ZK is unreachable then it's impossible for a node to self-fence and migrate anyways. A node has to be dead to the perspective of another alive node.

JoshuaBoniface commented

2018-06-26 22:01:23 -04:00

(Migrated from git.bonifacelabs.ca)

What I CAN do is implement a "kill my VMs forcibly" timeout - that stays node local, but in that case without an external fence event, there will be no migration - VMs would restart on the same node once it becomes healthy again.

JoshuaBoniface commented

2018-06-26 22:12:06 -04:00

(Migrated from git.bonifacelabs.ca)

From discussion - add config options for:

Self-kill - implement self-vm-kill at some multiple of the interval but before fencing to allow a node to free up its resources
Fence-fail-override - allow the post-fence migration to continue even if a fence operation fails; would require 1 to be turned on to be safe

From discussion - add config options for: 1. Self-kill - implement self-vm-kill at some multiple of the interval but before fencing to allow a node to free up its resources 2. Fence-fail-override - allow the post-fence migration to continue even if a fence operation fails; would require 1 to be turned on to be safe

JoshuaBoniface commented

2018-06-28 00:11:50 -04:00

(Migrated from git.bonifacelabs.ca)

mentioned in commit ad4d3d794b

mentioned in commit ad4d3d794b0c6d232625707453978a312b65109d

JoshuaBoniface commented

2018-06-28 21:30:17 -04:00

(Migrated from git.bonifacelabs.ca)

mentioned in commit 9ef5fcb836

mentioned in commit 9ef5fcb836c7827b5b19480bc958c2e80c9b01d9

JoshuaBoniface commented

2018-06-28 21:30:17 -04:00