Support cross-cluster replication of VMs #122

Closed
opened 2021-04-09 11:53:42 -04:00 by JoshuaBoniface · 3 comments
JoshuaBoniface commented 2021-04-09 11:53:42 -04:00 (Migrated from git.bonifacelabs.ca)

Making use of RBD mirroring (https://docs.ceph.com/en/nautilus/rbd/rbd-mirroring/), add support for replicating VMs between PVC clusters on a per-VM (and per-Pool) basis for disaster recovery only (no live migration between clusters).

Currently-envisioned setup:

  1. Activate RBD mirroring on the pool(s) on the primary cluster is cross-cluster mode
  2. Activate RBD mirroring on the secondary cluster
  3. Support a VM metadata flag that it is replicated and which is the primary cluster; prevent flag from being changed while VM is not in "stop" state.
  4. Make the VM permanent-disabled on the non-primary cluster
  5. To fail over, the VM must be stopped and the "primary cluster" metadata field changed to the secondary cluster. The VM can then be started on the secondary cluster.

I'm not sure exactly how feasible DR migration is with this however. We will have to account for split-brains safely, and ensure that VMs will not run accidentally on the old primary if it comes back. This will require a decent bit of reworking on the VM hander side.

Making use of RBD mirroring (https://docs.ceph.com/en/nautilus/rbd/rbd-mirroring/), add support for replicating VMs between PVC clusters on a per-VM (and per-Pool) basis *for disaster recovery only* (no live migration between clusters). Currently-envisioned setup: 1. Activate RBD mirroring on the pool(s) on the primary cluster is cross-cluster mode 2. Activate RBD mirroring on the secondary cluster 3. Support a VM metadata flag that it is replicated and which is the primary cluster; prevent flag from being changed while VM is not in "stop" state. 4. Make the VM permanent-disabled on the non-primary cluster 5. To fail over, the VM must be stopped and the "primary cluster" metadata field changed to the secondary cluster. The VM can then be started on the secondary cluster. I'm not sure exactly how feasible DR migration is with this however. We will have to account for split-brains safely, and ensure that VMs will not run accidentally on the old primary if it comes back. This will require a decent bit of reworking on the VM hander side.
JoshuaBoniface commented 2021-06-17 02:53:53 -04:00 (Migrated from git.bonifacelabs.ca)

After much more careful consideration of the usecases, I don't think this is going to be a very valuable feature.

There is a distinct split between "service availability" and "disaster recovery" that needs to be thoroughly defined. This would not be a "service availability" feature since it would require manual intervention.

There is still a point to be made for this for disaster recovery situations, but thinking about how intensive the implementation would be, and the fact that there are many alternative disaster recovery options (e.g. system backups), it might not truly be worth the effort. Especially since in normal operation it would be live-replicating the changes, backups would be needed anyways.

That said after the full implementation of SR-IOV and some refactoring of the classes and other cleanup, this might be worth revisiting.

After much more careful consideration of the usecases, I don't think this is going to be a very valuable feature. There is a distinct split between "service availability" and "disaster recovery" that needs to be thoroughly defined. This would not be a "service availability" feature since it would require manual intervention. There is still a point to be made for this for disaster recovery situations, but thinking about how intensive the implementation would be, and the fact that there are many alternative disaster recovery options (e.g. system backups), it might not truly be worth the effort. Especially since in normal operation it would be live-replicating the changes, backups would be needed *anyways*. That said after the full implementation of SR-IOV and some refactoring of the classes and other cleanup, this might be worth revisiting.
JoshuaBoniface commented 2021-07-18 23:15:33 -04:00 (Migrated from git.bonifacelabs.ca)

Nope, still definitely not worth the hassle. The implementation would be obscenely complex.

Nope, still definitely not worth the hassle. The implementation would be obscenely complex.
JoshuaBoniface commented 2021-07-18 23:15:34 -04:00 (Migrated from git.bonifacelabs.ca)

closed

closed
Sign in to join this conversation.
No Milestone
No project
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#122
No description provided.