Add cluster peering for DR/replication #169

Closed
opened 2023-11-26 03:11:17 -05:00 by joshuaboniface · 6 comments

Add the ability to do cross-cluster asynchronous replication/DR.

The original plans in #122 were for synchronous replication. The problem there is sheer complexity, requiring numerous daemons to be running on both sides.

However, time-based asynchronous replication is much more feasible: we already have most of the scaffolding in place with VM backups in recent versions, so most of the remaining work would be in handling the "VM" part of it in a more consistent way.

I'm envisioning a new "replica" state for VMs, which allows them to be tracked on the remote side. For the storage, there can be backup-like full or incremental sends on regular schedules or on manual intervention, which then can be used to bring VMs online on the remote side after stopping them (or, them being stopped by a disaster) on the source side.

Part of implementing this would be to move the existing backup/autobackup/restore functionality out of the CLI client and into Celery workers, which then also eliminates the need for the triggering commands to run on the primary coordinator. This will also allow this to be tracked a bit more nicely (likely in Postgres rather than Zookeeper), which will improve the backup system as well. The choice could then be to write "backups" out to either disk images (as they are now) or to remote clusters.

The main complexity would be the ability of the second cluster to receive the inputs: this is going to bring in some API-to-API communication that does not currently exist, but shouldn't be too hard.

Adding to the 1.0 milestone as this would be a very nice feature to have for that release.

Add the ability to do cross-cluster asynchronous replication/DR. The original plans in #122 were for synchronous replication. The problem there is sheer complexity, requiring numerous daemons to be running on both sides. However, time-based asynchronous replication is much more feasible: we already have most of the scaffolding in place with VM backups in recent versions, so most of the remaining work would be in handling the "VM" part of it in a more consistent way. I'm envisioning a new "replica" state for VMs, which allows them to be tracked on the remote side. For the storage, there can be backup-like full or incremental sends on regular schedules or on manual intervention, which then can be used to bring VMs online on the remote side after stopping them (or, them being stopped by a disaster) on the source side. Part of implementing this would be to move the existing backup/autobackup/restore functionality out of the CLI client and into Celery workers, which then also eliminates the need for the triggering commands to run on the primary coordinator. This will also allow this to be tracked a bit more nicely (likely in Postgres rather than Zookeeper), which will improve the backup system as well. The choice could then be to write "backups" out to either disk images (as they are now) or to remote clusters. The main complexity would be the ability of the second cluster to receive the inputs: this is going to bring in some API-to-API communication that does not currently exist, but shouldn't be too hard. Adding to the 1.0 milestone as this would be a very nice feature to have for that release.
joshuaboniface added this to the 1.0 milestone 2023-11-26 03:11:17 -05:00
joshuaboniface added the
Client
Daemon
feature
labels 2023-11-28 16:48:07 -05:00
Author
Owner

Thoughts - the core scaffolding shouldn't be too difficult here.

  1. We need to introduce the notion of peer clusters. A cluster can have exactly one peer, bidirectional. Peers exchange information about their capabilities and resources (VMs, networks, etc.) and establish a VPN connection on port 7373. Perhaps bi-directional Wireguard instances with one-way routes? Will need some investigation there as I don't really want any one side to be a "leader" per se.

  2. We must move backups/autobackups into the pvcworkerd workers as planned.

  3. We add the ability to export a backup to the peer node. The system would handle the logic of creating receiving disks and performing the RBD export/import over the VPN connection, defining the VM in a custom state (replica?) on the peer, and handling snapshot logic.

  4. At this point we have a POC of this issue, but we could then look into extending this to synchronous replication as well.

The main problem is testing and validating assumptions as I need two clusters to work with. Will have to investigate the feasibility of this.

Thoughts - the core scaffolding shouldn't be too difficult here. 1. We need to introduce the notion of peer clusters. A cluster can have exactly one peer, bidirectional. Peers exchange information about their capabilities and resources (VMs, networks, etc.) and establish a VPN connection on port 7373. Perhaps bi-directional Wireguard instances with one-way routes? Will need some investigation there as I don't really want any one side to be a "leader" per se. 2. We must move backups/autobackups into the `pvcworkerd` workers as planned. 3. We add the ability to export a backup to the peer node. The system would handle the logic of creating receiving disks and performing the RBD export/import over the VPN connection, defining the VM in a custom state (`replica`?) on the peer, and handling snapshot logic. 4. At this point we have a POC of this issue, but we could then look into extending this to synchronous replication as well. The main problem is testing and validating assumptions as I need two clusters to work with. Will have to investigate the feasibility of this.
joshuaboniface changed title from Add asynchronous cross-cluster replication/DR to Add cluster peering for DR/replication 2024-04-26 00:19:27 -04:00
Author
Owner

More digging has revealed that this current plan is not possible using Wireguard as long as we support Debian 10, since Wireguard packages are not available for Debian 10. This would leave only OpenVPN, who's performance characteristics are suboptimal, and I would not like to go down the suboptimal route just for that legacy support.

The alternative is to drop Debian 10, which would be prudent, but would require work from a practical side as numerous deployed clusters use Debian 10. This would have to be done first and Debian 10 support officially dropped before this plan could proceed.

More digging has revealed that this current plan is not possible using Wireguard as long as we support Debian 10, since Wireguard packages are not available for Debian 10. This would leave only OpenVPN, who's performance characteristics are suboptimal, and I would not like to go down the suboptimal route just for that legacy support. The alternative is to drop Debian 10, which would be prudent, but would require work from a practical side as numerous deployed clusters use Debian 10. This would have to be done first and Debian 10 support officially dropped before this plan could proceed.
Author
Owner

With the implementation of snapshot import/export functionality in #185, much of the scaffolding for this is now in place.

The current plan is to implement a pvc vm snapshot send command, which will take a peer cluster as an argument. The command will handle all aspects of sending the VM to the peer, exporting the RBD images and VM configuration via the API to the peer cluster. No work will be needed on the destination. These sends can be, like pvc vm snapshot export commands, either full or incremental. On the peer, the VM would be in the special replica state, to denote that it is a snapshot replica.

This completely removes the need for any sort of persistence in the connectivity, or any running long-term process on either side. I do believe that some sort of tunnel between the two clusters is ideal for the RBD send, but will try a few different implementations to see how well they work.

Since snapshots are already crash-consistent, as long as the snapshot send is recent enough, failover would be as simple as bringing up the VM on the destination.

For DR purposes, it would be important to ensure the VM won't start back up on the source cluster, so that would have to be carefully controlled, but I believe this would be rare enough an event that manual intervention would certainly be required.

For migration purposes, this works well to keep downtime minimal: a full snapshot can be sent, the VM stopped on the source, a final incremental sent, and then the VM started on the peer. This could also be automatically integrated with a command like pvc vm cluster-move.

Currently all this is planned for version 0.9.100.

With the implementation of snapshot import/export functionality in #185, much of the scaffolding for this is now in place. The current plan is to implement a `pvc vm snapshot send` command, which will take a peer cluster as an argument. The command will handle all aspects of sending the VM to the peer, exporting the RBD images and VM configuration via the API to the peer cluster. No work will be needed on the destination. These sends can be, like `pvc vm snapshot export` commands, either full or incremental. On the peer, the VM would be in the special `replica` state, to denote that it is a snapshot replica. This completely removes the need for any sort of persistence in the connectivity, or any running long-term process on either side. I do believe that some sort of tunnel between the two clusters is ideal for the RBD send, but will try a few different implementations to see how well they work. Since snapshots are already crash-consistent, as long as the snapshot send is recent enough, failover would be as simple as bringing up the VM on the destination. For DR purposes, it would be important to ensure the VM won't start back up on the source cluster, so that would have to be carefully controlled, but I believe this would be rare enough an event that manual intervention would certainly be required. For migration purposes, this works well to keep downtime minimal: a full snapshot can be sent, the VM stopped on the source, a final incremental sent, and then the VM started on the peer. This could also be automatically integrated with a command like `pvc vm cluster-move`. Currently all this is planned for version 0.9.100.
Author
Owner

Initial implementation of "vm snapshot send" has been completed and will be present in 0.9.101. While there are very few guardrails on the receiving side, the implementation is working as expected. The target cluster uses the state mirror instead of replica as I felt this more accurately captured the expected intention.

Continuing to debate whether fully implementing a "cluster move" of a VM is worthwhile, or if the steps should be left to the administrator to control. From my experience with Nutanix it does not automate this, so the current implementation is at least in feature parity there.

Initial implementation of "vm snapshot send" has been completed and will be present in 0.9.101. While there are very few guardrails on the receiving side, the implementation is working as expected. The target cluster uses the state `mirror` instead of `replica` as I felt this more accurately captured the expected intention. Continuing to debate whether fully implementing a "cluster move" of a VM is worthwhile, or if the steps should be left to the administrator to control. From my experience with Nutanix it does not automate this, so the current implementation is at least in feature parity there.
Author
Owner

Cluster move has been implemented via two new commands, vm mirror create and vm mirror promote, which automate the steps around vm snapshot send. To be released in 0.9.101.

Cluster move has been implemented via two new commands, `vm mirror create` and `vm mirror promote`, which automate the steps around `vm snapshot send`. To be released in 0.9.101.
Author
Owner

Considering this finished barring any major bugs.

Considering this finished barring any major bugs.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#169
No description provided.