Bump version to 0.9.103

Fix double-appending domain bug
Since storage_hosts now includes the storage domain as FQDNs, don't re-append it within vmbuilder.
2024-11-01 17:23:24 -04:00 · 2024-11-01 17:18:51 -04:00 · 2024-10-30 13:12:08 -04:00 · 2024-10-30 12:53:29 -04:00 · 2024-10-25 23:51:08 -04:00 · 2024-10-25 23:47:33 -04:00
39 changed files with 11652 additions and 6765 deletions
--- a/.version
+++ b/.version
@ -1 +1 @@
-0.9.100
+0.9.103
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,39 @@
 ## PVC Changelog
 ###### [v0.9.103](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.103)
  * [Provisioner] Fixes a bug with the change in `storage_hosts` to FQDNs affecting the VM Builder
  * [Monitoring] Fixes the Munin plugin to work properly with sudo
 ###### [v0.9.102](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.102)
  * [API Daemon] Ensures that received config snapshots update storage hosts in addition to secret UUIDs
  * [CLI Client] Fixes several bugs around local connection handling and connection listings
 ###### [v0.9.101](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.101)
  **New Feature**: Adds VM snapshot sending (`vm snapshot send`), VM mirroring (`vm mirror create`), and (offline) mirror promotion (`vm mirror promote`). Permits transferring VM snapshots to remote clusters, individually or repeatedly, and promoting them to active status, for disaster recovery and migration between clusters.
  **Breaking Change**: Migrates the API daemon into Gunicorn when in production mode. Permits more scalable and performant operation of the API. **Requires additional dependency packages on all coordinator nodes** (`gunicorn`, `python3-gunicorn`, `python3-setuptools`); upgrade via `pvc-ansible` is strongly recommended.
  **Enhancement**: Provides whole cluster utilization stats in the cluster status data. Permits better observability into the overall resource utilization of the cluster.
  **Enhancement**: Adds a new storage benchmark format (v2) which includes additional resource utilization statistics. This allows for better evaluation of storage performance impact on the cluster as a whole. The updated format also permits arbitrary benchmark job names for easier parsing and tracking.
  * [API Daemon] Allows scanning of new volumes added manually via other commands
  * [API Daemon/CLI Client] Adds whole cluster utilization statistics to cluster status
  * [API Daemon] Moves production API execution into Gunicorn
  * [API Daemon] Adds a new storage benchmark format (v2) with additional resource tracking
  * [API Daemon] Adds support for named storage benchmark jobs
  * [API Daemon] Fixes a bug in OSD creation which would create `split` OSDs if `--osd-count` was set to 1
  * [API Daemon] Adds support for the `mirror` VM state used by snapshot mirrors
  * [CLI Client] Fixes several output display bugs in various commands and in Worker task outputs
  * [CLI Client] Improves and shrinks the status progress bar output to support longer messages
  * [API Daemon] Adds support for sending snapshots to remote clusters
  * [API Daemon] Adds support for updating and promoting snapshot mirrors to remote clusters
  * [Node Daemon] Improves timeouts during primary/secondary coordinator transitions to avoid deadlocks
  * [Node Daemon] Improves timeouts during keepalive updates to avoid deadlocks
  * [Node Daemon] Refactors fencing thread structure to ensure a single fencing task per cluster and sequential node fences to avoid potential anomalies (e.g. fencing 2 nodes simultaneously)
  * [Node Daemon] Fixes a bug in fencing if VM locks were already freed, leaving VMs in an invalid state
  * [Node Daemon] Increases the wait time during system startup to ensure Zookeeper has more time to synchronize
 ###### [v0.9.100](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.100)
  * [API Daemon] Improves the handling of "detect:" disk strings on newer systems by leveraging the "nvme" command
--- a/README.md
+++ b/README.md
@ -1,10 +1,11 @@
 <p align="center">
-<img alt="Logo banner" src="images/pvc_logo_black.png"/>
+<img alt="Logo banner" src="https://docs.parallelvirtualcluster.org/en/latest/images/pvc_logo_black.png"/>
 <br/><br/>
 <a href="https://www.parallelvirtualcluster.org"><img alt="Website" src="https://img.shields.io/badge/visit-website-blue"/></a>
 <a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Latest Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
 <a href="https://docs.parallelvirtualcluster.org/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
 <a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
 <a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
 <a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
 <a href="https://docs.parallelvirtualcluster.org/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
 </p>
 ## What is PVC?
@ -23,62 +24,64 @@ Installation of PVC is accomplished by two main components: a [Node installer IS
 Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.
 More information about PVC, its motivations, the hardware requirements, and setting up and managing a cluster [can be found over at our docs page](https://docs.parallelvirtualcluster.org).
 ## Getting Started
 To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster.
 ## Changelog
-View the changelog in [CHANGELOG.md](CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**
+View the changelog in [CHANGELOG.md](https://github.com/parallelvirtualcluster/pvc/blob/master/CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**
 ## Screenshots
 These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85.
-<p><img alt="0. Integrated help" src="images/0-integrated-help.png"/><br/>
+<p><img alt="0. Integrated help" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/0-integrated-help.png"/><br/>
 <i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i>
 </p>
-<p><img alt="1. Connection management" src="images/1-connection-management.png"/><br/>
+<p><img alt="1. Connection management" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/1-connection-management.png"/><br/>
 <i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i>
 </p>
-<p><img alt="2. Cluster details and output formats" src="images/2-cluster-details-and-output-formats.png"/><br/>
+<p><img alt="2. Cluster details and output formats" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/2-cluster-details-and-output-formats.png"/><br/>
 <i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i>
 </p>
-<p><img alt="3. Node information" src="images/3-node-information.png"/><br/>
+<p><img alt="3. Node information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/3-node-information.png"/><br/>
 <i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i>
 </p>
-<p><img alt="4. VM information" src="images/4-vm-information.png"/><br/>
+<p><img alt="4. VM information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/4-vm-information.png"/><br/>
 <i>PVC can show details about the VMs in the cluster, including their state, resource allocations, current hosting node, and metadata.</i>
 </p>
-<p><img alt="5. VM details" src="images/5-vm-details.png"/><br/>
+<p><img alt="5. VM details" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/5-vm-details.png"/><br/>
 <i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i>
 </p>
-<p><img alt="6. Network information" src="images/6-network-information.png"/><br/>
+<p><img alt="6. Network information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/6-network-information.png"/><br/>
 <i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i>
 </p>
-<p><img alt="7. Storage information" src="images/7-storage-information.png"/><br/>
+<p><img alt="7. Storage information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/7-storage-information.png"/><br/>
 <i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i>
 </p>
-<p><img alt="8. VM and node logs" src="images/8-vm-and-node-logs.png"/><br/>
+<p><img alt="8. VM and node logs" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/8-vm-and-node-logs.png"/><br/>
 <i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i>
 </p>
-<p><img alt="9. VM and worker tasks" src="images/9-vm-and-worker-tasks.png"/><br/>
+<p><img alt="9. VM and worker tasks" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/9-vm-and-worker-tasks.png"/><br/>
 <i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i>
 </p>
-<p><img alt="10. Provisioner" src="images/10-provisioner.png"/><br/>
+<p><img alt="10. Provisioner" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/10-provisioner.png"/><br/>
 <i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i>
 </p>
-<p><img alt="11. Prometheus and Grafana dashboard" src="images/11-prometheus-grafana.png"/><br/>
+<p><img alt="11. Prometheus and Grafana dashboard" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/11-prometheus-grafana.png"/><br/>
 <i>PVC features several monitoring integration examples under "node-daemon/monitoring", including CheckMK, Munin, and, most recently, Prometheus, including an example Grafana dashboard for cluster monitoring and alerting.</i>
 </p>
--- a/api-daemon/pvcapid-manage-zk.py
+++ b/api-daemon/pvcapid-manage-zk.py
@ -21,4 +21,5 @@
 from daemon_lib.zkhandler import ZKSchema
-ZKSchema.write()
+schema = ZKSchema(root_path=".")
 schema.write()
--- a/api-daemon/pvcapid.py
+++ b/api-daemon/pvcapid.py
@ -19,6 +19,13 @@
 #
 ###############################################################################
-import pvcapid.Daemon  # noqa: F401
+import sys
 from os import path
 # Ensure current directory (/usr/share/pvc) is in the system path for Gunicorn
 current_dir = path.dirname(path.abspath(__file__))
 sys.path.append(current_dir)
 import pvcapid.Daemon  # noqa: F401, E402
 pvcapid.Daemon.entrypoint()
--- a/api-daemon/pvcapid/Daemon.py
+++ b/api-daemon/pvcapid/Daemon.py
@ -19,15 +19,13 @@
 #
 ###############################################################################
-
+import subprocess
 from ssl import SSLContext, TLSVersion
 from distutils.util import strtobool as dustrtobool
 import daemon_lib.config as cfg
 # Daemon version
-version = "0.9.100"
+version = "0.9.100~git-73c0834f"
 # API version
 API_VERSION = 1.0
@ -53,7 +51,6 @@ def strtobool(stringv):
 # Configuration Parsing
 ##########################################################
 # Get our configuration
 config = cfg.get_configuration()
 config["daemon_name"] = "pvcapid"
@ -61,22 +58,16 @@ config["daemon_version"] = version
 ##########################################################
-# Entrypoint
+# Flask App Creation for Gunicorn
 ##########################################################
-def entrypoint():
+def create_app():
-    import pvcapid.flaskapi as pvc_api  # noqa: E402
+    """
-
+    Create and return the Flask app and SSL context if necessary.
-    if config["api_ssl_enabled"]:
+    """
-        context = SSLContext()
+    # Import the Flask app from pvcapid.flaskapi after adjusting the path
-        context.minimum_version = TLSVersion.TLSv1
+    import pvcapid.flaskapi as pvc_api
        context.get_ca_certs()
        context.load_cert_chain(
            config["api_ssl_cert_file"], keyfile=config["api_ssl_key_file"]
        )
    else:
        context = None
    # Print our startup messages
    print("")
@ -102,9 +93,69 @@ def entrypoint():
    print("")
    pvc_api.celery_startup()
-    pvc_api.app.run(
+
-        config["api_listen_address"],
+    return pvc_api.app
-        config["api_listen_port"],
+
-        threaded=True,
+
-        ssl_context=context,
+##########################################################
-    )
+# Entrypoint
 ##########################################################
 def entrypoint():
    if config["debug"]:
        app = create_app()
        if config["api_ssl_enabled"]:
            ssl_context = SSLContext()
            ssl_context.minimum_version = TLSVersion.TLSv1
            ssl_context.get_ca_certs()
            ssl_context.load_cert_chain(
                config["api_ssl_cert_file"], keyfile=config["api_ssl_key_file"]
            )
        else:
            ssl_context = None
        app.run(
            config["api_listen_address"],
            config["api_listen_port"],
            threaded=True,
            ssl_context=ssl_context,
        )
    else:
        # Build the command to run Gunicorn
        gunicorn_cmd = [
            "gunicorn",
            "--workers",
            "1",
            "--threads",
            "8",
            "--timeout",
            "86400",
            "--bind",
            "{}:{}".format(config["api_listen_address"], config["api_listen_port"]),
            "pvcapid.Daemon:create_app()",
            "--log-level",
            "info",
            "--access-logfile",
            "-",
            "--error-logfile",
            "-",
        ]
        if config["api_ssl_enabled"]:
            gunicorn_cmd += [
                "--certfile",
                config["api_ssl_cert_file"],
                "--keyfile",
                config["api_ssl_key_file"],
            ]
        # Run Gunicorn
        try:
            subprocess.run(gunicorn_cmd)
        except KeyboardInterrupt:
            exit(0)
        except Exception as e:
            print(e)
            exit(1)
--- a/api-daemon/pvcapid/flaskapi.py
+++ b/api-daemon/pvcapid/flaskapi.py
--- a/api-daemon/pvcapid/helper.py
+++ b/api-daemon/pvcapid/helper.py
@ -21,7 +21,9 @@
 import flask
 import json
 import logging
 import lxml.etree as etree
 import sys
 from re import match
 from requests import get
@ -40,6 +42,15 @@ import daemon_lib.network as pvc_network
 import daemon_lib.ceph as pvc_ceph
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.INFO)
 handler = logging.StreamHandler(sys.stdout)
 handler.setLevel(logging.INFO)
 formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
 handler.setFormatter(formatter)
 logger.addHandler(handler)
 #
 # Cluster base functions
 #
@ -1142,11 +1153,11 @@ def vm_remove(zkhandler, name):
@ZKConnection(config)
-def vm_start(zkhandler, name):
+def vm_start(zkhandler, name, force=False):
    """
    Start a VM in the PVC cluster.
    """
-    retflag, retdata = pvc_vm.start_vm(zkhandler, name)
+    retflag, retdata = pvc_vm.start_vm(zkhandler, name, force=force)
    if retflag:
        retcode = 200
@ -1190,11 +1201,11 @@ def vm_shutdown(zkhandler, name, wait):
@ZKConnection(config)
-def vm_stop(zkhandler, name):
+def vm_stop(zkhandler, name, force=False):
    """
    Forcibly stop a VM in the PVC cluster.
    """
-    retflag, retdata = pvc_vm.stop_vm(zkhandler, name)
+    retflag, retdata = pvc_vm.stop_vm(zkhandler, name, force=force)
    if retflag:
        retcode = 200
@ -1208,7 +1219,7 @@ def vm_stop(zkhandler, name):
@ZKConnection(config)
 def vm_disable(zkhandler, name, force=False):
    """
-    Disable (shutdown or force stop if required)a  VM in the PVC cluster.
+    Disable (shutdown or force stop if required) a VM in the PVC cluster.
    """
    retflag, retdata = pvc_vm.disable_vm(zkhandler, name, force=force)
@ -1280,7 +1291,7 @@ def vm_flush_locks(zkhandler, vm):
        zkhandler, None, None, None, vm, is_fuzzy=False, negate=False
    )
-    if retdata[0].get("state") not in ["stop", "disable"]:
+    if retdata[0].get("state") not in ["stop", "disable", "mirror"]:
        return {"message": "VM must be stopped to flush locks"}, 400
    retflag, retdata = pvc_vm.flush_locks(zkhandler, vm)
@ -1294,6 +1305,342 @@ def vm_flush_locks(zkhandler, vm):
    return output, retcode
@ZKConnection(config)
 def vm_snapshot_receive_block_full(zkhandler, pool, volume, snapshot, size, request):
    """
    Receive an RBD volume from a remote system
    """
    import rados
    import rbd
    _, rbd_detail = pvc_ceph.get_list_volume(
        zkhandler, pool, limit=volume, is_fuzzy=False
    )
    if len(rbd_detail) > 0:
        volume_exists = True
    else:
        volume_exists = False
    cluster = rados.Rados(conffile="/etc/ceph/ceph.conf")
    cluster.connect()
    ioctx = cluster.open_ioctx(pool)
    if not volume_exists:
        rbd_inst = rbd.RBD()
        rbd_inst.create(ioctx, volume, size)
        retflag, retdata = pvc_ceph.add_volume(
            zkhandler, pool, volume, str(size) + "B", force_flag=True, zk_only=True
        )
        if not retflag:
            ioctx.close()
            cluster.shutdown()
            if retflag:
                retcode = 200
            else:
                retcode = 400
            output = {"message": retdata.replace('"', "'")}
            return output, retcode
    image = rbd.Image(ioctx, volume)
    last_chunk = 0
    chunk_size = 1024 * 1024 * 1024
    logger.info(f"Importing full snapshot {pool}/{volume}@{snapshot}")
    while True:
        chunk = request.stream.read(chunk_size)
        if not chunk:
            break
        image.write(chunk, last_chunk)
        last_chunk += len(chunk)
    image.close()
    ioctx.close()
    cluster.shutdown()
    return {"message": "Successfully received RBD block device"}, 200
@ZKConnection(config)
 def vm_snapshot_receive_block_diff(
    zkhandler, pool, volume, snapshot, source_snapshot, request
 ):
    """
    Receive an RBD volume from a remote system
    """
    import rados
    import rbd
    cluster = rados.Rados(conffile="/etc/ceph/ceph.conf")
    cluster.connect()
    ioctx = cluster.open_ioctx(pool)
    image = rbd.Image(ioctx, volume)
    if len(request.files) > 0:
        logger.info(f"Applying {len(request.files)} RBD diff chunks for {snapshot}")
        for i in range(len(request.files)):
            object_key = f"object_{i}"
            if object_key in request.files:
                object_data = request.files[object_key].read()
                offset = int.from_bytes(object_data[:8], "big")
                length = int.from_bytes(object_data[8:16], "big")
                data = object_data[16 : 16 + length]
                logger.info(f"Applying RBD diff chunk at {offset} ({length} bytes)")
                image.write(data, offset)
    else:
        return {"message": "No data received"}, 400
    image.close()
    ioctx.close()
    cluster.shutdown()
    return {
        "message": f"Successfully received {len(request.files)} RBD diff chunks"
    }, 200
@ZKConnection(config)
 def vm_snapshot_receive_block_createsnap(zkhandler, pool, volume, snapshot):
    """
    Create the snapshot of a remote volume
    """
    import rados
    import rbd
    cluster = rados.Rados(conffile="/etc/ceph/ceph.conf")
    cluster.connect()
    ioctx = cluster.open_ioctx(pool)
    image = rbd.Image(ioctx, volume)
    image.create_snap(snapshot)
    image.close()
    ioctx.close()
    cluster.shutdown()
    retflag, retdata = pvc_ceph.add_snapshot(
        zkhandler, pool, volume, snapshot, zk_only=True
    )
    if not retflag:
        if retflag:
            retcode = 200
        else:
            retcode = 400
        output = {"message": retdata.replace('"', "'")}
        return output, retcode
    return {"message": "Successfully received RBD snapshot"}, 200
@ZKConnection(config)
 def vm_snapshot_receive_config(zkhandler, snapshot, vm_config, source_snapshot=None):
    """
    Receive a VM configuration snapshot from a remote system, and modify it to work on our system
    """
    def parse_unified_diff(diff_text, original_text):
        """
        Take a unified diff and apply it to an original string
        """
        # Split the original string into lines
        original_lines = original_text.splitlines(keepends=True)
        patched_lines = []
        original_idx = 0  # Track position in original lines
        diff_lines = diff_text.splitlines(keepends=True)
        for line in diff_lines:
            if line.startswith("---") or line.startswith("+++"):
                # Ignore prefix lines
                continue
            if line.startswith("@@"):
                # Extract line numbers from the diff hunk header
                hunk_header = line
                parts = hunk_header.split(" ")
                original_range = parts[1]
                # Get the starting line number and range length for the original file
                original_start, _ = map(int, original_range[1:].split(","))
                # Adjust for zero-based indexing
                original_start -= 1
                # Add any lines between the current index and the next hunk's start
                while original_idx < original_start:
                    patched_lines.append(original_lines[original_idx])
                    original_idx += 1
            elif line.startswith("-"):
                # This line should be removed from the original, skip it
                original_idx += 1
            elif line.startswith("+"):
                # This line should be added to the patched version, removing the '+'
                patched_lines.append(line[1:])
            else:
                # Context line (unchanged), it has no prefix, add from the original
                patched_lines.append(original_lines[original_idx])
                original_idx += 1
        # Add any remaining lines from the original file after the last hunk
        patched_lines.extend(original_lines[original_idx:])
        return "".join(patched_lines).strip()
    # Get our XML configuration for this snapshot
    # We take the main XML configuration, then apply the diff for this particular incremental
    current_snapshot = [s for s in vm_config["snapshots"] if s["name"] == snapshot][0]
    vm_xml = vm_config["xml"]
    vm_xml_diff = "\n".join(current_snapshot["xml_diff_lines"])
    snapshot_vm_xml = parse_unified_diff(vm_xml_diff, vm_xml)
    xml_data = etree.fromstring(snapshot_vm_xml)
    # Replace the Ceph storage secret UUID with this cluster's
    our_ceph_secret_uuid = config["ceph_secret_uuid"]
    ceph_secrets = xml_data.xpath("//secret[@type='ceph']")
    for ceph_secret in ceph_secrets:
        ceph_secret.set("uuid", our_ceph_secret_uuid)
    # Replace the Ceph source hosts with this cluster's
    our_ceph_storage_hosts = config["storage_hosts"]
    our_ceph_storage_port = str(config["ceph_monitor_port"])
    ceph_sources = xml_data.xpath("//source[@protocol='rbd']")
    for ceph_source in ceph_sources:
        for host in ceph_source.xpath("host"):
            ceph_source.remove(host)
        for ceph_storage_host in our_ceph_storage_hosts:
            new_host = etree.Element("host")
            new_host.set("name", ceph_storage_host)
            new_host.set("port", our_ceph_storage_port)
            ceph_source.append(new_host)
    # Regenerate the VM XML
    snapshot_vm_xml = etree.tostring(xml_data, pretty_print=True).decode("utf8")
    if (
        source_snapshot is not None
        or pvc_vm.searchClusterByUUID(zkhandler, vm_config["uuid"]) is not None
    ):
        logger.info(
            f"Receiving incremental VM configuration for {vm_config['name']}@{snapshot}"
        )
        # Modify the VM based on our passed detail
        retcode, retmsg = pvc_vm.modify_vm(
            zkhandler,
            vm_config["uuid"],
            False,
            snapshot_vm_xml,
        )
        if not retcode:
            retcode = 400
            retdata = {"message": retmsg}
            return retdata, retcode
        retcode, retmsg = pvc_vm.modify_vm_metadata(
            zkhandler,
            vm_config["uuid"],
            None,  # Node limits are left unchanged
            vm_config["node_selector"],
            vm_config["node_autostart"],
            vm_config["profile"],
            vm_config["migration_method"],
            vm_config["migration_max_downtime"],
        )
        if not retcode:
            retcode = 400
            retdata = {"message": retmsg}
            return retdata, retcode
        current_vm_tags = zkhandler.children(("domain.meta.tags", vm_config["uuid"]))
        new_vm_tags = [t["name"] for t in vm_config["tags"]]
        remove_tags = []
        add_tags = []
        for tag in vm_config["tags"]:
            if tag["name"] not in current_vm_tags:
                add_tags.append((tag["name"], tag["protected"]))
        for tag in current_vm_tags:
            if tag not in new_vm_tags:
                remove_tags.append(tag)
        for tag in add_tags:
            name, protected = tag
            pvc_vm.modify_vm_tag(
                zkhandler, vm_config["uuid"], "add", name, protected=protected
            )
        for tag in remove_tags:
            pvc_vm.modify_vm_tag(zkhandler, vm_config["uuid"], "remove", name)
    else:
        logger.info(
            f"Receiving full VM configuration for {vm_config['name']}@{snapshot}"
        )
        # Define the VM based on our passed detail
        retcode, retmsg = pvc_vm.define_vm(
            zkhandler,
            snapshot_vm_xml,
            None,  # Target node is autoselected
            None,  # Node limits are invalid here so ignore them
            vm_config["node_selector"],
            vm_config["node_autostart"],
            vm_config["migration_method"],
            vm_config["migration_max_downtime"],
            vm_config["profile"],
            vm_config["tags"],
            "mirror",
        )
        if not retcode:
            retcode = 400
            retdata = {"message": retmsg}
            return retdata, retcode
    # Add this snapshot to the VM manually in Zookeeper
    zkhandler.write(
        [
            (
                (
                    "domain.snapshots",
                    vm_config["uuid"],
                    "domain_snapshot.name",
                    snapshot,
                ),
                snapshot,
            ),
            (
                (
                    "domain.snapshots",
                    vm_config["uuid"],
                    "domain_snapshot.timestamp",
                    snapshot,
                ),
                current_snapshot["timestamp"],
            ),
            (
                (
                    "domain.snapshots",
                    vm_config["uuid"],
                    "domain_snapshot.xml",
                    snapshot,
                ),
                snapshot_vm_xml,
            ),
            (
                (
                    "domain.snapshots",
                    vm_config["uuid"],
                    "domain_snapshot.rbd_snapshots",
                    snapshot,
                ),
                ",".join(current_snapshot["rbd_snapshots"]),
            ),
        ]
    )
    return {"message": "Successfully received VM configuration snapshot"}, 200
 #
 # Network functions
 #
@ -1996,6 +2343,22 @@ def ceph_volume_list(zkhandler, pool=None, limit=None, is_fuzzy=True):
    return retdata, retcode
@ZKConnection(config)
 def ceph_volume_scan(zkhandler, pool, name):
    """
    (Re)scan a Ceph RBD volume for stats in the PVC Ceph storage cluster.
    """
    retflag, retdata = pvc_ceph.scan_volume(zkhandler, pool, name)
    if retflag:
        retcode = 200
    else:
        retcode = 400
    output = {"message": retdata.replace('"', "'")}
    return output, retcode
@ZKConnection(config)
 def ceph_volume_add(zkhandler, pool, name, size, force_flag=False):
    """
--- a/client-cli/pvc/cli/cli.py
+++ b/client-cli/pvc/cli/cli.py
@ -1517,12 +1517,21 @@ def cli_vm_remove(domain):
@click.command(name="start", short_help="Start up a defined virtual machine.")
@connection_req
@click.argument("domain")
-def cli_vm_start(domain):
+@click.option(
    "--force",
    "force_flag",
    is_flag=True,
    default=False,
    help="Force a snapshot mirror state change.",
 )
 def cli_vm_start(domain, force_flag):
    """
    Start virtual machine DOMAIN on its configured node. DOMAIN may be a UUID or name.
    If the VM is a snapshot mirror, "--force" allows a manual state change to the mirror.
    """
-    retcode, retmsg = pvc.lib.vm.vm_state(CLI_CONFIG, domain, "start")
+    retcode, retmsg = pvc.lib.vm.vm_state(CLI_CONFIG, domain, "start", force=force_flag)
    finish(retcode, retmsg)
@ -1582,13 +1591,22 @@ def cli_vm_shutdown(domain, wait):
@click.command(name="stop", short_help="Forcibly halt a running virtual machine.")
@connection_req
@click.argument("domain")
@click.option(
    "--force",
    "force_flag",
    is_flag=True,
    default=False,
    help="Force a snapshot mirror state change.",
 )
@confirm_opt("Forcibly stop virtual machine {domain}")
-def cli_vm_stop(domain):
+def cli_vm_stop(domain, force_flag):
    """
    Forcibly halt (destroy) running virtual machine DOMAIN. DOMAIN may be a UUID or name.
    If the VM is a snapshot mirror, "--force" allows a manual state change to the mirror.
    """
-    retcode, retmsg = pvc.lib.vm.vm_state(CLI_CONFIG, domain, "stop")
+    retcode, retmsg = pvc.lib.vm.vm_state(CLI_CONFIG, domain, "stop", force=force_flag)
    finish(retcode, retmsg)
@ -1603,14 +1621,14 @@ def cli_vm_stop(domain):
    "force_flag",
    is_flag=True,
    default=False,
-    help="Forcibly stop the VM instead of waiting for shutdown.",
+    help="Forcibly stop VM without shutdown and/or force a snapshot mirror state change.",
 )
@confirm_opt("Shut down and disable virtual machine {domain}")
 def cli_vm_disable(domain, force_flag):
    """
    Shut down virtual machine DOMAIN and mark it as disabled. DOMAIN may be a UUID or name.
-    Disabled VMs will not be counted towards a degraded cluster health status, unlike stopped VMs. Use this option for a VM that will remain off for an extended period.
+    If "--force" is specified, and the VM is running, it will be forcibly stopped instead of waiting for a graceful ACPI shutdown. If the VM is a snapshot mirror, "--force" allows a manual state change to the mirror.
    """
    retcode, retmsg = pvc.lib.vm.vm_state(
@ -2018,6 +2036,308 @@ def cli_vm_snapshot_import(
    finish(retcode, retmsg)
 ###############################################################################
 # > pvc vm snapshot send
 ###############################################################################
@click.command(
    name="send",
    short_help="Send a snapshot of a virtual machine to another PVC cluster.",
 )
@connection_req
@click.argument("domain")
@click.argument("snapshot_name")
@click.argument("destination")
@click.option(
    "-k",
    "--destination-api-key",
    "destination_api_key",
    default=None,
    help="The API key of the destination cluster when specifying an API URI.",
 )
@click.option(
    "-p",
    "--destination-pool",
    "destination_storage_pool",
    default=None,
    help="The target storage pool on the destination cluster, if it differs from the source pool.",
 )
@click.option(
    "-i",
    "--incremental",
    "incremental_parent",
    default=None,
    help="Perform an incremental volume send from this parent snapshot.",
 )
@click.option(
    "--wait/--no-wait",
    "wait_flag",
    is_flag=True,
    default=True,
    show_default=True,
    help="Wait or don't wait for task to complete, showing progress if waiting",
 )
 def cli_vm_snapshot_send(
    domain,
    snapshot_name,
    destination,
    destination_api_key,
    destination_storage_pool,
    incremental_parent,
    wait_flag,
 ):
    """
    Send the (existing) snapshot SNAPSHOT_NAME of virtual machine DOMAIN to the remote PVC cluster DESTINATION.
    DOMAIN may be a UUID or name. DESTINATION may be either a configured PVC connection name in this CLI instance (i.e. a valid argument to "--connection"), or a full API URI, including the scheme, port and API prefix; if using the latter, an API key can be specified with the "-k"/"--destination-api-key" option.
    The send will include the VM configuration, metainfo, and a point-in-time snapshot of all attached RBD volumes.
    By default, the storage pool of the sending cluster will be used at the destination cluster as well. If a pool of that name does not exist, specify one with the "-p"/"--detination-pool" option.
    Incremental sends are possible by specifying the "-i"/"--incremental-parent" option along with a parent snapshot name. To correctly receive, that parent snapshot must exist on DESTINATION. Subsequent sends after the first do not have to be incremental, but an incremental send is likely to perform better than a full send if the VM experiences few writes.
    WARNING: Once sent, the VM will be in the state "mirror" on the destination cluster. If it is subsequently started, for instance for disaster recovery, a new snapshot must be taken on the destination cluster and sent back or data will be inconsistent between the instances. Only VMs in the "mirror" state can accept new sends.
    WARNING: This functionality has no automatic backout on the remote side. While a properly configured cluster should not fail any step in the process, a situation like an intermittent network connection might cause a failure which would have to be manually corrected on that side, usually by removing the mirrored VM and retrying, or rolling back to a previous snapshot and retrying. Future versions may enhance automatic recovery, but for now this would be up to the administrator.
    """
    connections_config = get_store(CLI_CONFIG["store_path"])
    if destination in connections_config.keys():
        destination_cluster_config = connections_config[destination]
        destination_api_uri = "{}://{}:{}{}".format(
            destination_cluster_config["scheme"],
            destination_cluster_config["host"],
            destination_cluster_config["port"],
            CLI_CONFIG["api_prefix"],
        )
        destination_api_key = destination_cluster_config["api_key"]
    else:
        if "http" not in destination:
            finish(
                False, "ERROR: A valid destination cluster or URI must be specified!"
            )
        destination_api_uri = destination
        destination_api_key = destination_api_key
    retcode, retmsg = pvc.lib.vm.vm_send_snapshot(
        CLI_CONFIG,
        domain,
        snapshot_name,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=CLI_CONFIG.get("verify_ssl"),
        destination_storage_pool=destination_storage_pool,
        incremental_parent=incremental_parent,
        wait_flag=wait_flag,
    )
    if retcode and wait_flag:
        retmsg = wait_for_celery_task(CLI_CONFIG, retmsg)
    finish(retcode, retmsg)
 ###############################################################################
 # > pvc vm mirror
 ###############################################################################
@click.group(
    name="mirror",
    short_help="Manage snapshot mirrors for PVC VMs.",
    context_settings=CONTEXT_SETTINGS,
 )
 def cli_vm_mirror():
    """
    Manage snapshot mirrors of VMs in a PVC cluster.
    """
    pass
 ###############################################################################
 # > pvc vm mirror create
 ###############################################################################
@click.command(
    name="create",
    short_help="Create a snapshot mirror of a virtual machine to another PVC cluster.",
 )
@connection_req
@click.argument("domain")
@click.argument("destination")
@click.option(
    "-k",
    "--destination-api-key",
    "destination_api_key",
    default=None,
    help="The API key of the destination cluster when specifying an API URI.",
 )
@click.option(
    "-p",
    "--destination-pool",
    "destination_storage_pool",
    default=None,
    help="The target storage pool on the destination cluster, if it differs from the source pool.",
 )
@click.option(
    "--wait/--no-wait",
    "wait_flag",
    is_flag=True,
    default=True,
    show_default=True,
    help="Wait or don't wait for task to complete, showing progress if waiting",
 )
 def cli_vm_mirror_create(
    domain,
    destination,
    destination_api_key,
    destination_storage_pool,
    wait_flag,
 ):
    """
    For the virtual machine DOMAIN: create a new snapshot (dated), and send snapshot to the remote PVC cluster DESTINATION; creates a cross-cluster snapshot mirror of the VM.
    DOMAIN may be a UUID or name. DESTINATION may be either a configured PVC connection name in this CLI instance (i.e. a valid argument to "--connection"), or a full API URI, including the scheme, port and API prefix; if using the latter, an API key can be specified with the "-k"/"--destination-api-key" option.
    The send will include the VM configuration, metainfo, and a point-in-time snapshot of all attached RBD volumes.
    This command may be used repeatedly to send new updates for a remote VM mirror. If a valid shared snapshot is found on the destination cluster, block device transfers will be incremental based on that snapshot.
    By default, the storage pool of the sending cluster will be used at the destination cluster as well. If a pool of that name does not exist, specify one with the "-p"/"--detination-pool" option.
    WARNING: Once sent, the VM will be in the state "mirror" on the destination cluster. If it is subsequently started, for instance for disaster recovery, a new snapshot must be taken on the destination cluster and sent back or data will be inconsistent between the instances. Only VMs in the "mirror" state can accept new sends. Consider using "mirror promote" instead of any manual promotion attempts.
    WARNING: This functionality has no automatic backout on the remote side. While a properly configured cluster should not fail any step in the process, a situation like an intermittent network connection might cause a failure which would have to be manually corrected on that side, usually by removing the mirrored VM and retrying, or rolling back to a previous snapshot and retrying. Future versions may enhance automatic recovery, but for now this would be up to the administrator.
    """
    connections_config = get_store(CLI_CONFIG["store_path"])
    if destination in connections_config.keys():
        destination_cluster_config = connections_config[destination]
        destination_api_uri = "{}://{}:{}{}".format(
            destination_cluster_config["scheme"],
            destination_cluster_config["host"],
            destination_cluster_config["port"],
            CLI_CONFIG["api_prefix"],
        )
        destination_api_key = destination_cluster_config["api_key"]
    else:
        if "http" not in destination:
            finish(
                False, "ERROR: A valid destination cluster or URI must be specified!"
            )
        destination_api_uri = destination
        destination_api_key = destination_api_key
    retcode, retmsg = pvc.lib.vm.vm_create_mirror(
        CLI_CONFIG,
        domain,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=CLI_CONFIG.get("verify_ssl"),
        destination_storage_pool=destination_storage_pool,
        wait_flag=wait_flag,
    )
    if retcode and wait_flag:
        retmsg = wait_for_celery_task(CLI_CONFIG, retmsg)
    finish(retcode, retmsg)
 ###############################################################################
 # > pvc vm mirror promote
 ###############################################################################
@click.command(
    name="promote",
    short_help="Shut down, create a snapshot mirror, and promote a virtual machine to another PVC cluster.",
 )
@connection_req
@click.argument("domain")
@click.argument("destination")
@click.option(
    "-k",
    "--destination-api-key",
    "destination_api_key",
    default=None,
    help="The API key of the destination cluster when specifying an API URI.",
 )
@click.option(
    "-p",
    "--destination-pool",
    "destination_storage_pool",
    default=None,
    help="The target storage pool on the destination cluster, if it differs from the source pool.",
 )
@click.option(
    "--remove/--no-remove",
    "remove_flag",
    is_flag=True,
    default=False,
    show_default=True,
    help="Remove or don't remove the local VM after promoting (if set, performs a cross-cluster move).",
 )
@click.option(
    "--wait/--no-wait",
    "wait_flag",
    is_flag=True,
    default=True,
    show_default=True,
    help="Wait or don't wait for task to complete, showing progress if waiting",
 )
@confirm_opt("Promote VM {domain} on cluster {destination} (will shut down VM)")
 def cli_vm_mirror_promote(
    domain,
    destination,
    destination_api_key,
    destination_storage_pool,
    remove_flag,
    wait_flag,
 ):
    """
    For the virtual machine DOMAIN: shut down on this cluster, create a new snapshot (dated), send snapshot to the remote PVC cluster DESTINATION, start on DESTINATION, and optionally remove from this cluster; performs a cross-cluster move of the VM, with or without retaining the source as a snapshot mirror.
    DOMAIN may be a UUID or name. DESTINATION may be either a configured PVC connection name in this CLI instance (i.e. a valid argument to "--connection"), or a full API URI, including the scheme, port and API prefix; if using the latter, an API key can be specified with the "-k"/"--destination-api-key" option.
    The send will include the VM configuration, metainfo, and a point-in-time snapshot of all attached RBD volumes.
    If a valid shared snapshot is found on the destination cluster, block device transfers will be incremental based on that snapshot.
    By default, the storage pool of the sending cluster will be used at the destination cluster as well. If a pool of that name does not exist, specify one with the "-p"/"--detination-pool" option.
    WARNING: Once promoted, if the "--remove" flag is not set, the VM will be in the state "mirror" on this cluster. This effectively flips which cluster is the "primary" for this VM, and subsequent mirror management commands must be run against the destination cluster instead of this cluster. If the "--remove" flag is set, the VM will be removed from this cluster entirely once successfully started on the destination cluster.
    WARNING: This functionality has no automatic backout on the remote side. While a properly configured cluster should not fail any step in the process, a situation like an intermittent network connection might cause a failure which would have to be manually corrected on that side, usually by removing the mirrored VM and retrying, or rolling back to a previous snapshot and retrying. Future versions may enhance automatic recovery, but for now this would be up to the administrator.
    """
    connections_config = get_store(CLI_CONFIG["store_path"])
    if destination in connections_config.keys():
        destination_cluster_config = connections_config[destination]
        destination_api_uri = "{}://{}:{}{}".format(
            destination_cluster_config["scheme"],
            destination_cluster_config["host"],
            destination_cluster_config["port"],
            CLI_CONFIG["api_prefix"],
        )
        destination_api_key = destination_cluster_config["api_key"]
    else:
        if "http" not in destination:
            finish(
                False, "ERROR: A valid destination cluster or URI must be specified!"
            )
        destination_api_uri = destination
        destination_api_key = destination_api_key
    retcode, retmsg = pvc.lib.vm.vm_promote_mirror(
        CLI_CONFIG,
        domain,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=CLI_CONFIG.get("verify_ssl"),
        destination_storage_pool=destination_storage_pool,
        remove_on_source=remove_flag,
        wait_flag=wait_flag,
    )
    if retcode and wait_flag:
        retmsg = wait_for_celery_task(CLI_CONFIG, retmsg)
    finish(retcode, retmsg)
 ###############################################################################
 # > pvc vm backup
 ###############################################################################
@ -3755,6 +4075,13 @@ def cli_storage_benchmark():
@click.command(name="run", short_help="Run a storage benchmark.")
@connection_req
@click.argument("pool")
@click.option(
    "--name",
    "name",
    default=None,
    show_default=False,
    help="Use a custom name for the job",
 )
@click.option(
    "--wait/--no-wait",
    "wait_flag",
@ -3766,12 +4093,14 @@ def cli_storage_benchmark():
@confirm_opt(
    "Storage benchmarks take approximately 10 minutes to run and generate significant load on the cluster; they should be run sparingly. Continue"
 )
-def cli_storage_benchmark_run(pool, wait_flag):
+def cli_storage_benchmark_run(pool, name, wait_flag):
    """
    Run a storage benchmark on POOL in the background.
    """
-    retcode, retmsg = pvc.lib.storage.ceph_benchmark_run(CLI_CONFIG, pool, wait_flag)
+    retcode, retmsg = pvc.lib.storage.ceph_benchmark_run(
        CLI_CONFIG, pool, name, wait_flag
    )
    if retcode and wait_flag:
        retmsg = wait_for_celery_task(CLI_CONFIG, retmsg)
@ -6579,7 +6908,11 @@ cli_vm_snapshot.add_command(cli_vm_snapshot_remove)
 cli_vm_snapshot.add_command(cli_vm_snapshot_rollback)
 cli_vm_snapshot.add_command(cli_vm_snapshot_export)
 cli_vm_snapshot.add_command(cli_vm_snapshot_import)
 cli_vm_snapshot.add_command(cli_vm_snapshot_send)
 cli_vm.add_command(cli_vm_snapshot)
 cli_vm_mirror.add_command(cli_vm_mirror_create)
 cli_vm_mirror.add_command(cli_vm_mirror_promote)
 cli_vm.add_command(cli_vm_mirror)
 cli_vm_backup.add_command(cli_vm_backup_create)
 cli_vm_backup.add_command(cli_vm_backup_restore)
 cli_vm_backup.add_command(cli_vm_backup_remove)
--- a/client-cli/pvc/cli/formatters.py
+++ b/client-cli/pvc/cli/formatters.py
@ -83,6 +83,37 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
    total_volumes = data.get("volumes", 0)
    total_snapshots = data.get("snapshots", 0)
    total_cpu_total = data.get("resources", {}).get("cpu", {}).get("total", 0)
    total_cpu_load = data.get("resources", {}).get("cpu", {}).get("load", 0)
    total_cpu_utilization = (
        data.get("resources", {}).get("cpu", {}).get("utilization", 0)
    )
    total_cpu_string = (
        f"{total_cpu_utilization:.1f}% ({total_cpu_load:.1f} / {total_cpu_total})"
    )
    total_memory_total = (
        data.get("resources", {}).get("memory", {}).get("total", 0) / 1024
    )
    total_memory_used = (
        data.get("resources", {}).get("memory", {}).get("used", 0) / 1024
    )
    total_memory_utilization = (
        data.get("resources", {}).get("memory", {}).get("utilization", 0)
    )
    total_memory_string = f"{total_memory_utilization:.1f}% ({total_memory_used:.1f} GB / {total_memory_total:.1f} GB)"
    total_disk_total = (
        data.get("resources", {}).get("disk", {}).get("total", 0) / 1024 / 1024
    )
    total_disk_used = (
        data.get("resources", {}).get("disk", {}).get("used", 0) / 1024 / 1024
    )
    total_disk_utilization = round(
        data.get("resources", {}).get("disk", {}).get("utilization", 0)
    )
    total_disk_string = f"{total_disk_utilization:.1f}% ({total_disk_used:.1f} GB / {total_disk_total:.1f} GB)"
    if maintenance == "true" or health == -1:
        health_colour = ansii["blue"]
    elif health > 90:
@ -94,12 +125,9 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
    output = list()
-    output.append(f"{ansii['bold']}PVC cluster status:{ansii['end']}")
+    output.append(f"{ansii['purple']}Primary node:{ansii['end']}   {primary_node}")
-    output.append("")
+    output.append(f"{ansii['purple']}PVC version:{ansii['end']}    {pvc_version}")
-
+    output.append(f"{ansii['purple']}Upstream IP:{ansii['end']}    {upstream_ip}")
    output.append(f"{ansii['purple']}Primary node:{ansii['end']}  {primary_node}")
    output.append(f"{ansii['purple']}PVC version:{ansii['end']}   {pvc_version}")
    output.append(f"{ansii['purple']}Upstream IP:{ansii['end']}   {upstream_ip}")
    output.append("")
    if health != "-1":
@ -111,7 +139,7 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
        health = f"{health} (maintenance on)"
    output.append(
-        f"{ansii['purple']}Health:{ansii['end']}        {health_colour}{health}{ansii['end']}"
+        f"{ansii['purple']}Health:{ansii['end']}         {health_colour}{health}{ansii['end']}"
    )
    if messages is not None and len(messages) > 0:
@ -135,8 +163,18 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
                )
            )
-        messages = "\n               ".join(message_list)
+        messages = "\n                ".join(message_list)
-        output.append(f"{ansii['purple']}Active Faults:{ansii['end']} {messages}")
+    else:
        messages = "None"
    output.append(f"{ansii['purple']}Active faults:{ansii['end']}  {messages}")
    output.append(f"{ansii['purple']}Total CPU:{ansii['end']}      {total_cpu_string}")
    output.append(
        f"{ansii['purple']}Total memory:{ansii['end']}   {total_memory_string}"
    )
    output.append(f"{ansii['purple']}Total disk:{ansii['end']}     {total_disk_string}")
    output.append("")
@ -166,14 +204,14 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
    nodes_string = ", ".join(nodes_strings)
-    output.append(f"{ansii['purple']}Nodes:{ansii['end']}         {nodes_string}")
+    output.append(f"{ansii['purple']}Nodes:{ansii['end']}          {nodes_string}")
-    vm_states = ["start", "disable"]
+    vm_states = ["start", "disable", "mirror"]
    vm_states.extend(
        [
            state
            for state in data.get("vms", {}).keys()
-            if state not in ["total", "start", "disable"]
+            if state not in ["total", "start", "disable", "mirror"]
        ]
    )
@ -183,8 +221,10 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
            continue
        if state in ["start"]:
            state_colour = ansii["green"]
-        elif state in ["migrate", "disable", "provision"]:
+        elif state in ["migrate", "disable", "provision", "mirror"]:
            state_colour = ansii["blue"]
        elif state in ["mirror"]:
            state_colour = ansii["purple"]
        elif state in ["stop", "fail"]:
            state_colour = ansii["red"]
        else:
@ -196,7 +236,7 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
    vms_string = ", ".join(vms_strings)
-    output.append(f"{ansii['purple']}VMs:{ansii['end']}           {vms_string}")
+    output.append(f"{ansii['purple']}VMs:{ansii['end']}            {vms_string}")
    osd_states = ["up,in"]
    osd_states.extend(
@ -222,15 +262,15 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
    osds_string = " ".join(osds_strings)
-    output.append(f"{ansii['purple']}OSDs:{ansii['end']}          {osds_string}")
+    output.append(f"{ansii['purple']}OSDs:{ansii['end']}           {osds_string}")
-    output.append(f"{ansii['purple']}Pools:{ansii['end']}         {total_pools}")
+    output.append(f"{ansii['purple']}Pools:{ansii['end']}          {total_pools}")
-    output.append(f"{ansii['purple']}Volumes:{ansii['end']}       {total_volumes}")
+    output.append(f"{ansii['purple']}Volumes:{ansii['end']}        {total_volumes}")
-    output.append(f"{ansii['purple']}Snapshots:{ansii['end']}     {total_snapshots}")
+    output.append(f"{ansii['purple']}Snapshots:{ansii['end']}      {total_snapshots}")
-    output.append(f"{ansii['purple']}Networks:{ansii['end']}      {total_networks}")
+    output.append(f"{ansii['purple']}Networks:{ansii['end']}       {total_networks}")
    output.append("")
@ -258,9 +298,6 @@ def cli_cluster_status_format_short(CLI_CONFIG, data):
    output = list()
    output.append(f"{ansii['bold']}PVC cluster status:{ansii['end']}")
    output.append("")
    if health != "-1":
        health = f"{health}%"
    else:
@ -270,7 +307,7 @@ def cli_cluster_status_format_short(CLI_CONFIG, data):
        health = f"{health} (maintenance on)"
    output.append(
-        f"{ansii['purple']}Health:{ansii['end']}        {health_colour}{health}{ansii['end']}"
+        f"{ansii['purple']}Health:{ansii['end']}         {health_colour}{health}{ansii['end']}"
    )
    if messages is not None and len(messages) > 0:
@ -295,7 +332,48 @@ def cli_cluster_status_format_short(CLI_CONFIG, data):
            )
        messages = "\n               ".join(message_list)
-        output.append(f"{ansii['purple']}Active Faults:{ansii['end']} {messages}")
+    else:
        messages = "None"
    output.append(f"{ansii['purple']}Active faults:{ansii['end']}  {messages}")
    total_cpu_total = data.get("resources", {}).get("cpu", {}).get("total", 0)
    total_cpu_load = data.get("resources", {}).get("cpu", {}).get("load", 0)
    total_cpu_utilization = (
        data.get("resources", {}).get("cpu", {}).get("utilization", 0)
    )
    total_cpu_string = (
        f"{total_cpu_utilization:.1f}% ({total_cpu_load:.1f} / {total_cpu_total})"
    )
    total_memory_total = (
        data.get("resources", {}).get("memory", {}).get("total", 0) / 1024
    )
    total_memory_used = (
        data.get("resources", {}).get("memory", {}).get("used", 0) / 1024
    )
    total_memory_utilization = (
        data.get("resources", {}).get("memory", {}).get("utilization", 0)
    )
    total_memory_string = f"{total_memory_utilization:.1f}% ({total_memory_used:.1f} GB / {total_memory_total:.1f} GB)"
    total_disk_total = (
        data.get("resources", {}).get("disk", {}).get("total", 0) / 1024 / 1024
    )
    total_disk_used = (
        data.get("resources", {}).get("disk", {}).get("used", 0) / 1024 / 1024
    )
    total_disk_utilization = round(
        data.get("resources", {}).get("disk", {}).get("utilization", 0)
    )
    total_disk_string = f"{total_disk_utilization:.1f}% ({total_disk_used:.1f} GB / {total_disk_total:.1f} GB)"
    output.append(f"{ansii['purple']}CPU usage:{ansii['end']}      {total_cpu_string}")
    output.append(
        f"{ansii['purple']}Memory usage:{ansii['end']}   {total_memory_string}"
    )
    output.append(f"{ansii['purple']}Disk usage:{ansii['end']}     {total_disk_string}")
    output.append("")
@ -827,7 +905,7 @@ def cli_connection_list_format_pretty(CLI_CONFIG, data):
    # Parse each connection and adjust field lengths
    for connection in data:
        for field, length in [(f, fields[f]["length"]) for f in fields]:
-            _length = len(str(connection[field]))
+            _length = len(str(connection[field])) + 1
            if _length > length:
                length = len(str(connection[field])) + 1
@ -927,7 +1005,7 @@ def cli_connection_detail_format_pretty(CLI_CONFIG, data):
    # Parse each connection and adjust field lengths
    for connection in data:
        for field, length in [(f, fields[f]["length"]) for f in fields]:
-            _length = len(str(connection[field]))
+            _length = len(str(connection[field])) + 1
            if _length > length:
                length = len(str(connection[field])) + 1
--- a/client-cli/pvc/cli/helpers.py
+++ b/client-cli/pvc/cli/helpers.py
@ -167,9 +167,17 @@ def get_store(store_path):
    with open(store_file) as fh:
        try:
            store_data = jload(fh)
            return store_data
        except Exception:
-            return dict()
+            store_data = dict()
    if path.exists(DEFAULT_STORE_DATA["cfgfile"]):
        if store_data.get("local", None) != DEFAULT_STORE_DATA:
            del store_data["local"]
        if "local" not in store_data.keys():
            store_data["local"] = DEFAULT_STORE_DATA
            update_store(store_path, store_data)
    return store_data
 def update_store(store_path, store_data):
--- a/client-cli/pvc/cli/parsers.py
+++ b/client-cli/pvc/cli/parsers.py
@ -68,7 +68,8 @@ def cli_connection_list_parser(connections_config, show_keys_flag):
                }
            )
-    return connections_data
+    # Return, ensuring local is always first
    return sorted(connections_data, key=lambda x: (x.get("name") != "local"))
 def cli_connection_detail_parser(connections_config):
@ -121,4 +122,5 @@ def cli_connection_detail_parser(connections_config):
                }
            )
-    return connections_data
+    # Return, ensuring local is always first
    return sorted(connections_data, key=lambda x: (x.get("name") != "local"))
--- a/client-cli/pvc/cli/waiters.py
+++ b/client-cli/pvc/cli/waiters.py
@ -19,6 +19,8 @@
 #
 ###############################################################################
 import sys
 from click import progressbar
 from time import sleep, time
@ -105,7 +107,7 @@ def wait_for_celery_task(CLI_CONFIG, task_detail, start_late=False):
    # Start following the task state, updating progress as we go
    total_task = task_status.get("total")
-    with progressbar(length=total_task, show_eta=False) as bar:
+    with progressbar(length=total_task, width=20, show_eta=False) as bar:
        last_task = 0
        maxlen = 21
        echo(
@ -115,30 +117,39 @@ def wait_for_celery_task(CLI_CONFIG, task_detail, start_late=False):
        )
        while True:
            sleep(0.5)
            task_status = pvc.lib.common.task_status(
                CLI_CONFIG, task_id=task_id, is_watching=True
            )
            if isinstance(task_status, tuple):
                continue
            if task_status.get("state") != "RUNNING":
                break
-            if task_status.get("current") > last_task:
+            if task_status.get("current") == 0:
-                current_task = int(task_status.get("current"))
+                continue
-                total_task = int(task_status.get("total"))
+
-                bar.length = total_task
+            current_task = int(task_status.get("current"))
            total_task = int(task_status.get("total"))
            bar.length = total_task
            if current_task > last_task:
                bar.update(current_task - last_task)
                last_task = current_task
-                # The extensive spaces at the end cause this to overwrite longer previous messages
+
-                curlen = len(str(task_status.get("status")))
+            curlen = len(str(task_status.get("status")))
-                if curlen > maxlen:
+            if curlen > maxlen:
-                    maxlen = curlen
+                maxlen = curlen
-                lendiff = maxlen - curlen
+            lendiff = maxlen - curlen
-                overwrite_whitespace = " " * lendiff
+            overwrite_whitespace = " " * lendiff
-                echo(
+
-                    CLI_CONFIG,
+            percent_complete = (current_task / total_task) * 100
-                    "  " + task_status.get("status") + overwrite_whitespace,
+            bar_output = f"[{bar.format_bar()}]  {percent_complete:3.0f}%"
-                    newline=False,
+            sys.stdout.write(
-                )
+                f"\r  {bar_output}  {task_status['status']}{overwrite_whitespace}"
            task_status = pvc.lib.common.task_status(
                CLI_CONFIG, task_id=task_id, is_watching=True
            )
            sys.stdout.flush()
        if task_status.get("state") == "SUCCESS":
            bar.update(total_task - last_task)
--- a/client-cli/pvc/lib/common.py
+++ b/client-cli/pvc/lib/common.py
@ -83,7 +83,7 @@ class UploadProgressBar(object):
        else:
            self.end_suffix = ""
-        self.bar = click.progressbar(length=self.length, show_eta=True)
+        self.bar = click.progressbar(length=self.length, width=20, show_eta=True)
    def update(self, monitor):
        bytes_cur = monitor.bytes_read
--- a/client-cli/pvc/lib/storage.py
+++ b/client-cli/pvc/lib/storage.py
@ -30,6 +30,7 @@ from requests_toolbelt.multipart.encoder import (
 import pvc.lib.ansiprint as ansiprint
 from pvc.lib.common import UploadProgressBar, call_api, get_wait_retdata
 from pvc.cli.helpers import MAX_CONTENT_WIDTH
 #
 # Supplemental functions
@ -1724,15 +1725,17 @@ def format_list_snapshot(config, snapshot_list):
 #
 # Benchmark functions
 #
-def ceph_benchmark_run(config, pool, wait_flag):
+def ceph_benchmark_run(config, pool, name, wait_flag):
    """
    Run a storage benchmark against {pool}
    API endpoint: POST /api/v1/storage/ceph/benchmark
-    API arguments: pool={pool}
+    API arguments: pool={pool}, name={name}
    API schema: {message}
    """
    params = {"pool": pool}
    if name:
        params["name"] = name
    response = call_api(config, "post", "/storage/ceph/benchmark", params=params)
    return get_wait_retdata(response, wait_flag)
@ -1804,7 +1807,7 @@ def get_benchmark_list_results(benchmark_format, benchmark_data):
        benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_legacy(
            benchmark_data
        )
-    elif benchmark_format == 1:
+    elif benchmark_format == 1 or benchmark_format == 2:
        benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_json(
            benchmark_data
        )
@ -2006,6 +2009,7 @@ def format_info_benchmark(config, benchmark_information):
    benchmark_matrix = {
        0: format_info_benchmark_legacy,
        1: format_info_benchmark_json,
        2: format_info_benchmark_json,
    }
    benchmark_version = benchmark_information[0]["test_format"]
@ -2340,12 +2344,15 @@ def format_info_benchmark_json(config, benchmark_information):
    if benchmark_information["benchmark_result"] == "Running":
        return "Benchmark test is still running."
    benchmark_format = benchmark_information["test_format"]
    benchmark_details = benchmark_information["benchmark_result"]
    # Format a nice output; do this line-by-line then concat the elements at the end
    ainformation = []
    ainformation.append(
-        "{}Storage Benchmark details:{}".format(ansiprint.bold(), ansiprint.end())
+        "{}Storage Benchmark details (format {}):{}".format(
            ansiprint.bold(), benchmark_format, ansiprint.end()
        )
    )
    nice_test_name_map = {
@ -2393,7 +2400,7 @@ def format_info_benchmark_json(config, benchmark_information):
            if element[1] != 0:
                useful_latency_tree.append(element)
-        max_rows = 9
+        max_rows = 5
        if len(useful_latency_tree) > 9:
            max_rows = len(useful_latency_tree)
        elif len(useful_latency_tree) < 9:
@ -2402,15 +2409,10 @@ def format_info_benchmark_json(config, benchmark_information):
        # Format the static data
        overall_label = [
-            "Overall BW/s:",
+            "BW/s:",
-            "Overall IOPS:",
+            "IOPS:",
-            "Total I/O:",
+            "I/O:",
-            "Runtime (s):",
+            "Time:",
            "User CPU %:",
            "System CPU %:",
            "Ctx Switches:",
            "Major Faults:",
            "Minor Faults:",
        ]
        while len(overall_label) < max_rows:
            overall_label.append("")
@ -2419,68 +2421,149 @@ def format_info_benchmark_json(config, benchmark_information):
            format_bytes_tohuman(int(job_details[io_class]["bw_bytes"])),
            format_ops_tohuman(int(job_details[io_class]["iops"])),
            format_bytes_tohuman(int(job_details[io_class]["io_bytes"])),
-            job_details["job_runtime"] / 1000,
+            str(job_details["job_runtime"] / 1000) + "s",
            job_details["usr_cpu"],
            job_details["sys_cpu"],
            job_details["ctx"],
            job_details["majf"],
            job_details["minf"],
        ]
        while len(overall_data) < max_rows:
            overall_data.append("")
        cpu_label = [
            "Total:",
            "User:",
            "Sys:",
            "OSD:",
            "MON:",
        ]
        while len(cpu_label) < max_rows:
            cpu_label.append("")
        cpu_data = [
            (
                benchmark_details[test]["avg_cpu_util_percent"]["total"]
                if benchmark_format > 1
                else "N/A"
            ),
            round(job_details["usr_cpu"], 2),
            round(job_details["sys_cpu"], 2),
            (
                benchmark_details[test]["avg_cpu_util_percent"]["ceph-osd"]
                if benchmark_format > 1
                else "N/A"
            ),
            (
                benchmark_details[test]["avg_cpu_util_percent"]["ceph-mon"]
                if benchmark_format > 1
                else "N/A"
            ),
        ]
        while len(cpu_data) < max_rows:
            cpu_data.append("")
        memory_label = [
            "Total:",
            "OSD:",
            "MON:",
        ]
        while len(memory_label) < max_rows:
            memory_label.append("")
        memory_data = [
            (
                benchmark_details[test]["avg_memory_util_percent"]["total"]
                if benchmark_format > 1
                else "N/A"
            ),
            (
                benchmark_details[test]["avg_memory_util_percent"]["ceph-osd"]
                if benchmark_format > 1
                else "N/A"
            ),
            (
                benchmark_details[test]["avg_memory_util_percent"]["ceph-mon"]
                if benchmark_format > 1
                else "N/A"
            ),
        ]
        while len(memory_data) < max_rows:
            memory_data.append("")
        network_label = [
            "Total:",
            "Sent:",
            "Recv:",
        ]
        while len(network_label) < max_rows:
            network_label.append("")
        network_data = [
            (
                format_bytes_tohuman(
                    int(benchmark_details[test]["avg_network_util_bps"]["total"])
                )
                if benchmark_format > 1
                else "N/A"
            ),
            (
                format_bytes_tohuman(
                    int(benchmark_details[test]["avg_network_util_bps"]["sent"])
                )
                if benchmark_format > 1
                else "N/A"
            ),
            (
                format_bytes_tohuman(
                    int(benchmark_details[test]["avg_network_util_bps"]["recv"])
                )
                if benchmark_format > 1
                else "N/A"
            ),
        ]
        while len(network_data) < max_rows:
            network_data.append("")
        bandwidth_label = [
            "Min:",
            "Max:",
            "Mean:",
            "StdDev:",
            "Samples:",
            "",
            "",
            "",
            "",
        ]
        while len(bandwidth_label) < max_rows:
            bandwidth_label.append("")
        bandwidth_data = [
-            format_bytes_tohuman(int(job_details[io_class]["bw_min"]) * 1024),
+            format_bytes_tohuman(int(job_details[io_class]["bw_min"]) * 1024)
-            format_bytes_tohuman(int(job_details[io_class]["bw_max"]) * 1024),
+            + " / "
-            format_bytes_tohuman(int(job_details[io_class]["bw_mean"]) * 1024),
+            + format_ops_tohuman(int(job_details[io_class]["iops_min"])),
-            format_bytes_tohuman(int(job_details[io_class]["bw_dev"]) * 1024),
+            format_bytes_tohuman(int(job_details[io_class]["bw_max"]) * 1024)
-            job_details[io_class]["bw_samples"],
+            + " / "
-            "",
+            + format_ops_tohuman(int(job_details[io_class]["iops_max"])),
-            "",
+            format_bytes_tohuman(int(job_details[io_class]["bw_mean"]) * 1024)
-            "",
+            + " / "
-            "",
+            + format_ops_tohuman(int(job_details[io_class]["iops_mean"])),
            format_bytes_tohuman(int(job_details[io_class]["bw_dev"]) * 1024)
            + " / "
            + format_ops_tohuman(int(job_details[io_class]["iops_stddev"])),
            str(job_details[io_class]["bw_samples"])
            + " / "
            + str(job_details[io_class]["iops_samples"]),
        ]
        while len(bandwidth_data) < max_rows:
            bandwidth_data.append("")
-        iops_data = [
+        lat_label = [
-            format_ops_tohuman(int(job_details[io_class]["iops_min"])),
+            "Min:",
-            format_ops_tohuman(int(job_details[io_class]["iops_max"])),
+            "Max:",
-            format_ops_tohuman(int(job_details[io_class]["iops_mean"])),
+            "Mean:",
-            format_ops_tohuman(int(job_details[io_class]["iops_stddev"])),
+            "StdDev:",
            job_details[io_class]["iops_samples"],
            "",
            "",
            "",
            "",
        ]
-        while len(iops_data) < max_rows:
+        while len(lat_label) < max_rows:
-            iops_data.append("")
+            lat_label.append("")
        lat_data = [
            int(job_details[io_class]["lat_ns"]["min"]) / 1000,
            int(job_details[io_class]["lat_ns"]["max"]) / 1000,
            int(job_details[io_class]["lat_ns"]["mean"]) / 1000,
            int(job_details[io_class]["lat_ns"]["stddev"]) / 1000,
            "",
            "",
            "",
            "",
            "",
        ]
        while len(lat_data) < max_rows:
            lat_data.append("")
@ -2489,98 +2572,119 @@ def format_info_benchmark_json(config, benchmark_information):
        lat_bucket_label = list()
        lat_bucket_data = list()
        for element in useful_latency_tree:
-            lat_bucket_label.append(element[0])
+            lat_bucket_label.append(element[0] + ":" if element[0] else "")
-            lat_bucket_data.append(element[1])
+            lat_bucket_data.append(round(float(element[1]), 2) if element[1] else "")
        while len(lat_bucket_label) < max_rows:
            lat_bucket_label.append("")
        while len(lat_bucket_data) < max_rows:
            lat_bucket_label.append("")
        # Column default widths
-        overall_label_length = 0
+        overall_label_length = 5
        overall_column_length = 0
-        bandwidth_label_length = 0
+        cpu_label_length = 6
-        bandwidth_column_length = 11
+        cpu_column_length = 0
-        iops_column_length = 4
+        memory_label_length = 6
-        latency_column_length = 12
+        memory_column_length = 0
        network_label_length = 6
        network_column_length = 6
        bandwidth_label_length = 8
        bandwidth_column_length = 0
        latency_label_length = 7
        latency_column_length = 0
        latency_bucket_label_length = 0
        latency_bucket_column_length = 0
        # Column layout:
-        #    General    Bandwidth   IOPS      Latency   Percentiles
+        #    Overall    CPU   Memory  Network  Bandwidth/IOPS  Latency   Percentiles
-        #    ---------  ----------  --------  --------  ---------------
+        #    ---------  ----- ------- -------- --------------  --------  ---------------
-        #    Size       Min         Min       Min       A
+        #    BW         Total Total   Total    Min             Min       A
-        #    BW         Max         Max       Max       B
+        #    IOPS       Usr   OSD     Send     Max             Max       B
-        #    IOPS       Mean        Mean      Mean      ...
+        #    Time       Sys   MON     Recv     Mean            Mean      ...
-        #    Runtime    StdDev      StdDev    StdDev    Z
+        #    Size       OSD                    StdDev          StdDev    Z
-        #    UsrCPU     Samples     Samples
+        #               MON                    Samples
        #    SysCPU
        #    CtxSw
        #    MajFault
        #    MinFault
        # Set column widths
        for item in overall_label:
            _item_length = len(str(item))
            if _item_length > overall_label_length:
                overall_label_length = _item_length
        for item in overall_data:
            _item_length = len(str(item))
            if _item_length > overall_column_length:
                overall_column_length = _item_length
-        test_name_length = len(nice_test_name_map[test])
+        for item in cpu_data:
        if test_name_length > overall_label_length + overall_column_length:
            _diff = test_name_length - (overall_label_length + overall_column_length)
            overall_column_length += _diff
        for item in bandwidth_label:
            _item_length = len(str(item))
-            if _item_length > bandwidth_label_length:
+            if _item_length > cpu_column_length:
-                bandwidth_label_length = _item_length
+                cpu_column_length = _item_length
        for item in memory_data:
            _item_length = len(str(item))
            if _item_length > memory_column_length:
                memory_column_length = _item_length
        for item in network_data:
            _item_length = len(str(item))
            if _item_length > network_column_length:
                network_column_length = _item_length
        for item in bandwidth_data:
            _item_length = len(str(item))
            if _item_length > bandwidth_column_length:
                bandwidth_column_length = _item_length
        for item in iops_data:
            _item_length = len(str(item))
            if _item_length > iops_column_length:
                iops_column_length = _item_length
        for item in lat_data:
            _item_length = len(str(item))
            if _item_length > latency_column_length:
                latency_column_length = _item_length
-        for item in lat_bucket_label:
+        for item in lat_bucket_data:
            _item_length = len(str(item))
-            if _item_length > latency_bucket_label_length:
+            if _item_length > latency_bucket_column_length:
-                latency_bucket_label_length = _item_length
+                latency_bucket_column_length = _item_length
        # Top row (Headers)
        ainformation.append(
-            "{bold}\
+            "{bold}{overall_label: <{overall_label_length}} {header_fill}{end_bold}".format(
 {overall_label: <{overall_label_length}}    \
 {bandwidth_label: <{bandwidth_label_length}} \
 {bandwidth: <{bandwidth_length}}   \
 {iops: <{iops_length}}   \
 {latency: <{latency_length}}   \
 {latency_bucket_label: <{latency_bucket_label_length}} \
 {latency_bucket} \
 {end_bold}".format(
                bold=ansiprint.bold(),
                end_bold=ansiprint.end(),
                overall_label=nice_test_name_map[test],
                overall_label_length=overall_label_length,
-                bandwidth_label="",
+                header_fill="-"
-                bandwidth_label_length=bandwidth_label_length,
+                * (
-                bandwidth="Bandwidth/s",
+                    (MAX_CONTENT_WIDTH if MAX_CONTENT_WIDTH <= 120 else 120)
-                bandwidth_length=bandwidth_column_length,
+                    - len(nice_test_name_map[test])
-                iops="IOPS",
+                    - 4
-                iops_length=iops_column_length,
+                ),
-                latency="Latency (μs)",
+            )
-                latency_length=latency_column_length,
+        )
-                latency_bucket_label="Latency Buckets (μs/%)",
+
-                latency_bucket_label_length=latency_bucket_label_length,
+        ainformation.append(
-                latency_bucket="",
+            "{bold}\
 {overall_label: <{overall_label_length}}  \
 {cpu_label: <{cpu_label_length}}  \
 {memory_label: <{memory_label_length}}  \
 {network_label: <{network_label_length}}  \
 {bandwidth_label: <{bandwidth_label_length}}  \
 {latency_label: <{latency_label_length}}  \
 {latency_bucket_label: <{latency_bucket_label_length}}\
 {end_bold}".format(
                bold=ansiprint.bold(),
                end_bold=ansiprint.end(),
                overall_label="Overall",
                overall_label_length=overall_label_length + overall_column_length + 1,
                cpu_label="CPU (%)",
                cpu_label_length=cpu_label_length + cpu_column_length + 1,
                memory_label="Memory (%)",
                memory_label_length=memory_label_length + memory_column_length + 1,
                network_label="Network (bps)",
                network_label_length=network_label_length + network_column_length + 1,
                bandwidth_label="Bandwidth / IOPS",
                bandwidth_label_length=bandwidth_label_length
                + bandwidth_column_length
                + 1,
                latency_label="Latency (μs)",
                latency_label_length=latency_label_length + latency_column_length + 1,
                latency_bucket_label="Buckets (μs/%)",
                latency_bucket_label_length=latency_bucket_label_length
                + latency_bucket_column_length,
            )
        )
@ -2588,14 +2692,20 @@ def format_info_benchmark_json(config, benchmark_information):
            # Top row (Headers)
            ainformation.append(
                "{bold}\
-{overall_label: >{overall_label_length}} \
+{overall_label: <{overall_label_length}} \
-{overall: <{overall_length}}   \
+{overall: <{overall_length}}  \
-{bandwidth_label: >{bandwidth_label_length}} \
+{cpu_label: <{cpu_label_length}} \
-{bandwidth: <{bandwidth_length}}   \
+{cpu: <{cpu_length}}  \
-{iops: <{iops_length}}   \
+{memory_label: <{memory_label_length}} \
-{latency: <{latency_length}}   \
+{memory: <{memory_length}}  \
-{latency_bucket_label: >{latency_bucket_label_length}} \
+{network_label: <{network_label_length}} \
-{latency_bucket} \
+{network: <{network_length}}  \
 {bandwidth_label: <{bandwidth_label_length}} \
 {bandwidth: <{bandwidth_length}}  \
 {latency_label: <{latency_label_length}} \
 {latency: <{latency_length}}  \
 {latency_bucket_label: <{latency_bucket_label_length}} \
 {latency_bucket}\
 {end_bold}".format(
                    bold="",
                    end_bold="",
@ -2603,12 +2713,24 @@ def format_info_benchmark_json(config, benchmark_information):
                    overall_label_length=overall_label_length,
                    overall=overall_data[idx],
                    overall_length=overall_column_length,
                    cpu_label=cpu_label[idx],
                    cpu_label_length=cpu_label_length,
                    cpu=cpu_data[idx],
                    cpu_length=cpu_column_length,
                    memory_label=memory_label[idx],
                    memory_label_length=memory_label_length,
                    memory=memory_data[idx],
                    memory_length=memory_column_length,
                    network_label=network_label[idx],
                    network_label_length=network_label_length,
                    network=network_data[idx],
                    network_length=network_column_length,
                    bandwidth_label=bandwidth_label[idx],
                    bandwidth_label_length=bandwidth_label_length,
                    bandwidth=bandwidth_data[idx],
                    bandwidth_length=bandwidth_column_length,
-                    iops=iops_data[idx],
+                    latency_label=lat_label[idx],
-                    iops_length=iops_column_length,
+                    latency_label_length=latency_label_length,
                    latency=lat_data[idx],
                    latency_length=latency_column_length,
                    latency_bucket_label=lat_bucket_label[idx],
@ -2617,4 +2739,4 @@ def format_info_benchmark_json(config, benchmark_information):
                )
            )
-    return "\n".join(ainformation)
+    return "\n".join(ainformation) + "\n"
--- a/client-cli/pvc/lib/vm.py
+++ b/client-cli/pvc/lib/vm.py
@ -383,8 +383,8 @@ def vm_state(config, vm, target_state, force=False, wait=False):
    """
    params = {
        "state": target_state,
-        "force": str(force).lower(),
+        "force": force,
-        "wait": str(wait).lower(),
+        "wait": wait,
    }
    response = call_api(config, "post", "/vm/{vm}/state".format(vm=vm), params=params)
@ -595,6 +595,107 @@ def vm_import_snapshot(
    return get_wait_retdata(response, wait_flag)
 def vm_send_snapshot(
    config,
    vm,
    snapshot_name,
    destination_api_uri,
    destination_api_key,
    destination_api_verify_ssl=True,
    destination_storage_pool=None,
    incremental_parent=None,
    wait_flag=True,
 ):
    """
    Send an (existing) snapshot of a VM's disks and configuration to a destination PVC cluster, optionally
    incremental with incremental_parent
    API endpoint: POST /vm/{vm}/snapshot/send
    API arguments: snapshot_name=snapshot_name, destination_api_uri=destination_api_uri, destination_api_key=destination_api_key, destination_api_verify_ssl=destination_api_verify_ssl, incremental_parent=incremental_parent, destination_storage_pool=destination_storage_pool
    API schema: {"message":"{data}"}
    """
    params = {
        "snapshot_name": snapshot_name,
        "destination_api_uri": destination_api_uri,
        "destination_api_key": destination_api_key,
        "destination_api_verify_ssl": destination_api_verify_ssl,
    }
    if destination_storage_pool is not None:
        params["destination_storage_pool"] = destination_storage_pool
    if incremental_parent is not None:
        params["incremental_parent"] = incremental_parent
    response = call_api(
        config, "post", "/vm/{vm}/snapshot/send".format(vm=vm), params=params
    )
    return get_wait_retdata(response, wait_flag)
 def vm_create_mirror(
    config,
    vm,
    destination_api_uri,
    destination_api_key,
    destination_api_verify_ssl=True,
    destination_storage_pool=None,
    wait_flag=True,
 ):
    """
    Create a new snapshot and send the snapshot to a destination PVC cluster, with automatic incremental handling
    API endpoint: POST /vm/{vm}/mirror/create
    API arguments: destination_api_uri=destination_api_uri, destination_api_key=destination_api_key, destination_api_verify_ssl=destination_api_verify_ssl, destination_storage_pool=destination_storage_pool
    API schema: {"message":"{data}"}
    """
    params = {
        "destination_api_uri": destination_api_uri,
        "destination_api_key": destination_api_key,
        "destination_api_verify_ssl": destination_api_verify_ssl,
    }
    if destination_storage_pool is not None:
        params["destination_storage_pool"] = destination_storage_pool
    response = call_api(
        config, "post", "/vm/{vm}/mirror/create".format(vm=vm), params=params
    )
    return get_wait_retdata(response, wait_flag)
 def vm_promote_mirror(
    config,
    vm,
    destination_api_uri,
    destination_api_key,
    destination_api_verify_ssl=True,
    destination_storage_pool=None,
    remove_on_source=False,
    wait_flag=True,
 ):
    """
    Shut down a VM, create a new snapshot, send the snapshot to a destination PVC cluster, start the VM on the remote cluster, and optionally remove the local VM, with automatic incremental handling
    API endpoint: POST /vm/{vm}/mirror/promote
    API arguments: destination_api_uri=destination_api_uri, destination_api_key=destination_api_key, destination_api_verify_ssl=destination_api_verify_ssl, destination_storage_pool=destination_storage_pool, remove_on_source=remove_on_source
    API schema: {"message":"{data}"}
    """
    params = {
        "destination_api_uri": destination_api_uri,
        "destination_api_key": destination_api_key,
        "destination_api_verify_ssl": destination_api_verify_ssl,
        "remove_on_source": remove_on_source,
    }
    if destination_storage_pool is not None:
        params["destination_storage_pool"] = destination_storage_pool
    response = call_api(
        config, "post", "/vm/{vm}/mirror/promote".format(vm=vm), params=params
    )
    return get_wait_retdata(response, wait_flag)
 def vm_autobackup(config, email_recipients=None, force_full_flag=False, wait_flag=True):
    """
    Perform a cluster VM autobackup
@ -1760,6 +1861,7 @@ def format_info(config, domain_information, long_output):
        "provision": ansiprint.blue(),
        "restore": ansiprint.blue(),
        "import": ansiprint.blue(),
        "mirror": ansiprint.purple(),
    }
    ainformation.append(
        "{}State:{}              {}{}{}".format(
@ -2269,16 +2371,14 @@ def format_list(config, vm_list):
    # Format the string (elements)
    for domain_information in sorted(vm_list, key=lambda v: v["name"]):
-        if domain_information["state"] == "start":
+        if domain_information["state"] in ["start"]:
            vm_state_colour = ansiprint.green()
-        elif domain_information["state"] == "restart":
+        elif domain_information["state"] in ["restart", "shutdown"]:
            vm_state_colour = ansiprint.yellow()
-        elif domain_information["state"] == "shutdown":
+        elif domain_information["state"] in ["stop", "fail"]:
            vm_state_colour = ansiprint.yellow()
        elif domain_information["state"] == "stop":
            vm_state_colour = ansiprint.red()
        elif domain_information["state"] == "fail":
            vm_state_colour = ansiprint.red()
        elif domain_information["state"] in ["mirror"]:
            vm_state_colour = ansiprint.purple()
        else:
            vm_state_colour = ansiprint.blue()
@ -2302,8 +2402,10 @@ def format_list(config, vm_list):
            else:
                net_invalid_list.append(False)
        display_net_string_list = []
        net_string_list = []
        for net_idx, net_vni in enumerate(net_list):
            display_net_string_list.append(net_vni)
            if net_invalid_list[net_idx]:
                net_string_list.append(
                    "{}{}{}".format(
@ -2312,9 +2414,6 @@ def format_list(config, vm_list):
                        ansiprint.end(),
                    )
                )
                # Fix the length due to the extra fake characters
                vm_nets_length -= len(net_vni)
                vm_nets_length += len(net_string_list[net_idx])
            else:
                net_string_list.append(net_vni)
@ -2331,7 +2430,9 @@ def format_list(config, vm_list):
                vm_state_length=vm_state_length,
                vm_tags_length=vm_tags_length,
                vm_snapshots_length=vm_snapshots_length,
-                vm_nets_length=vm_nets_length,
+                vm_nets_length=vm_nets_length
                + len(",".join(net_string_list))
                - len(",".join(display_net_string_list)),
                vm_ram_length=vm_ram_length,
                vm_vcpu_length=vm_vcpu_length,
                vm_node_length=vm_node_length,
@ -2344,7 +2445,8 @@ def format_list(config, vm_list):
                vm_state=domain_information["state"],
                vm_tags=",".join(tag_list),
                vm_snapshots=len(domain_information.get("snapshots", list())),
-                vm_networks=",".join(net_string_list),
+                vm_networks=",".join(net_string_list)
                + ("" if all(net_invalid_list) else " "),
                vm_memory=domain_information["memory"],
                vm_vcpu=domain_information["vcpu"],
                vm_node=domain_information["node"],
--- a/client-cli/setup.py
+++ b/client-cli/setup.py
@ -2,7 +2,7 @@ from setuptools import setup
 setup(
    name="pvc",
-    version="0.9.100",
+    version="0.9.103",
    packages=["pvc.cli", "pvc.lib"],
    install_requires=[
        "Click",
--- a/daemon-common/benchmark.py
+++ b/daemon-common/benchmark.py
@ -19,31 +19,34 @@
 #
 ###############################################################################
 import os
 import psutil
 import psycopg2
 import psycopg2.extras
 import subprocess
 from datetime import datetime
 from json import loads, dumps
 from time import sleep
 from daemon_lib.celery import start, fail, log_info, update, finish
 import daemon_lib.common as pvc_common
 import daemon_lib.ceph as pvc_ceph
 # Define the current test format
-TEST_FORMAT = 1
+TEST_FORMAT = 2
 # We run a total of 8 tests, to give a generalized idea of performance on the cluster:
-#   1. A sequential read test of 8GB with a 4M block size
+#   1. A sequential read test of 64GB with a 4M block size
-#   2. A sequential write test of 8GB with a 4M block size
+#   2. A sequential write test of 64GB with a 4M block size
-#   3. A random read test of 8GB with a 4M block size
+#   3. A random read test of 64GB with a 4M block size
-#   4. A random write test of 8GB with a 4M block size
+#   4. A random write test of 64GB with a 4M block size
-#   5. A random read test of 8GB with a 256k block size
+#   5. A random read test of 64GB with a 256k block size
-#   6. A random write test of 8GB with a 256k block size
+#   6. A random write test of 64GB with a 256k block size
-#   7. A random read test of 8GB with a 4k block size
+#   7. A random read test of 64GB with a 4k block size
-#   8. A random write test of 8GB with a 4k block size
+#   8. A random write test of 64GB with a 4k block size
 # Taken together, these 8 results should give a very good indication of the overall storage performance
 # for a variety of workloads.
 test_matrix = {
@ -100,7 +103,7 @@ test_matrix = {
 # Specify the benchmark volume name and size
 benchmark_volume_name = "pvcbenchmark"
-benchmark_volume_size = "8G"
+benchmark_volume_size = "64G"
 #
@ -226,7 +229,7 @@ def cleanup_benchmark_volume(
 def run_benchmark_job(
-    test, pool, job_name=None, db_conn=None, db_cur=None, zkhandler=None
+    config, test, pool, job_name=None, db_conn=None, db_cur=None, zkhandler=None
 ):
    test_spec = test_matrix[test]
    log_info(None, f"Running test '{test}'")
@ -256,31 +259,165 @@ def run_benchmark_job(
    )
    log_info(None, "Running fio job: {}".format(" ".join(fio_cmd.split())))
-    retcode, stdout, stderr = pvc_common.run_os_command(fio_cmd)
+
    # Run the fio command manually instead of using our run_os_command wrapper
    # This will help us gather statistics about this node while it's running
    process = subprocess.Popen(
        fio_cmd.split(),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
    )
    # Wait 15 seconds for the test to start
    log_info(None, "Waiting 15 seconds for test resource stabilization")
    sleep(15)
    # Set up function to get process CPU utilization by name
    def get_cpu_utilization_by_name(process_name):
        cpu_usage = 0
        for proc in psutil.process_iter(["name", "cpu_percent"]):
            if proc.info["name"] == process_name:
                cpu_usage += proc.info["cpu_percent"]
        return cpu_usage
    # Set up function to get process memory utilization by name
    def get_memory_utilization_by_name(process_name):
        memory_usage = 0
        for proc in psutil.process_iter(["name", "memory_percent"]):
            if proc.info["name"] == process_name:
                memory_usage += proc.info["memory_percent"]
        return memory_usage
    # Set up function to get network traffic utilization in bps
    def get_network_traffic_bps(interface, duration=1):
        # Get initial network counters
        net_io_start = psutil.net_io_counters(pernic=True)
        if interface not in net_io_start:
            return None, None
        stats_start = net_io_start[interface]
        bytes_sent_start = stats_start.bytes_sent
        bytes_recv_start = stats_start.bytes_recv
        # Wait for the specified duration
        sleep(duration)
        # Get final network counters
        net_io_end = psutil.net_io_counters(pernic=True)
        stats_end = net_io_end[interface]
        bytes_sent_end = stats_end.bytes_sent
        bytes_recv_end = stats_end.bytes_recv
        # Calculate bytes per second
        bytes_sent_per_sec = (bytes_sent_end - bytes_sent_start) / duration
        bytes_recv_per_sec = (bytes_recv_end - bytes_recv_start) / duration
        # Convert to bits per second (bps)
        bits_sent_per_sec = bytes_sent_per_sec * 8
        bits_recv_per_sec = bytes_recv_per_sec * 8
        bits_total_per_sec = bits_sent_per_sec + bits_recv_per_sec
        return bits_sent_per_sec, bits_recv_per_sec, bits_total_per_sec
    log_info(None, f"Starting system resource polling for test '{test}'")
    storage_interface = config["storage_dev"]
    total_cpus = psutil.cpu_count(logical=True)
    ticks = 1
    osd_cpu_utilization = 0
    osd_memory_utilization = 0
    mon_cpu_utilization = 0
    mon_memory_utilization = 0
    total_cpu_utilization = 0
    total_memory_utilization = 0
    storage_sent_bps = 0
    storage_recv_bps = 0
    storage_total_bps = 0
    while process.poll() is None:
        # Do collection of statistics like network bandwidth and cpu utilization
        current_osd_cpu_utilization = get_cpu_utilization_by_name("ceph-osd")
        current_osd_memory_utilization = get_memory_utilization_by_name("ceph-osd")
        current_mon_cpu_utilization = get_cpu_utilization_by_name("ceph-mon")
        current_mon_memory_utilization = get_memory_utilization_by_name("ceph-mon")
        current_total_cpu_utilization = psutil.cpu_percent(interval=1)
        current_total_memory_utilization = psutil.virtual_memory().percent
        (
            current_storage_sent_bps,
            current_storage_recv_bps,
            current_storage_total_bps,
        ) = get_network_traffic_bps(storage_interface)
        # Recheck if the process is done yet; if it's not, we add the values and increase the ticks
        # This helps ensure that if the process finishes earlier than the longer polls above,
        # this particular tick isn't counted which can skew the average
        if process.poll() is None:
            osd_cpu_utilization += current_osd_cpu_utilization
            osd_memory_utilization += current_osd_memory_utilization
            mon_cpu_utilization += current_mon_cpu_utilization
            mon_memory_utilization += current_mon_memory_utilization
            total_cpu_utilization += current_total_cpu_utilization
            total_memory_utilization += current_total_memory_utilization
            storage_sent_bps += current_storage_sent_bps
            storage_recv_bps += current_storage_recv_bps
            storage_total_bps += current_storage_total_bps
            ticks += 1
    # Get the 1-minute load average and CPU utilization, which covers the test duration
    load1, _, _ = os.getloadavg()
    load1 = round(load1, 2)
    # Calculate the average CPU utilization values over the runtime
    # Divide the OSD and MON CPU utilization by the total number of CPU cores, because
    # the total is divided this way
    avg_osd_cpu_utilization = round(osd_cpu_utilization / ticks / total_cpus, 2)
    avg_osd_memory_utilization = round(osd_memory_utilization / ticks, 2)
    avg_mon_cpu_utilization = round(mon_cpu_utilization / ticks / total_cpus, 2)
    avg_mon_memory_utilization = round(mon_memory_utilization / ticks, 2)
    avg_total_cpu_utilization = round(total_cpu_utilization / ticks, 2)
    avg_total_memory_utilization = round(total_memory_utilization / ticks, 2)
    avg_storage_sent_bps = round(storage_sent_bps / ticks, 2)
    avg_storage_recv_bps = round(storage_recv_bps / ticks, 2)
    avg_storage_total_bps = round(storage_total_bps / ticks, 2)
    stdout, stderr = process.communicate()
    retcode = process.returncode
    resource_data = {
        "avg_cpu_util_percent": {
            "total": avg_total_cpu_utilization,
            "ceph-mon": avg_mon_cpu_utilization,
            "ceph-osd": avg_osd_cpu_utilization,
        },
        "avg_memory_util_percent": {
            "total": avg_total_memory_utilization,
            "ceph-mon": avg_mon_memory_utilization,
            "ceph-osd": avg_osd_memory_utilization,
        },
        "avg_network_util_bps": {
            "sent": avg_storage_sent_bps,
            "recv": avg_storage_recv_bps,
            "total": avg_storage_total_bps,
        },
    }
    try:
        jstdout = loads(stdout)
        if retcode:
            raise
    except Exception:
-        cleanup(
+        return None, None
            job_name,
            db_conn=db_conn,
            db_cur=db_cur,
            zkhandler=zkhandler,
        )
        fail(
            None,
            f"Failed to run fio test '{test}': {stderr}",
        )
-    return jstdout
+    return resource_data, jstdout
-def worker_run_benchmark(zkhandler, celery, config, pool):
+def worker_run_benchmark(zkhandler, celery, config, pool, name):
    # Phase 0 - connect to databases
-    cur_time = datetime.now().isoformat(timespec="seconds")
+    if not name:
-    cur_primary = zkhandler.read("base.config.primary_node")
+        cur_time = datetime.now().isoformat(timespec="seconds")
-    job_name = f"{cur_time}_{cur_primary}"
+        cur_primary = zkhandler.read("base.config.primary_node")
        job_name = f"{cur_time}_{cur_primary}"
    else:
        job_name = name
    current_stage = 0
    total_stages = 13
@ -358,7 +495,8 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
            total=total_stages,
        )
-        results[test] = run_benchmark_job(
+        resource_data, fio_data = run_benchmark_job(
            config,
            test,
            pool,
            job_name=job_name,
@ -366,6 +504,25 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
            db_cur=db_cur,
            zkhandler=zkhandler,
        )
        if resource_data is None or fio_data is None:
            cleanup_benchmark_volume(
                pool,
                job_name=job_name,
                db_conn=db_conn,
                db_cur=db_cur,
                zkhandler=zkhandler,
            )
            cleanup(
                job_name,
                db_conn=db_conn,
                db_cur=db_cur,
                zkhandler=zkhandler,
            )
            fail(
                None,
                f"Failed to run fio test '{test}'",
            )
        results[test] = {**resource_data, **fio_data}
    # Phase 3 - cleanup
    current_stage += 1
--- a/daemon-common/ceph.py
+++ b/daemon-common/ceph.py
@ -560,7 +560,21 @@ def getVolumeInformation(zkhandler, pool, volume):
    return volume_information
-def add_volume(zkhandler, pool, name, size, force_flag=False):
+def scan_volume(zkhandler, pool, name):
    retcode, stdout, stderr = common.run_os_command(
        "rbd info --format json {}/{}".format(pool, name)
    )
    volstats = stdout
    # 3. Add the new volume to Zookeeper
    zkhandler.write(
        [
            (("volume.stats", f"{pool}/{name}"), volstats),
        ]
    )
 def add_volume(zkhandler, pool, name, size, force_flag=False, zk_only=False):
    # 1. Verify the size of the volume
    pool_information = getPoolInformation(zkhandler, pool)
    size_bytes = format_bytes_fromhuman(size)
@ -592,27 +606,28 @@ def add_volume(zkhandler, pool, name, size, force_flag=False):
        )
    # 2. Create the volume
-    retcode, stdout, stderr = common.run_os_command(
+    # zk_only flag skips actually creating the volume - this would be done by some other mechanism
-        "rbd create --size {}B {}/{}".format(size_bytes, pool, name)
+    if not zk_only:
-    )
+        retcode, stdout, stderr = common.run_os_command(
-    if retcode:
+            "rbd create --size {}B {}/{}".format(size_bytes, pool, name)
-        return False, 'ERROR: Failed to create RBD volume "{}": {}'.format(name, stderr)
+        )
-
+        if retcode:
-    # 2. Get volume stats
+            return False, 'ERROR: Failed to create RBD volume "{}": {}'.format(
-    retcode, stdout, stderr = common.run_os_command(
+                name, stderr
-        "rbd info --format json {}/{}".format(pool, name)
+            )
    )
    volstats = stdout
    # 3. Add the new volume to Zookeeper
    zkhandler.write(
        [
            (("volume", f"{pool}/{name}"), ""),
-            (("volume.stats", f"{pool}/{name}"), volstats),
+            (("volume.stats", f"{pool}/{name}"), ""),
            (("snapshot", f"{pool}/{name}"), ""),
        ]
    )
    # 4. Scan the volume stats
    scan_volume(zkhandler, pool, name)
    return True, 'Created RBD volume "{}" of size "{}" in pool "{}".'.format(
        name, format_bytes_tohuman(size_bytes), pool
    )
@ -662,21 +677,18 @@ def clone_volume(zkhandler, pool, name_src, name_new, force_flag=False):
            ),
        )
-    # 3. Get volume stats
+    # 3. Add the new volume to Zookeeper
    retcode, stdout, stderr = common.run_os_command(
        "rbd info --format json {}/{}".format(pool, name_new)
    )
    volstats = stdout
    # 4. Add the new volume to Zookeeper
    zkhandler.write(
        [
            (("volume", f"{pool}/{name_new}"), ""),
-            (("volume.stats", f"{pool}/{name_new}"), volstats),
+            (("volume.stats", f"{pool}/{name_new}"), ""),
            (("snapshot", f"{pool}/{name_new}"), ""),
        ]
    )
    # 4. Scan the volume stats
    scan_volume(zkhandler, pool, name_new)
    return True, 'Cloned RBD volume "{}" to "{}" in pool "{}"'.format(
        name_src, name_new, pool
    )
@ -761,20 +773,8 @@ def resize_volume(zkhandler, pool, name, size, force_flag=False):
        except Exception:
            pass
-    # 4. Get volume stats
+    # 4. Scan the volume stats
-    retcode, stdout, stderr = common.run_os_command(
+    scan_volume(zkhandler, pool, name)
        "rbd info --format json {}/{}".format(pool, name)
    )
    volstats = stdout
    # 5. Update the volume in Zookeeper
    zkhandler.write(
        [
            (("volume", f"{pool}/{name}"), ""),
            (("volume.stats", f"{pool}/{name}"), volstats),
            (("snapshot", f"{pool}/{name}"), ""),
        ]
    )
    return True, 'Resized RBD volume "{}" to size "{}" in pool "{}".'.format(
        name, format_bytes_tohuman(size_bytes), pool
@ -807,18 +807,8 @@ def rename_volume(zkhandler, pool, name, new_name):
        ]
    )
-    # 3. Get volume stats
+    # 3. Scan the volume stats
-    retcode, stdout, stderr = common.run_os_command(
+    scan_volume(zkhandler, pool, new_name)
        "rbd info --format json {}/{}".format(pool, new_name)
    )
    volstats = stdout
    # 4. Update the volume stats in Zookeeper
    zkhandler.write(
        [
            (("volume.stats", f"{pool}/{new_name}"), volstats),
        ]
    )
    return True, 'Renamed RBD volume "{}" to "{}" in pool "{}".'.format(
        name, new_name, pool
@ -1102,17 +1092,17 @@ def rollback_snapshot(zkhandler, pool, volume, name):
            ),
        )
-        # 1. Roll back the snapshot
+    # 1. Roll back the snapshot
-        retcode, stdout, stderr = common.run_os_command(
+    retcode, stdout, stderr = common.run_os_command(
-            "rbd snap rollback {}/{}@{}".format(pool, volume, name)
+        "rbd snap rollback {}/{}@{}".format(pool, volume, name)
    )
    if retcode:
        return (
            False,
            'ERROR: Failed to roll back RBD volume "{}" in pool "{}" to snapshot "{}": {}'.format(
                volume, pool, name, stderr
            ),
        )
        if retcode:
            return (
                False,
                'ERROR: Failed to roll back RBD volume "{}" in pool "{}" to snapshot "{}": {}'.format(
                    volume, pool, name, stderr
                ),
            )
    return True, 'Rolled back RBD volume "{}" in pool "{}" to snapshot "{}".'.format(
        volume, pool, name
@ -1178,11 +1168,14 @@ def get_list_snapshot(zkhandler, target_pool, target_volume, limit=None, is_fuzz
            continue
        if target_volume and volume_name != target_volume:
            continue
-        snapshot_stats = json.loads(
+        try:
-            zkhandler.read(
+            snapshot_stats = json.loads(
-                ("snapshot.stats", f"{pool_name}/{volume_name}/{snapshot_name}")
+                zkhandler.read(
                    ("snapshot.stats", f"{pool_name}/{volume_name}/{snapshot_name}")
                )
            )
-        )
+        except Exception:
            snapshot_stats = []
        if limit:
            try:
                if re.fullmatch(limit, snapshot_name):
@ -1238,16 +1231,16 @@ def osd_worker_add_osd(
    current_stage = 0
    total_stages = 5
    if split_count is None:
-        _split_count = 1
+        split_count = 1
    else:
-        _split_count = split_count
+        split_count = int(split_count)
-    total_stages = total_stages + 3 * int(_split_count)
+    total_stages = total_stages + 3 * int(split_count)
    if ext_db_ratio is not None or ext_db_size is not None:
-        total_stages = total_stages + 3 * int(_split_count) + 1
+        total_stages = total_stages + 3 * int(split_count) + 1
    start(
        celery,
-        f"Adding {_split_count} new OSD(s) on device {device} with weight {weight}",
+        f"Adding {split_count} new OSD(s) on device {device} with weight {weight}",
        current=current_stage,
        total=total_stages,
    )
@ -1288,7 +1281,7 @@ def osd_worker_add_osd(
    else:
        ext_db_flag = False
-    if split_count is not None:
+    if split_count > 1:
        split_flag = f"--osds-per-device {split_count}"
        is_split = True
        log_info(
--- a/daemon-common/cluster.py
+++ b/daemon-common/cluster.py
@ -262,6 +262,22 @@ def getClusterInformation(zkhandler):
    # Get cluster maintenance state
    maintenance_state = zkhandler.read("base.config.maintenance")
    # Prepare cluster total values
    cluster_total_node_memory = 0
    cluster_total_used_memory = 0
    cluster_total_free_memory = 0
    cluster_total_allocated_memory = 0
    cluster_total_provisioned_memory = 0
    cluster_total_average_memory_utilization = 0
    cluster_total_cpu_cores = 0
    cluster_total_cpu_load = 0
    cluster_total_average_cpu_utilization = 0
    cluster_total_allocated_cores = 0
    cluster_total_osd_space = 0
    cluster_total_used_space = 0
    cluster_total_free_space = 0
    cluster_total_average_osd_utilization = 0
    # Get primary node
    maintenance_state, primary_node = zkhandler.read_many(
        [
@ -276,19 +292,36 @@ def getClusterInformation(zkhandler):
    # Get the list of Nodes
    node_list = zkhandler.children("base.node")
    node_count = len(node_list)
-    # Get the daemon and domain states of all Nodes
+    # Get the information of all Nodes
    node_state_reads = list()
    node_memory_reads = list()
    node_cpu_reads = list()
    for node in node_list:
        node_state_reads += [
            ("node.state.daemon", node),
            ("node.state.domain", node),
        ]
        node_memory_reads += [
            ("node.memory.total", node),
            ("node.memory.used", node),
            ("node.memory.free", node),
            ("node.memory.allocated", node),
            ("node.memory.provisioned", node),
        ]
        node_cpu_reads += [
            ("node.data.static", node),
            ("node.vcpu.allocated", node),
            ("node.cpu.load", node),
        ]
    all_node_states = zkhandler.read_many(node_state_reads)
    all_node_memory = zkhandler.read_many(node_memory_reads)
    all_node_cpu = zkhandler.read_many(node_cpu_reads)
    # Parse out the Node states
    node_data = list()
    formatted_node_states = {"total": node_count}
    for nidx, node in enumerate(node_list):
-        # Split the large list of return values by the IDX of this node
+        # Split the large list of return values by the IDX of this node (states)
        # Each node result is 2 fields long
        pos_start = nidx * 2
        pos_end = nidx * 2 + 2
@ -308,6 +341,46 @@ def getClusterInformation(zkhandler):
            else:
                formatted_node_states[node_state] = 1
        # Split the large list of return values by the IDX of this node (memory)
        # Each node result is 5 fields long
        pos_start = nidx * 5
        pos_end = nidx * 5 + 5
        (
            node_memory_total,
            node_memory_used,
            node_memory_free,
            node_memory_allocated,
            node_memory_provisioned,
        ) = tuple(all_node_memory[pos_start:pos_end])
        cluster_total_node_memory += int(node_memory_total)
        cluster_total_used_memory += int(node_memory_used)
        cluster_total_free_memory += int(node_memory_free)
        cluster_total_allocated_memory += int(node_memory_allocated)
        cluster_total_provisioned_memory += int(node_memory_provisioned)
        # Split the large list of return values by the IDX of this node (cpu)
        # Each nod result is 3 fields long
        pos_start = nidx * 3
        pos_end = nidx * 3 + 3
        node_static_data, node_vcpu_allocated, node_cpu_load = tuple(
            all_node_cpu[pos_start:pos_end]
        )
        cluster_total_cpu_cores += int(node_static_data.split()[0])
        cluster_total_cpu_load += round(float(node_cpu_load), 2)
        cluster_total_allocated_cores += int(node_vcpu_allocated)
    cluster_total_average_memory_utilization = (
        (round((cluster_total_used_memory / cluster_total_node_memory) * 100, 2))
        if cluster_total_node_memory > 0
        else 0.00
    )
    cluster_total_average_cpu_utilization = (
        (round((cluster_total_cpu_load / cluster_total_cpu_cores) * 100, 2))
        if cluster_total_cpu_cores > 0
        else 0.00
    )
    # Get the list of VMs
    vm_list = zkhandler.children("base.domain")
    vm_count = len(vm_list)
@ -380,6 +453,18 @@ def getClusterInformation(zkhandler):
            else:
                formatted_osd_states[osd_state] = 1
        # Add the OSD utilization
        cluster_total_osd_space += int(osd_stats["kb"])
        cluster_total_used_space += int(osd_stats["kb_used"])
        cluster_total_free_space += int(osd_stats["kb_avail"])
        cluster_total_average_osd_utilization += float(osd_stats["utilization"])
    cluster_total_average_osd_utilization = (
        (round(cluster_total_average_osd_utilization / len(ceph_osd_list), 2))
        if ceph_osd_list
        else 0.00
    )
    # Get the list of Networks
    network_list = zkhandler.children("base.network")
    network_count = len(network_list)
@ -424,6 +509,28 @@ def getClusterInformation(zkhandler):
        "pools": ceph_pool_count,
        "volumes": ceph_volume_count,
        "snapshots": ceph_snapshot_count,
        "resources": {
            "memory": {
                "total": cluster_total_node_memory,
                "free": cluster_total_free_memory,
                "used": cluster_total_used_memory,
                "allocated": cluster_total_allocated_memory,
                "provisioned": cluster_total_provisioned_memory,
                "utilization": cluster_total_average_memory_utilization,
            },
            "cpu": {
                "total": cluster_total_cpu_cores,
                "load": cluster_total_cpu_load,
                "allocated": cluster_total_allocated_cores,
                "utilization": cluster_total_average_cpu_utilization,
            },
            "disk": {
                "total": cluster_total_osd_space,
                "used": cluster_total_used_space,
                "free": cluster_total_free_space,
                "utilization": cluster_total_average_osd_utilization,
            },
        },
        "detail": {
            "node": node_data,
            "vm": vm_data,
@ -1053,6 +1160,7 @@ def get_resource_metrics(zkhandler):
            "fail": 8,
            "import": 9,
            "restore": 10,
            "mirror": 99,
        }
        state = vm["state"]
        output_lines.append(
--- a/daemon-common/common.py
+++ b/daemon-common/common.py
@ -85,6 +85,7 @@ vm_state_combinations = [
    "provision",
    "import",
    "restore",
    "mirror",
 ]
 ceph_osd_state_combinations = [
    "up,in",
--- a/daemon-common/config.py
+++ b/daemon-common/config.py
@ -375,8 +375,11 @@ def get_parsed_configuration(config_file):
        config = {**config, **config_api_ssl}
        # Use coordinators as storage hosts if not explicitly specified
        # These are added as FQDNs in the storage domain
        if not config["storage_hosts"] or len(config["storage_hosts"]) < 1:
-            config["storage_hosts"] = config["coordinators"]
+            config["storage_hosts"] = []
            for host in config["coordinators"]:
                config["storage_hosts"].append(f"{host}.{config['storage_domain']}")
        # Set up our token list if specified
        if config["api_auth_source"] == "token":
--- a/daemon-common/migrations/versions/15.json
+++ b/daemon-common/migrations/versions/15.json
@ -0,0 +1 @@
 {"version": "15", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.fence_lock": "/config/fence_lock", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock", "snapshots": "/snapshots"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "domain_snapshot": {"name": "", "timestamp": "/timestamp", "xml": "/xml", "rbd_snapshots": "/rbdsnaplist"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
--- a/daemon-common/vm.py
+++ b/daemon-common/vm.py
--- a/daemon-common/vmbuilder.py
+++ b/daemon-common/vmbuilder.py
@ -336,11 +336,7 @@ def worker_create_vm(
    retcode, stdout, stderr = pvc_common.run_os_command("uname -m")
    vm_data["system_architecture"] = stdout.strip()
-    monitor_list = list()
+    vm_data["ceph_monitor_list"] = config["storage_hosts"]
    monitor_names = config["storage_hosts"]
    for monitor in monitor_names:
        monitor_list.append("{}.{}".format(monitor, config["storage_domain"]))
    vm_data["ceph_monitor_list"] = monitor_list
    vm_data["ceph_monitor_port"] = config["ceph_monitor_port"]
    vm_data["ceph_monitor_secret"] = config["ceph_secret_uuid"]
--- a/daemon-common/zkhandler.py
+++ b/daemon-common/zkhandler.py
@ -30,7 +30,8 @@ from kazoo.client import KazooClient, KazooState
 from kazoo.exceptions import NoNodeError
-SCHEMA_ROOT_PATH = "/usr/share/pvc/daemon_lib/migrations/versions"
+DEFAULT_ROOT_PATH = "/usr/share/pvc"
 SCHEMA_PATH = "daemon_lib/migrations/versions"
 #
@ -576,7 +577,7 @@ class ZKHandler(object):
 #
 class ZKSchema(object):
    # Current version
-    _version = 14
+    _version = 15
    # Root for doing nested keys
    _schema_root = ""
@ -592,6 +593,7 @@ class ZKSchema(object):
            "schema.version": f"{_schema_root}/schema/version",
            "config": f"{_schema_root}/config",
            "config.maintenance": f"{_schema_root}/config/maintenance",
            "config.fence_lock": f"{_schema_root}/config/fence_lock",
            "config.primary_node": f"{_schema_root}/config/primary_node",
            "config.primary_node.sync_lock": f"{_schema_root}/config/primary_node/sync_lock",
            "config.upstream_ip": f"{_schema_root}/config/upstream_ip",
@ -831,8 +833,8 @@ class ZKSchema(object):
    def schema(self, schema):
        self._schema = schema
-    def __init__(self):
+    def __init__(self, root_path=DEFAULT_ROOT_PATH):
-        pass
+        self.schema_path = f"{root_path}/{SCHEMA_PATH}"
    def __repr__(self):
        return f"ZKSchema({self.version})"
@ -872,7 +874,7 @@ class ZKSchema(object):
        if not quiet:
            print(f"Loading schema version {version}")
-        with open(f"{SCHEMA_ROOT_PATH}/{version}.json", "r") as sfh:
+        with open(f"{self.schema_path}/{version}.json", "r") as sfh:
            self.schema = json.load(sfh)
            self.version = self.schema.get("version")
@ -1134,7 +1136,7 @@ class ZKSchema(object):
    # Migrate from older to newer schema
    def migrate(self, zkhandler, new_version):
        # Determine the versions in between
-        versions = ZKSchema.find_all(start=self.version, end=new_version)
+        versions = self.find_all(start=self.version, end=new_version)
        if versions is None:
            return
@ -1150,7 +1152,7 @@ class ZKSchema(object):
    # Rollback from newer to older schema
    def rollback(self, zkhandler, old_version):
        # Determine the versions in between
-        versions = ZKSchema.find_all(start=old_version - 1, end=self.version - 1)
+        versions = self.find_all(start=old_version - 1, end=self.version - 1)
        if versions is None:
            return
@ -1165,6 +1167,12 @@ class ZKSchema(object):
            # Apply those changes
            self.run_migrate(zkhandler, changes)
    # Write the latest schema to a file
    def write(self):
        schema_file = f"{self.schema_path}/{self._version}.json"
        with open(schema_file, "w") as sfh:
            json.dump(self._schema, sfh)
    @classmethod
    def key_diff(cls, schema_a, schema_b):
        # schema_a = current
@ -1210,26 +1218,10 @@ class ZKSchema(object):
        return {"add": diff_add, "remove": diff_remove, "rename": diff_rename}
    # Load in the schemal of the current cluster
    @classmethod
    def load_current(cls, zkhandler):
        new_instance = cls()
        version = new_instance.get_version(zkhandler)
        new_instance.load(version)
        return new_instance
    # Write the latest schema to a file
    @classmethod
    def write(cls):
        schema_file = f"{SCHEMA_ROOT_PATH}/{cls._version}.json"
        with open(schema_file, "w") as sfh:
            json.dump(cls._schema, sfh)
    # Static methods for reading information from the files
-    @staticmethod
+    def find_all(self, start=0, end=None):
    def find_all(start=0, end=None):
        versions = list()
-        for version in os.listdir(SCHEMA_ROOT_PATH):
+        for version in os.listdir(self.schema_path):
            sequence_id = int(version.split(".")[0])
            if end is None:
                if sequence_id > start:
@ -1242,11 +1234,18 @@ class ZKSchema(object):
        else:
            return None
-    @staticmethod
+    def find_latest(self):
    def find_latest():
        latest_version = 0
-        for version in os.listdir(SCHEMA_ROOT_PATH):
+        for version in os.listdir(self.schema_path):
            sequence_id = int(version.split(".")[0])
            if sequence_id > latest_version:
                latest_version = sequence_id
        return latest_version
    # Load in the schema of the current cluster
    @classmethod
    def load_current(cls, zkhandler):
        new_instance = cls()
        version = new_instance.get_version(zkhandler)
        new_instance.load(version)
        return new_instance
--- a/debian/changelog
+++ b/debian/changelog
@ -1,3 +1,43 @@
 pvc (0.9.103-0) unstable; urgency=high
  * [Provisioner] Fixes a bug with the change in `storage_hosts` to FQDNs affecting the VM Builder
  * [Monitoring] Fixes the Munin plugin to work properly with sudo
 -- Joshua M. Boniface <joshua@boniface.me>  Fri, 01 Nov 2024 17:19:44 -0400
 pvc (0.9.102-0) unstable; urgency=high
  * [API Daemon] Ensures that received config snapshots update storage hosts in addition to secret UUIDs
  * [CLI Client] Fixes several bugs around local connection handling and connection listings
 -- Joshua M. Boniface <joshua@boniface.me>  Thu, 17 Oct 2024 10:48:31 -0400
 pvc (0.9.101-0) unstable; urgency=high
  **New Feature**: Adds VM snapshot sending (`vm snapshot send`), VM mirroring (`vm mirror create`), and (offline) mirror promotion (`vm mirror promote`). Permits transferring VM snapshots to remote clusters, individually or repeatedly, and promoting them to active status, for disaster recovery and migration between clusters.
  **Breaking Change**: Migrates the API daemon into Gunicorn when in production mode. Permits more scalable and performant operation of the API. **Requires additional dependency packages on all coordinator nodes** (`gunicorn`, `python3-gunicorn`, `python3-setuptools`); upgrade via `pvc-ansible` is strongly recommended.
  **Enhancement**: Provides whole cluster utilization stats in the cluster status data. Permits better observability into the overall resource utilization of the cluster.
  **Enhancement**: Adds a new storage benchmark format (v2) which includes additional resource utilization statistics. This allows for better evaluation of storage performance impact on the cluster as a whole. The updated format also permits arbitrary benchmark job names for easier parsing and tracking.
  * [API Daemon] Allows scanning of new volumes added manually via other commands
  * [API Daemon/CLI Client] Adds whole cluster utilization statistics to cluster status
  * [API Daemon] Moves production API execution into Gunicorn
  * [API Daemon] Adds a new storage benchmark format (v2) with additional resource tracking
  * [API Daemon] Adds support for named storage benchmark jobs
  * [API Daemon] Fixes a bug in OSD creation which would create `split` OSDs if `--osd-count` was set to 1
  * [API Daemon] Adds support for the `mirror` VM state used by snapshot mirrors
  * [CLI Client] Fixes several output display bugs in various commands and in Worker task outputs
  * [CLI Client] Improves and shrinks the status progress bar output to support longer messages
  * [API Daemon] Adds support for sending snapshots to remote clusters
  * [API Daemon] Adds support for updating and promoting snapshot mirrors to remote clusters
  * [Node Daemon] Improves timeouts during primary/secondary coordinator transitions to avoid deadlocks
  * [Node Daemon] Improves timeouts during keepalive updates to avoid deadlocks
  * [Node Daemon] Refactors fencing thread structure to ensure a single fencing task per cluster and sequential node fences to avoid potential anomalies (e.g. fencing 2 nodes simultaneously)
  * [Node Daemon] Fixes a bug in fencing if VM locks were already freed, leaving VMs in an invalid state
  * [Node Daemon] Increases the wait time during system startup to ensure Zookeeper has more time to synchronize
 -- Joshua M. Boniface <joshua@boniface.me>  Tue, 15 Oct 2024 11:39:11 -0400
 pvc (0.9.100-0) unstable; urgency=high
  * [API Daemon] Improves the handling of "detect:" disk strings on newer systems by leveraging the "nvme" command
--- a/debian/control
+++ b/debian/control
@ -32,7 +32,7 @@ Description: Parallel Virtual Cluster worker daemon
 Package: pvc-daemon-api
 Architecture: all
-Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
+Depends: systemd, pvc-daemon-common, gunicorn, python3-gunicorn, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
 Description: Parallel Virtual Cluster API daemon
 A KVM/Zookeeper/Ceph-based VM and private cloud manager
 .
--- a/health-daemon/pvchealthd/Daemon.py
+++ b/health-daemon/pvchealthd/Daemon.py
@ -33,7 +33,7 @@ import os
 import signal
 # Daemon version
-version = "0.9.100"
+version = "0.9.103"
 ##########################################################
--- a/monitoring/munin/pvc
+++ b/monitoring/munin/pvc
@ -34,7 +34,7 @@ warning=0.99
 critical=1.99
 export PVC_CLIENT_DIR="/run/shm/munin-pvc"
-PVC_CMD="/usr/bin/pvc --quiet --cluster local status --format json-pretty"
+PVC_CMD="/usr/bin/sudo -E /usr/bin/pvc --quiet cluster status --format json-pretty"
 JQ_CMD="/usr/bin/jq"
 output_usage() {
@ -126,7 +126,7 @@ output_values() {
    is_maintenance="$( $JQ_CMD ".maintenance" <<<"${PVC_OUTPUT}" | tr -d '"' )"
    cluster_health="$( $JQ_CMD ".cluster_health.health" <<<"${PVC_OUTPUT}" | tr -d '"' )"
-    cluster_health_messages="$( $JQ_CMD -r ".cluster_health.messages | @csv" <<<"${PVC_OUTPUT}" | tr -d '"' | sed 's/,/, /g' )"
+    cluster_health_messages="$( $JQ_CMD -r ".cluster_health.messages | map(.text) | join(\", \")" <<<"${PVC_OUTPUT}" )"
    echo 'multigraph pvc_cluster_health'
    echo "pvc_cluster_health.value ${cluster_health}"
    echo "pvc_cluster_health.extinfo ${cluster_health_messages}"
@ -142,7 +142,7 @@ output_values() {
    echo "pvc_cluster_alert.value ${cluster_health_alert}"
    node_health="$( $JQ_CMD ".node_health.${HOST}.health" <<<"${PVC_OUTPUT}" | tr -d '"' )"
-    node_health_messages="$( $JQ_CMD -r ".node_health.${HOST}.messages | @csv" <<<"${PVC_OUTPUT}" | tr -d '"' | sed 's/,/, /g' )"
+    node_health_messages="$( $JQ_CMD -r ".node_health.${HOST}.messages | join(\", \")" <<<"${PVC_OUTPUT}" )"
    echo 'multigraph pvc_node_health'
    echo "pvc_node_health.value ${node_health}"
    echo "pvc_node_health.extinfo ${node_health_messages}"
--- a/monitoring/prometheus/grafana-pvc-cluster-dashboard.json
+++ b/monitoring/prometheus/grafana-pvc-cluster-dashboard.json
--- a/monitoring/prometheus/grafana-pvc-vms-dashboard.json
+++ b/monitoring/prometheus/grafana-pvc-vms-dashboard.json
--- a/node-daemon/pvcnoded/Daemon.py
+++ b/node-daemon/pvcnoded/Daemon.py
@ -49,7 +49,7 @@ import re
 import json
 # Daemon version
-version = "0.9.100"
+version = "0.9.103"
 ##########################################################
--- a/node-daemon/pvcnoded/objects/NodeInstance.py
+++ b/node-daemon/pvcnoded/objects/NodeInstance.py
@ -438,8 +438,11 @@ class NodeInstance(object):
        # Synchronize nodes B (I am reader)
        lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
        self.logger.out("Acquiring read lock for synchronization phase B", state="i")
-        lock.acquire()
+        try:
-        self.logger.out("Acquired read lock for synchronization phase B", state="o")
+            lock.acquire(timeout=5)  # Don't wait forever and completely block us
            self.logger.out("Acquired read lock for synchronization phase G", state="o")
        except Exception:
            pass
        self.logger.out("Releasing read lock for synchronization phase B", state="i")
        lock.release()
        self.logger.out("Released read lock for synchronization phase B", state="o")
@ -648,8 +651,11 @@ class NodeInstance(object):
        # Synchronize nodes A (I am reader)
        lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
        self.logger.out("Acquiring read lock for synchronization phase A", state="i")
-        lock.acquire()
+        try:
-        self.logger.out("Acquired read lock for synchronization phase A", state="o")
+            lock.acquire(timeout=5)  # Don't wait forever and completely block us
            self.logger.out("Acquired read lock for synchronization phase G", state="o")
        except Exception:
            pass
        self.logger.out("Releasing read lock for synchronization phase A", state="i")
        lock.release()
        self.logger.out("Released read lock for synchronization phase A", state="o")
@ -682,8 +688,11 @@ class NodeInstance(object):
        # Synchronize nodes C (I am reader)
        lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
        self.logger.out("Acquiring read lock for synchronization phase C", state="i")
-        lock.acquire()
+        try:
-        self.logger.out("Acquired read lock for synchronization phase C", state="o")
+            lock.acquire(timeout=5)  # Don't wait forever and completely block us
            self.logger.out("Acquired read lock for synchronization phase G", state="o")
        except Exception:
            pass
        # 5. Remove Upstream floating IP
        self.logger.out(
            "Removing floating upstream IP {}/{} from interface {}".format(
@ -701,8 +710,11 @@ class NodeInstance(object):
        # Synchronize nodes D (I am reader)
        lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
        self.logger.out("Acquiring read lock for synchronization phase D", state="i")
-        lock.acquire()
+        try:
-        self.logger.out("Acquired read lock for synchronization phase D", state="o")
+            lock.acquire(timeout=5)  # Don't wait forever and completely block us
            self.logger.out("Acquired read lock for synchronization phase G", state="o")
        except Exception:
            pass
        # 6. Remove Cluster & Storage floating IP
        self.logger.out(
            "Removing floating management IP {}/{} from interface {}".format(
@ -729,8 +741,11 @@ class NodeInstance(object):
        # Synchronize nodes E (I am reader)
        lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
        self.logger.out("Acquiring read lock for synchronization phase E", state="i")
-        lock.acquire()
+        try:
-        self.logger.out("Acquired read lock for synchronization phase E", state="o")
+            lock.acquire(timeout=5)  # Don't wait forever and completely block us
            self.logger.out("Acquired read lock for synchronization phase G", state="o")
        except Exception:
            pass
        # 7. Remove Metadata link-local IP
        self.logger.out(
            "Removing Metadata link-local IP {}/{} from interface {}".format(
@ -746,8 +761,11 @@ class NodeInstance(object):
        # Synchronize nodes F (I am reader)
        lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
        self.logger.out("Acquiring read lock for synchronization phase F", state="i")
-        lock.acquire()
+        try:
-        self.logger.out("Acquired read lock for synchronization phase F", state="o")
+            lock.acquire(timeout=5)  # Don't wait forever and completely block us
            self.logger.out("Acquired read lock for synchronization phase G", state="o")
        except Exception:
            pass
        # 8. Remove gateway IPs
        for network in self.d_network:
            self.d_network[network].removeGateways()
@ -759,7 +777,7 @@ class NodeInstance(object):
        lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
        self.logger.out("Acquiring read lock for synchronization phase G", state="i")
        try:
-            lock.acquire(timeout=60)  # Don't wait forever and completely block us
+            lock.acquire(timeout=5)  # Don't wait forever and completely block us
            self.logger.out("Acquired read lock for synchronization phase G", state="o")
        except Exception:
            pass
--- a/node-daemon/pvcnoded/util/fencing.py
+++ b/node-daemon/pvcnoded/util/fencing.py
@ -21,15 +21,72 @@
 import time
 from kazoo.exceptions import LockTimeout
 import daemon_lib.common as common
 from daemon_lib.vm import vm_worker_flush_locks
 #
-# Fence thread entry function
+# Fence monitor thread entrypoint
 #
-def fence_node(node_name, zkhandler, config, logger):
+def fence_monitor(zkhandler, config, logger):
    # Attempt to acquire an exclusive lock on the fence_lock key
    # If it is already held, we'll abort since another node is processing fences
    lock = zkhandler.exclusivelock("base.config.fence_lock")
    try:
        lock.acquire(timeout=config["keepalive_interval"] - 1)
        for node_name in zkhandler.children("base.node"):
            try:
                node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
                node_keepalive = int(zkhandler.read(("node.keepalive", node_name)))
            except Exception:
                node_daemon_state = "unknown"
                node_keepalive = 0
            node_deadtime = int(time.time()) - (
                int(config["keepalive_interval"]) * int(config["fence_intervals"])
            )
            if node_keepalive < node_deadtime and node_daemon_state == "run":
                logger.out(
                    f"Node {node_name} seems dead; starting monitor for fencing",
                    state="w",
                )
                zk_lock = zkhandler.writelock(("node.state.daemon", node_name))
                with zk_lock:
                    # Ensures that, if we lost the lock race and come out of waiting,
                    # we won't try to trigger our own fence thread.
                    if zkhandler.read(("node.state.daemon", node_name)) != "dead":
                        # Write the updated data after we start the fence thread
                        zkhandler.write([(("node.state.daemon", node_name), "dead")])
                        # Start the fence monitoring task for this node
                        # NOTE: This is not a subthread and is designed to block this for loop
                        # This ensures that only one node is ever being fenced at a time
                        fence_node(zkhandler, config, logger, node_name)
            else:
                logger.out(
                    f"Node {node_name} is OK; last checkin is {node_deadtime - node_keepalive}s from threshold, node state is '{node_daemon_state}'",
                    state="d",
                    prefix="fence-thread",
                )
    except LockTimeout:
        logger.out(
            "Fence monitor thread failed to acquire exclusive lock; skipping", state="i"
        )
    except Exception as e:
        logger.out(f"Fence monitor thread failed: {e}", state="w")
    finally:
        # We're finished, so release the global lock
        lock.release()
 #
 # Fence action function
 #
 def fence_node(zkhandler, config, logger, node_name):
    # We allow exactly 6 saving throws (30 seconds) for the host to come back online or we kill it
    failcount_limit = 6
    failcount = 0
@ -190,7 +247,7 @@ def migrateFromFencedNode(zkhandler, node_name, config, logger):
            )
            zkhandler.write(
                {
-                    (("domain.state", dom_uuid), "stopped"),
+                    (("domain.state", dom_uuid), "stop"),
                    (("domain.meta.autostart", dom_uuid), "True"),
                }
            )
@ -202,6 +259,9 @@ def migrateFromFencedNode(zkhandler, node_name, config, logger):
    # Loop through the VMs
    for dom_uuid in dead_node_running_domains:
        if dom_uuid in ["0", 0]:
            # Skip the invalid "0" UUID we sometimes get
            continue
        try:
            fence_migrate_vm(dom_uuid)
        except Exception as e:
--- a/node-daemon/pvcnoded/util/keepalive.py
+++ b/node-daemon/pvcnoded/util/keepalive.py
@ -756,29 +756,21 @@ def node_keepalive(logger, config, zkhandler, this_node, netstats):
    # Join against running threads
    if config["enable_hypervisor"]:
-        vm_stats_thread.join(timeout=config["keepalive_interval"])
+        vm_stats_thread.join(timeout=config["keepalive_interval"] - 1)
        if vm_stats_thread.is_alive():
            logger.out("VM stats gathering exceeded timeout, continuing", state="w")
    if config["enable_storage"]:
-        ceph_stats_thread.join(timeout=config["keepalive_interval"])
+        ceph_stats_thread.join(timeout=config["keepalive_interval"] - 1)
        if ceph_stats_thread.is_alive():
            logger.out("Ceph stats gathering exceeded timeout, continuing", state="w")
    # Get information from thread queues
    if config["enable_hypervisor"]:
        try:
-            this_node.domains_count = vm_thread_queue.get(
+            this_node.domains_count = vm_thread_queue.get(timeout=0.1)
-                timeout=config["keepalive_interval"]
+            this_node.memalloc = vm_thread_queue.get(timeout=0.1)
-            )
+            this_node.memprov = vm_thread_queue.get(timeout=0.1)
-            this_node.memalloc = vm_thread_queue.get(
+            this_node.vcpualloc = vm_thread_queue.get(timeout=0.1)
                timeout=config["keepalive_interval"]
            )
            this_node.memprov = vm_thread_queue.get(
                timeout=config["keepalive_interval"]
            )
            this_node.vcpualloc = vm_thread_queue.get(
                timeout=config["keepalive_interval"]
            )
        except Exception:
            logger.out("VM stats queue get exceeded timeout, continuing", state="w")
    else:
@ -789,9 +781,7 @@ def node_keepalive(logger, config, zkhandler, this_node, netstats):
    if config["enable_storage"]:
        try:
-            osds_this_node = ceph_thread_queue.get(
+            osds_this_node = ceph_thread_queue.get(timeout=0.1)
                timeout=(config["keepalive_interval"] - 1)
            )
        except Exception:
            logger.out("Ceph stats queue get exceeded timeout, continuing", state="w")
            osds_this_node = "?"
@ -887,44 +877,12 @@ def node_keepalive(logger, config, zkhandler, this_node, netstats):
            )
    # Look for dead nodes and fence them
-    if not this_node.maintenance:
+    if not this_node.maintenance and config["daemon_mode"] == "coordinator":
        logger.out(
            "Look for dead nodes and fence them", state="d", prefix="main-thread"
        )
-        if config["daemon_mode"] == "coordinator":
+        fence_monitor_thread = Thread(
-            for node_name in zkhandler.children("base.node"):
+            target=pvcnoded.util.fencing.fence_monitor,
-                try:
+            args=(zkhandler, config, logger),
-                    node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
+        )
-                    node_keepalive = int(zkhandler.read(("node.keepalive", node_name)))
+        fence_monitor_thread.start()
                except Exception:
                    node_daemon_state = "unknown"
                    node_keepalive = 0
                # Handle deadtime and fencng if needed
                # (A node is considered dead when its keepalive timer is >6*keepalive_interval seconds
                # out-of-date while in 'start' state)
                node_deadtime = int(time.time()) - (
                    int(config["keepalive_interval"]) * int(config["fence_intervals"])
                )
                if node_keepalive < node_deadtime and node_daemon_state == "run":
                    logger.out(
                        "Node {} seems dead - starting monitor for fencing".format(
                            node_name
                        ),
                        state="w",
                    )
                    zk_lock = zkhandler.writelock(("node.state.daemon", node_name))
                    with zk_lock:
                        # Ensures that, if we lost the lock race and come out of waiting,
                        # we won't try to trigger our own fence thread.
                        if zkhandler.read(("node.state.daemon", node_name)) != "dead":
                            fence_thread = Thread(
                                target=pvcnoded.util.fencing.fence_node,
                                args=(node_name, zkhandler, config, logger),
                                kwargs={},
                            )
                            fence_thread.start()
                            # Write the updated data after we start the fence thread
                            zkhandler.write(
                                [(("node.state.daemon", node_name), "dead")]
                            )
--- a/node-daemon/pvcnoded/util/services.py
+++ b/node-daemon/pvcnoded/util/services.py
@ -102,5 +102,5 @@ def start_system_services(logger, config):
    start_workerd(logger, config)
    start_healthd(logger, config)
-    logger.out("Waiting 5 seconds for daemons to start", state="s")
+    logger.out("Waiting 10 seconds for daemons to start", state="s")
-    sleep(5)
+    sleep(10)
--- a/node-daemon/pvcnoded/util/zookeeper.py
+++ b/node-daemon/pvcnoded/util/zookeeper.py
@ -188,3 +188,6 @@ def setup_node(logger, config, zkhandler):
                (("node.count.networks", config["node_hostname"]), "0"),
            ]
        )
    logger.out("Waiting 5 seconds for Zookeeper to synchronize", state="s")
    time.sleep(5)
--- a/worker-daemon/pvcworkerd/Daemon.py
+++ b/worker-daemon/pvcworkerd/Daemon.py
@ -33,6 +33,9 @@ from daemon_lib.vm import (
    vm_worker_rollback_snapshot,
    vm_worker_export_snapshot,
    vm_worker_import_snapshot,
    vm_worker_send_snapshot,
    vm_worker_create_mirror,
    vm_worker_promote_mirror,
 )
 from daemon_lib.ceph import (
    osd_worker_add_osd,
@ -52,7 +55,7 @@ from daemon_lib.autobackup import (
 )
 # Daemon version
-version = "0.9.100"
+version = "0.9.103"
 config = cfg.get_configuration()
@ -96,12 +99,12 @@ def create_vm(
@celery.task(name="storage.benchmark", bind=True, routing_key="run_on")
-def storage_benchmark(self, pool=None, run_on="primary"):
+def storage_benchmark(self, pool=None, name=None, run_on="primary"):
    @ZKConnection(config)
-    def run_storage_benchmark(zkhandler, self, pool):
+    def run_storage_benchmark(zkhandler, self, pool, name):
-        return worker_run_benchmark(zkhandler, self, config, pool)
+        return worker_run_benchmark(zkhandler, self, config, pool, name)
-    return run_storage_benchmark(self, pool)
+    return run_storage_benchmark(self, pool, name)
@celery.task(name="cluster.autobackup", bind=True, routing_key="run_on")
@ -227,6 +230,138 @@ def vm_import_snapshot(
    )
@celery.task(name="vm.send_snapshot", bind=True, routing_key="run_on")
 def vm_send_snapshot(
    self,
    domain=None,
    snapshot_name=None,
    destination_api_uri="",
    destination_api_key="",
    destination_api_verify_ssl=True,
    incremental_parent=None,
    destination_storage_pool=None,
    run_on="primary",
 ):
    @ZKConnection(config)
    def run_vm_send_snapshot(
        zkhandler,
        self,
        domain,
        snapshot_name,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=True,
        incremental_parent=None,
        destination_storage_pool=None,
    ):
        return vm_worker_send_snapshot(
            zkhandler,
            self,
            domain,
            snapshot_name,
            destination_api_uri,
            destination_api_key,
            destination_api_verify_ssl=destination_api_verify_ssl,
            incremental_parent=incremental_parent,
            destination_storage_pool=destination_storage_pool,
        )
    return run_vm_send_snapshot(
        self,
        domain,
        snapshot_name,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=destination_api_verify_ssl,
        incremental_parent=incremental_parent,
        destination_storage_pool=destination_storage_pool,
    )
@celery.task(name="vm.create_mirror", bind=True, routing_key="run_on")
 def vm_create_mirror(
    self,
    domain=None,
    destination_api_uri="",
    destination_api_key="",
    destination_api_verify_ssl=True,
    destination_storage_pool=None,
    run_on="primary",
 ):
    @ZKConnection(config)
    def run_vm_create_mirror(
        zkhandler,
        self,
        domain,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=True,
        destination_storage_pool=None,
    ):
        return vm_worker_create_mirror(
            zkhandler,
            self,
            domain,
            destination_api_uri,
            destination_api_key,
            destination_api_verify_ssl=destination_api_verify_ssl,
            destination_storage_pool=destination_storage_pool,
        )
    return run_vm_create_mirror(
        self,
        domain,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=destination_api_verify_ssl,
        destination_storage_pool=destination_storage_pool,
    )
@celery.task(name="vm.promote_mirror", bind=True, routing_key="run_on")
 def vm_promote_mirror(
    self,
    domain=None,
    destination_api_uri="",
    destination_api_key="",
    destination_api_verify_ssl=True,
    destination_storage_pool=None,
    remove_on_source=False,
    run_on="primary",
 ):
    @ZKConnection(config)
    def run_vm_promote_mirror(
        zkhandler,
        self,
        domain,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=True,
        destination_storage_pool=None,
        remove_on_source=False,
    ):
        return vm_worker_promote_mirror(
            zkhandler,
            self,
            domain,
            destination_api_uri,
            destination_api_key,
            destination_api_verify_ssl=destination_api_verify_ssl,
            destination_storage_pool=destination_storage_pool,
            remove_on_source=remove_on_source,
        )
    return run_vm_promote_mirror(
        self,
        domain,
        destination_api_uri,
        destination_api_key,
        destination_api_verify_ssl=destination_api_verify_ssl,
        destination_storage_pool=destination_storage_pool,
        remove_on_source=remove_on_source,
    )
@celery.task(name="osd.add", bind=True, routing_key="run_on")
 def osd_add(
    self,
Author	SHA1	Message	Date
Joshua Boniface	9441cb3b2e	Bump version to 0.9.103	2024-11-01 17:23:24 -04:00
Joshua Boniface	b16542c8fc	Fix double-appending domain bug Since storage_hosts now includes the storage domain as FQDNs, don't re-append it within vmbuilder.	2024-11-01 17:18:51 -04:00
Joshua Boniface	de0c7e37f2	Allow environment setting for Munin	2024-10-30 13:12:08 -04:00
Joshua Boniface	ae26a071c7	Fix bugs with Munin plugin	2024-10-30 12:53:29 -04:00
Joshua Boniface	49a34acd14	Fix README images	2024-10-25 23:51:08 -04:00
Joshua Boniface	82365ea539	Update README badge order	2024-10-25 23:47:33 -04:00
Joshua Boniface	86f0c5c3ae	Update README	2024-10-25 23:43:57 -04:00
Joshua Boniface	83294298e1	Update README to match GitHub	2024-10-25 23:37:32 -04:00
Joshua Boniface	4187aacc5b	Correct formatting of OpenAPI Swagger specs	2024-10-19 02:23:46 -04:00
Joshua Boniface	35c82b5249	Bump version to 0.9.102	2024-10-17 10:48:31 -04:00
Joshua Boniface	e80b797e3a	Add missing sorter for detail parser	2024-10-17 10:09:49 -04:00
Joshua Boniface	7c8c71dff7	Improve handling of local connections in CLI 1. Ensure the local connection is actually always present if it exists, and stored in the store file. 2. Remove any invalid "local" store entries if present (i.e. pvcapid.yaml entries from legacy versions). 3. Order the connection lists such that "local" is always first. 4. Improve pretty list output format such that all fields are wider if needed	2024-10-17 09:56:54 -04:00
Joshua Boniface	861fef91e3	Add modification of Monitor hosts on XML import Missing this means clusters with different storage hosts would fail to start silently. Ensure these are updated like the secret UUID is as well.	2024-10-16 16:00:54 -04:00
Joshua Boniface	d1fcac1f0a	Bump version to 0.9.101	2024-10-15 11:39:11 -04:00
Joshua Boniface	6ace2ebf6a	Set expected PVC version for mirroring	2024-10-15 11:31:50 -04:00
Joshua Boniface	962fba7621	Bump up startup waits slightly Ensures there's more time for daemons (specifically Zookeeper) to start up and synchronize between nodes.	2024-10-15 11:10:23 -04:00
Joshua Boniface	49bf51da38	Fix indentation of previous fix	2024-10-15 10:57:33 -04:00
Joshua Boniface	1293e8ae7e	Fix bugs in lock freeing function 1. The destination state on an error was invalid; should be "stop". 2. If a lock was listed but removing it fails (because it was already cleared somehow, this would error. In turn this would cause the VM to not migrate and be left in an undefined state. Fix that when unlocking is forced.	2024-10-15 10:43:52 -04:00
Joshua Boniface	ae2cf8a070	Add some time for Zookeeper to synchronize	2024-10-15 10:43:44 -04:00
Joshua Boniface	ab5bd3c57d	Fix handling of invalid nets in list Ensure we add the difference in length between the visual output and the ANSI-coded output to avoid the format handler mishandling the length.	2024-10-14 12:51:02 -04:00
Joshua Boniface	35153cd6b6	Fix path handling for zkhandler Using full paths broke the local schema generator, so convert these to proper class instance methods and use them along with a new default + settable override.	2024-10-11 16:03:40 -04:00
Joshua Boniface	7f7047dd52	Add one more instance of mirror as purple	2024-10-11 14:44:14 -04:00
Joshua Boniface	9a91767405	Add proper return codes to API handlers	2024-10-11 14:43:44 -04:00
Joshua Boniface	bcfa6851e1	Use purple for mirror state colour	2024-10-11 10:44:39 -04:00
Joshua Boniface	28b8b3bb44	Use proper response parsing instead of raise_for	2024-10-11 10:32:15 -04:00
Joshua Boniface	02425159ef	Update Grafana graphs	2024-10-11 09:47:19 -04:00
Joshua Boniface	a6f8500309	Improve fence handling to prevent anomalies 1. Move fence monitoring to its own thread rather than doing the listing and triggering within the main keepalive thread. 2. Add a global lock key at /config/fence_lock and use this lock key to prevent multiple nodes from trying to run fences simultaneously. 3. Run the fencing monitor for each node sequentially within the context of the main fence monitoring thread, to ensure that fences of multiple nodes happen sequentially rather than in parallel. All of these should help to prevent any anomalies where one node can try to fence multiple nodes at once without recourse.	2024-10-10 16:42:57 -04:00
Joshua Boniface	ebec1332e9	Return to relative paths for SCHEMA_ROOT_PATH	2024-10-10 16:20:02 -04:00
Joshua Boniface	c08c3b2d7d	Improve thread timeouts in keepalive Avoids various parts of the keepalive deadlocking waiting on data that will never come when various internal processes fail. This should ensure based on testing that the keepalive will always finish in <5 seconds.	2024-10-10 15:33:47 -04:00
Joshua Boniface	4c0d90b517	Add read lock timeouts to prevent deadlocks	2024-10-10 15:19:05 -04:00
Joshua Boniface	70c588d3a8	Add confirmation option for mirror promote	2024-10-10 01:57:06 -04:00
Joshua Boniface	214e7f835a	Properly preserve state on promotion Ensure if the state is start, stop, or disable, that state is preserved; if it's anything else, the remote side will be started.	2024-10-10 01:21:05 -04:00
Joshua Boniface	96cebfb42a	Handle cross-cluster Ceph storage secrets	2024-10-10 00:47:50 -04:00
Joshua Boniface	c4763ac596	Fix invalid responses during promote	2024-10-09 01:14:19 -04:00
Joshua Boniface	ea5512e3d8	Only shut down VM if it is running	2024-10-09 01:10:42 -04:00
Joshua Boniface	ac00f7c4c8	Fix boolean state of remove_on_source	2024-10-09 01:04:08 -04:00
Joshua Boniface	6d31bf439e	Update error text	2024-10-09 01:00:51 -04:00
Joshua Boniface	c714093a2e	Ensure VM start is forced	2024-10-09 00:58:43 -04:00
Joshua Boniface	04a09b9269	Fix invalid data in state change	2024-10-09 00:55:13 -04:00
Joshua Boniface	3ede0c7d38	Name mirror snapshots like autobackup snapshots	2024-10-09 00:49:22 -04:00
Joshua Boniface	ab9390fdb8	Fix another bad stage counting instance	2024-10-09 00:44:20 -04:00
Joshua Boniface	1c83584788	Set correct verbage	2024-10-09 00:38:59 -04:00
Joshua Boniface	7f3ab4e119	Fix stage counting in tasks	2024-10-09 00:37:13 -04:00
Joshua Boniface	16eb09dc22	Fix ordering bug with vm_detail	2024-10-09 00:33:00 -04:00
Joshua Boniface	7ba75adef4	Fix bug if destination is missing	2024-10-09 00:27:42 -04:00
Joshua Boniface	a691d26c30	Add check for scheme in destination Allows handling invalid cluster names properly.	2024-10-09 00:25:13 -04:00
Joshua Boniface	1d90b066bc	Add guard rails against manipulating mirrors Snapshot mirrors should normally be promoted using "mirror promote", and not started manually. This adds guard rails against that to the "start", "stop", and "disable" state commands to prevent changing mirror states without an explicit "--force" option.	2024-10-08 23:51:48 -04:00
Joshua Boniface	3ea7421f09	Implement friendlier VM mirror commands Adds two helper commands which automate sending and promoting VM snapshots as "vm mirror" commands. "vm mirror create" replicates the functionality of "snapshot create" and "snapshot send", performing both in one single task using an autogenerated dated snapshot name for automatic cross-cluster replication. "vm mirror promote" replicates the functionality of "vm shutdown", "snapshot create", "snapshot send", "vm start" (remote), and, optionally, "vm remove", performing in one single task an entire cross-cluster VM move with or without retaining the copy on the local cluster (if retained, the local copy becomes a snapshot mirror of the remote, flipping their statuses).	2024-10-08 23:51:39 -04:00
Joshua Boniface	df4d437d31	Update the description of VM define endpoint	2024-10-01 13:30:44 -04:00
Joshua Boniface	8295e2089d	Add proper response schema for 202 responses	2024-10-01 13:25:11 -04:00
Joshua Boniface	4ccb570762	Enhance documentation of snapshot send command	2024-09-30 23:54:53 -04:00
Joshua Boniface	235299942a	Add volume resize if changed	2024-09-30 20:51:59 -04:00
Joshua Boniface	9aa32134a9	Fix bug in API specification	2024-09-30 20:51:49 -04:00
Joshua Boniface	75eac356d5	Increase send blocksize and add total speed It's much faster and seems to cause no issues.	2024-09-30 20:11:12 -04:00
Joshua Boniface	fb8561cc5d	Actually fix incremental sending	2024-09-30 17:00:18 -04:00
Joshua Boniface	5f7aa0b2d6	Improve incremental send speed	2024-09-30 04:15:17 -04:00
Joshua Boniface	7fac7a62cf	Clean up debug print statements	2024-09-30 03:51:39 -04:00
Joshua Boniface	b19642aa2e	Fix bug where snapshot rollback was never called	2024-09-30 03:04:35 -04:00
Joshua Boniface	974e0d6ac2	Shorten progress bars to 20 characters They were needlessly long and this limited the message size.	2024-09-30 03:04:10 -04:00
Joshua Boniface	7785166a7e	Finish working implementation of send/receive Required some significant refactoring due to issues with the diff send, but it works.	2024-09-30 02:53:23 -04:00
Joshua Boniface	34f0a2f388	Add mostly complete implementation of VM send	2024-09-29 01:31:13 -04:00
Joshua Boniface	8fa37d21c0	Fix handling of invalid network lengths	2024-09-29 00:39:53 -04:00
Joshua Boniface	f462ebbc6b	Add VM snapshot send (initial)	2024-09-28 10:49:35 -04:00
Joshua Boniface	0d533f3658	Rework task output bar operation Allows sending constant updates including changes to the message within the same task.	2024-09-28 10:48:39 -04:00
Joshua Boniface	792d135950	Update responses for Celery tasks	2024-09-28 02:01:56 -04:00
Joshua Boniface	a64e0c1985	Fix incorrect default value typos	2024-09-28 02:01:56 -04:00
Joshua Boniface	1cbadb1172	Add "mirror" VM state	2024-09-28 02:01:56 -04:00
Joshua Boniface	b1c4b2e928	Add Ceph block receive (initial)	2024-09-28 02:01:56 -04:00
Joshua Boniface	7fe1262887	Fix indentation in faults	2024-09-28 02:01:33 -04:00
Joshua Boniface	0e389ba1f4	Fix bug when setting split count = 1 Would set the OSD as split in Zookeeper, even though it wasn't.	2024-09-23 13:06:05 -04:00
Joshua Boniface	41cd34ba4d	Allow specifying job names for benchmarks	2024-09-18 14:55:12 -04:00
Joshua Boniface	736762901c	Update benchmarks to include resource utilization Adds additional polled information on node cpu, memory, and network bandwidth for the node running the test. This should provide additional useful information about the results of the test. Also bumps the test format to 2 to ensure clients can handle the changes properly.	2024-09-18 14:32:03 -04:00
Joshua Boniface	ecb812ccac	Update linting for pvcapid recent changes	2024-09-18 10:18:50 -04:00
Joshua Boniface	a2e5df9f6d	Add support for Gunicorn execution Modifies pvcapid to run under Gunicorn when in non-debug mode, instead of the Flask development server. This is proper practice for one, and also helps increase performance slightly in some workloads (file uploads mainly).	2024-09-09 13:20:03 -04:00
Joshua Boniface	73c0834f85	Remove headers and add util to short output	2024-09-06 11:40:39 -04:00
Joshua Boniface	2de999c700	Add total cluster utilization stats Useful for evaluating the cluster resources as a whole.	2024-09-05 16:05:33 -04:00
Joshua Boniface	7543eb839d	Add dedicated volume scan endpoint Allows an imported volume to be scanned for stats independently. Designed to be used as part of a snapshot import via API, to allow the "create" to happen before the real import (to check for available space, etc.) and then run this import after when the RBD volume actually exists.	2024-09-03 20:32:27 -04:00
		`@ -0,0 +1 @@`
							{"version": "15", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.fence_lock": "/config/fence_lock", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock", "snapshots": "/snapshots"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "domain_snapshot": {"name": "", "timestamp": "/timestamp", "xml": "/xml", "rbd_snapshots": "/rbdsnaplist"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}