Allow specifying job names for benchmarks

Update benchmarks to include resource utilization
Adds additional polled information on node cpu, memory, and network bandwidth for the node running the test. This should provide additional useful information about the results of the test. Also bumps the test format to 2 to ensure clients can handle the changes properly.
2024-09-18 14:55:12 -04:00 · 2024-09-18 14:32:03 -04:00 · 2024-09-18 10:18:50 -04:00 · 2024-09-09 13:20:03 -04:00 · 2024-09-06 11:40:39 -04:00 · 2024-09-05 16:05:33 -04:00
141 changed files with 21471 additions and 6671 deletions
--- a/.bbuilder-tasks.yaml
+++ b/.bbuilder-tasks.yaml
@ -4,4 +4,4 @@ bbuilder:
    published:
      - git submodule update --init
      - /bin/bash build-stable-deb.sh
-      - sudo /usr/local/bin/deploy-package -C pvc
+      - sudo /usr/local/bin/deploy-package -C pvc -D bookworm
--- a/.file-header
+++ b/.file-header
@ -3,7 +3,7 @@
 # <Filename> - <Description>
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/.flake8
+++ b/.flake8
@ -8,7 +8,7 @@
 ignore = W503, E501, F403, F405
 extend-ignore = E203
 # We exclude the Debian, migrations, and provisioner examples
-exclude = debian,api-daemon/migrations/versions,api-daemon/provisioner/examples,node-daemon/monitoring
+exclude = debian,monitoring,api-daemon/migrations/versions,api-daemon/provisioner/examples
 # Set the max line length to 88 for Black
 max-line-length = 88

--- a/.version
+++ b/.version
@ -1 +1 @@
-0.9.83
+0.9.100
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,140 @@
 ## PVC Changelog

+###### [v0.9.100](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.100)
+
+  * [API Daemon] Improves the handling of "detect:" disk strings on newer systems by leveraging the "nvme" command
+  * [Client CLI] Update help text about "detect:" disk strings
+  * [Meta] Updates deprecation warnings and updates builder to only add this version for Debian 12 (Bookworm)
+
+###### [v0.9.99](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.99)
+
+  **Deprecation Warning**: `pvc vm backup` commands are now deprecated and will be removed in a future version. Use `pvc vm snapshot` commands instead.
+  **Breaking Change**: The on-disk format of VM snapshot exports differs from backup exports, and the PVC autobackup system now leverages these. It is recommended to start fresh with a new tree of backups for `pvc autobackup` for maximum compatibility.
+  **Breaking Change**: VM autobackups now run in `pvcworkerd` instead of the CLI client directly, allowing them to be triggerd from any node (or externally). It is important to apply the timer unit changes from the `pvc-ansible` role after upgrading to 0.9.99 to avoid duplicate runs.
+  **Usage Note**: VM snapshots are displayed in the `pvc vm list` and `pvc vm info` outputs, not in a unique "list" endpoint.
+
+  * [API Daemon] Adds a proper error when an invalid provisioner profile is specified
+  * [Node Daemon] Sorts Ceph pools properly in node keepalive to avoid incorrect ordering
+  * [Health Daemon] Improves handling of IPMI checks by adding multiple tries but a shorter timeout
+  * [API Daemon] Improves handling of XML parsing errors in VM configurations
+  * [ALL] Adds support for whole VM snapshots, including configuration XML details, and direct rollback to snapshots
+  * [ALL] Adds support for exporting and importing whole VM snapshots
+  * [Client CLI] Removes vCPU topology from short VM info output
+  * [Client CLI] Improves output format of VM info output
+  * [API Daemon] Adds an endpoint to get the current primary node
+  * [Client CLI] Fixes a bug where API requests were made 3 times
+  * [Other] Improves the build-and-deploy.sh script
+  * [API Daemon] Improves the "vm rename" command to avoid redefining VM, preserving history etc.
+  * [API Daemon] Adds an indication when a task is run on the primary node
+  * [API Daemon] Fixes a bug where the ZK schema relative path didn't work sometimes
+
+###### [v0.9.98](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.98)
+
+  * [CLI Client] Fixed output when API call times out
+  * [Node Daemon] Improves the handling of fence states
+  * [API Daemon/CLI Client] Adds support for storage snapshot rollback
+  * [CLI Client] Adds additional warning messages about snapshot consistency to help output
+  * [API Daemon] Fixes a bug listing snapshots by pool/volume
+  * [Node Daemon] Adds a --version flag for information gathering by update-motd.sh
+
+###### [v0.9.97](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.97)
+
+  * [Client CLI] Ensures --lines is always an integer value
+  * [Node Daemon] Fixes a bug if d_network changes during iteration
+  * [Node Daemon] Moves to using allocated instead of free memory for node reporting
+  * [API Daemon] Fixes a bug if lingering RBD snapshots exist when removing a volume (#180)
+
+###### [v0.9.96](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.96)
+
+  * [API Daemon] Fixes a bug when reporting node stats
+  * [API Daemon] Fixes a bug deleteing successful benchmark results
+
+###### [v0.9.95](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.95)
+
+  * [API Daemon/CLI Client] Adds a flag to allow duplicate VNIs in network templates
+  * [API Daemon] Ensures that storage template disks are returned in disk ID order
+  * [Client CLI] Fixes a display bug showing all OSDs as split
+
+###### [v0.9.94](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.94)
+
+  * [CLI Client] Fixes an incorrect ordering issue with autobackup summary emails
+  * [API Daemon/CLI Client] Adds an additional safety check for 80% cluster fullness when doing volume adds or resizes
+  * [API Daemon/CLI Client] Adds safety checks to volume clones as well
+  * [API Daemon] Fixes a few remaining memory bugs for stopped/disabled VMs
+
+###### [v0.9.93](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.93)
+
+  * [API Daemon] Fixes a bug where stuck zkhandler threads were not cleaned up on error
+
+###### [v0.9.92](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.92)
+
+  * [CLI Client] Adds the new restore state to the colours list for VM status
+  * [API Daemon] Fixes an incorrect variable assignment
+  * [Provisioner] Improves the error handling of various steps in the debootstrap and rinse example scripts
+  * [CLI Client] Fixes two bugs around missing keys that were added recently (uses get() instead direct dictionary refs)
+  * [CLI Client] Improves API error handling via GET retries (x3) and better server status code handling
+
+###### [v0.9.91](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.91)
+
+  * [Client CLI] Fixes a bug and improves output during cluster task events.
+  * [Client CLI] Improves the output of the task list display.
+  * [Provisioner] Fixes some missing cloud-init modules in the default debootstrap script.
+  * [Client CLI] Fixes a bug with a missing argument to the vm_define helper function.
+  * [All] Fixes inconsistent package find + rm commands to avoid errors in dpkg.
+
+###### [v0.9.90](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.90)
+
+  * [Client CLI/API Daemon] Adds additional backup metainfo and an emailed report option to autobackups.
+  * [All] Adds a live migration maximum downtime selector to help with busy VM migrations.
+  * [API Daemon] Fixes a database migration bug on Debian 10/11.
+  * [Node Daemon] Fixes a race condition when applying Zookeeper schema changes.
+
+###### [v0.9.89](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.89)
+
+  * [API/Worker Daemons] Fixes a bug with the Celery result backends not being properly initialized on Debian 10/11.
+  * [API Daemon] Fixes a bug if VM CPU stats are missing on Debian 10.
+
+###### [v0.9.88](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.88)
+
+  * [API Daemon] Adds an additional Prometheus metrics proxy for Zookeeper stats.
+  * [API Daemon] Adds a new configuration to enable or disable metric endpoints if desired, defaulting to enabled.
+  * [API Daemon] Alters and adjusts the metrics output for VMs to complement new dashboard.
+  * [CLI Client] Adds a "json-prometheus" output format to "pvc connection list" to auto-generate file SD configs.
+  * [Monitoring] Adds a new VM dashboard, updates the Cluster dashboard, and adds a README.
+
+###### [v0.9.87](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.87)
+
+  * [API Daemon] Adds cluster Prometheus resource utilization metrics and an updated Grafana dashboard.
+  * [Node Daemon] Adds network traffic rate calculation subsystem.
+  * [All Daemons] Fixes a printing bug where newlines were not added atomically.
+  * [CLI Client] Fixes a bug listing connections if no default is specified.
+  * [All Daemons] Simplifies debug logging conditionals by moving into the Logger instance itself.
+
+###### [v0.9.86](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.86)
+
+  * [API Daemon] Significantly improves the performance of several commands via async Zookeeper calls and removal of superfluous backend calls.
+  * [Docs] Improves the project README and updates screenshot images to show the current output and more functionality.
+  * [API Daemon/CLI] Corrects some bugs in VM metainformation output.
+  * [Node Daemon] Fixes resource reporting bugs from 0.9.81 and properly clears node resource numbers on a fence.
+  * [Health Daemon] Adds a wait during pvchealthd startup until the node is in run state, to avoid erroneous faults during node bootup.
+  * [API Daemon] Fixes an incorrect reference to legacy pvcapid.yaml file in migration script.
+
+###### [v0.9.85](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.85)
+
+  * [Packaging] Fixes a dependency bug introduced in 0.9.84
+  * [Node Daemon] Fixes an output bug during keepalives
+  * [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
+
+###### [v0.9.84](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.84)
+
+  **Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.
+
+  * [All] Adds persistent fault reporting to clusters, replacing the old cluster health calculations.
+  * [API Daemon] Adds cluster-level Prometheus metric exporting as well as a Ceph Prometheus proxy to the API.
+  * [CLI Client] Improves formatting output of "pvc cluster status".
+  * [Node Daemon] Fixes several bugs and enhances the working of the psql health check plugin.
+  * [Worker Daemon] Fixes several bugs in the example provisioner scripts, and moves the libvirt_schema library into the daemon common libraries.
+
 ###### [v0.9.83](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.83)

  **Breaking Changes:** This release features a breaking change for the daemon config. A new unified "pvc.conf" file is required for all daemons (and the CLI client for Autobackup and API-on-this-host functionality), which will be written by the "pvc" role in the PVC Ansible framework. Using the "update-pvc-daemons" oneshot playbook from PVC Ansible is **required** to update to this release, as it will ensure this file is written to the proper place before deploying the new package versions, and also ensures that the old entires are cleaned up afterwards. In addition, this release fully splits the node worker and health subsystems into discrete daemons ("pvcworkerd" and "pvchealthd") and packages ("pvc-daemon-worker" and "pvc-daemon-health") respectively. The "pvc-daemon-node" package also now depends on both packages, and the "pvc-daemon-api" package can now be reliably used outside of the PVC nodes themselves (for instance, in a VM) without any strange cross-dependency issues.
--- a/README.md
+++ b/README.md
@ -1,5 +1,5 @@
 <p align="center">
-<img alt="Logo banner" src="docs/images/pvc_logo_black.png"/>
+<img alt="Logo banner" src="images/pvc_logo_black.png"/>
 <br/><br/>
 <a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
 <a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
@ -19,41 +19,66 @@ As a consequence of its features, PVC makes administrating very high-uptime VMs

 PVC also features an optional, fully customizable VM provisioning framework, designed to automate and simplify VM deployments using custom provisioning profiles, scripts, and CloudInit userdata API support.

-Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client or WebUI.
+Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client ~~or WebUI~~ (eventually).

 Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.

-
-## What is it based on?
-
-The core node and API daemons, as well as the CLI API client, are written in Python 3 and are fully Free Software (GNU GPL v3). In addition to these, PVC makes use of the following software tools to provide a holistic hyperconverged infrastructure solution:
-
-  * Debian GNU/Linux as the base OS.
-  * Linux KVM, QEMU, and Libvirt for VM management.
-  * Linux `ip`, FRRouting, NFTables, DNSMasq, and PowerDNS for network management.
-  * Ceph for storage management.
-  * Apache Zookeeper for the primary cluster state database.
-  * Patroni PostgreSQL manager for the secondary relation databases (DNS aggregation, Provisioner configuration).
-
-
 ## Getting Started

-To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/getting-started/) page for details on configuring your first cluster.
-
+To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster.

 ## Changelog

-View the changelog in [CHANGELOG.md](CHANGELOG.md).
-
+View the changelog in [CHANGELOG.md](CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**

 ## Screenshots

-While PVC's API and internals aren't very screenshot-worthy, here is some example output of the CLI tool.
+These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85.

-<p><img alt="Node listing" src="docs/images/pvc-nodes.png"/><br/><i>Listing the nodes in a cluster</i></p>
+<p><img alt="0. Integrated help" src="images/0-integrated-help.png"/><br/>
+<i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i>
+</p>

-<p><img alt="Network listing" src="docs/images/pvc-networks.png"/><br/><i>Listing the networks in a cluster, showing 3 bridged and 1 IPv4-only managed networks</i></p>
+<p><img alt="1. Connection management" src="images/1-connection-management.png"/><br/>
+<i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i>
+</p>

-<p><img alt="VM listing and migration" src="docs/images/pvc-migration.png"/><br/><i>Listing a limited set of VMs and migrating one with status updates</i></p>
+<p><img alt="2. Cluster details and output formats" src="images/2-cluster-details-and-output-formats.png"/><br/>
+<i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i>
+</p>

-<p><img alt="Node logs" src="docs/images/pvc-nodelog.png"/><br/><i>Viewing the logs of a node (keepalives and VM [un]migration)</i></p>
+<p><img alt="3. Node information" src="images/3-node-information.png"/><br/>
+<i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i>
+</p>
+
+<p><img alt="4. VM information" src="images/4-vm-information.png"/><br/>
+<i>PVC can show details about the VMs in the cluster, including their state, resource allocations, current hosting node, and metadata.</i>
+</p>
+
+<p><img alt="5. VM details" src="images/5-vm-details.png"/><br/>
+<i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i>
+</p>
+
+<p><img alt="6. Network information" src="images/6-network-information.png"/><br/>
+<i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i>
+</p>
+
+<p><img alt="7. Storage information" src="images/7-storage-information.png"/><br/>
+<i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i>
+</p>
+
+<p><img alt="8. VM and node logs" src="images/8-vm-and-node-logs.png"/><br/>
+<i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i>
+</p>
+
+<p><img alt="9. VM and worker tasks" src="images/9-vm-and-worker-tasks.png"/><br/>
+<i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i>
+</p>
+
+<p><img alt="10. Provisioner" src="images/10-provisioner.png"/><br/>
+<i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i>
+</p>
+
+<p><img alt="11. Prometheus and Grafana dashboard" src="images/11-prometheus-grafana.png"/><br/>
+<i>PVC features several monitoring integration examples under "node-daemon/monitoring", including CheckMK, Munin, and, most recently, Prometheus, including an example Grafana dashboard for cluster monitoring and alerting.</i>
+</p>
--- a/api-daemon/migrations/versions/977e7b4d3497_pvc_version_0_9_89.py
+++ b/api-daemon/migrations/versions/977e7b4d3497_pvc_version_0_9_89.py
@ -0,0 +1,28 @@
+"""PVC version 0.9.89
+
+Revision ID: 977e7b4d3497
+Revises: 88fa0d88a9f8
+Create Date: 2024-01-10 16:09:44.659027
+
+"""
+from alembic import op
+import sqlalchemy as sa
+
+
+# revision identifiers, used by Alembic.
+revision = '977e7b4d3497'
+down_revision = '88fa0d88a9f8'
+branch_labels = None
+depends_on = None
+
+
+def upgrade():
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.add_column('system_template', sa.Column('migration_max_downtime', sa.Integer(), default="300", server_default="300", nullable=True))
+    # ### end Alembic commands ###
+
+
+def downgrade():
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.drop_column('system_template', 'migration_max_downtime')
+    # ### end Alembic commands ###
--- a/api-daemon/provisioner/examples/script/1-noop.py
+++ b/api-daemon/provisioner/examples/script/1-noop.py
@ -3,7 +3,7 @@
 # 1-noop.py - PVC Provisioner example script for noop install
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -147,7 +147,7 @@


 # This import is always required here, as VMBuilder is used by the VMBuilderScript class.
-from pvcapid.vmbuilder import VMBuilder
+from daemon_lib.vmbuilder import VMBuilder


 # The VMBuilderScript class must be named as such, and extend VMBuilder.
@ -174,7 +174,7 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        import pvcapid.libvirt_schema as libvirt_schema
+        import daemon_lib.libvirt_schema as libvirt_schema
        import datetime
        import random

--- a/api-daemon/provisioner/examples/script/2-ova.py
+++ b/api-daemon/provisioner/examples/script/2-ova.py
@ -3,7 +3,7 @@
 # 2-ova.py - PVC Provisioner example script for OVA profile install
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -148,7 +148,7 @@


 # This import is always required here, as VMBuilder is used by the VMBuilderScript class.
-from pvcapid.vmbuilder import VMBuilder
+from daemon_lib.vmbuilder import VMBuilder


 # The VMBuilderScript class must be named as such, and extend VMBuilder.
@ -177,7 +177,7 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        import pvcapid.libvirt_schema as libvirt_schema
+        import daemon_lib.libvirt_schema as libvirt_schema
        import datetime
        import random

@ -289,8 +289,8 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.common as pvc_common
        import daemon_lib.ceph as pvc_ceph
        import os
@ -383,8 +383,8 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.ceph as pvc_ceph

        for volume in list(reversed(self.vm_data["volumes"])):
--- a/api-daemon/provisioner/examples/script/3-debootstrap.py
+++ b/api-daemon/provisioner/examples/script/3-debootstrap.py
@ -3,7 +3,7 @@
 # 3-debootstrap.py - PVC Provisioner example script for debootstrap install
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -147,7 +147,11 @@


 # This import is always required here, as VMBuilder is used by the VMBuilderScript class.
-from pvcapid.vmbuilder import VMBuilder
+from daemon_lib.vmbuilder import VMBuilder
+
+
+# These are some global variables used below
+default_root_password = "test123"


 # The VMBuilderScript class must be named as such, and extend VMBuilder.
@ -186,7 +190,7 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        import pvcapid.libvirt_schema as libvirt_schema
+        import daemon_lib.libvirt_schema as libvirt_schema
        import datetime
        import random

@ -301,16 +305,16 @@ class VMBuilderScript(VMBuilder):
        This function should use the various exposed PVC commands as indicated to create
        RBD block devices and map them to the host as required.

-        open_zk is exposed from pvcapid.vmbuilder to provide a context manager for opening
+        open_zk is exposed from daemon_lib.vmbuilder to provide a context manager for opening
        connections to the PVC Zookeeper cluster; ensure you also import (and pass it)
-        the config object from pvcapid.Daemon as well. This context manager then allows
+        the config object from pvcworkerd.Daemon as well. This context manager then allows
        the use of various common daemon library functions, without going through the API.
        """

        # Run any imports first
        import os
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.common as pvc_common
        import daemon_lib.ceph as pvc_ceph

@ -446,7 +450,7 @@ class VMBuilderScript(VMBuilder):

        # Run any imports first
        import os
-        from pvcapid.vmbuilder import chroot
+        from daemon_lib.vmbuilder import chroot

        # The directory we mounted things on earlier during prepare(); this could very well
        # be exposed as a module-level variable if you so choose
@ -498,11 +502,15 @@ class VMBuilderScript(VMBuilder):
        ret = os.system(
            f"debootstrap --include={','.join(deb_packages)} {deb_release} {temp_dir} {deb_mirror}"
        )
+        ret = int(ret >> 8)
        if ret > 0:
-            self.fail("Failed to run debootstrap")
+            self.fail(f"Debootstrap failed with exit code {ret}")

        # Bind mount the devfs so we can grub-install later
-        os.system("mount --bind /dev {}/dev".format(temp_dir))
+        ret = os.system("mount --bind /dev {}/dev".format(temp_dir))
+        ret = int(ret >> 8)
+        if ret > 0:
+            self.fail(f"/dev bind mount failed with exit code {ret}")

        # Create an fstab entry for each volume
        fstab_file = "{}/etc/fstab".format(temp_dir)
@ -589,11 +597,13 @@ After=multi-user.target
                 - migrator
                 - bootcmd
                 - write-files
+                 - growpart
                 - resizefs
                 - set_hostname
                 - update_hostname
                 - update_etc_hosts
                 - ca-certs
+                 - users-groups
                 - ssh
                
                cloud_config_modules:
@ -686,23 +696,36 @@ GRUB_DISABLE_LINUX_UUID=false
        # Do some tasks inside the chroot using the provided context manager
        with chroot(temp_dir):
            # Install and update GRUB
-            os.system(
+            ret = os.system(
                "grub-install --force /dev/rbd/{}/{}_{}".format(
                    root_volume["pool"], vm_name, root_volume["disk_id"]
                )
            )
-            os.system("update-grub")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"GRUB install failed with exit code {ret}")
+
+            ret = os.system("update-grub")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"GRUB update failed with exit code {ret}")

            # Set a really dumb root password so the VM can be debugged
            # EITHER CHANGE THIS YOURSELF, here or in Userdata, or run something after install
            # to change the root password: don't leave it like this on an Internet-facing machine!
-            os.system("echo root:test123 | chpasswd")
+            ret = os.system(f"echo root:{default_root_password} | chpasswd")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Root password change failed with exit code {ret}")

            # Enable cloud-init target on (first) boot
            # Your user-data should handle this and disable it once done, or things get messy.
            # That cloud-init won't run without this hack seems like a bug... but even the official
            # Debian cloud images are affected, so who knows.
-            os.system("systemctl enable cloud-init.target")
+            ret = os.system("systemctl enable cloud-init.target")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Enable of cloud-init failed with exit code {ret}")

    def cleanup(self):
        """
@ -718,8 +741,8 @@ GRUB_DISABLE_LINUX_UUID=false

        # Run any imports first
        import os
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.common as pvc_common
        import daemon_lib.ceph as pvc_ceph

@ -727,7 +750,7 @@ GRUB_DISABLE_LINUX_UUID=false
        temp_dir = "/tmp/target"

        # Unmount the bound devfs
-        os.system("umount {}/dev".format(temp_dir))
+        os.system("umount -f {}/dev".format(temp_dir))

        # Use this construct for reversing the list, as the normal reverse() messes with the list
        for volume in list(reversed(self.vm_data["volumes"])):
@ -744,7 +767,7 @@ GRUB_DISABLE_LINUX_UUID=false
            ):
                # Unmount filesystem
                retcode, stdout, stderr = pvc_common.run_os_command(
-                    f"umount {mount_path}"
+                    f"umount -f {mount_path}"
                )
                if retcode:
                    self.log_err(
--- a/api-daemon/provisioner/examples/script/4-rinse.py
+++ b/api-daemon/provisioner/examples/script/4-rinse.py
@ -3,7 +3,7 @@
 # 4-rinse.py - PVC Provisioner example script for rinse install
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -147,7 +147,12 @@


 # This import is always required here, as VMBuilder is used by the VMBuilderScript class.
-from pvcapid.vmbuilder import VMBuilder
+from daemon_lib.vmbuilder import VMBuilder
+
+
+# These are some global variables used below
+default_root_password = "test123"
+default_local_time = "UTC"


 # The VMBuilderScript class must be named as such, and extend VMBuilder.
@ -186,7 +191,7 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        import pvcapid.libvirt_schema as libvirt_schema
+        import daemon_lib.libvirt_schema as libvirt_schema
        import datetime
        import random

@ -301,16 +306,16 @@ class VMBuilderScript(VMBuilder):
        This function should use the various exposed PVC commands as indicated to create
        RBD block devices and map them to the host as required.

-        open_zk is exposed from pvcapid.vmbuilder to provide a context manager for opening
+        open_zk is exposed from daemon_lib.vmbuilder to provide a context manager for opening
        connections to the PVC Zookeeper cluster; ensure you also import (and pass it)
-        the config object from pvcapid.Daemon as well. This context manager then allows
+        the config object from pvcworkerd.Daemon as well. This context manager then allows
        the use of various common daemon library functions, without going through the API.
        """

        # Run any imports first
        import os
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.common as pvc_common
        import daemon_lib.ceph as pvc_ceph

@ -446,7 +451,7 @@ class VMBuilderScript(VMBuilder):

        # Run any imports first
        import os
-        from pvcapid.vmbuilder import chroot
+        from daemon_lib.vmbuilder import chroot
        import daemon_lib.common as pvc_common

        # The directory we mounted things on earlier during prepare(); this could very well
@ -524,13 +529,23 @@ class VMBuilderScript(VMBuilder):
        ret = os.system(
            f"rinse --arch {rinse_architecture} --directory {temporary_directory} --distribution {rinse_release} --cache-dir {rinse_cache} --add-pkg-list /tmp/addpkg --verbose {mirror_arg}"
        )
+        ret = int(ret >> 8)
        if ret > 0:
-            self.fail("Failed to run rinse")
+            self.fail(f"Rinse failed with exit code {ret}")

        # Bind mount the devfs, sysfs, and procfs so we can grub-install later
-        os.system("mount --bind /dev {}/dev".format(temporary_directory))
-        os.system("mount --bind /sys {}/sys".format(temporary_directory))
-        os.system("mount --bind /proc {}/proc".format(temporary_directory))
+        ret = os.system("mount --bind /dev {}/dev".format(temporary_directory))
+        ret = int(ret >> 8)
+        if ret > 0:
+            self.fail(f"/dev bind mount failed with exit code {ret}")
+        ret = os.system("mount --bind /sys {}/sys".format(temporary_directory))
+        ret = int(ret >> 8)
+        if ret > 0:
+            self.fail(f"/sys bind mount failed with exit code {ret}")
+        ret = os.system("mount --bind /proc {}/proc".format(temporary_directory))
+        ret = int(ret >> 8)
+        if ret > 0:
+            self.fail(f"/proc bind mount failed with exit code {ret}")

        # Create an fstab entry for each volume
        fstab_file = "{}/etc/fstab".format(temporary_directory)
@ -642,41 +657,76 @@ GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=0 --word=8 --parity=no --stop=
        # Do some tasks inside the chroot using the provided context manager
        with chroot(temporary_directory):
            # Fix the broken kernel from rinse by setting a systemd machine ID and running the post scripts
-            os.system("systemd-machine-id-setup")
-            os.system(
+            ret = os.system("systemd-machine-id-setup")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Machine ID setup failed with exit code {ret}")
+
+            ret = os.system(
                "rpm -q --scripts kernel-core | grep -A20  'posttrans scriptlet' | tail -n+2 | bash -x"
            )
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"RPM kernel reinstall failed with exit code {ret}")

            # Install any post packages
-            os.system(f"dnf install -y {' '.join(post_packages)}")
+            if len(post_packages) > 0:
+                ret = os.system(f"dnf install -y {' '.join(post_packages)}")
+                ret = int(ret >> 8)
+                if ret > 0:
+                    self.fail(f"DNF install failed with exit code {ret}")

            # Install and update GRUB config
-            os.system(
+            ret = os.system(
                "grub2-install --force /dev/rbd/{}/{}_{}".format(
                    root_volume["pool"], vm_name, root_volume["disk_id"]
                )
            )
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"GRUB install failed with exit code {ret}")
+
            os.system("grub2-mkconfig -o /boot/grub2/grub.cfg")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"GRUB update failed with exit code {ret}")

            # Set a really dumb root password so the VM can be debugged
            # EITHER CHANGE THIS YOURSELF, here or in Userdata, or run something after install
            # to change the root password: don't leave it like this on an Internet-facing machine!
-            os.system("echo root:test123 | chpasswd")
+            ret = os.system(f"echo root:{default_root_password} | chpasswd")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Root password change failed with exit code {ret}")

            # Enable dbus-broker
-            os.system("systemctl enable dbus-broker.service")
+            ret = os.system("systemctl enable dbus-broker.service")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Enable of dbus-broker failed with exit code {ret}")

            # Enable NetworkManager
            os.system("systemctl enable NetworkManager.service")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Enable of NetworkManager failed with exit code {ret}")

            # Enable cloud-init target on (first) boot
            # Your user-data should handle this and disable it once done, or things get messy.
            # That cloud-init won't run without this hack seems like a bug... but even the official
            # Debian cloud images are affected, so who knows.
            os.system("systemctl enable cloud-init.target")
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Enable of cloud-init failed with exit code {ret}")

            # Set the timezone to UTC
-            os.system("ln -sf ../usr/share/zoneinfo/UTC /etc/localtime")
+            ret = os.system(
+                f"ln -sf ../usr/share/zoneinfo/{default_local_time} /etc/localtime"
+            )
+            ret = int(ret >> 8)
+            if ret > 0:
+                self.fail(f"Localtime update failed with exit code {ret}")

    def cleanup(self):
        """
@ -692,8 +742,8 @@ GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=0 --word=8 --parity=no --stop=

        # Run any imports first
        import os
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.common as pvc_common
        import daemon_lib.ceph as pvc_ceph

--- a/api-daemon/provisioner/examples/script/5-pfsense.py
+++ b/api-daemon/provisioner/examples/script/5-pfsense.py
@ -3,7 +3,7 @@
 # 6-pfsense.py - PVC Provisioner example script for pfSense install
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -173,7 +173,7 @@


 # This import is always required here, as VMBuilder is used by the VMBuilderScript class.
-from pvcapid.vmbuilder import VMBuilder
+from daemon_lib.vmbuilder import VMBuilder


 # Set up some variables for later; if you frequently use these tools, you might benefit from
@ -243,7 +243,7 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        import pvcapid.libvirt_schema as libvirt_schema
+        import daemon_lib.libvirt_schema as libvirt_schema
        import datetime
        import random

@ -358,8 +358,8 @@ class VMBuilderScript(VMBuilder):

        # Run any imports first; as shown here, you can import anything from the PVC
        # namespace, as well as (of course) the main Python namespaces
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.common as pvc_common
        import daemon_lib.ceph as pvc_ceph
        import json
@ -902,8 +902,8 @@ class VMBuilderScript(VMBuilder):
        """

        # Run any imports first
-        from pvcapid.vmbuilder import open_zk
-        from pvcapid.Daemon import config
+        from daemon_lib.vmbuilder import open_zk
+        from pvcworkerd.Daemon import config
        import daemon_lib.ceph as pvc_ceph

        # Use this construct for reversing the list, as the normal reverse() messes with the list
--- a/api-daemon/pvc-api-db-upgrade
+++ b/api-daemon/pvc-api-db-upgrade
@ -3,7 +3,7 @@
 # Apply PVC database migrations
 # Part of the Parallel Virtual Cluster (PVC) system

-export PVC_CONFIG_FILE="/etc/pvc/pvcapid.yaml"
+export PVC_CONFIG_FILE="/etc/pvc/pvc.conf"

 if [[ ! -f ${PVC_CONFIG_FILE} ]]; then
    echo "Create a configuration file at ${PVC_CONFIG_FILE} before upgrading the database."
@ -12,15 +12,7 @@ fi

 pushd /usr/share/pvc

-case "$( cat /etc/debian_version )" in
-    10.*|11.*)
-        # Debian 10 & 11
-        ./pvcapid-manage_legacy.py db upgrade
-    ;;
-    *)
-        # Debian 12+
-        flask --app ./pvcapid-manage_flask.py db upgrade
-    ;;
-esac
+export FLASK_APP=./pvcapid-manage-flask.py
+flask db upgrade

 popd
--- a/api-daemon/pvcapid-manage-flask.py
+++ b/api-daemon/pvcapid-manage-flask.py
@ -3,7 +3,7 @@
 # pvcapid-manage_flask.py - PVC Database management tasks (via Flask CLI)
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/api-daemon/pvcapid-manage-zk.py
+++ b/api-daemon/pvcapid-manage-zk.py
@ -3,7 +3,7 @@
 # pvcapid-manage-zk.py - PVC Zookeeper migration generator
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/api-daemon/pvcapid-manage_legacy.py
+++ b/api-daemon/pvcapid-manage_legacy.py
@ -1,33 +0,0 @@
-#!/usr/bin/env python3
-
-# pvcapid-manage_legacy.py - PVC Database management tasks (Legacy)
-# Part of the Parallel Virtual Cluster (PVC) system
-#
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
-#
-#    This program is free software: you can redistribute it and/or modify
-#    it under the terms of the GNU General Public License as published by
-#    the Free Software Foundation, version 3.
-#
-#    This program is distributed in the hope that it will be useful,
-#    but WITHOUT ANY WARRANTY; without even the implied warranty of
-#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-#    GNU General Public License for more details.
-#
-#    You should have received a copy of the GNU General Public License
-#    along with this program.  If not, see <https://www.gnu.org/licenses/>.
-#
-###############################################################################
-
-from flask_migrate import Migrate, MigrateCommand, Manager
-
-from pvcapid.flaskapi import app, db
-from pvcapid.models import *  # noqa F401,F403
-
-migrate = Migrate(app, db)
-manager = Manager(app)
-
-manager.add_command("db", MigrateCommand)
-
-if __name__ == "__main__":
-    manager.run()
--- a/api-daemon/pvcapid.py
+++ b/api-daemon/pvcapid.py
@ -3,7 +3,7 @@
 # pvcapid.py - API daemon startup stub
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -19,6 +19,13 @@
 #
 ###############################################################################

-import pvcapid.Daemon  # noqa: F401
+import sys
+from os import path
+
+# Ensure current directory (/usr/share/pvc) is in the system path for Gunicorn
+current_dir = path.dirname(path.abspath(__file__))
+sys.path.append(current_dir)
+
+import pvcapid.Daemon  # noqa: F401, E402

 pvcapid.Daemon.entrypoint()
--- a/api-daemon/pvcapid/Daemon.py
+++ b/api-daemon/pvcapid/Daemon.py
@ -3,7 +3,7 @@
 # Daemon.py - PVC HTTP API daemon
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -19,15 +19,13 @@
 #
 ###############################################################################

-
+import subprocess
 from ssl import SSLContext, TLSVersion
-
 from distutils.util import strtobool as dustrtobool
-
 import daemon_lib.config as cfg

 # Daemon version
-version = "0.9.83"
+version = "0.9.100~git-73c0834f"

 # API version
 API_VERSION = 1.0
@ -53,7 +51,6 @@ def strtobool(stringv):
 # Configuration Parsing
 ##########################################################

-
 # Get our configuration
 config = cfg.get_configuration()
 config["daemon_name"] = "pvcapid"
@ -61,22 +58,16 @@ config["daemon_version"] = version


 ##########################################################
-# Entrypoint
+# Flask App Creation for Gunicorn
 ##########################################################


-def entrypoint():
-    import pvcapid.flaskapi as pvc_api  # noqa: E402
-
-    if config["api_ssl_enabled"]:
-        context = SSLContext()
-        context.minimum_version = TLSVersion.TLSv1
-        context.get_ca_certs()
-        context.load_cert_chain(
-            config["api_ssl_cert_file"], keyfile=config["api_ssl_key_file"]
-        )
-    else:
-        context = None
+def create_app():
+    """
+    Create and return the Flask app and SSL context if necessary.
+    """
+    # Import the Flask app from pvcapid.flaskapi after adjusting the path
+    import pvcapid.flaskapi as pvc_api

    # Print our startup messages
    print("")
@ -102,9 +93,69 @@ def entrypoint():
    print("")

    pvc_api.celery_startup()
-    pvc_api.app.run(
-        config["api_listen_address"],
-        config["api_listen_port"],
-        threaded=True,
-        ssl_context=context,
-    )
+
+    return pvc_api.app
+
+
+##########################################################
+# Entrypoint
+##########################################################
+
+
+def entrypoint():
+    if config["debug"]:
+        app = create_app()
+
+        if config["api_ssl_enabled"]:
+            ssl_context = SSLContext()
+            ssl_context.minimum_version = TLSVersion.TLSv1
+            ssl_context.get_ca_certs()
+            ssl_context.load_cert_chain(
+                config["api_ssl_cert_file"], keyfile=config["api_ssl_key_file"]
+            )
+        else:
+            ssl_context = None
+
+        app.run(
+            config["api_listen_address"],
+            config["api_listen_port"],
+            threaded=True,
+            ssl_context=ssl_context,
+        )
+    else:
+        # Build the command to run Gunicorn
+        gunicorn_cmd = [
+            "gunicorn",
+            "--workers",
+            "1",
+            "--threads",
+            "8",
+            "--timeout",
+            "86400",
+            "--bind",
+            "{}:{}".format(config["api_listen_address"], config["api_listen_port"]),
+            "pvcapid.Daemon:create_app()",
+            "--log-level",
+            "info",
+            "--access-logfile",
+            "-",
+            "--error-logfile",
+            "-",
+        ]
+
+        if config["api_ssl_enabled"]:
+            gunicorn_cmd += [
+                "--certfile",
+                config["api_ssl_cert_file"],
+                "--keyfile",
+                config["api_ssl_key_file"],
+            ]
+
+        # Run Gunicorn
+        try:
+            subprocess.run(gunicorn_cmd)
+        except KeyboardInterrupt:
+            exit(0)
+        except Exception as e:
+            print(e)
+            exit(1)
--- a/api-daemon/pvcapid/flaskapi.py
+++ b/api-daemon/pvcapid/flaskapi.py
--- a/api-daemon/pvcapid/helper.py
+++ b/api-daemon/pvcapid/helper.py
@ -3,7 +3,7 @@
 # helper.py - PVC HTTP API helper functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -23,6 +23,8 @@ import flask
 import json
 import lxml.etree as etree

+from re import match
+from requests import get
 from werkzeug.formparser import parse_form_data

 from pvcapid.Daemon import config, strtobool
@ -31,6 +33,7 @@ from daemon_lib.zkhandler import ZKConnection

 import daemon_lib.common as pvc_common
 import daemon_lib.cluster as pvc_cluster
+import daemon_lib.faults as pvc_faults
 import daemon_lib.node as pvc_node
 import daemon_lib.vm as pvc_vm
 import daemon_lib.network as pvc_network
@ -118,6 +121,211 @@ def cluster_maintenance(zkhandler, maint_state="false"):
    return retdata, retcode


+#
+# Metrics functions
+#
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def cluster_health_metrics(zkhandler):
+    """
+    Get cluster-wide Prometheus metrics for health
+    """
+
+    retflag, retdata = pvc_cluster.get_health_metrics(zkhandler)
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+    return retdata, retcode
+
+
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def cluster_resource_metrics(zkhandler):
+    """
+    Get cluster-wide Prometheus metrics for resource utilization
+    """
+
+    retflag, retdata = pvc_cluster.get_resource_metrics(zkhandler)
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+    return retdata, retcode
+
+
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def ceph_metrics(zkhandler):
+    """
+    Obtain current Ceph Prometheus metrics from the active MGR
+    """
+    # We have to parse out the *name* of the currently active MGR
+    # While the JSON version of the "ceph status" output provides a
+    # URL, this URL is in the backend (i.e. storage) network, which
+    # the API might not have access to. This way, we can connect to
+    # the node name which can be handled however.
+    retcode, retdata = pvc_ceph.get_status(zkhandler)
+    if not retcode:
+        ceph_mgr_node = None
+    else:
+        ceph_data = retdata["ceph_data"]
+        try:
+            ceph_mgr_line = [
+                n for n in ceph_data.split("\n") if match(r"^mgr:", n.strip())
+            ][0]
+            ceph_mgr_node = ceph_mgr_line.split()[1].split("(")[0]
+        except Exception:
+            ceph_mgr_node = None
+
+    if ceph_mgr_node is not None:
+        # Get the data from the endpoint
+        # We use the default port of 9283
+        ceph_prometheus_uri = f"http://{ceph_mgr_node}:9283/metrics"
+        response = get(ceph_prometheus_uri)
+
+        if response.status_code == 200:
+            output = response.text
+            status_code = 200
+        else:
+            output = (
+                f"Error: Failed to obtain metric data from {ceph_mgr_node} MGR daemon\n"
+            )
+            status_code = 400
+    else:
+        output = "Error: Failed to find an active MGR node\n"
+        status_code = 400
+
+    return output, status_code
+
+
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def zookeeper_metrics(zkhandler):
+    """
+    Obtain current Zookeeper Prometheus metrics from the active coordinator node
+    """
+    primary_node = zkhandler.read("base.config.primary_node")
+    if primary_node is not None:
+        # Get the data from the endpoint
+        # We use the default port of 9141
+        zookeeper_prometheus_uri = f"http://{primary_node}:9141/metrics"
+        response = get(zookeeper_prometheus_uri)
+
+        if response.status_code == 200:
+            output = response.text
+            # Parse the text to remove annoying ports (":2181")
+            output = output.replace(":2181", "")
+            # Sort the output text
+            output_lines = output.split("\n")
+            output_lines.sort()
+            output = "\n".join(output_lines) + "\n"
+            status_code = 200
+        else:
+            output = f"Error: Failed to obtain metric data from {primary_node} primary node daemon\n"
+            status_code = 400
+    else:
+        output = "Error: Failed to find an active primary node\n"
+        status_code = 400
+
+    return output, status_code
+
+
+#
+# Fault functions
+#
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def fault_list(zkhandler, limit=None, sort_key="last_reported"):
+    """
+    Return a list of all faults sorted by SORT_KEY.
+    """
+    retflag, retdata = pvc_faults.get_list(zkhandler, limit=limit, sort_key=sort_key)
+
+    if retflag:
+        retcode = 200
+    elif retflag and limit is not None and len(retdata) < 1:
+        retcode = 404
+        retdata = {"message": f"No fault with ID {limit} found"}
+    else:
+        retcode = 400
+        retdata = {"message": retdata}
+
+    return retdata, retcode
+
+
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def fault_acknowledge(zkhandler, fault_id):
+    """
+    Acknowledge a fault of FAULT_ID.
+    """
+    retflag, retdata = pvc_faults.acknowledge(zkhandler, fault_id=fault_id)
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 404
+
+    retdata = {"message": retdata}
+
+    return retdata, retcode
+
+
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def fault_acknowledge_all(zkhandler):
+    """
+    Acknowledge all faults.
+    """
+    retflag, retdata = pvc_faults.acknowledge(zkhandler)
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 404
+
+    retdata = {"message": retdata}
+
+    return retdata, retcode
+
+
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def fault_delete(zkhandler, fault_id):
+    """
+    Delete a fault of FAULT_ID.
+    """
+    retflag, retdata = pvc_faults.delete(zkhandler, fault_id=fault_id)
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 404
+
+    retdata = {"message": retdata}
+
+    return retdata, retcode
+
+
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def fault_delete_all(zkhandler):
+    """
+    Delete all faults.
+    """
+    retflag, retdata = pvc_faults.delete(zkhandler)
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 404
+
+    retdata = {"message": retdata}
+
+    return retdata, retcode
+
+
 #
 # Node functions
 #
@ -433,6 +641,7 @@ def vm_define(
    selector,
    autostart,
    migration_method,
+    migration_max_downtime=300,
    user_tags=[],
    protected_tags=[],
 ):
@ -460,6 +669,7 @@ def vm_define(
        selector,
        autostart,
        migration_method,
+        migration_max_downtime,
        profile=None,
        tags=tags,
    )
@ -555,6 +765,134 @@ def vm_restore(
    return output, retcode


+@ZKConnection(config)
+def create_vm_snapshot(
+    zkhandler,
+    domain,
+    snapshot_name=None,
+):
+    """
+    Take a snapshot of a VM.
+    """
+    retflag, retdata = pvc_vm.create_vm_snapshot(
+        zkhandler,
+        domain,
+        snapshot_name,
+    )
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+
+    output = {"message": retdata.replace('"', "'")}
+    return output, retcode
+
+
+@ZKConnection(config)
+def remove_vm_snapshot(
+    zkhandler,
+    domain,
+    snapshot_name,
+):
+    """
+    Take a snapshot of a VM.
+    """
+    retflag, retdata = pvc_vm.remove_vm_snapshot(
+        zkhandler,
+        domain,
+        snapshot_name,
+    )
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+
+    output = {"message": retdata.replace('"', "'")}
+    return output, retcode
+
+
+@ZKConnection(config)
+def rollback_vm_snapshot(
+    zkhandler,
+    domain,
+    snapshot_name,
+):
+    """
+    Roll back to a snapshot of a VM.
+    """
+    retflag, retdata = pvc_vm.rollback_vm_snapshot(
+        zkhandler,
+        domain,
+        snapshot_name,
+    )
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+
+    output = {"message": retdata.replace('"', "'")}
+    return output, retcode
+
+
+@ZKConnection(config)
+def export_vm_snapshot(
+    zkhandler,
+    domain,
+    snapshot_name,
+    export_path,
+    incremental_parent=None,
+):
+    """
+    Export a snapshot of a VM to files.
+    """
+    retflag, retdata = pvc_vm.export_vm_snapshot(
+        zkhandler,
+        domain,
+        snapshot_name,
+        export_path,
+        incremental_parent,
+    )
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+
+    output = {"message": retdata.replace('"', "'")}
+    return output, retcode
+
+
+@ZKConnection(config)
+def import_vm_snapshot(
+    zkhandler,
+    domain,
+    snapshot_name,
+    export_path,
+    retain_snapshot=False,
+):
+    """
+    Import a snapshot of a VM from files.
+    """
+    retflag, retdata = pvc_vm.import_vm_snapshot(
+        zkhandler,
+        domain,
+        snapshot_name,
+        export_path,
+        retain_snapshot,
+    )
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+
+    output = {"message": retdata.replace('"', "'")}
+    return output, retcode
+
+
@ZKConnection(config)
 def vm_attach_device(zkhandler, vm, device_spec_xml):
    """
@ -618,6 +956,7 @@ def get_vm_meta(zkhandler, vm):
        domain_node_selector,
        domain_node_autostart,
        domain_migrate_method,
+        domain_migrate_max_downtime,
    ) = pvc_common.getDomainMetadata(zkhandler, dom_uuid)

    retcode = 200
@ -627,6 +966,7 @@ def get_vm_meta(zkhandler, vm):
        "node_selector": domain_node_selector.lower(),
        "node_autostart": domain_node_autostart,
        "migration_method": domain_migrate_method.lower(),
+        "migration_max_downtime": int(domain_migrate_max_downtime),
    }

    return retdata, retcode
@ -634,7 +974,14 @@ def get_vm_meta(zkhandler, vm):

@ZKConnection(config)
 def update_vm_meta(
-    zkhandler, vm, limit, selector, autostart, provisioner_profile, migration_method
+    zkhandler,
+    vm,
+    limit,
+    selector,
+    autostart,
+    provisioner_profile,
+    migration_method,
+    migration_max_downtime,
 ):
    """
    Update metadata of a VM.
@ -650,7 +997,14 @@ def update_vm_meta(
            autostart = False

    retflag, retdata = pvc_vm.modify_vm_metadata(
-        zkhandler, vm, limit, selector, autostart, provisioner_profile, migration_method
+        zkhandler,
+        vm,
+        limit,
+        selector,
+        autostart,
+        provisioner_profile,
+        migration_method,
+        migration_max_downtime,
    )

    if retflag:
@ -1643,11 +1997,29 @@ def ceph_volume_list(zkhandler, pool=None, limit=None, is_fuzzy=True):


@ZKConnection(config)
-def ceph_volume_add(zkhandler, pool, name, size):
+def ceph_volume_scan(zkhandler, pool, name):
+    """
+    (Re)scan a Ceph RBD volume for stats in the PVC Ceph storage cluster.
+    """
+    retflag, retdata = pvc_ceph.scan_volume(zkhandler, pool, name)
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+
+    output = {"message": retdata.replace('"', "'")}
+    return output, retcode
+
+
+@ZKConnection(config)
+def ceph_volume_add(zkhandler, pool, name, size, force_flag=False):
    """
    Add a Ceph RBD volume to the PVC Ceph storage cluster.
    """
-    retflag, retdata = pvc_ceph.add_volume(zkhandler, pool, name, size)
+    retflag, retdata = pvc_ceph.add_volume(
+        zkhandler, pool, name, size, force_flag=force_flag
+    )

    if retflag:
        retcode = 200
@ -1659,11 +2031,13 @@ def ceph_volume_add(zkhandler, pool, name, size):


@ZKConnection(config)
-def ceph_volume_clone(zkhandler, pool, name, source_volume):
+def ceph_volume_clone(zkhandler, pool, name, source_volume, force_flag):
    """
    Clone a Ceph RBD volume to a new volume on the PVC Ceph storage cluster.
    """
-    retflag, retdata = pvc_ceph.clone_volume(zkhandler, pool, source_volume, name)
+    retflag, retdata = pvc_ceph.clone_volume(
+        zkhandler, pool, source_volume, name, force_flag=force_flag
+    )

    if retflag:
        retcode = 200
@ -1675,11 +2049,13 @@ def ceph_volume_clone(zkhandler, pool, name, source_volume):


@ZKConnection(config)
-def ceph_volume_resize(zkhandler, pool, name, size):
+def ceph_volume_resize(zkhandler, pool, name, size, force_flag):
    """
    Resize an existing Ceph RBD volume in the PVC Ceph storage cluster.
    """
-    retflag, retdata = pvc_ceph.resize_volume(zkhandler, pool, name, size)
+    retflag, retdata = pvc_ceph.resize_volume(
+        zkhandler, pool, name, size, force_flag=force_flag
+    )

    if retflag:
        retcode = 200
@ -1951,6 +2327,22 @@ def ceph_volume_snapshot_rename(zkhandler, pool, volume, name, new_name):
    return output, retcode


+@ZKConnection(config)
+def ceph_volume_snapshot_rollback(zkhandler, pool, volume, name):
+    """
+    Roll back a Ceph RBD volume to a given snapshot in the PVC Ceph storage cluster.
+    """
+    retflag, retdata = pvc_ceph.rollback_snapshot(zkhandler, pool, volume, name)
+
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+
+    output = {"message": retdata.replace('"', "'")}
+    return output, retcode
+
+
@ZKConnection(config)
 def ceph_volume_snapshot_remove(zkhandler, pool, volume, name):
    """
--- a/api-daemon/pvcapid/models.py
+++ b/api-daemon/pvcapid/models.py
@ -3,7 +3,7 @@
 # models.py - PVC Database models
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -36,6 +36,7 @@ class DBSystemTemplate(db.Model):
    node_selector = db.Column(db.Text)
    node_autostart = db.Column(db.Boolean, nullable=False)
    migration_method = db.Column(db.Text)
+    migration_max_downtime = db.Column(db.Integer, default=300, server_default="300")
    ova = db.Column(db.Integer, db.ForeignKey("ova.id"), nullable=True)

    def __init__(
@ -50,6 +51,7 @@ class DBSystemTemplate(db.Model):
        node_selector,
        node_autostart,
        migration_method,
+        migration_max_downtime,
        ova=None,
    ):
        self.name = name
@ -62,6 +64,7 @@ class DBSystemTemplate(db.Model):
        self.node_selector = node_selector
        self.node_autostart = node_autostart
        self.migration_method = migration_method
+        self.migration_max_downtime = migration_max_downtime
        self.ova = ova

    def __repr__(self):
--- a/api-daemon/pvcapid/ova.py
+++ b/api-daemon/pvcapid/ova.py
@ -3,7 +3,7 @@
 # ova.py - PVC OVA parser library
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -50,7 +50,7 @@ def open_database(config):
    conn = psycopg2.connect(
        host=config["api_postgresql_host"],
        port=config["api_postgresql_port"],
-        dbname=config["api_postgresql_name"],
+        dbname=config["api_postgresql_dbname"],
        user=config["api_postgresql_user"],
        password=config["api_postgresql_password"],
    )
--- a/api-daemon/pvcapid/provisioner.py
+++ b/api-daemon/pvcapid/provisioner.py
@ -3,7 +3,7 @@
 # provisioner.py - PVC API Provisioner functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -125,7 +125,7 @@ def list_template(limit, table, is_fuzzy=True):
            args = (template_data["id"],)
            cur.execute(query, args)
            disks = cur.fetchall()
-            data[template_id]["disks"] = disks
+            data[template_id]["disks"] = sorted(disks, key=lambda x: x["disk_id"])

    close_database(conn, cur)

@ -221,6 +221,7 @@ def create_template_system(
    node_selector=None,
    node_autostart=False,
    migration_method=None,
+    migration_max_downtime=None,
    ova=None,
 ):
    if list_template_system(name, is_fuzzy=False)[-1] != 404:
@ -231,7 +232,7 @@ def create_template_system(
    if node_selector == "none":
        node_selector = None

-    query = "INSERT INTO system_template (name, vcpu_count, vram_mb, serial, vnc, vnc_bind, node_limit, node_selector, node_autostart, migration_method, ova) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s);"
+    query = "INSERT INTO system_template (name, vcpu_count, vram_mb, serial, vnc, vnc_bind, node_limit, node_selector, node_autostart, migration_method, migration_max_downtime, ova) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s);"
    args = (
        name,
        vcpu_count,
@ -243,6 +244,7 @@ def create_template_system(
        node_selector,
        node_autostart,
        migration_method,
+        migration_max_downtime,
        ova,
    )

@ -282,27 +284,28 @@ def create_template_network(name, mac_template=None):
    return retmsg, retcode


-def create_template_network_element(name, vni):
+def create_template_network_element(name, vni, permit_duplicate=False):
    if list_template_network(name, is_fuzzy=False)[-1] != 200:
        retmsg = {"message": 'The network template "{}" does not exist.'.format(name)}
        retcode = 400
        return retmsg, retcode

-    networks, code = list_template_network_vnis(name)
-    if code != 200:
-        networks = []
-    found_vni = False
-    for network in networks:
-        if network["vni"] == vni:
-            found_vni = True
-    if found_vni:
-        retmsg = {
-            "message": 'The VNI "{}" in network template "{}" already exists.'.format(
-                vni, name
-            )
-        }
-        retcode = 400
-        return retmsg, retcode
+    if not permit_duplicate:
+        networks, code = list_template_network_vnis(name)
+        if code != 200:
+            networks = []
+        found_vni = False
+        for network in networks:
+            if network["vni"] == vni:
+                found_vni = True
+        if found_vni:
+            retmsg = {
+                "message": 'The VNI "{}" in network template "{}" already exists.'.format(
+                    vni, name
+                )
+            }
+            retcode = 400
+            return retmsg, retcode

    conn, cur = open_database(config)
    try:
@ -438,6 +441,7 @@ def modify_template_system(
    node_selector=None,
    node_autostart=None,
    migration_method=None,
+    migration_max_downtime=None,
 ):
    if list_template_system(name, is_fuzzy=False)[-1] != 200:
        retmsg = {"message": 'The system template "{}" does not exist.'.format(name)}
@ -505,6 +509,11 @@ def modify_template_system(
    if migration_method is not None:
        fields.append({"field": "migration_method", "data": migration_method})

+    if migration_max_downtime is not None:
+        fields.append(
+            {"field": "migration_max_downtime", "data": int(migration_max_downtime)}
+        )
+
    conn, cur = open_database(config)
    try:
        for field in fields:
--- a/api-daemon/swagger.html
+++ b/api-daemon/swagger.html
@ -1,13 +0,0 @@
-<!DOCTYPE html>
-<html>
-  <head>
-    <title>PVC Client API Documentation</title>
-    <meta charset="utf-8"/>
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <style> body { margin: 0; padding: 0; } </style>
-  </head>
-  <body>
-    <redoc spec-url='./swagger.json' hide-loading></redoc>
-    <script src="https://rebilly.github.io/ReDoc/releases/latest/redoc.min.js"> </script>
-  </body>
-</html>
--- a/api-daemon/swagger.json
+++ b/api-daemon/swagger.json
--- a/build-and-deploy.sh
+++ b/build-and-deploy.sh
@ -13,6 +13,8 @@ else
 fi

 KEEP_ARTIFACTS=""
+API_ONLY=""
+PRIMARY_NODE=""
 if [[ -n ${1} ]]; then
    for arg in ${@}; do
        case ${arg} in
@ -20,33 +22,45 @@ if [[ -n ${1} ]]; then
                KEEP_ARTIFACTS="y"
                shift
            ;;
+            -a|--api-only)
+                API_ONLY="y"
+                shift
+            ;;
+            -p=*|--become-primary=*)
+                PRIMARY_NODE=$( awk -F'=' '{ print $NF }' <<<"${arg}" )
+                shift
+            ;;
        esac
    done
 fi

 HOSTS=( ${@} )
-echo "> Deploying to host(s): ${HOSTS[@]}"
+echo "Deploying to host(s): ${HOSTS[@]}"
+if [[ -n ${PRIMARY_NODE} ]]; then
+    echo "Will become primary on ${PRIMARY_NODE} after updating it"
+fi

 # Move to repo root if we're not
 pushd $( git rev-parse --show-toplevel ) &>/dev/null

 # Prepare code
-echo "Preparing code (format and lint)..."
+echo "> Preparing code (format and lint)..."
 ./format || exit 1
 ./lint || exit 1

 # Build the packages
-echo -n "Building packages..."
+echo -n "> Building packages..."
 version="$( ./build-unstable-deb.sh 2>/dev/null )"
 echo " done. Package version ${version}."

 # Install the client(s) locally
-echo -n "Installing client packages locally..."
+echo -n "> Installing client packages locally..."
 $SUDO dpkg -i --force-all ../pvc-client*_${version}*.deb &>/dev/null
 echo " done".

+echo "> Copying packages..."
 for HOST in ${HOSTS[@]}; do
-    echo -n "Copying packages to host ${HOST}..."
+    echo -n ">>> Copying packages to host ${HOST}..."
    ssh $HOST $SUDO rm -rf /tmp/pvc &>/dev/null
    ssh $HOST mkdir /tmp/pvc &>/dev/null
    scp ../pvc-*_${version}*.deb $HOST:/tmp/pvc/ &>/dev/null
@ -57,26 +71,34 @@ if [[ -z ${KEEP_ARTIFACTS} ]]; then
 fi

 for HOST in ${HOSTS[@]}; do
-    echo "> Deploying packages to host ${HOST}"
-    echo -n "Installing packages..."
+    echo "> Deploying packages on host ${HOST}"
+    echo -n ">>> Installing packages..."
    ssh $HOST $SUDO dpkg -i --force-all /tmp/pvc/*.deb &>/dev/null
    ssh $HOST rm -rf /tmp/pvc &>/dev/null
    echo " done."
-    echo -n "Restarting PVC daemons..."
+    echo -n ">>> Restarting PVC daemons..."
    ssh $HOST $SUDO systemctl restart pvcapid &>/dev/null
    sleep 2
    ssh $HOST $SUDO systemctl restart pvcworkerd &>/dev/null
+    if [[ -z ${API_ONLY} ]]; then
    sleep 2
    ssh $HOST $SUDO systemctl restart pvchealthd &>/dev/null
    sleep 2
    ssh $HOST $SUDO systemctl restart pvcnoded &>/dev/null
    echo " done."
-    echo -n "Waiting for node daemon to be running..."
+    echo -n ">>> Waiting for node daemon to be running..."
    while [[ $( ssh $HOST "pvc -q node list -f json ${HOST%%.*} | jq -r '.[].daemon_state'" 2>/dev/null ) != "run" ]]; do
        sleep 5
        echo -n "."
    done
+    fi
    echo " done."
+    if [[ -n ${PRIMARY_NODE} && ${PRIMARY_NODE} == ${HOST} ]]; then
+        echo -n ">>> Setting node $HOST to primary coordinator state... "
+        ssh $HOST pvc -q node primary --wait &>/dev/null
+        ssh $HOST $SUDO systemctl restart pvcworkerd &>/dev/null
+        echo "done."
+    fi
 done

 popd &>/dev/null
--- a/client-cli/pvc.py
+++ b/client-cli/pvc.py
@ -3,7 +3,7 @@
 # pvc.py - PVC client command-line interface (stub testing interface)
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/client-cli/pvc/cli/cli.py
+++ b/client-cli/pvc/cli/cli.py
--- a/client-cli/pvc/cli/formatters.py
+++ b/client-cli/pvc/cli/formatters.py
@ -3,7 +3,7 @@
 # formatters.py - PVC Click CLI output formatters library
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -19,6 +19,7 @@
 #
 ###############################################################################

+from pvc.cli.helpers import MAX_CONTENT_WIDTH
 from pvc.lib.node import format_info as node_format_info
 from pvc.lib.node import format_list as node_format_list
 from pvc.lib.vm import format_vm_tags as vm_format_tags
@ -82,6 +83,37 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
    total_volumes = data.get("volumes", 0)
    total_snapshots = data.get("snapshots", 0)

+    total_cpu_total = data.get("resources", {}).get("cpu", {}).get("total", 0)
+    total_cpu_load = data.get("resources", {}).get("cpu", {}).get("load", 0)
+    total_cpu_utilization = (
+        data.get("resources", {}).get("cpu", {}).get("utilization", 0)
+    )
+    total_cpu_string = (
+        f"{total_cpu_utilization:.1f}% ({total_cpu_load:.1f} / {total_cpu_total})"
+    )
+
+    total_memory_total = (
+        data.get("resources", {}).get("memory", {}).get("total", 0) / 1024
+    )
+    total_memory_used = (
+        data.get("resources", {}).get("memory", {}).get("used", 0) / 1024
+    )
+    total_memory_utilization = (
+        data.get("resources", {}).get("memory", {}).get("utilization", 0)
+    )
+    total_memory_string = f"{total_memory_utilization:.1f}% ({total_memory_used:.1f} GB / {total_memory_total:.1f} GB)"
+
+    total_disk_total = (
+        data.get("resources", {}).get("disk", {}).get("total", 0) / 1024 / 1024
+    )
+    total_disk_used = (
+        data.get("resources", {}).get("disk", {}).get("used", 0) / 1024 / 1024
+    )
+    total_disk_utilization = round(
+        data.get("resources", {}).get("disk", {}).get("utilization", 0)
+    )
+    total_disk_string = f"{total_disk_utilization:.1f}% ({total_disk_used:.1f} GB / {total_disk_total:.1f} GB)"
+
    if maintenance == "true" or health == -1:
        health_colour = ansii["blue"]
    elif health > 90:
@ -93,7 +125,9 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):

    output = list()

-    output.append(f"{ansii['bold']}PVC cluster status:{ansii['end']}")
+    output.append(f"{ansii['purple']}Primary node:{ansii['end']}   {primary_node}")
+    output.append(f"{ansii['purple']}PVC version:{ansii['end']}    {pvc_version}")
+    output.append(f"{ansii['purple']}Upstream IP:{ansii['end']}    {upstream_ip}")
    output.append("")

    if health != "-1":
@ -105,18 +139,43 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
        health = f"{health} (maintenance on)"

    output.append(
-        f"{ansii['purple']}Cluster health:{ansii['end']}   {health_colour}{health}{ansii['end']}"
+        f"{ansii['purple']}Health:{ansii['end']}         {health_colour}{health}{ansii['end']}"
    )

    if messages is not None and len(messages) > 0:
-        messages = "\n                  ".join(sorted(messages))
-        output.append(f"{ansii['purple']}Health messages:{ansii['end']}  {messages}")
+        message_list = list()
+        for message in messages:
+            if message["health_delta"] >= 50:
+                message_colour = ansii["red"]
+            elif message["health_delta"] >= 10:
+                message_colour = ansii["yellow"]
+            else:
+                message_colour = ansii["green"]
+            message_delta = (
+                f"({message_colour}-{message['health_delta']}%{ansii['end']})"
+            )
+            message_list.append(
+                # 15 length due to ANSI colour sequences
+                "{id} {delta:<15} {text}".format(
+                    id=message["id"],
+                    delta=message_delta,
+                    text=message["text"],
+                )
+            )

-    output.append("")
+        messages = "\n               ".join(message_list)
+    else:
+        messages = "None"
+    output.append(f"{ansii['purple']}Active faults:{ansii['end']}  {messages}")
+
+    output.append(f"{ansii['purple']}Total CPU:{ansii['end']}      {total_cpu_string}")
+
+    output.append(
+        f"{ansii['purple']}Total memory:{ansii['end']}   {total_memory_string}"
+    )
+
+    output.append(f"{ansii['purple']}Total disk:{ansii['end']}     {total_disk_string}")

-    output.append(f"{ansii['purple']}Primary node:{ansii['end']}     {primary_node}")
-    output.append(f"{ansii['purple']}PVC version:{ansii['end']}      {pvc_version}")
-    output.append(f"{ansii['purple']}Upstream IP:{ansii['end']}      {upstream_ip}")
    output.append("")

    node_states = ["run,ready"]
@ -145,7 +204,7 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):

    nodes_string = ", ".join(nodes_strings)

-    output.append(f"{ansii['purple']}Nodes:{ansii['end']}            {nodes_string}")
+    output.append(f"{ansii['purple']}Nodes:{ansii['end']}          {nodes_string}")

    vm_states = ["start", "disable"]
    vm_states.extend(
@ -175,7 +234,7 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):

    vms_string = ", ".join(vms_strings)

-    output.append(f"{ansii['purple']}VMs:{ansii['end']}              {vms_string}")
+    output.append(f"{ansii['purple']}VMs:{ansii['end']}            {vms_string}")

    osd_states = ["up,in"]
    osd_states.extend(
@ -201,15 +260,15 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):

    osds_string = " ".join(osds_strings)

-    output.append(f"{ansii['purple']}OSDs:{ansii['end']}             {osds_string}")
+    output.append(f"{ansii['purple']}OSDs:{ansii['end']}           {osds_string}")

-    output.append(f"{ansii['purple']}Pools:{ansii['end']}            {total_pools}")
+    output.append(f"{ansii['purple']}Pools:{ansii['end']}          {total_pools}")

-    output.append(f"{ansii['purple']}Volumes:{ansii['end']}          {total_volumes}")
+    output.append(f"{ansii['purple']}Volumes:{ansii['end']}        {total_volumes}")

-    output.append(f"{ansii['purple']}Snapshots:{ansii['end']}        {total_snapshots}")
+    output.append(f"{ansii['purple']}Snapshots:{ansii['end']}      {total_snapshots}")

-    output.append(f"{ansii['purple']}Networks:{ansii['end']}         {total_networks}")
+    output.append(f"{ansii['purple']}Networks:{ansii['end']}       {total_networks}")

    output.append("")

@ -237,9 +296,6 @@ def cli_cluster_status_format_short(CLI_CONFIG, data):

    output = list()

-    output.append(f"{ansii['bold']}PVC cluster status:{ansii['end']}")
-    output.append("")
-
    if health != "-1":
        health = f"{health}%"
    else:
@ -249,18 +305,375 @@ def cli_cluster_status_format_short(CLI_CONFIG, data):
        health = f"{health} (maintenance on)"

    output.append(
-        f"{ansii['purple']}Cluster health:{ansii['end']}   {health_colour}{health}{ansii['end']}"
+        f"{ansii['purple']}Health:{ansii['end']}         {health_colour}{health}{ansii['end']}"
    )

    if messages is not None and len(messages) > 0:
-        messages = "\n                  ".join(sorted(messages))
-        output.append(f"{ansii['purple']}Health messages:{ansii['end']}  {messages}")
+        message_list = list()
+        for message in messages:
+            if message["health_delta"] >= 50:
+                message_colour = ansii["red"]
+            elif message["health_delta"] >= 10:
+                message_colour = ansii["yellow"]
+            else:
+                message_colour = ansii["green"]
+            message_delta = (
+                f"({message_colour}-{message['health_delta']}%{ansii['end']})"
+            )
+            message_list.append(
+                # 15 length due to ANSI colour sequences
+                "{id} {delta:<15} {text}".format(
+                    id=message["id"],
+                    delta=message_delta,
+                    text=message["text"],
+                )
+            )
+
+        messages = "\n               ".join(message_list)
+    else:
+        messages = "None"
+    output.append(f"{ansii['purple']}Active faults:{ansii['end']}  {messages}")
+
+    total_cpu_total = data.get("resources", {}).get("cpu", {}).get("total", 0)
+    total_cpu_load = data.get("resources", {}).get("cpu", {}).get("load", 0)
+    total_cpu_utilization = (
+        data.get("resources", {}).get("cpu", {}).get("utilization", 0)
+    )
+    total_cpu_string = (
+        f"{total_cpu_utilization:.1f}% ({total_cpu_load:.1f} / {total_cpu_total})"
+    )
+
+    total_memory_total = (
+        data.get("resources", {}).get("memory", {}).get("total", 0) / 1024
+    )
+    total_memory_used = (
+        data.get("resources", {}).get("memory", {}).get("used", 0) / 1024
+    )
+    total_memory_utilization = (
+        data.get("resources", {}).get("memory", {}).get("utilization", 0)
+    )
+    total_memory_string = f"{total_memory_utilization:.1f}% ({total_memory_used:.1f} GB / {total_memory_total:.1f} GB)"
+
+    total_disk_total = (
+        data.get("resources", {}).get("disk", {}).get("total", 0) / 1024 / 1024
+    )
+    total_disk_used = (
+        data.get("resources", {}).get("disk", {}).get("used", 0) / 1024 / 1024
+    )
+    total_disk_utilization = round(
+        data.get("resources", {}).get("disk", {}).get("utilization", 0)
+    )
+    total_disk_string = f"{total_disk_utilization:.1f}% ({total_disk_used:.1f} GB / {total_disk_total:.1f} GB)"
+
+    output.append(f"{ansii['purple']}CPU usage:{ansii['end']}      {total_cpu_string}")
+
+    output.append(
+        f"{ansii['purple']}Memory usage:{ansii['end']}   {total_memory_string}"
+    )
+
+    output.append(f"{ansii['purple']}Disk usage:{ansii['end']}     {total_disk_string}")

    output.append("")

    return "\n".join(output)


+def cli_cluster_fault_list_format_short(CLI_CONFIG, fault_data):
+    """
+    Short pretty format the output of cli_cluster_fault_list
+    """
+
+    fault_list_output = []
+
+    # Determine optimal column widths
+    fault_id_length = 3  # "ID"
+    fault_status_length = 7  # "Status"
+    fault_last_reported_length = 14  # "Last Reported"
+    fault_health_delta_length = 7  # "Health"
+    fault_message_length = 8  # "Message"
+
+    for fault in fault_data:
+        # fault_id column
+        _fault_id_length = len(str(fault["id"])) + 1
+        if _fault_id_length > fault_id_length:
+            fault_id_length = _fault_id_length
+
+        # status column
+        _fault_status_length = len(str(fault["status"])) + 1
+        if _fault_status_length > fault_status_length:
+            fault_status_length = _fault_status_length
+
+        # health_delta column
+        _fault_health_delta_length = len(str(fault["health_delta"])) + 1
+        if _fault_health_delta_length > fault_health_delta_length:
+            fault_health_delta_length = _fault_health_delta_length
+
+        # last_reported column
+        _fault_last_reported_length = len(str(fault["last_reported"])) + 1
+        if _fault_last_reported_length > fault_last_reported_length:
+            fault_last_reported_length = _fault_last_reported_length
+
+    message_prefix_len = (
+        fault_id_length
+        + 1
+        + fault_status_length
+        + 1
+        + fault_health_delta_length
+        + 1
+        + fault_last_reported_length
+        + 1
+    )
+    message_length = MAX_CONTENT_WIDTH - message_prefix_len
+
+    if fault_message_length > message_length:
+        fault_message_length = message_length + 1
+
+    # Handle splitting fault messages into separate lines based on width
+    formatted_messages = dict()
+    for fault in fault_data:
+        split_message = list()
+        if len(fault["message"]) > message_length:
+            words = fault["message"].split()
+            current_line = words[0]
+            for word in words[1:]:
+                if len(current_line) + len(word) + 1 < message_length:
+                    current_line = f"{current_line} {word}"
+                else:
+                    split_message.append(current_line)
+                    current_line = word
+            split_message.append(current_line)
+
+            for line in split_message:
+                # message column
+                _fault_message_length = len(line) + 1
+                if _fault_message_length > fault_message_length:
+                    fault_message_length = _fault_message_length
+
+            message = f"\n{' ' * message_prefix_len}".join(split_message)
+        else:
+            message = fault["message"]
+
+            # message column
+            _fault_message_length = len(message) + 1
+            if _fault_message_length > fault_message_length:
+                fault_message_length = _fault_message_length
+
+        formatted_messages[fault["id"]] = message
+
+    meta_header_length = (
+        fault_id_length + fault_status_length + fault_health_delta_length + 2
+    )
+    detail_header_length = (
+        fault_id_length
+        + fault_health_delta_length
+        + fault_status_length
+        + fault_last_reported_length
+        + fault_message_length
+        + 3
+        - meta_header_length
+    )
+
+    # Format the string (header)
+    fault_list_output.append(
+        "{bold}Meta {meta_dashes}  Fault {detail_dashes}{end_bold}".format(
+            bold=ansii["bold"],
+            end_bold=ansii["end"],
+            meta_dashes="-" * (meta_header_length - len("Meta  ")),
+            detail_dashes="-" * (detail_header_length - len("Fault  ")),
+        )
+    )
+
+    fault_list_output.append(
+        "{bold}{fault_id: <{fault_id_length}} {fault_status: <{fault_status_length}} {fault_health_delta: <{fault_health_delta_length}} {fault_last_reported: <{fault_last_reported_length}} {fault_message}{end_bold}".format(
+            bold=ansii["bold"],
+            end_bold=ansii["end"],
+            fault_id_length=fault_id_length,
+            fault_status_length=fault_status_length,
+            fault_health_delta_length=fault_health_delta_length,
+            fault_last_reported_length=fault_last_reported_length,
+            fault_id="ID",
+            fault_status="Status",
+            fault_health_delta="Health",
+            fault_last_reported="Last Reported",
+            fault_message="Message",
+        )
+    )
+
+    for fault in sorted(
+        fault_data,
+        key=lambda x: (x["health_delta"], x["last_reported"]),
+        reverse=True,
+    ):
+        health_delta = fault["health_delta"]
+        if fault["acknowledged_at"] != "":
+            health_colour = ansii["blue"]
+        elif health_delta >= 50:
+            health_colour = ansii["red"]
+        elif health_delta >= 10:
+            health_colour = ansii["yellow"]
+        else:
+            health_colour = ansii["green"]
+
+        if len(fault["message"]) > message_length:
+            words = fault["message"].split()
+            split_message = list()
+            current_line = words[0]
+            for word in words:
+                if len(current_line) + len(word) + 1 < message_length:
+                    current_line = f"{current_line} {word}"
+                else:
+                    split_message.append(current_line)
+                    current_line = word
+            split_message.append(current_line)
+
+            message = f"\n{' ' * message_prefix_len}".join(split_message)
+        else:
+            message = fault["message"]
+
+        fault_list_output.append(
+            "{bold}{fault_id: <{fault_id_length}} {fault_status: <{fault_status_length}} {health_colour}{fault_health_delta: <{fault_health_delta_length}}{end_colour} {fault_last_reported: <{fault_last_reported_length}} {fault_message}{end_bold}".format(
+                bold="",
+                end_bold="",
+                health_colour=health_colour,
+                end_colour=ansii["end"],
+                fault_id_length=fault_id_length,
+                fault_status_length=fault_status_length,
+                fault_health_delta_length=fault_health_delta_length,
+                fault_last_reported_length=fault_last_reported_length,
+                fault_id=fault["id"],
+                fault_status=fault["status"],
+                fault_health_delta=f"-{fault['health_delta']}%",
+                fault_last_reported=fault["last_reported"],
+                fault_message=formatted_messages[fault["id"]],
+            )
+        )
+
+    return "\n".join(fault_list_output)
+
+
+def cli_cluster_fault_list_format_long(CLI_CONFIG, fault_data):
+    """
+    Pretty format the output of cli_cluster_fault_list
+    """
+
+    fault_list_output = []
+
+    # Determine optimal column widths
+    fault_id_length = 3  # "ID"
+    fault_status_length = 7  # "Status"
+    fault_health_delta_length = 7  # "Health"
+    fault_acknowledged_at_length = 9  # "Ack'd On"
+    fault_last_reported_length = 14  # "Last Reported"
+    fault_first_reported_length = 15  # "First Reported"
+    # Message goes on its own line
+
+    for fault in fault_data:
+        # fault_id column
+        _fault_id_length = len(str(fault["id"])) + 1
+        if _fault_id_length > fault_id_length:
+            fault_id_length = _fault_id_length
+
+        # status column
+        _fault_status_length = len(str(fault["status"])) + 1
+        if _fault_status_length > fault_status_length:
+            fault_status_length = _fault_status_length
+
+        # health_delta column
+        _fault_health_delta_length = len(str(fault["health_delta"])) + 1
+        if _fault_health_delta_length > fault_health_delta_length:
+            fault_health_delta_length = _fault_health_delta_length
+
+        # acknowledged_at column
+        _fault_acknowledged_at_length = len(str(fault["acknowledged_at"])) + 1
+        if _fault_acknowledged_at_length > fault_acknowledged_at_length:
+            fault_acknowledged_at_length = _fault_acknowledged_at_length
+
+        # last_reported column
+        _fault_last_reported_length = len(str(fault["last_reported"])) + 1
+        if _fault_last_reported_length > fault_last_reported_length:
+            fault_last_reported_length = _fault_last_reported_length
+
+        # first_reported column
+        _fault_first_reported_length = len(str(fault["first_reported"])) + 1
+        if _fault_first_reported_length > fault_first_reported_length:
+            fault_first_reported_length = _fault_first_reported_length
+
+    # Format the string (header)
+    fault_list_output.append(
+        "{bold}{fault_id: <{fault_id_length}} {fault_status: <{fault_status_length}} {fault_health_delta: <{fault_health_delta_length}} {fault_acknowledged_at: <{fault_acknowledged_at_length}} {fault_last_reported: <{fault_last_reported_length}} {fault_first_reported: <{fault_first_reported_length}}{end_bold}".format(
+            bold=ansii["bold"],
+            end_bold=ansii["end"],
+            fault_id_length=fault_id_length,
+            fault_status_length=fault_status_length,
+            fault_health_delta_length=fault_health_delta_length,
+            fault_acknowledged_at_length=fault_acknowledged_at_length,
+            fault_last_reported_length=fault_last_reported_length,
+            fault_first_reported_length=fault_first_reported_length,
+            fault_id="ID",
+            fault_status="Status",
+            fault_health_delta="Health",
+            fault_acknowledged_at="Ack'd On",
+            fault_last_reported="Last Reported",
+            fault_first_reported="First Reported",
+        )
+    )
+    fault_list_output.append(
+        "{bold}> {fault_message}{end_bold}".format(
+            bold=ansii["bold"],
+            end_bold=ansii["end"],
+            fault_message="Message",
+        )
+    )
+
+    for fault in sorted(
+        fault_data,
+        key=lambda x: (x["status"], x["health_delta"], x["last_reported"]),
+        reverse=True,
+    ):
+        health_delta = fault["health_delta"]
+        if fault["acknowledged_at"] != "":
+            health_colour = ansii["blue"]
+        elif health_delta >= 50:
+            health_colour = ansii["red"]
+        elif health_delta >= 10:
+            health_colour = ansii["yellow"]
+        else:
+            health_colour = ansii["green"]
+
+        fault_list_output.append("")
+        fault_list_output.append(
+            "{bold}{fault_id: <{fault_id_length}} {health_colour}{fault_status: <{fault_status_length}} {fault_health_delta: <{fault_health_delta_length}}{end_colour} {fault_acknowledged_at: <{fault_acknowledged_at_length}} {fault_last_reported: <{fault_last_reported_length}} {fault_first_reported: <{fault_first_reported_length}}{end_bold}".format(
+                bold="",
+                end_bold="",
+                health_colour=health_colour,
+                end_colour=ansii["end"],
+                fault_id_length=fault_id_length,
+                fault_status_length=fault_status_length,
+                fault_health_delta_length=fault_health_delta_length,
+                fault_acknowledged_at_length=fault_acknowledged_at_length,
+                fault_last_reported_length=fault_last_reported_length,
+                fault_first_reported_length=fault_first_reported_length,
+                fault_id=fault["id"],
+                fault_status=fault["status"].title(),
+                fault_health_delta=f"-{fault['health_delta']}%",
+                fault_acknowledged_at=(
+                    fault["acknowledged_at"]
+                    if fault["acknowledged_at"] != ""
+                    else "N/A"
+                ),
+                fault_last_reported=fault["last_reported"],
+                fault_first_reported=fault["first_reported"],
+            )
+        )
+        fault_list_output.append(
+            "> {fault_message}".format(
+                fault_message=fault["message"],
+            )
+        )
+
+    return "\n".join(fault_list_output)
+
+
 def cli_cluster_task_format_pretty(CLI_CONFIG, task_data):
    """
    Pretty format the output of cli_cluster_task
@ -310,6 +723,24 @@ def cli_cluster_task_format_pretty(CLI_CONFIG, task_data):
        if _task_type_length > task_type_length:
            task_type_length = _task_type_length

+        for arg_name, arg_data in task["kwargs"].items():
+            # Skip the "run_on" argument
+            if arg_name == "run_on":
+                continue
+
+            # task_arg_name column
+            _task_arg_name_length = len(str(arg_name)) + 1
+            if _task_arg_name_length > task_arg_name_length:
+                task_arg_name_length = _task_arg_name_length
+
+    task_header_length = (
+        task_id_length + task_name_length + task_type_length + task_worker_length + 3
+    )
+    max_task_data_length = (
+        MAX_CONTENT_WIDTH - task_header_length - task_arg_name_length - 2
+    )
+
+    for task in task_data:
        updated_kwargs = list()
        for arg_name, arg_data in task["kwargs"].items():
            # Skip the "run_on" argument
@ -321,15 +752,30 @@ def cli_cluster_task_format_pretty(CLI_CONFIG, task_data):
            if _task_arg_name_length > task_arg_name_length:
                task_arg_name_length = _task_arg_name_length

-            if len(str(arg_data)) > 17:
-                arg_data = arg_data[:17] + "..."
+            if isinstance(arg_data, list):
+                for subarg_data in arg_data:
+                    if len(subarg_data) > max_task_data_length:
+                        subarg_data = (
+                            str(subarg_data[: max_task_data_length - 4]) + " ..."
+                        )

-            # task_arg_data column
-            _task_arg_data_length = len(str(arg_data)) + 1
-            if _task_arg_data_length > task_arg_data_length:
-                task_arg_data_length = _task_arg_data_length
+                    # task_arg_data column
+                    _task_arg_data_length = len(str(subarg_data)) + 1
+                    if _task_arg_data_length > task_arg_data_length:
+                        task_arg_data_length = _task_arg_data_length
+
+                    updated_kwargs.append({"name": arg_name, "data": subarg_data})
+            else:
+                if len(str(arg_data)) > 24:
+                    arg_data = str(arg_data[:24]) + " ..."
+
+                    # task_arg_data column
+                    _task_arg_data_length = len(str(arg_data)) + 1
+                    if _task_arg_data_length > task_arg_data_length:
+                        task_arg_data_length = _task_arg_data_length
+
+                updated_kwargs.append({"name": arg_name, "data": arg_data})

-            updated_kwargs.append({"name": arg_name, "data": arg_data})
        task["kwargs"] = updated_kwargs
        tasks.append(task)

@ -511,6 +957,28 @@ def cli_connection_list_format_pretty(CLI_CONFIG, data):
    return "\n".join(output)


+def cli_connection_list_format_prometheus_json(CLI_CONFIG, data):
+    """
+    Format the output of cli_connection_list as Prometheus file service discovery JSON
+    """
+
+    from json import dumps
+
+    output = list()
+    for connection in data:
+        output_obj = {
+            "targets": [f"{connection['address']}:{connection['port']}"],
+            "labels": {
+                "job": "pvc",
+                "pvc_cluster_name": f"{connection['name']}: {connection['description']}",
+                "pvc_cluster_id": connection["name"],
+            },
+        }
+        output.append(output_obj)
+
+    return dumps(output, indent=2)
+
+
 def cli_connection_detail_format_pretty(CLI_CONFIG, data):
    """
    Pretty format the output of cli_connection_detail
--- a/client-cli/pvc/cli/helpers.py
+++ b/client-cli/pvc/cli/helpers.py
@ -3,7 +3,7 @@
 # helpers.py - PVC Click CLI helper function library
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -20,32 +20,29 @@
 ###############################################################################

 from click import echo as click_echo
-from click import confirm
-from datetime import datetime
 from distutils.util import strtobool
-from getpass import getuser
 from json import load as jload
 from json import dump as jdump
-from os import chmod, environ, getpid, path, makedirs
-from re import findall
+from os import chmod, environ, getpid, path, get_terminal_size
 from socket import gethostname
-from subprocess import run, PIPE
 from sys import argv
 from syslog import syslog, openlog, closelog, LOG_AUTH
 from yaml import load as yload
 from yaml import SafeLoader

-import pvc.lib.provisioner
-import pvc.lib.vm
-import pvc.lib.node
-

 DEFAULT_STORE_DATA = {"cfgfile": "/etc/pvc/pvc.conf"}
 DEFAULT_STORE_FILENAME = "pvc.json"
 DEFAULT_API_PREFIX = "/api/v1"
 DEFAULT_NODE_HOSTNAME = gethostname().split(".")[0]
 DEFAULT_AUTOBACKUP_FILENAME = "/etc/pvc/pvc.conf"
-MAX_CONTENT_WIDTH = 120
+
+try:
+    # Define the content width to be the maximum terminal size
+    MAX_CONTENT_WIDTH = get_terminal_size().columns - 1
+except OSError:
+    # Fall back to 80 columns if "Inappropriate ioctl for device"
+    MAX_CONTENT_WIDTH = 80


 def echo(config, message, newline=True, stderr=False):
@ -189,322 +186,3 @@ def update_store(store_path, store_data):

    with open(store_file, "w") as fh:
        jdump(store_data, fh, sort_keys=True, indent=4)
-
-
-def get_autobackup_config(CLI_CONFIG, cfgfile):
-    try:
-        config = dict()
-        with open(cfgfile) as fh:
-            backup_config = yload(fh, Loader=SafeLoader)["autobackup"]
-
-        config["backup_root_path"] = backup_config["backup_root_path"]
-        config["backup_root_suffix"] = backup_config["backup_root_suffix"]
-        config["backup_tags"] = backup_config["backup_tags"]
-        config["backup_schedule"] = backup_config["backup_schedule"]
-        config["auto_mount_enabled"] = backup_config["auto_mount"]["enabled"]
-        if config["auto_mount_enabled"]:
-            config["mount_cmds"] = list()
-            _mount_cmds = backup_config["auto_mount"]["mount_cmds"]
-            for _mount_cmd in _mount_cmds:
-                if "{backup_root_path}" in _mount_cmd:
-                    _mount_cmd = _mount_cmd.format(
-                        backup_root_path=backup_config["backup_root_path"]
-                    )
-                config["mount_cmds"].append(_mount_cmd)
-
-            config["unmount_cmds"] = list()
-            _unmount_cmds = backup_config["auto_mount"]["unmount_cmds"]
-            for _unmount_cmd in _unmount_cmds:
-                if "{backup_root_path}" in _unmount_cmd:
-                    _unmount_cmd = _unmount_cmd.format(
-                        backup_root_path=backup_config["backup_root_path"]
-                    )
-                config["unmount_cmds"].append(_unmount_cmd)
-
-    except FileNotFoundError:
-        echo(CLI_CONFIG, "ERROR: Specified backup configuration does not exist!")
-        exit(1)
-    except KeyError as e:
-        echo(CLI_CONFIG, f"ERROR: Backup configuration is invalid: {e}")
-        exit(1)
-
-    return config
-
-
-def vm_autobackup(
-    CLI_CONFIG,
-    autobackup_cfgfile=DEFAULT_AUTOBACKUP_FILENAME,
-    force_full_flag=False,
-    cron_flag=False,
-):
-    """
-    Perform automatic backups of VMs based on an external config file.
-    """
-
-    # Validate that we are running on the current primary coordinator of the 'local' cluster connection
-    real_connection = CLI_CONFIG["connection"]
-    CLI_CONFIG["connection"] = "local"
-    retcode, retdata = pvc.lib.node.node_info(CLI_CONFIG, DEFAULT_NODE_HOSTNAME)
-    if not retcode or retdata.get("coordinator_state") != "primary":
-        if cron_flag:
-            echo(
-                CLI_CONFIG,
-                "Current host is not the primary coordinator of the local cluster and running in cron mode. Exiting cleanly.",
-            )
-            exit(0)
-        else:
-            echo(
-                CLI_CONFIG,
-                f"ERROR: Current host is not the primary coordinator of the local cluster; got connection '{real_connection}', host '{DEFAULT_NODE_HOSTNAME}'.",
-            )
-            echo(
-                CLI_CONFIG,
-                "Autobackup MUST be run from the cluster active primary coordinator using the 'local' connection. See '-h'/'--help' for details.",
-            )
-            exit(1)
-
-    # Ensure we're running as root, or show a warning & confirmation
-    if getuser() != "root":
-        confirm(
-            "WARNING: You are not running this command as 'root'. This command should be run under the same user as the API daemon, which is usually 'root'. Are you sure you want to continue?",
-            prompt_suffix=" ",
-            abort=True,
-        )
-
-    # Load our YAML config
-    autobackup_config = get_autobackup_config(CLI_CONFIG, autobackup_cfgfile)
-
-    # Get a list of all VMs on the cluster
-    # We don't do tag filtering here, because we could match an arbitrary number of tags; instead, we
-    # parse the list after
-    retcode, retdata = pvc.lib.vm.vm_list(CLI_CONFIG, None, None, None, None, None)
-    if not retcode:
-        echo(CLI_CONFIG, f"ERROR: Failed to fetch VM list: {retdata}")
-        exit(1)
-    cluster_vms = retdata
-
-    # Parse the list to match tags; too complex for list comprehension alas
-    backup_vms = list()
-    for vm in cluster_vms:
-        vm_tag_names = [t["name"] for t in vm["tags"]]
-        matching_tags = (
-            True
-            if len(
-                set(vm_tag_names).intersection(set(autobackup_config["backup_tags"]))
-            )
-            > 0
-            else False
-        )
-        if matching_tags:
-            backup_vms.append(vm["name"])
-
-    if len(backup_vms) < 1:
-        echo(CLI_CONFIG, "Found no suitable VMs for autobackup.")
-        exit(0)
-
-    # Pretty print the names of the VMs we'll back up (to stderr)
-    maxnamelen = max([len(n) for n in backup_vms]) + 2
-    cols = 1
-    while (cols * maxnamelen + maxnamelen + 2) <= MAX_CONTENT_WIDTH:
-        cols += 1
-    rows = len(backup_vms) // cols
-    vm_list_rows = list()
-    for row in range(0, rows + 1):
-        row_start = row * cols
-        row_end = (row * cols) + cols
-        row_str = ""
-        for x in range(row_start, row_end):
-            if x < len(backup_vms):
-                row_str += "{:<{}}".format(backup_vms[x], maxnamelen)
-        vm_list_rows.append(row_str)
-
-    echo(CLI_CONFIG, f"Found {len(backup_vms)} suitable VM(s) for autobackup.")
-    echo(CLI_CONFIG, "Full VM list:", stderr=True)
-    echo(CLI_CONFIG, "  {}".format("\n  ".join(vm_list_rows)), stderr=True)
-    echo(CLI_CONFIG, "", stderr=True)
-
-    if autobackup_config["auto_mount_enabled"]:
-        # Execute each mount_cmds command in sequence
-        for cmd in autobackup_config["mount_cmds"]:
-            echo(
-                CLI_CONFIG,
-                f"Executing mount command '{cmd.split()[0]}'... ",
-                newline=False,
-            )
-            tstart = datetime.now()
-            ret = run(
-                cmd.split(),
-                stdout=PIPE,
-                stderr=PIPE,
-            )
-            tend = datetime.now()
-            ttot = tend - tstart
-            if ret.returncode != 0:
-                echo(
-                    CLI_CONFIG,
-                    f"failed. [{ttot.seconds}s]",
-                )
-                echo(
-                    CLI_CONFIG,
-                    f"Exiting; command reports: {ret.stderr.decode().strip()}",
-                )
-                exit(1)
-            else:
-                echo(CLI_CONFIG, f"done. [{ttot.seconds}s]")
-
-    # For each VM, perform the backup
-    for vm in backup_vms:
-        backup_suffixed_path = f"{autobackup_config['backup_root_path']}{autobackup_config['backup_root_suffix']}"
-        if not path.exists(backup_suffixed_path):
-            makedirs(backup_suffixed_path)
-
-        backup_path = f"{backup_suffixed_path}/{vm}"
-        autobackup_state_file = f"{backup_path}/.autobackup.json"
-        if not path.exists(backup_path) or not path.exists(autobackup_state_file):
-            # There are no new backups so the list is empty
-            state_data = dict()
-            tracked_backups = list()
-        else:
-            with open(autobackup_state_file) as fh:
-                state_data = jload(fh)
-            tracked_backups = state_data["tracked_backups"]
-
-        full_interval = autobackup_config["backup_schedule"]["full_interval"]
-        full_retention = autobackup_config["backup_schedule"]["full_retention"]
-
-        full_backups = [b for b in tracked_backups if b["type"] == "full"]
-        if len(full_backups) > 0:
-            last_full_backup = full_backups[0]
-            last_full_backup_idx = tracked_backups.index(last_full_backup)
-            if force_full_flag:
-                this_backup_type = "forced-full"
-                this_backup_incremental_parent = None
-                this_backup_retain_snapshot = True
-            elif last_full_backup_idx >= full_interval - 1:
-                this_backup_type = "full"
-                this_backup_incremental_parent = None
-                this_backup_retain_snapshot = True
-            else:
-                this_backup_type = "incremental"
-                this_backup_incremental_parent = last_full_backup["datestring"]
-                this_backup_retain_snapshot = False
-        else:
-            # The very first backup must be full to start the tree
-            this_backup_type = "full"
-            this_backup_incremental_parent = None
-            this_backup_retain_snapshot = True
-
-        # Perform the backup
-        echo(
-            CLI_CONFIG,
-            f"Backing up VM '{vm}' ({this_backup_type})... ",
-            newline=False,
-        )
-        tstart = datetime.now()
-        retcode, retdata = pvc.lib.vm.vm_backup(
-            CLI_CONFIG,
-            vm,
-            backup_suffixed_path,
-            incremental_parent=this_backup_incremental_parent,
-            retain_snapshot=this_backup_retain_snapshot,
-        )
-        tend = datetime.now()
-        ttot = tend - tstart
-        if not retcode:
-            echo(CLI_CONFIG, f"failed. [{ttot.seconds}s]")
-            echo(CLI_CONFIG, f"Skipping cleanups; command reports: {retdata}")
-            continue
-        else:
-            backup_datestring = findall(r"[0-9]{14}", retdata)[0]
-            echo(
-                CLI_CONFIG,
-                f"done. Backup '{backup_datestring}' created. [{ttot.seconds}s]",
-            )
-
-        # Read backup file to get details
-        backup_json_file = f"{backup_path}/{backup_datestring}/pvcbackup.json"
-        with open(backup_json_file) as fh:
-            backup_json = jload(fh)
-        backup = {
-            "datestring": backup_json["datestring"],
-            "type": backup_json["type"],
-            "parent": backup_json["incremental_parent"],
-            "retained_snapshot": backup_json["retained_snapshot"],
-        }
-        tracked_backups.insert(0, backup)
-
-        # Delete any full backups that are expired
-        marked_for_deletion = list()
-        found_full_count = 0
-        for backup in tracked_backups:
-            if backup["type"] == "full":
-                found_full_count += 1
-                if found_full_count > full_retention:
-                    marked_for_deletion.append(backup)
-
-        # Depete any incremental backups that depend on marked parents
-        for backup in tracked_backups:
-            if backup["type"] == "incremental" and backup["parent"] in [
-                b["datestring"] for b in marked_for_deletion
-            ]:
-                marked_for_deletion.append(backup)
-
-        # Execute deletes
-        for backup_to_delete in marked_for_deletion:
-            echo(
-                CLI_CONFIG,
-                f"Removing old VM '{vm}' backup '{backup_to_delete['datestring']}' ({backup_to_delete['type']})... ",
-                newline=False,
-            )
-            tstart = datetime.now()
-            retcode, retdata = pvc.lib.vm.vm_remove_backup(
-                CLI_CONFIG,
-                vm,
-                backup_suffixed_path,
-                backup_to_delete["datestring"],
-            )
-            tend = datetime.now()
-            ttot = tend - tstart
-            if not retcode:
-                echo(CLI_CONFIG, f"failed. [{ttot.seconds}s]")
-                echo(
-                    CLI_CONFIG,
-                    f"Skipping removal from tracked backups; command reports: {retdata}",
-                )
-                continue
-            else:
-                tracked_backups.remove(backup_to_delete)
-                echo(CLI_CONFIG, f"done. [{ttot.seconds}s]")
-
-        # Update tracked state information
-        state_data["tracked_backups"] = tracked_backups
-        with open(autobackup_state_file, "w") as fh:
-            jdump(state_data, fh)
-
-    if autobackup_config["auto_mount_enabled"]:
-        # Execute each unmount_cmds command in sequence
-        for cmd in autobackup_config["unmount_cmds"]:
-            echo(
-                CLI_CONFIG,
-                f"Executing unmount command '{cmd.split()[0]}'... ",
-                newline=False,
-            )
-            tstart = datetime.now()
-            ret = run(
-                cmd.split(),
-                stdout=PIPE,
-                stderr=PIPE,
-            )
-            tend = datetime.now()
-            ttot = tend - tstart
-            if ret.returncode != 0:
-                echo(
-                    CLI_CONFIG,
-                    f"failed. [{ttot.seconds}s]",
-                )
-                echo(
-                    CLI_CONFIG,
-                    f"Continuing; command reports: {ret.stderr.decode().strip()}",
-                )
-            else:
-                echo(CLI_CONFIG, f"done. [{ttot.seconds}s]")
--- a/client-cli/pvc/cli/parsers.py
+++ b/client-cli/pvc/cli/parsers.py
@ -3,7 +3,7 @@
 # parsers.py - PVC Click CLI data parser function library
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/client-cli/pvc/cli/waiters.py
+++ b/client-cli/pvc/cli/waiters.py
@ -3,7 +3,7 @@
 # waiters.py - PVC Click CLI output waiters library
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -115,10 +115,14 @@ def wait_for_celery_task(CLI_CONFIG, task_detail, start_late=False):
        )
        while True:
            sleep(0.5)
+            if isinstance(task_status, tuple):
+                continue
            if task_status.get("state") != "RUNNING":
                break
            if task_status.get("current") > last_task:
                current_task = int(task_status.get("current"))
+                total_task = int(task_status.get("total"))
+                bar.length = total_task
                bar.update(current_task - last_task)
                last_task = current_task
                # The extensive spaces at the end cause this to overwrite longer previous messages
--- a/client-cli/pvc/lib/ansiprint.py
+++ b/client-cli/pvc/lib/ansiprint.py
@ -3,7 +3,7 @@
 # ansiprint.py - Printing function for formatted messages
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/client-cli/pvc/lib/cluster.py
+++ b/client-cli/pvc/lib/cluster.py
@ -3,7 +3,7 @@
 # cluster.py - PVC CLI client function library, cluster management
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -21,6 +21,8 @@

 import json

+from time import sleep
+
 from pvc.lib.common import call_api


@ -114,3 +116,22 @@ def get_info(config):
        return True, response.json()
    else:
        return False, response.json().get("message", "")
+
+
+def get_primary_node(config):
+    """
+    Get the current primary node of the PVC cluster
+
+    API endpoint: GET /api/v1/status/primary_node
+    API arguments:
+    API schema: {json_data_object}
+    """
+    while True:
+        response = call_api(config, "get", "/status/primary_node")
+        resp_code = response.status_code
+        if resp_code == 200:
+            break
+        else:
+            sleep(1)
+
+    return True, response.json()["primary_node"]
--- a/client-cli/pvc/lib/common.py
+++ b/client-cli/pvc/lib/common.py
@ -3,7 +3,7 @@
 # common.py - PVC CLI client function library, Common functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -108,9 +108,10 @@ class UploadProgressBar(object):


 class ErrorResponse(requests.Response):
-    def __init__(self, json_data, status_code):
+    def __init__(self, json_data, status_code, headers):
        self.json_data = json_data
        self.status_code = status_code
+        self.headers = headers

    def json(self):
        return self.json_data
@ -140,15 +141,32 @@ def call_api(
    # Determine the request type and hit the API
    disable_warnings()
    try:
+        response = None
        if operation == "get":
-            response = requests.get(
-                uri,
-                timeout=timeout,
-                headers=headers,
-                params=params,
-                data=data,
-                verify=config["verify_ssl"],
-            )
+            retry_on_code = [429, 500, 502, 503, 504]
+            for i in range(3):
+                failed = False
+                try:
+                    response = requests.get(
+                        uri,
+                        timeout=timeout,
+                        headers=headers,
+                        params=params,
+                        data=data,
+                        verify=config["verify_ssl"],
+                    )
+                    if response.status_code in retry_on_code:
+                        failed = True
+                        continue
+                    break
+                except requests.exceptions.ConnectionError:
+                    failed = True
+                    continue
+            if failed:
+                error = f"Code {response.status_code}" if response else "Timeout"
+                raise requests.exceptions.ConnectionError(
+                    f"Failed to connect after 3 tries ({error})"
+                )
        if operation == "post":
            response = requests.post(
                uri,
@ -189,7 +207,8 @@ def call_api(
            )
    except Exception as e:
        message = "Failed to connect to the API: {}".format(e)
-        response = ErrorResponse({"message": message}, 500)
+        code = response.status_code if response else 504
+        response = ErrorResponse({"message": message}, code, None)

    # Display debug output
    if config["debug"]:
--- a/client-cli/pvc/lib/faults.py
+++ b/client-cli/pvc/lib/faults.py
@ -0,0 +1,127 @@
+#!/usr/bin/env python3
+
+# faults.py - PVC CLI client function library, faults management
+# Part of the Parallel Virtual Cluster (PVC) system
+#
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
+#
+#    This program is free software: you can redistribute it and/or modify
+#    it under the terms of the GNU General Public License as published by
+#    the Free Software Foundation, version 3.
+#
+#    This program is distributed in the hope that it will be useful,
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#    GNU General Public License for more details.
+#
+#    You should have received a copy of the GNU General Public License
+#    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+#
+###############################################################################
+
+from pvc.lib.common import call_api
+
+
+def get_list(config, limit=None, sort_key="last_reported"):
+    """
+    Get list of PVC faults
+
+    API endpoint: GET /api/v1/faults
+    API arguments: sort_key={sort_key}
+    API schema: {json_data_object}
+    """
+    if limit is not None:
+        params = {}
+        endpoint = f"/faults/{limit}"
+    else:
+        params = {"sort_key": sort_key}
+        endpoint = "/faults"
+
+    response = call_api(config, "get", endpoint, params=params)
+
+    if response.status_code == 200:
+        return True, response.json()
+    else:
+        return False, response.json().get("message", "")
+
+
+def acknowledge(config, faults):
+    """
+    Acknowledge one or more PVC faults
+
+    API endpoint: PUT /api/v1/faults/<fault_id> for fault_id in faults
+    API arguments:
+    API schema: {json_message}
+    """
+    status_codes = list()
+    bad_msgs = list()
+    for fault_id in faults:
+        response = call_api(config, "put", f"/faults/{fault_id}")
+
+        if response.status_code == 200:
+            status_codes.append(True)
+        else:
+            status_codes.append(False)
+            bad_msgs.append(response.json().get("message", ""))
+
+    if all(status_codes):
+        return True, f"Successfully acknowledged fault(s) {', '.join(faults)}"
+    else:
+        return False, ", ".join(bad_msgs)
+
+
+def acknowledge_all(config):
+    """
+    Acknowledge all PVC faults
+
+    API endpoint: PUT /api/v1/faults
+    API arguments:
+    API schema: {json_message}
+    """
+    response = call_api(config, "put", "/faults")
+
+    if response.status_code == 200:
+        return True, response.json().get("message", "")
+    else:
+        return False, response.json().get("message", "")
+
+
+def delete(config, faults):
+    """
+    Delete one or more PVC faults
+
+    API endpoint: DELETE /api/v1/faults/<fault_id> for fault_id in faults
+    API arguments:
+    API schema: {json_message}
+    """
+    status_codes = list()
+    bad_msgs = list()
+    for fault_id in faults:
+        response = call_api(config, "delete", f"/faults/{fault_id}")
+
+        if response.status_code == 200:
+            status_codes.append(True)
+        else:
+            status_codes.append(False)
+            bad_msgs.append(response.json().get("message", ""))
+
+    if all(status_codes):
+        return True, f"Successfully deleted fault(s) {', '.join(faults)}"
+    else:
+        return False, ", ".join(bad_msgs)
+
+
+def delete_all(config):
+    """
+    Delete all PVC faults
+
+    API endpoint: DELETE /api/v1/faults
+    API arguments:
+    API schema: {json_message}
+    """
+    response = call_api(config, "delete", "/faults")
+
+    if response.status_code == 200:
+        return True, response.json().get("message", "")
+    else:
+        return False, response.json().get("message", "")
--- a/client-cli/pvc/lib/network.py
+++ b/client-cli/pvc/lib/network.py
@ -3,7 +3,7 @@
 # network.py - PVC CLI client function library, Network functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/client-cli/pvc/lib/node.py
+++ b/client-cli/pvc/lib/node.py
@ -3,7 +3,7 @@
 # node.py - PVC CLI client function library, node management
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -249,6 +249,8 @@ def getOutputColours(node_information):
        daemon_state_colour = ansiprint.yellow()
    elif node_information["daemon_state"] == "dead":
        daemon_state_colour = ansiprint.red() + ansiprint.bold()
+    elif node_information["daemon_state"] == "fenced":
+        daemon_state_colour = ansiprint.red()
    else:
        daemon_state_colour = ansiprint.blue()

--- a/client-cli/pvc/lib/provisioner.py
+++ b/client-cli/pvc/lib/provisioner.py
@ -3,7 +3,7 @@
 # provisioner.py - PVC CLI client function library, Provisioner functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -779,7 +779,8 @@ def format_list_template_system(template_data):
    template_node_limit_length = 6
    template_node_selector_length = 9
    template_node_autostart_length = 10
-    template_migration_method_length = 10
+    template_migration_method_length = 12
+    template_migration_max_downtime_length = 13

    for template in template_data:
        # template_name column
@ -826,6 +827,17 @@ def format_list_template_system(template_data):
        _template_migration_method_length = len(str(template["migration_method"])) + 1
        if _template_migration_method_length > template_migration_method_length:
            template_migration_method_length = _template_migration_method_length
+        # template_migration_max_downtime column
+        _template_migration_max_downtime_length = (
+            len(str(template["migration_max_downtime"])) + 1
+        )
+        if (
+            _template_migration_max_downtime_length
+            > template_migration_max_downtime_length
+        ):
+            template_migration_max_downtime_length = (
+                _template_migration_max_downtime_length
+            )

    # Format the string (header)
    template_list_output.append(
@ -842,7 +854,8 @@ def format_list_template_system(template_data):
            + template_node_selector_length
            + template_node_autostart_length
            + template_migration_method_length
-            + 3,
+            + template_migration_max_downtime_length
+            + 4,
            template_header="System Templates "
            + "".join(
                ["-" for _ in range(17, template_name_length + template_id_length)]
@ -874,7 +887,8 @@ def format_list_template_system(template_data):
                        + template_node_selector_length
                        + template_node_autostart_length
                        + template_migration_method_length
-                        + 2,
+                        + template_migration_max_downtime_length
+                        + 3,
                    )
                ]
            ),
@ -891,7 +905,8 @@ def format_list_template_system(template_data):
 {template_node_limit: <{template_node_limit_length}} \
 {template_node_selector: <{template_node_selector_length}} \
 {template_node_autostart: <{template_node_autostart_length}} \
-{template_migration_method: <{template_migration_method_length}}{end_bold}".format(
+{template_migration_method: <{template_migration_method_length}} \
+{template_migration_max_downtime: <{template_migration_max_downtime_length}}{end_bold}".format(
            bold=ansiprint.bold(),
            end_bold=ansiprint.end(),
            template_name_length=template_name_length,
@ -905,6 +920,7 @@ def format_list_template_system(template_data):
            template_node_selector_length=template_node_selector_length,
            template_node_autostart_length=template_node_autostart_length,
            template_migration_method_length=template_migration_method_length,
+            template_migration_max_downtime_length=template_migration_max_downtime_length,
            template_name="Name",
            template_id="ID",
            template_vcpu="vCPUs",
@ -915,7 +931,8 @@ def format_list_template_system(template_data):
            template_node_limit="Limit",
            template_node_selector="Selector",
            template_node_autostart="Autostart",
-            template_migration_method="Migration",
+            template_migration_method="Mig. Method",
+            template_migration_max_downtime="Max Downtime",
        )
    )

@ -931,7 +948,8 @@ def format_list_template_system(template_data):
 {template_node_limit: <{template_node_limit_length}} \
 {template_node_selector: <{template_node_selector_length}} \
 {template_node_autostart: <{template_node_autostart_length}} \
-{template_migration_method: <{template_migration_method_length}}{end_bold}".format(
+{template_migration_method: <{template_migration_method_length}} \
+{template_migration_max_downtime: <{template_migration_max_downtime_length}}{end_bold}".format(
                template_name_length=template_name_length,
                template_id_length=template_id_length,
                template_vcpu_length=template_vcpu_length,
@ -943,6 +961,7 @@ def format_list_template_system(template_data):
                template_node_selector_length=template_node_selector_length,
                template_node_autostart_length=template_node_autostart_length,
                template_migration_method_length=template_migration_method_length,
+                template_migration_max_downtime_length=template_migration_max_downtime_length,
                bold="",
                end_bold="",
                template_name=str(template["name"]),
@ -956,6 +975,7 @@ def format_list_template_system(template_data):
                template_node_selector=str(template["node_selector"]),
                template_node_autostart=str(template["node_autostart"]),
                template_migration_method=str(template["migration_method"]),
+                template_migration_max_downtime=f"{str(template['migration_max_downtime'])} ms",
            )
        )

--- a/client-cli/pvc/lib/storage.py
+++ b/client-cli/pvc/lib/storage.py
@ -3,7 +3,7 @@
 # ceph.py - PVC CLI client function library, Ceph cluster functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -30,6 +30,7 @@ from requests_toolbelt.multipart.encoder import (

 import pvc.lib.ansiprint as ansiprint
 from pvc.lib.common import UploadProgressBar, call_api, get_wait_retdata
+from pvc.cli.helpers import MAX_CONTENT_WIDTH

 #
 # Supplemental functions
@ -430,7 +431,9 @@ def format_list_osd(config, osd_list):
            )
            continue

-        if osd_information["is_split"]:
+        if osd_information.get("is_split") is not None and osd_information.get(
+            "is_split"
+        ):
            osd_information["device"] = f"{osd_information['device']} [s]"

        # Deal with the size to human readable
@ -1172,15 +1175,15 @@ def ceph_volume_list(config, limit, pool):
        return False, response.json().get("message", "")


-def ceph_volume_add(config, pool, volume, size):
+def ceph_volume_add(config, pool, volume, size, force_flag=False):
    """
    Add new Ceph volume

    API endpoint: POST /api/v1/storage/ceph/volume
-    API arguments: volume={volume}, pool={pool}, size={size}
+    API arguments: volume={volume}, pool={pool}, size={size}, force={force_flag}
    API schema: {"message":"{data}"}
    """
-    params = {"volume": volume, "pool": pool, "size": size}
+    params = {"volume": volume, "pool": pool, "size": size, "force": force_flag}
    response = call_api(config, "post", "/storage/ceph/volume", params=params)

    if response.status_code == 200:
@ -1261,12 +1264,14 @@ def ceph_volume_remove(config, pool, volume):
    return retstatus, response.json().get("message", "")


-def ceph_volume_modify(config, pool, volume, new_name=None, new_size=None):
+def ceph_volume_modify(
+    config, pool, volume, new_name=None, new_size=None, force_flag=False
+):
    """
    Modify Ceph volume

    API endpoint: PUT /api/v1/storage/ceph/volume/{pool}/{volume}
-    API arguments:
+    API arguments: [new_name={new_name}], [new_size={new_size}], force_flag={force_flag}
    API schema: {"message":"{data}"}
    """

@ -1275,6 +1280,7 @@ def ceph_volume_modify(config, pool, volume, new_name=None, new_size=None):
        params["new_name"] = new_name
    if new_size:
        params["new_size"] = new_size
+        params["force"] = force_flag

    response = call_api(
        config,
@ -1291,15 +1297,15 @@ def ceph_volume_modify(config, pool, volume, new_name=None, new_size=None):
    return retstatus, response.json().get("message", "")


-def ceph_volume_clone(config, pool, volume, new_volume):
+def ceph_volume_clone(config, pool, volume, new_volume, force_flag=False):
    """
    Clone Ceph volume

    API endpoint: POST /api/v1/storage/ceph/volume/{pool}/{volume}
-    API arguments: new_volume={new_volume
+    API arguments: new_volume={new_volume, force_flag={force_flag}
    API schema: {"message":"{data}"}
    """
-    params = {"new_volume": new_volume}
+    params = {"new_volume": new_volume, "force_flag": force_flag}
    response = call_api(
        config,
        "post",
@ -1539,6 +1545,30 @@ def ceph_snapshot_add(config, pool, volume, snapshot):
    return retstatus, response.json().get("message", "")


+def ceph_snapshot_rollback(config, pool, volume, snapshot):
+    """
+    Roll back Ceph volume to snapshot
+
+    API endpoint: POST /api/v1/storage/ceph/snapshot/{pool}/{volume}/{snapshot}/rollback
+    API arguments:
+    API schema: {"message":"{data}"}
+    """
+    response = call_api(
+        config,
+        "post",
+        "/storage/ceph/snapshot/{pool}/{volume}/{snapshot}/rollback".format(
+            snapshot=snapshot, volume=volume, pool=pool
+        ),
+    )
+
+    if response.status_code == 200:
+        retstatus = True
+    else:
+        retstatus = False
+
+    return retstatus, response.json().get("message", "")
+
+
 def ceph_snapshot_remove(config, pool, volume, snapshot):
    """
    Remove Ceph snapshot
@ -1695,15 +1725,17 @@ def format_list_snapshot(config, snapshot_list):
 #
 # Benchmark functions
 #
-def ceph_benchmark_run(config, pool, wait_flag):
+def ceph_benchmark_run(config, pool, name, wait_flag):
    """
    Run a storage benchmark against {pool}

    API endpoint: POST /api/v1/storage/ceph/benchmark
-    API arguments: pool={pool}
+    API arguments: pool={pool}, name={name}
    API schema: {message}
    """
    params = {"pool": pool}
+    if name:
+        params["name"] = name
    response = call_api(config, "post", "/storage/ceph/benchmark", params=params)

    return get_wait_retdata(response, wait_flag)
@ -1775,7 +1807,7 @@ def get_benchmark_list_results(benchmark_format, benchmark_data):
        benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_legacy(
            benchmark_data
        )
-    elif benchmark_format == 1:
+    elif benchmark_format == 1 or benchmark_format == 2:
        benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_json(
            benchmark_data
        )
@ -1977,6 +2009,7 @@ def format_info_benchmark(config, benchmark_information):
    benchmark_matrix = {
        0: format_info_benchmark_legacy,
        1: format_info_benchmark_json,
+        2: format_info_benchmark_json,
    }

    benchmark_version = benchmark_information[0]["test_format"]
@ -2311,12 +2344,15 @@ def format_info_benchmark_json(config, benchmark_information):
    if benchmark_information["benchmark_result"] == "Running":
        return "Benchmark test is still running."

+    benchmark_format = benchmark_information["test_format"]
    benchmark_details = benchmark_information["benchmark_result"]

    # Format a nice output; do this line-by-line then concat the elements at the end
    ainformation = []
    ainformation.append(
-        "{}Storage Benchmark details:{}".format(ansiprint.bold(), ansiprint.end())
+        "{}Storage Benchmark details (format {}):{}".format(
+            ansiprint.bold(), benchmark_format, ansiprint.end()
+        )
    )

    nice_test_name_map = {
@ -2364,7 +2400,7 @@ def format_info_benchmark_json(config, benchmark_information):
            if element[1] != 0:
                useful_latency_tree.append(element)

-        max_rows = 9
+        max_rows = 5
        if len(useful_latency_tree) > 9:
            max_rows = len(useful_latency_tree)
        elif len(useful_latency_tree) < 9:
@ -2373,15 +2409,10 @@ def format_info_benchmark_json(config, benchmark_information):

        # Format the static data
        overall_label = [
-            "Overall BW/s:",
-            "Overall IOPS:",
-            "Total I/O:",
-            "Runtime (s):",
-            "User CPU %:",
-            "System CPU %:",
-            "Ctx Switches:",
-            "Major Faults:",
-            "Minor Faults:",
+            "BW/s:",
+            "IOPS:",
+            "I/O:",
+            "Time:",
        ]
        while len(overall_label) < max_rows:
            overall_label.append("")
@ -2390,68 +2421,149 @@ def format_info_benchmark_json(config, benchmark_information):
            format_bytes_tohuman(int(job_details[io_class]["bw_bytes"])),
            format_ops_tohuman(int(job_details[io_class]["iops"])),
            format_bytes_tohuman(int(job_details[io_class]["io_bytes"])),
-            job_details["job_runtime"] / 1000,
-            job_details["usr_cpu"],
-            job_details["sys_cpu"],
-            job_details["ctx"],
-            job_details["majf"],
-            job_details["minf"],
+            str(job_details["job_runtime"] / 1000) + "s",
        ]
        while len(overall_data) < max_rows:
            overall_data.append("")

+        cpu_label = [
+            "Total:",
+            "User:",
+            "Sys:",
+            "OSD:",
+            "MON:",
+        ]
+        while len(cpu_label) < max_rows:
+            cpu_label.append("")
+
+        cpu_data = [
+            (
+                benchmark_details[test]["avg_cpu_util_percent"]["total"]
+                if benchmark_format > 1
+                else "N/A"
+            ),
+            round(job_details["usr_cpu"], 2),
+            round(job_details["sys_cpu"], 2),
+            (
+                benchmark_details[test]["avg_cpu_util_percent"]["ceph-osd"]
+                if benchmark_format > 1
+                else "N/A"
+            ),
+            (
+                benchmark_details[test]["avg_cpu_util_percent"]["ceph-mon"]
+                if benchmark_format > 1
+                else "N/A"
+            ),
+        ]
+        while len(cpu_data) < max_rows:
+            cpu_data.append("")
+
+        memory_label = [
+            "Total:",
+            "OSD:",
+            "MON:",
+        ]
+        while len(memory_label) < max_rows:
+            memory_label.append("")
+
+        memory_data = [
+            (
+                benchmark_details[test]["avg_memory_util_percent"]["total"]
+                if benchmark_format > 1
+                else "N/A"
+            ),
+            (
+                benchmark_details[test]["avg_memory_util_percent"]["ceph-osd"]
+                if benchmark_format > 1
+                else "N/A"
+            ),
+            (
+                benchmark_details[test]["avg_memory_util_percent"]["ceph-mon"]
+                if benchmark_format > 1
+                else "N/A"
+            ),
+        ]
+        while len(memory_data) < max_rows:
+            memory_data.append("")
+
+        network_label = [
+            "Total:",
+            "Sent:",
+            "Recv:",
+        ]
+        while len(network_label) < max_rows:
+            network_label.append("")
+
+        network_data = [
+            (
+                format_bytes_tohuman(
+                    int(benchmark_details[test]["avg_network_util_bps"]["total"])
+                )
+                if benchmark_format > 1
+                else "N/A"
+            ),
+            (
+                format_bytes_tohuman(
+                    int(benchmark_details[test]["avg_network_util_bps"]["sent"])
+                )
+                if benchmark_format > 1
+                else "N/A"
+            ),
+            (
+                format_bytes_tohuman(
+                    int(benchmark_details[test]["avg_network_util_bps"]["recv"])
+                )
+                if benchmark_format > 1
+                else "N/A"
+            ),
+        ]
+        while len(network_data) < max_rows:
+            network_data.append("")
+
        bandwidth_label = [
            "Min:",
            "Max:",
            "Mean:",
            "StdDev:",
            "Samples:",
-            "",
-            "",
-            "",
-            "",
        ]
        while len(bandwidth_label) < max_rows:
            bandwidth_label.append("")

        bandwidth_data = [
-            format_bytes_tohuman(int(job_details[io_class]["bw_min"]) * 1024),
-            format_bytes_tohuman(int(job_details[io_class]["bw_max"]) * 1024),
-            format_bytes_tohuman(int(job_details[io_class]["bw_mean"]) * 1024),
-            format_bytes_tohuman(int(job_details[io_class]["bw_dev"]) * 1024),
-            job_details[io_class]["bw_samples"],
-            "",
-            "",
-            "",
-            "",
+            format_bytes_tohuman(int(job_details[io_class]["bw_min"]) * 1024)
+            + " / "
+            + format_ops_tohuman(int(job_details[io_class]["iops_min"])),
+            format_bytes_tohuman(int(job_details[io_class]["bw_max"]) * 1024)
+            + " / "
+            + format_ops_tohuman(int(job_details[io_class]["iops_max"])),
+            format_bytes_tohuman(int(job_details[io_class]["bw_mean"]) * 1024)
+            + " / "
+            + format_ops_tohuman(int(job_details[io_class]["iops_mean"])),
+            format_bytes_tohuman(int(job_details[io_class]["bw_dev"]) * 1024)
+            + " / "
+            + format_ops_tohuman(int(job_details[io_class]["iops_stddev"])),
+            str(job_details[io_class]["bw_samples"])
+            + " / "
+            + str(job_details[io_class]["iops_samples"]),
        ]
        while len(bandwidth_data) < max_rows:
            bandwidth_data.append("")

-        iops_data = [
-            format_ops_tohuman(int(job_details[io_class]["iops_min"])),
-            format_ops_tohuman(int(job_details[io_class]["iops_max"])),
-            format_ops_tohuman(int(job_details[io_class]["iops_mean"])),
-            format_ops_tohuman(int(job_details[io_class]["iops_stddev"])),
-            job_details[io_class]["iops_samples"],
-            "",
-            "",
-            "",
-            "",
+        lat_label = [
+            "Min:",
+            "Max:",
+            "Mean:",
+            "StdDev:",
        ]
-        while len(iops_data) < max_rows:
-            iops_data.append("")
+        while len(lat_label) < max_rows:
+            lat_label.append("")

        lat_data = [
            int(job_details[io_class]["lat_ns"]["min"]) / 1000,
            int(job_details[io_class]["lat_ns"]["max"]) / 1000,
            int(job_details[io_class]["lat_ns"]["mean"]) / 1000,
            int(job_details[io_class]["lat_ns"]["stddev"]) / 1000,
-            "",
-            "",
-            "",
-            "",
-            "",
        ]
        while len(lat_data) < max_rows:
            lat_data.append("")
@ -2460,98 +2572,119 @@ def format_info_benchmark_json(config, benchmark_information):
        lat_bucket_label = list()
        lat_bucket_data = list()
        for element in useful_latency_tree:
-            lat_bucket_label.append(element[0])
-            lat_bucket_data.append(element[1])
+            lat_bucket_label.append(element[0] + ":" if element[0] else "")
+            lat_bucket_data.append(round(float(element[1]), 2) if element[1] else "")
+        while len(lat_bucket_label) < max_rows:
+            lat_bucket_label.append("")
+        while len(lat_bucket_data) < max_rows:
+            lat_bucket_label.append("")

        # Column default widths
-        overall_label_length = 0
+        overall_label_length = 5
        overall_column_length = 0
-        bandwidth_label_length = 0
-        bandwidth_column_length = 11
-        iops_column_length = 4
-        latency_column_length = 12
+        cpu_label_length = 6
+        cpu_column_length = 0
+        memory_label_length = 6
+        memory_column_length = 0
+        network_label_length = 6
+        network_column_length = 6
+        bandwidth_label_length = 8
+        bandwidth_column_length = 0
+        latency_label_length = 7
+        latency_column_length = 0
        latency_bucket_label_length = 0
+        latency_bucket_column_length = 0

        # Column layout:
-        #    General    Bandwidth   IOPS      Latency   Percentiles
-        #    ---------  ----------  --------  --------  ---------------
-        #    Size       Min         Min       Min       A
-        #    BW         Max         Max       Max       B
-        #    IOPS       Mean        Mean      Mean      ...
-        #    Runtime    StdDev      StdDev    StdDev    Z
-        #    UsrCPU     Samples     Samples
-        #    SysCPU
-        #    CtxSw
-        #    MajFault
-        #    MinFault
+        #    Overall    CPU   Memory  Network  Bandwidth/IOPS  Latency   Percentiles
+        #    ---------  ----- ------- -------- --------------  --------  ---------------
+        #    BW         Total Total   Total    Min             Min       A
+        #    IOPS       Usr   OSD     Send     Max             Max       B
+        #    Time       Sys   MON     Recv     Mean            Mean      ...
+        #    Size       OSD                    StdDev          StdDev    Z
+        #               MON                    Samples

        # Set column widths
-        for item in overall_label:
-            _item_length = len(str(item))
-            if _item_length > overall_label_length:
-                overall_label_length = _item_length
-
        for item in overall_data:
            _item_length = len(str(item))
            if _item_length > overall_column_length:
                overall_column_length = _item_length

-        test_name_length = len(nice_test_name_map[test])
-        if test_name_length > overall_label_length + overall_column_length:
-            _diff = test_name_length - (overall_label_length + overall_column_length)
-            overall_column_length += _diff
-
-        for item in bandwidth_label:
+        for item in cpu_data:
            _item_length = len(str(item))
-            if _item_length > bandwidth_label_length:
-                bandwidth_label_length = _item_length
+            if _item_length > cpu_column_length:
+                cpu_column_length = _item_length
+
+        for item in memory_data:
+            _item_length = len(str(item))
+            if _item_length > memory_column_length:
+                memory_column_length = _item_length
+
+        for item in network_data:
+            _item_length = len(str(item))
+            if _item_length > network_column_length:
+                network_column_length = _item_length

        for item in bandwidth_data:
            _item_length = len(str(item))
            if _item_length > bandwidth_column_length:
                bandwidth_column_length = _item_length

-        for item in iops_data:
-            _item_length = len(str(item))
-            if _item_length > iops_column_length:
-                iops_column_length = _item_length
-
        for item in lat_data:
            _item_length = len(str(item))
            if _item_length > latency_column_length:
                latency_column_length = _item_length

-        for item in lat_bucket_label:
+        for item in lat_bucket_data:
            _item_length = len(str(item))
-            if _item_length > latency_bucket_label_length:
-                latency_bucket_label_length = _item_length
+            if _item_length > latency_bucket_column_length:
+                latency_bucket_column_length = _item_length

        # Top row (Headers)
        ainformation.append(
-            "{bold}\
-{overall_label: <{overall_label_length}}    \
-{bandwidth_label: <{bandwidth_label_length}} \
-{bandwidth: <{bandwidth_length}}   \
-{iops: <{iops_length}}   \
-{latency: <{latency_length}}   \
-{latency_bucket_label: <{latency_bucket_label_length}} \
-{latency_bucket} \
-{end_bold}".format(
+            "{bold}{overall_label: <{overall_label_length}} {header_fill}{end_bold}".format(
                bold=ansiprint.bold(),
                end_bold=ansiprint.end(),
                overall_label=nice_test_name_map[test],
                overall_label_length=overall_label_length,
-                bandwidth_label="",
-                bandwidth_label_length=bandwidth_label_length,
-                bandwidth="Bandwidth/s",
-                bandwidth_length=bandwidth_column_length,
-                iops="IOPS",
-                iops_length=iops_column_length,
-                latency="Latency (μs)",
-                latency_length=latency_column_length,
-                latency_bucket_label="Latency Buckets (μs/%)",
-                latency_bucket_label_length=latency_bucket_label_length,
-                latency_bucket="",
+                header_fill="-"
+                * (
+                    (MAX_CONTENT_WIDTH if MAX_CONTENT_WIDTH <= 120 else 120)
+                    - len(nice_test_name_map[test])
+                    - 4
+                ),
+            )
+        )
+
+        ainformation.append(
+            "{bold}\
+{overall_label: <{overall_label_length}}  \
+{cpu_label: <{cpu_label_length}}  \
+{memory_label: <{memory_label_length}}  \
+{network_label: <{network_label_length}}  \
+{bandwidth_label: <{bandwidth_label_length}}  \
+{latency_label: <{latency_label_length}}  \
+{latency_bucket_label: <{latency_bucket_label_length}}\
+{end_bold}".format(
+                bold=ansiprint.bold(),
+                end_bold=ansiprint.end(),
+                overall_label="Overall",
+                overall_label_length=overall_label_length + overall_column_length + 1,
+                cpu_label="CPU (%)",
+                cpu_label_length=cpu_label_length + cpu_column_length + 1,
+                memory_label="Memory (%)",
+                memory_label_length=memory_label_length + memory_column_length + 1,
+                network_label="Network (bps)",
+                network_label_length=network_label_length + network_column_length + 1,
+                bandwidth_label="Bandwidth / IOPS",
+                bandwidth_label_length=bandwidth_label_length
+                + bandwidth_column_length
+                + 1,
+                latency_label="Latency (μs)",
+                latency_label_length=latency_label_length + latency_column_length + 1,
+                latency_bucket_label="Buckets (μs/%)",
+                latency_bucket_label_length=latency_bucket_label_length
+                + latency_bucket_column_length,
            )
        )

@ -2559,14 +2692,20 @@ def format_info_benchmark_json(config, benchmark_information):
            # Top row (Headers)
            ainformation.append(
                "{bold}\
-{overall_label: >{overall_label_length}} \
-{overall: <{overall_length}}   \
-{bandwidth_label: >{bandwidth_label_length}} \
-{bandwidth: <{bandwidth_length}}   \
-{iops: <{iops_length}}   \
-{latency: <{latency_length}}   \
-{latency_bucket_label: >{latency_bucket_label_length}} \
-{latency_bucket} \
+{overall_label: <{overall_label_length}} \
+{overall: <{overall_length}}  \
+{cpu_label: <{cpu_label_length}} \
+{cpu: <{cpu_length}}  \
+{memory_label: <{memory_label_length}} \
+{memory: <{memory_length}}  \
+{network_label: <{network_label_length}} \
+{network: <{network_length}}  \
+{bandwidth_label: <{bandwidth_label_length}} \
+{bandwidth: <{bandwidth_length}}  \
+{latency_label: <{latency_label_length}} \
+{latency: <{latency_length}}  \
+{latency_bucket_label: <{latency_bucket_label_length}} \
+{latency_bucket}\
 {end_bold}".format(
                    bold="",
                    end_bold="",
@ -2574,12 +2713,24 @@ def format_info_benchmark_json(config, benchmark_information):
                    overall_label_length=overall_label_length,
                    overall=overall_data[idx],
                    overall_length=overall_column_length,
+                    cpu_label=cpu_label[idx],
+                    cpu_label_length=cpu_label_length,
+                    cpu=cpu_data[idx],
+                    cpu_length=cpu_column_length,
+                    memory_label=memory_label[idx],
+                    memory_label_length=memory_label_length,
+                    memory=memory_data[idx],
+                    memory_length=memory_column_length,
+                    network_label=network_label[idx],
+                    network_label_length=network_label_length,
+                    network=network_data[idx],
+                    network_length=network_column_length,
                    bandwidth_label=bandwidth_label[idx],
                    bandwidth_label_length=bandwidth_label_length,
                    bandwidth=bandwidth_data[idx],
                    bandwidth_length=bandwidth_column_length,
-                    iops=iops_data[idx],
-                    iops_length=iops_column_length,
+                    latency_label=lat_label[idx],
+                    latency_label_length=latency_label_length,
                    latency=lat_data[idx],
                    latency_length=latency_column_length,
                    latency_bucket_label=lat_bucket_label[idx],
@ -2588,4 +2739,4 @@ def format_info_benchmark_json(config, benchmark_information):
                )
            )

-    return "\n".join(ainformation)
+    return "\n".join(ainformation) + "\n"
--- a/client-cli/pvc/lib/vm.py
+++ b/client-cli/pvc/lib/vm.py
@ -3,7 +3,7 @@
 # vm.py - PVC CLI client function library, VM functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -89,6 +89,7 @@ def vm_define(
    node_selector,
    node_autostart,
    migration_method,
+    migration_max_downtime,
    user_tags,
    protected_tags,
 ):
@ -96,7 +97,7 @@ def vm_define(
    Define a new VM on the cluster

    API endpoint: POST /vm
-    API arguments: xml={xml}, node={node}, limit={node_limit}, selector={node_selector}, autostart={node_autostart}, migration_method={migration_method}, user_tags={user_tags}, protected_tags={protected_tags}
+    API arguments: xml={xml}, node={node}, limit={node_limit}, selector={node_selector}, autostart={node_autostart}, migration_method={migration_method}, migration_max_downtime={migration_max_downtime}, user_tags={user_tags}, protected_tags={protected_tags}
    API schema: {"message":"{data}"}
    """
    params = {
@ -105,6 +106,7 @@ def vm_define(
        "selector": node_selector,
        "autostart": node_autostart,
        "migration_method": migration_method,
+        "migration_max_downtime": migration_max_downtime,
        "user_tags": user_tags,
        "protected_tags": protected_tags,
    }
@ -205,6 +207,7 @@ def vm_metadata(
    node_selector,
    node_autostart,
    migration_method,
+    migration_max_downtime,
    provisioner_profile,
 ):
    """
@ -229,6 +232,9 @@ def vm_metadata(
    if migration_method is not None:
        params["migration_method"] = migration_method

+    if migration_max_downtime is not None:
+        params["migration_max_downtime"] = migration_max_downtime
+
    if provisioner_profile is not None:
        params["profile"] = provisioner_profile

@ -415,7 +421,7 @@ def vm_node(config, vm, target_node, action, force=False, wait=False, force_live
    return retstatus, response.json().get("message", "")


-def vm_locks(config, vm, wait_flag):
+def vm_locks(config, vm, wait_flag=True):
    """
    Flush RBD locks of (stopped) VM

@ -492,6 +498,121 @@ def vm_restore(config, vm, backup_path, backup_datestring, retain_snapshot=False
        return True, response.json().get("message", "")


+def vm_create_snapshot(config, vm, snapshot_name=None, wait_flag=True):
+    """
+    Take a snapshot of a VM's disks and configuration
+
+    API endpoint: POST /vm/{vm}/snapshot
+    API arguments: snapshot_name=snapshot_name
+    API schema: {"message":"{data}"}
+    """
+    params = dict()
+    if snapshot_name is not None:
+        params["snapshot_name"] = snapshot_name
+    response = call_api(
+        config, "post", "/vm/{vm}/snapshot".format(vm=vm), params=params
+    )
+
+    return get_wait_retdata(response, wait_flag)
+
+
+def vm_remove_snapshot(config, vm, snapshot_name, wait_flag=True):
+    """
+    Remove a snapshot of a VM's disks and configuration
+
+    API endpoint: DELETE /vm/{vm}/snapshot
+    API arguments: snapshot_name=snapshot_name
+    API schema: {"message":"{data}"}
+    """
+    params = {"snapshot_name": snapshot_name}
+    response = call_api(
+        config, "delete", "/vm/{vm}/snapshot".format(vm=vm), params=params
+    )
+
+    return get_wait_retdata(response, wait_flag)
+
+
+def vm_rollback_snapshot(config, vm, snapshot_name, wait_flag=True):
+    """
+    Roll back to a snapshot of a VM's disks and configuration
+
+    API endpoint: POST /vm/{vm}/snapshot/rollback
+    API arguments: snapshot_name=snapshot_name
+    API schema: {"message":"{data}"}
+    """
+    params = {"snapshot_name": snapshot_name}
+    response = call_api(
+        config, "post", "/vm/{vm}/snapshot/rollback".format(vm=vm), params=params
+    )
+
+    return get_wait_retdata(response, wait_flag)
+
+
+def vm_export_snapshot(
+    config, vm, snapshot_name, export_path, incremental_parent=None, wait_flag=True
+):
+    """
+    Export an (existing) snapshot of a VM's disks and configuration to export_path, optionally
+    incremental with incremental_parent
+
+    API endpoint: POST /vm/{vm}/snapshot/export
+    API arguments: snapshot_name=snapshot_name, export_path=export_path, incremental_parent=incremental_parent
+    API schema: {"message":"{data}"}
+    """
+    params = {
+        "snapshot_name": snapshot_name,
+        "export_path": export_path,
+    }
+    if incremental_parent is not None:
+        params["incremental_parent"] = incremental_parent
+
+    response = call_api(
+        config, "post", "/vm/{vm}/snapshot/export".format(vm=vm), params=params
+    )
+
+    return get_wait_retdata(response, wait_flag)
+
+
+def vm_import_snapshot(
+    config, vm, snapshot_name, import_path, retain_snapshot=False, wait_flag=True
+):
+    """
+    Import a snapshot of {vm} and its volumes from a local primary coordinator filesystem path
+
+    API endpoint: POST /vm/{vm}/snapshot/import
+    API arguments: snapshot_name={snapshot_name}, import_path={import_path}, retain_snapshot={retain_snapshot}
+    API schema: {"message":"{data}"}
+    """
+    params = {
+        "snapshot_name": snapshot_name,
+        "import_path": import_path,
+        "retain_snapshot": retain_snapshot,
+    }
+    response = call_api(
+        config, "post", "/vm/{vm}/snapshot/import".format(vm=vm), params=params
+    )
+
+    return get_wait_retdata(response, wait_flag)
+
+
+def vm_autobackup(config, email_recipients=None, force_full_flag=False, wait_flag=True):
+    """
+    Perform a cluster VM autobackup
+
+    API endpoint: POST /vm//autobackup
+    API arguments: email_recipients=email_recipients, force_full_flag=force_full_flag
+    API schema: {"message":"{data}"}
+    """
+    params = {
+        "email_recipients": email_recipients,
+        "force_full": force_full_flag,
+    }
+
+    response = call_api(config, "post", "/vm/autobackup", params=params)
+
+    return get_wait_retdata(response, wait_flag)
+
+
 def vm_vcpus_set(config, vm, vcpus, topology, restart):
    """
    Set the vCPU count of the VM with topology
@ -1516,29 +1637,40 @@ def format_info(config, domain_information, long_output):
            ansiprint.purple(), ansiprint.end(), domain_information["vcpu"]
        )
    )
-    ainformation.append(
-        "{}Topology (S/C/T):{}   {}".format(
-            ansiprint.purple(), ansiprint.end(), domain_information["vcpu_topology"]
+    if long_output:
+        ainformation.append(
+            "{}Topology (S/C/T):{}   {}".format(
+                ansiprint.purple(), ansiprint.end(), domain_information["vcpu_topology"]
+            )
        )
-    )

    if (
-        domain_information["vnc"].get("listen", "None") != "None"
-        and domain_information["vnc"].get("port", "None") != "None"
-    ):
+        domain_information["vnc"].get("listen")
+        and domain_information["vnc"].get("port")
+    ) or long_output:
+        listen = (
+            domain_information["vnc"]["listen"]
+            if domain_information["vnc"].get("listen")
+            else "N/A"
+        )
+        port = (
+            domain_information["vnc"]["port"]
+            if domain_information["vnc"].get("port")
+            else "N/A"
+        )
        ainformation.append("")
        ainformation.append(
            "{}VNC listen:{}         {}".format(
-                ansiprint.purple(), ansiprint.end(), domain_information["vnc"]["listen"]
+                ansiprint.purple(), ansiprint.end(), listen
            )
        )
        ainformation.append(
            "{}VNC port:{}           {}".format(
-                ansiprint.purple(), ansiprint.end(), domain_information["vnc"]["port"]
+                ansiprint.purple(), ansiprint.end(), port
            )
        )

-    if long_output is True:
+    if long_output:
        # Virtualization information
        ainformation.append("")
        ainformation.append(
@ -1626,6 +1758,8 @@ def format_info(config, domain_information, long_output):
        "migrate": ansiprint.blue(),
        "unmigrate": ansiprint.blue(),
        "provision": ansiprint.blue(),
+        "restore": ansiprint.blue(),
+        "import": ansiprint.blue(),
    }
    ainformation.append(
        "{}State:{}              {}{}{}".format(
@ -1637,14 +1771,14 @@ def format_info(config, domain_information, long_output):
        )
    )
    ainformation.append(
-        "{}Current Node:{}       {}".format(
+        "{}Current node:{}       {}".format(
            ansiprint.purple(), ansiprint.end(), domain_information["node"]
        )
    )
    if not domain_information["last_node"]:
        domain_information["last_node"] = "N/A"
    ainformation.append(
-        "{}Previous Node:{}      {}".format(
+        "{}Previous node:{}      {}".format(
            ansiprint.purple(), ansiprint.end(), domain_information["last_node"]
        )
    )
@ -1658,46 +1792,70 @@ def format_info(config, domain_information, long_output):
            )
        )

-    if not domain_information.get("node_selector"):
-        formatted_node_selector = "False"
+    if (
+        not domain_information.get("node_selector")
+        or domain_information.get("node_selector") == "None"
+    ):
+        formatted_node_selector = "Default"
    else:
-        formatted_node_selector = domain_information["node_selector"]
+        formatted_node_selector = str(domain_information["node_selector"]).title()

-    if not domain_information.get("node_limit"):
-        formatted_node_limit = "False"
+    if (
+        not domain_information.get("node_limit")
+        or domain_information.get("node_limit") == "None"
+    ):
+        formatted_node_limit = "Any"
    else:
        formatted_node_limit = ", ".join(domain_information["node_limit"])

    if not domain_information.get("node_autostart"):
+        autostart_colour = ansiprint.blue()
        formatted_node_autostart = "False"
    else:
-        formatted_node_autostart = domain_information["node_autostart"]
+        autostart_colour = ansiprint.green()
+        formatted_node_autostart = "True"

-    if not domain_information.get("migration_method"):
-        formatted_migration_method = "any"
+    if (
+        not domain_information.get("migration_method")
+        or domain_information.get("migration_method") == "None"
+    ):
+        formatted_migration_method = "Live, Shutdown"
    else:
-        formatted_migration_method = domain_information["migration_method"]
-
-    ainformation.append(
-        "{}Migration selector:{} {}".format(
-            ansiprint.purple(), ansiprint.end(), formatted_node_selector
+        formatted_migration_method = (
+            f"{str(domain_information['migration_method']).title()} only"
        )
-    )
+
    ainformation.append(
        "{}Node limit:{}         {}".format(
            ansiprint.purple(), ansiprint.end(), formatted_node_limit
        )
    )
    ainformation.append(
-        "{}Autostart:{}          {}".format(
-            ansiprint.purple(), ansiprint.end(), formatted_node_autostart
+        "{}Autostart:{}          {}{}{}".format(
+            ansiprint.purple(),
+            ansiprint.end(),
+            autostart_colour,
+            formatted_node_autostart,
+            ansiprint.end(),
        )
    )
    ainformation.append(
-        "{}Migration Method:{}   {}".format(
+        "{}Migration method:{}   {}".format(
            ansiprint.purple(), ansiprint.end(), formatted_migration_method
        )
    )
+    ainformation.append(
+        "{}Migration selector:{} {}".format(
+            ansiprint.purple(), ansiprint.end(), formatted_node_selector
+        )
+    )
+    ainformation.append(
+        "{}Max live downtime:{}  {}".format(
+            ansiprint.purple(),
+            ansiprint.end(),
+            f"{domain_information.get('migration_max_downtime')} ms",
+        )
+    )

    # Tag list
    tags_name_length = 5
@ -1736,13 +1894,17 @@ def format_info(config, domain_information, long_output):
            domain_information["tags"], key=lambda t: t["type"] + t["name"]
        ):
            ainformation.append(
-                "                    {tags_name: <{tags_name_length}} {tags_type: <{tags_type_length}} {tags_protected: <{tags_protected_length}}".format(
+                "                    {tags_name: <{tags_name_length}} {tags_type: <{tags_type_length}} {tags_protected_colour}{tags_protected: <{tags_protected_length}}{end}".format(
                    tags_name_length=tags_name_length,
                    tags_type_length=tags_type_length,
                    tags_protected_length=tags_protected_length,
                    tags_name=tag["name"],
                    tags_type=tag["type"],
                    tags_protected=str(tag["protected"]),
+                    tags_protected_colour=(
+                        ansiprint.green() if tag["protected"] else ansiprint.blue()
+                    ),
+                    end=ansiprint.end(),
                )
            )
    else:
@ -1754,6 +1916,78 @@ def format_info(config, domain_information, long_output):
            )
        )

+    # Snapshot list
+    snapshots_name_length = 5
+    snapshots_age_length = 4
+    snapshots_xml_changes_length = 12
+    for snapshot in domain_information.get("snapshots", list()):
+        xml_diff_plus = 0
+        xml_diff_minus = 0
+        for line in snapshot["xml_diff_lines"]:
+            if re.match(r"^\+ ", line):
+                xml_diff_plus += 1
+            elif re.match(r"^- ", line):
+                xml_diff_minus += 1
+        xml_diff_counts = f"+{xml_diff_plus}/-{xml_diff_minus}"
+
+        _snapshots_name_length = len(snapshot["name"]) + 1
+        if _snapshots_name_length > snapshots_name_length:
+            snapshots_name_length = _snapshots_name_length
+
+        _snapshots_age_length = len(snapshot["age"]) + 1
+        if _snapshots_age_length > snapshots_age_length:
+            snapshots_age_length = _snapshots_age_length
+
+        _snapshots_xml_changes_length = len(xml_diff_counts) + 1
+        if _snapshots_xml_changes_length > snapshots_xml_changes_length:
+            snapshots_xml_changes_length = _snapshots_xml_changes_length
+
+    if len(domain_information.get("snapshots", list())) > 0:
+        ainformation.append("")
+        ainformation.append(
+            "{purple}Snapshots:{end}          {bold}{snapshots_name: <{snapshots_name_length}} {snapshots_age: <{snapshots_age_length}} {snapshots_xml_changes: <{snapshots_xml_changes_length}}{end}".format(
+                purple=ansiprint.purple(),
+                bold=ansiprint.bold(),
+                end=ansiprint.end(),
+                snapshots_name_length=snapshots_name_length,
+                snapshots_age_length=snapshots_age_length,
+                snapshots_xml_changes_length=snapshots_xml_changes_length,
+                snapshots_name="Name",
+                snapshots_age="Age",
+                snapshots_xml_changes="XML Changes",
+            )
+        )
+
+        for snapshot in domain_information.get("snapshots", list()):
+            xml_diff_plus = 0
+            xml_diff_minus = 0
+            for line in snapshot["xml_diff_lines"]:
+                if re.match(r"^\+ ", line):
+                    xml_diff_plus += 1
+                elif re.match(r"^- ", line):
+                    xml_diff_minus += 1
+            xml_diff_counts = f"{ansiprint.green()}+{xml_diff_plus}{ansiprint.end()}/{ansiprint.red()}-{xml_diff_minus}{ansiprint.end()}"
+
+            ainformation.append(
+                "                    {snapshots_name: <{snapshots_name_length}} {snapshots_age: <{snapshots_age_length}} {snapshots_xml_changes: <{snapshots_xml_changes_length}}{end}".format(
+                    snapshots_name_length=snapshots_name_length,
+                    snapshots_age_length=snapshots_age_length,
+                    snapshots_xml_changes_length=snapshots_xml_changes_length,
+                    snapshots_name=snapshot["name"],
+                    snapshots_age=snapshot["age"],
+                    snapshots_xml_changes=xml_diff_counts,
+                    end=ansiprint.end(),
+                )
+            )
+    else:
+        ainformation.append("")
+        ainformation.append(
+            "{purple}Snapshots:{end}          N/A".format(
+                purple=ansiprint.purple(),
+                end=ansiprint.end(),
+            )
+        )
+
    # Network list
    net_list = []
    cluster_net_list = call_api(config, "get", "/network").json()
@ -1780,7 +2014,7 @@ def format_info(config, domain_information, long_output):
        )
    )

-    if long_output is True:
+    if long_output:
        # Disk list
        ainformation.append("")
        name_length = 0
@ -1916,6 +2150,7 @@ def format_list(config, vm_list):
    vm_name_length = 5
    vm_state_length = 6
    vm_tags_length = 5
+    vm_snapshots_length = 10
    vm_nets_length = 9
    vm_ram_length = 8
    vm_vcpu_length = 6
@ -1936,6 +2171,12 @@ def format_list(config, vm_list):
        _vm_tags_length = len(",".join(tag_list)) + 1
        if _vm_tags_length > vm_tags_length:
            vm_tags_length = _vm_tags_length
+        # vm_snapshots column
+        _vm_snapshots_length = (
+            len(str(len(domain_information.get("snapshots", list())))) + 1
+        )
+        if _vm_snapshots_length > vm_snapshots_length:
+            vm_snapshots_length = _vm_snapshots_length
        # vm_nets column
        _vm_nets_length = len(",".join(net_list)) + 1
        if _vm_nets_length > vm_nets_length:
@ -1952,7 +2193,11 @@ def format_list(config, vm_list):
    # Format the string (header)
    vm_list_output.append(
        "{bold}{vm_header: <{vm_header_length}} {resource_header: <{resource_header_length}} {node_header: <{node_header_length}}{end_bold}".format(
-            vm_header_length=vm_name_length + vm_state_length + vm_tags_length + 2,
+            vm_header_length=vm_name_length
+            + vm_state_length
+            + vm_tags_length
+            + vm_snapshots_length
+            + 3,
            resource_header_length=vm_nets_length + vm_ram_length + vm_vcpu_length + 2,
            node_header_length=vm_node_length + vm_migrated_length + 1,
            bold=ansiprint.bold(),
@ -1962,7 +2207,12 @@ def format_list(config, vm_list):
                [
                    "-"
                    for _ in range(
-                        4, vm_name_length + vm_state_length + vm_tags_length + 1
+                        4,
+                        vm_name_length
+                        + vm_state_length
+                        + vm_tags_length
+                        + +vm_snapshots_length
+                        + 2,
                    )
                ]
            ),
@ -1984,6 +2234,7 @@ def format_list(config, vm_list):
        "{bold}{vm_name: <{vm_name_length}} \
 {vm_state_colour}{vm_state: <{vm_state_length}}{end_colour} \
 {vm_tags: <{vm_tags_length}} \
+{vm_snapshots: <{vm_snapshots_length}} \
 {vm_networks: <{vm_nets_length}} \
 {vm_memory: <{vm_ram_length}} {vm_vcpu: <{vm_vcpu_length}} \
 {vm_node: <{vm_node_length}} \
@ -1991,6 +2242,7 @@ def format_list(config, vm_list):
            vm_name_length=vm_name_length,
            vm_state_length=vm_state_length,
            vm_tags_length=vm_tags_length,
+            vm_snapshots_length=vm_snapshots_length,
            vm_nets_length=vm_nets_length,
            vm_ram_length=vm_ram_length,
            vm_vcpu_length=vm_vcpu_length,
@ -2003,6 +2255,7 @@ def format_list(config, vm_list):
            vm_name="Name",
            vm_state="State",
            vm_tags="Tags",
+            vm_snapshots="Snapshots",
            vm_networks="Networks",
            vm_memory="RAM (M)",
            vm_vcpu="vCPUs",
@ -2069,6 +2322,7 @@ def format_list(config, vm_list):
            "{bold}{vm_name: <{vm_name_length}} \
 {vm_state_colour}{vm_state: <{vm_state_length}}{end_colour} \
 {vm_tags: <{vm_tags_length}} \
+{vm_snapshots: <{vm_snapshots_length}} \
 {vm_networks: <{vm_nets_length}} \
 {vm_memory: <{vm_ram_length}} {vm_vcpu: <{vm_vcpu_length}} \
 {vm_node: <{vm_node_length}} \
@ -2076,6 +2330,7 @@ def format_list(config, vm_list):
                vm_name_length=vm_name_length,
                vm_state_length=vm_state_length,
                vm_tags_length=vm_tags_length,
+                vm_snapshots_length=vm_snapshots_length,
                vm_nets_length=vm_nets_length,
                vm_ram_length=vm_ram_length,
                vm_vcpu_length=vm_vcpu_length,
@ -2088,6 +2343,7 @@ def format_list(config, vm_list):
                vm_name=domain_information["name"],
                vm_state=domain_information["state"],
                vm_tags=",".join(tag_list),
+                vm_snapshots=len(domain_information.get("snapshots", list())),
                vm_networks=",".join(net_string_list),
                vm_memory=domain_information["memory"],
                vm_vcpu=domain_information["vcpu"],
--- a/client-cli/pvc/lib/zkhandler.py
+++ b/client-cli/pvc/lib/zkhandler.py
@ -3,7 +3,7 @@
 # zkhandler.py - Secure versioned ZooKeeper updates
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/client-cli/setup.py
+++ b/client-cli/setup.py
@ -2,7 +2,7 @@ from setuptools import setup

 setup(
    name="pvc",
-    version="0.9.83",
+    version="0.9.100",
    packages=["pvc.cli", "pvc.lib"],
    install_requires=[
        "Click",
--- a/daemon-common/autobackup.py
+++ b/daemon-common/autobackup.py
@ -0,0 +1,695 @@
+#!/usr/bin/env python3
+
+# autobackup.py - PVC API Autobackup functions
+# Part of the Parallel Virtual Cluster (PVC) system
+#
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
+#
+#    This program is free software: you can redistribute it and/or modify
+#    it under the terms of the GNU General Public License as published by
+#    the Free Software Foundation, version 3.
+#
+#    This program is distributed in the hope that it will be useful,
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#    GNU General Public License for more details.
+#
+#    You should have received a copy of the GNU General Public License
+#    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+#
+###############################################################################
+
+from datetime import datetime
+from json import load as jload
+from json import dump as jdump
+from os import popen, makedirs, path, scandir
+from shutil import rmtree
+from subprocess import run, PIPE
+
+from daemon_lib.common import run_os_command
+from daemon_lib.config import get_autobackup_configuration
+from daemon_lib.celery import start, fail, log_info, log_err, update, finish
+
+import daemon_lib.ceph as ceph
+import daemon_lib.vm as vm
+
+
+def send_execution_failure_report(
+    celery_conf, config, recipients=None, total_time=0, error=None
+):
+    if recipients is None:
+        return
+
+    from email.utils import formatdate
+    from socket import gethostname
+
+    log_message = f"Sending email failure report to {', '.join(recipients)}"
+    log_info(celery_conf[0], log_message)
+    update(
+        celery_conf[0],
+        log_message,
+        current=celery_conf[1] + 1,
+        total=celery_conf[2],
+    )
+
+    current_datetime = datetime.now()
+    email_datetime = formatdate(float(current_datetime.strftime("%s")))
+
+    email = list()
+    email.append(f"Date: {email_datetime}")
+    email.append(
+        f"Subject: PVC Autobackup execution failure for cluster '{config['cluster']}'"
+    )
+
+    email_to = list()
+    for recipient in recipients:
+        email_to.append(f"<{recipient}>")
+
+    email.append(f"To: {', '.join(email_to)}")
+    email.append(f"From: PVC Autobackup System <pvc@{gethostname()}>")
+    email.append("")
+
+    email.append(
+        f"A PVC autobackup has FAILED at {current_datetime} in {total_time}s due to an execution error."
+    )
+    email.append("")
+    email.append("The reported error message is:")
+    email.append(f"  {error}")
+
+    try:
+        with popen("/usr/sbin/sendmail -t", "w") as p:
+            p.write("\n".join(email))
+    except Exception as e:
+        log_err(f"Failed to send report email: {e}")
+
+
+def send_execution_summary_report(
+    celery_conf, config, recipients=None, total_time=0, summary=dict()
+):
+    if recipients is None:
+        return
+
+    from email.utils import formatdate
+    from socket import gethostname
+
+    log_message = f"Sending email summary report to {', '.join(recipients)}"
+    log_info(celery_conf[0], log_message)
+    update(
+        celery_conf[0],
+        log_message,
+        current=celery_conf[1] + 1,
+        total=celery_conf[2],
+    )
+
+    current_datetime = datetime.now()
+    email_datetime = formatdate(float(current_datetime.strftime("%s")))
+
+    email = list()
+    email.append(f"Date: {email_datetime}")
+    email.append(f"Subject: PVC Autobackup report for cluster '{config['cluster']}'")
+
+    email_to = list()
+    for recipient in recipients:
+        email_to.append(f"<{recipient}>")
+
+    email.append(f"To: {', '.join(email_to)}")
+    email.append(f"From: PVC Autobackup System <pvc@{gethostname()}>")
+    email.append("")
+
+    email.append(
+        f"A PVC autobackup has been completed at {current_datetime} in {total_time}."
+    )
+    email.append("")
+    email.append(
+        "The following is a summary of all current VM backups after cleanups, most recent first:"
+    )
+    email.append("")
+
+    for vm_name in summary.keys():
+        email.append(f"VM: {vm_name}:")
+        for backup in summary[vm_name]:
+            datestring = backup.get("datestring")
+            backup_date = datetime.strptime(datestring, "%Y%m%d%H%M%S")
+            if backup.get("result", False):
+                email.append(
+                    f"    {backup_date}: Success in {backup.get('runtime_secs', 0)} seconds, ID {backup.get('snapshot_name')}, type {backup.get('type', 'unknown')}"
+                )
+                email.append(
+                    f"                         Backup contains {len(backup.get('export_files'))} files totaling {ceph.format_bytes_tohuman(backup.get('export_size_bytes', 0))} ({backup.get('export_size_bytes', 0)} bytes)"
+                )
+            else:
+                email.append(
+                    f"    {backup_date}: Failure in {backup.get('runtime_secs', 0)} seconds, ID {backup.get('snapshot_name')}, type {backup.get('type', 'unknown')}"
+                )
+                email.append(f"                         {backup.get('result_message')}")
+
+    try:
+        with popen("/usr/sbin/sendmail -t", "w") as p:
+            p.write("\n".join(email))
+    except Exception as e:
+        log_err(f"Failed to send report email: {e}")
+
+
+def run_vm_backup(zkhandler, celery, config, vm_detail, force_full=False):
+    vm_name = vm_detail["name"]
+    dom_uuid = vm_detail["uuid"]
+    backup_suffixed_path = f"{config['backup_root_path']}{config['backup_root_suffix']}"
+    vm_backup_path = f"{backup_suffixed_path}/{vm_name}"
+    autobackup_state_file = f"{vm_backup_path}/.autobackup.json"
+    full_interval = config["backup_schedule"]["full_interval"]
+    full_retention = config["backup_schedule"]["full_retention"]
+
+    if not path.exists(vm_backup_path) or not path.exists(autobackup_state_file):
+        # There are no existing backups so the list is empty
+        state_data = dict()
+        tracked_backups = list()
+    else:
+        with open(autobackup_state_file) as fh:
+            state_data = jload(fh)
+        tracked_backups = state_data["tracked_backups"]
+
+    full_backups = [b for b in tracked_backups if b["type"] == "full"]
+    if len(full_backups) > 0:
+        last_full_backup = full_backups[0]
+        last_full_backup_idx = tracked_backups.index(last_full_backup)
+        if force_full:
+            this_backup_incremental_parent = None
+            this_backup_retain_snapshot = True
+        elif last_full_backup_idx >= full_interval - 1:
+            this_backup_incremental_parent = None
+            this_backup_retain_snapshot = True
+        else:
+            this_backup_incremental_parent = last_full_backup["snapshot_name"]
+            this_backup_retain_snapshot = False
+    else:
+        # The very first ackup must be full to start the tree
+        this_backup_incremental_parent = None
+        this_backup_retain_snapshot = True
+
+    export_type = (
+        "incremental" if this_backup_incremental_parent is not None else "full"
+    )
+
+    now = datetime.now()
+    datestring = now.strftime("%Y%m%d%H%M%S")
+    snapshot_name = f"ab{datestring}"
+
+    # Take the VM snapshot (vm.vm_worker_create_snapshot)
+    snap_list = list()
+
+    failure = False
+    export_files = None
+    export_files_size = 0
+
+    def update_tracked_backups():
+        # Read export file to get details
+        backup_json_file = (
+            f"{backup_suffixed_path}/{vm_name}/{snapshot_name}/snapshot.json"
+        )
+        try:
+            with open(backup_json_file) as fh:
+                backup_json = jload(fh)
+            tracked_backups.insert(0, backup_json)
+        except Exception as e:
+            log_err(celery, f"Could not open export JSON: {e}")
+            return list()
+
+        state_data["tracked_backups"] = tracked_backups
+        with open(autobackup_state_file, "w") as fh:
+            jdump(state_data, fh)
+
+        return tracked_backups
+
+    def write_backup_summary(success=False, message=""):
+        ttotal = (datetime.now() - now).total_seconds()
+        export_details = {
+            "type": export_type,
+            "result": success,
+            "message": message,
+            "datestring": datestring,
+            "runtime_secs": ttotal,
+            "snapshot_name": snapshot_name,
+            "incremental_parent": this_backup_incremental_parent,
+            "vm_detail": vm_detail,
+            "export_files": export_files,
+            "export_size_bytes": export_files_size,
+        }
+        try:
+            with open(
+                f"{backup_suffixed_path}/{vm_name}/{snapshot_name}/snapshot.json",
+                "w",
+            ) as fh:
+                jdump(export_details, fh)
+        except Exception as e:
+            log_err(celery, f"Error exporting snapshot details: {e}")
+            return False, e
+
+        return True, ""
+
+    def cleanup_failure():
+        for snapshot in snap_list:
+            rbd, snapshot_name = snapshot.split("@")
+            pool, volume = rbd.split("/")
+            # We capture no output here, because if this fails too we're in a deep
+            # error chain and will just ignore it
+            ceph.remove_snapshot(zkhandler, pool, volume, snapshot_name)
+
+    rbd_list = zkhandler.read(("domain.storage.volumes", dom_uuid)).split(",")
+
+    for rbd in rbd_list:
+        pool, volume = rbd.split("/")
+        ret, msg = ceph.add_snapshot(
+            zkhandler, pool, volume, snapshot_name, zk_only=False
+        )
+        if not ret:
+            cleanup_failure()
+            error_message = msg.replace("ERROR: ", "")
+            log_err(celery, error_message)
+            failure = True
+            break
+        else:
+            snap_list.append(f"{pool}/{volume}@{snapshot_name}")
+
+    if failure:
+        error_message = (f"[{vm_name}] Error in snapshot export, skipping",)
+        write_backup_summary(message=error_message)
+        tracked_backups = update_tracked_backups()
+        return tracked_backups
+
+    # Get the current domain XML
+    vm_config = zkhandler.read(("domain.xml", dom_uuid))
+
+    # Add the snapshot entry to Zookeeper
+    ret = zkhandler.write(
+        [
+            (
+                (
+                    "domain.snapshots",
+                    dom_uuid,
+                    "domain_snapshot.name",
+                    snapshot_name,
+                ),
+                snapshot_name,
+            ),
+            (
+                (
+                    "domain.snapshots",
+                    dom_uuid,
+                    "domain_snapshot.timestamp",
+                    snapshot_name,
+                ),
+                now.strftime("%s"),
+            ),
+            (
+                (
+                    "domain.snapshots",
+                    dom_uuid,
+                    "domain_snapshot.xml",
+                    snapshot_name,
+                ),
+                vm_config,
+            ),
+            (
+                (
+                    "domain.snapshots",
+                    dom_uuid,
+                    "domain_snapshot.rbd_snapshots",
+                    snapshot_name,
+                ),
+                ",".join(snap_list),
+            ),
+        ]
+    )
+    if not ret:
+        error_message = (f"[{vm_name}] Error in snapshot export, skipping",)
+        log_err(celery, error_message)
+        write_backup_summary(message=error_message)
+        tracked_backups = update_tracked_backups()
+        return tracked_backups
+
+    # Export the snapshot (vm.vm_worker_export_snapshot)
+    export_target_path = f"{backup_suffixed_path}/{vm_name}/{snapshot_name}/images"
+
+    try:
+        makedirs(export_target_path)
+    except Exception as e:
+        error_message = (
+            f"[{vm_name}] Failed to create target directory '{export_target_path}': {e}",
+        )
+        log_err(celery, error_message)
+        return tracked_backups
+
+    def export_cleanup():
+        from shutil import rmtree
+
+        rmtree(f"{backup_suffixed_path}/{vm_name}/{snapshot_name}")
+
+    # Set the export filetype
+    if this_backup_incremental_parent is not None:
+        export_fileext = "rbddiff"
+    else:
+        export_fileext = "rbdimg"
+
+    snapshot_volumes = list()
+    for rbdsnap in snap_list:
+        pool, _volume = rbdsnap.split("/")
+        volume, name = _volume.split("@")
+        ret, snapshots = ceph.get_list_snapshot(
+            zkhandler, pool, volume, limit=name, is_fuzzy=False
+        )
+        if ret:
+            snapshot_volumes += snapshots
+
+    export_files = list()
+    for snapshot_volume in snapshot_volumes:
+        snap_pool = snapshot_volume["pool"]
+        snap_volume = snapshot_volume["volume"]
+        snap_snapshot_name = snapshot_volume["snapshot"]
+        snap_size = snapshot_volume["stats"]["size"]
+
+        if this_backup_incremental_parent is not None:
+            retcode, stdout, stderr = run_os_command(
+                f"rbd export-diff --from-snap {this_backup_incremental_parent} {snap_pool}/{snap_volume}@{snap_snapshot_name} {export_target_path}/{snap_pool}.{snap_volume}.{export_fileext}"
+            )
+            if retcode:
+                error_message = (
+                    f"[{vm_name}] Failed to export snapshot for volume(s) '{snap_pool}/{snap_volume}'",
+                )
+                failure = True
+                break
+            else:
+                export_files.append(
+                    (
+                        f"images/{snap_pool}.{snap_volume}.{export_fileext}",
+                        snap_size,
+                    )
+                )
+        else:
+            retcode, stdout, stderr = run_os_command(
+                f"rbd export --export-format 2 {snap_pool}/{snap_volume}@{snap_snapshot_name} {export_target_path}/{snap_pool}.{snap_volume}.{export_fileext}"
+            )
+            if retcode:
+                error_message = (
+                    f"[{vm_name}] Failed to export snapshot for volume(s) '{snap_pool}/{snap_volume}'",
+                )
+                failure = True
+                break
+            else:
+                export_files.append(
+                    (
+                        f"images/{snap_pool}.{snap_volume}.{export_fileext}",
+                        snap_size,
+                    )
+                )
+
+    if failure:
+        log_err(celery, error_message)
+        write_backup_summary(message=error_message)
+        tracked_backups = update_tracked_backups()
+        return tracked_backups
+
+    def get_dir_size(pathname):
+        total = 0
+        with scandir(pathname) as it:
+            for entry in it:
+                if entry.is_file():
+                    total += entry.stat().st_size
+                elif entry.is_dir():
+                    total += get_dir_size(entry.path)
+        return total
+
+    export_files_size = get_dir_size(export_target_path)
+
+    ret, e = write_backup_summary(success=True)
+    if not ret:
+        error_message = (f"[{vm_name}] Failed to export configuration snapshot: {e}",)
+        log_err(celery, error_message)
+        write_backup_summary(message=error_message)
+        tracked_backups = update_tracked_backups()
+        return tracked_backups
+
+    # Clean up the snapshot (vm.vm_worker_remove_snapshot)
+    if not this_backup_retain_snapshot:
+        for snap in snap_list:
+            rbd, name = snap.split("@")
+            pool, volume = rbd.split("/")
+            ret, msg = ceph.remove_snapshot(zkhandler, pool, volume, name)
+            if not ret:
+                error_message = msg.replace("ERROR: ", f"[{vm_name}] ")
+                failure = True
+                break
+
+        if failure:
+            log_err(celery, error_message)
+            write_backup_summary(message=error_message)
+            tracked_backups = update_tracked_backups()
+            return tracked_backups
+
+        ret = zkhandler.delete(
+            ("domain.snapshots", dom_uuid, "domain_snapshot.name", snapshot_name)
+        )
+        if not ret:
+            error_message = (f"[{vm_name}] Failed to remove VM snapshot; continuing",)
+            log_err(celery, error_message)
+
+    marked_for_deletion = list()
+    # Find any full backups that are expired
+    found_full_count = 0
+    for backup in tracked_backups:
+        if backup["type"] == "full":
+            found_full_count += 1
+            if found_full_count > full_retention:
+                marked_for_deletion.append(backup)
+    # Find any incremental backups that depend on marked parents
+    for backup in tracked_backups:
+        if backup["type"] == "incremental" and backup["incremental_parent"] in [
+            b["snapshot_name"] for b in marked_for_deletion
+        ]:
+            marked_for_deletion.append(backup)
+
+    if len(marked_for_deletion) > 0:
+        for backup_to_delete in marked_for_deletion:
+            ret = vm.vm_worker_remove_snapshot(
+                zkhandler, None, vm_name, backup_to_delete["snapshot_name"]
+            )
+            if ret is False:
+                error_message = f"Failed to remove obsolete backup snapshot '{backup_to_delete['snapshot_name']}', leaving in tracked backups"
+                log_err(celery, error_message)
+            else:
+                rmtree(f"{vm_backup_path}/{backup_to_delete['snapshot_name']}")
+                tracked_backups.remove(backup_to_delete)
+
+    tracked_backups = update_tracked_backups()
+    return tracked_backups
+
+
+def worker_cluster_autobackup(
+    zkhandler, celery, force_full=False, email_recipients=None
+):
+    config = get_autobackup_configuration()
+
+    backup_summary = dict()
+
+    current_stage = 0
+    total_stages = 1
+    if email_recipients is not None:
+        total_stages += 1
+
+    start(
+        celery,
+        f"Starting cluster '{config['cluster']}' VM autobackup",
+        current=current_stage,
+        total=total_stages,
+    )
+
+    if not config["autobackup_enabled"]:
+        message = "Autobackups are not configured on this cluster."
+        log_info(celery, message)
+        return finish(
+            celery,
+            message,
+            current=total_stages,
+            total=total_stages,
+        )
+
+    autobackup_start_time = datetime.now()
+
+    retcode, vm_list = vm.get_list(zkhandler)
+    if not retcode:
+        error_message = f"Failed to fetch VM list: {vm_list}"
+        log_err(celery, error_message)
+        send_execution_failure_report(
+            (celery, current_stage, total_stages),
+            config,
+            recipients=email_recipients,
+            error=error_message,
+        )
+        fail(celery, error_message)
+        return False
+
+    backup_suffixed_path = f"{config['backup_root_path']}{config['backup_root_suffix']}"
+    if not path.exists(backup_suffixed_path):
+        makedirs(backup_suffixed_path)
+
+    full_interval = config["backup_schedule"]["full_interval"]
+
+    backup_vms = list()
+    for vm_detail in vm_list:
+        vm_tag_names = [t["name"] for t in vm_detail["tags"]]
+        matching_tags = (
+            True
+            if len(set(vm_tag_names).intersection(set(config["backup_tags"]))) > 0
+            else False
+        )
+        if matching_tags:
+            backup_vms.append(vm_detail)
+
+    if len(backup_vms) < 1:
+        message = "Found no VMs tagged for autobackup."
+        log_info(celery, message)
+        return finish(
+            celery,
+            message,
+            current=total_stages,
+            total=total_stages,
+        )
+
+    if config["auto_mount_enabled"]:
+        total_stages += len(config["mount_cmds"])
+        total_stages += len(config["unmount_cmds"])
+
+    total_stages += len(backup_vms)
+
+    log_info(
+        celery,
+        f"Found {len(backup_vms)} suitable VM(s) for autobackup: {', '.join([b['name'] for b in backup_vms])}",
+    )
+
+    # Handle automount mount commands
+    if config["auto_mount_enabled"]:
+        for cmd in config["mount_cmds"]:
+            current_stage += 1
+            update(
+                celery,
+                f"Executing mount command '{cmd.split()[0]}'",
+                current=current_stage,
+                total=total_stages,
+            )
+
+            ret = run(
+                cmd.split(),
+                stdout=PIPE,
+                stderr=PIPE,
+            )
+
+            if ret.returncode != 0:
+                error_message = f"Failed to execute mount command '{cmd.split()[0]}': {ret.stderr.decode().strip()}"
+                log_err(celery, error_message)
+                send_execution_failure_report(
+                    (celery, current_stage, total_stages),
+                    config,
+                    recipients=email_recipients,
+                    total_time=datetime.now() - autobackup_start_time,
+                    error=error_message,
+                )
+                fail(celery, error_message)
+                return False
+
+    # Execute the backup: take a snapshot, then export the snapshot
+    for vm_detail in backup_vms:
+        vm_backup_path = f"{backup_suffixed_path}/{vm_detail['name']}"
+        autobackup_state_file = f"{vm_backup_path}/.autobackup.json"
+        if not path.exists(vm_backup_path) or not path.exists(autobackup_state_file):
+            # There are no existing backups so the list is empty
+            state_data = dict()
+            tracked_backups = list()
+        else:
+            with open(autobackup_state_file) as fh:
+                state_data = jload(fh)
+            tracked_backups = state_data["tracked_backups"]
+
+        full_backups = [b for b in tracked_backups if b["type"] == "full"]
+        if len(full_backups) > 0:
+            last_full_backup = full_backups[0]
+            last_full_backup_idx = tracked_backups.index(last_full_backup)
+            if force_full:
+                this_backup_incremental_parent = None
+            elif last_full_backup_idx >= full_interval - 1:
+                this_backup_incremental_parent = None
+            else:
+                this_backup_incremental_parent = last_full_backup["snapshot_name"]
+        else:
+            # The very first ackup must be full to start the tree
+            this_backup_incremental_parent = None
+
+        export_type = (
+            "incremental" if this_backup_incremental_parent is not None else "full"
+        )
+
+        current_stage += 1
+        update(
+            celery,
+            f"Performing autobackup of VM {vm_detail['name']} ({export_type})",
+            current=current_stage,
+            total=total_stages,
+        )
+
+        summary = run_vm_backup(
+            zkhandler,
+            celery,
+            config,
+            vm_detail,
+            force_full=force_full,
+        )
+        backup_summary[vm_detail["name"]] = summary
+
+    # Handle automount unmount commands
+    if config["auto_mount_enabled"]:
+        for cmd in config["unmount_cmds"]:
+            current_stage += 1
+            update(
+                celery,
+                f"Executing unmount command '{cmd.split()[0]}'",
+                current=current_stage,
+                total=total_stages,
+            )
+
+            ret = run(
+                cmd.split(),
+                stdout=PIPE,
+                stderr=PIPE,
+            )
+
+            if ret.returncode != 0:
+                error_message = f"Failed to execute unmount command '{cmd.split()[0]}': {ret.stderr.decode().strip()}"
+                log_err(celery, error_message)
+                send_execution_failure_report(
+                    (celery, current_stage, total_stages),
+                    config,
+                    recipients=email_recipients,
+                    total_time=datetime.now() - autobackup_start_time,
+                    error=error_message,
+                )
+                fail(celery, error_message)
+                return False
+
+    autobackup_end_time = datetime.now()
+    autobackup_total_time = autobackup_end_time - autobackup_start_time
+
+    if email_recipients is not None:
+        send_execution_summary_report(
+            (celery, current_stage, total_stages),
+            config,
+            recipients=email_recipients,
+            total_time=autobackup_total_time,
+            summary=backup_summary,
+        )
+        current_stage += 1
+
+    current_stage += 1
+    return finish(
+        celery,
+        f"Successfully completed cluster '{config['cluster']}' VM autobackup",
+        current=current_stage,
+        total=total_stages,
+    )
--- a/daemon-common/benchmark.py
+++ b/daemon-common/benchmark.py
@ -3,7 +3,7 @@
 # benchmark.py - PVC API Benchmark functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -19,31 +19,34 @@
 #
 ###############################################################################

+import os
+import psutil
 import psycopg2
 import psycopg2.extras
+import subprocess

 from datetime import datetime
 from json import loads, dumps
+from time import sleep

 from daemon_lib.celery import start, fail, log_info, update, finish

-import daemon_lib.common as pvc_common
 import daemon_lib.ceph as pvc_ceph


 # Define the current test format
-TEST_FORMAT = 1
+TEST_FORMAT = 2


 # We run a total of 8 tests, to give a generalized idea of performance on the cluster:
-#   1. A sequential read test of 8GB with a 4M block size
-#   2. A sequential write test of 8GB with a 4M block size
-#   3. A random read test of 8GB with a 4M block size
-#   4. A random write test of 8GB with a 4M block size
-#   5. A random read test of 8GB with a 256k block size
-#   6. A random write test of 8GB with a 256k block size
-#   7. A random read test of 8GB with a 4k block size
-#   8. A random write test of 8GB with a 4k block size
+#   1. A sequential read test of 64GB with a 4M block size
+#   2. A sequential write test of 64GB with a 4M block size
+#   3. A random read test of 64GB with a 4M block size
+#   4. A random write test of 64GB with a 4M block size
+#   5. A random read test of 64GB with a 256k block size
+#   6. A random write test of 64GB with a 256k block size
+#   7. A random read test of 64GB with a 4k block size
+#   8. A random write test of 64GB with a 4k block size
 # Taken together, these 8 results should give a very good indication of the overall storage performance
 # for a variety of workloads.
 test_matrix = {
@ -100,7 +103,7 @@ test_matrix = {

 # Specify the benchmark volume name and size
 benchmark_volume_name = "pvcbenchmark"
-benchmark_volume_size = "8G"
+benchmark_volume_size = "64G"


 #
@ -115,12 +118,13 @@ class BenchmarkError(Exception):
 #


-def cleanup(job_name, db_conn=None, db_cur=None, zkhandler=None):
+def cleanup(job_name, db_conn=None, db_cur=None, zkhandler=None, final=False):
    if db_conn is not None and db_cur is not None:
-        # Clean up our dangling result
-        query = "DELETE FROM storage_benchmarks WHERE job = %s;"
-        args = (job_name,)
-        db_cur.execute(query, args)
+        if not final:
+            # Clean up our dangling result (non-final runs only)
+            query = "DELETE FROM storage_benchmarks WHERE job = %s;"
+            args = (job_name,)
+            db_cur.execute(query, args)
        db_conn.commit()
        # Close the database connections cleanly
        close_database(db_conn, db_cur)
@ -225,7 +229,7 @@ def cleanup_benchmark_volume(


 def run_benchmark_job(
-    test, pool, job_name=None, db_conn=None, db_cur=None, zkhandler=None
+    config, test, pool, job_name=None, db_conn=None, db_cur=None, zkhandler=None
 ):
    test_spec = test_matrix[test]
    log_info(None, f"Running test '{test}'")
@ -255,31 +259,165 @@ def run_benchmark_job(
    )

    log_info(None, "Running fio job: {}".format(" ".join(fio_cmd.split())))
-    retcode, stdout, stderr = pvc_common.run_os_command(fio_cmd)
+
+    # Run the fio command manually instead of using our run_os_command wrapper
+    # This will help us gather statistics about this node while it's running
+    process = subprocess.Popen(
+        fio_cmd.split(),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True,
+    )
+
+    # Wait 15 seconds for the test to start
+    log_info(None, "Waiting 15 seconds for test resource stabilization")
+    sleep(15)
+
+    # Set up function to get process CPU utilization by name
+    def get_cpu_utilization_by_name(process_name):
+        cpu_usage = 0
+        for proc in psutil.process_iter(["name", "cpu_percent"]):
+            if proc.info["name"] == process_name:
+                cpu_usage += proc.info["cpu_percent"]
+        return cpu_usage
+
+    # Set up function to get process memory utilization by name
+    def get_memory_utilization_by_name(process_name):
+        memory_usage = 0
+        for proc in psutil.process_iter(["name", "memory_percent"]):
+            if proc.info["name"] == process_name:
+                memory_usage += proc.info["memory_percent"]
+        return memory_usage
+
+    # Set up function to get network traffic utilization in bps
+    def get_network_traffic_bps(interface, duration=1):
+        # Get initial network counters
+        net_io_start = psutil.net_io_counters(pernic=True)
+        if interface not in net_io_start:
+            return None, None
+
+        stats_start = net_io_start[interface]
+        bytes_sent_start = stats_start.bytes_sent
+        bytes_recv_start = stats_start.bytes_recv
+
+        # Wait for the specified duration
+        sleep(duration)
+
+        # Get final network counters
+        net_io_end = psutil.net_io_counters(pernic=True)
+        stats_end = net_io_end[interface]
+        bytes_sent_end = stats_end.bytes_sent
+        bytes_recv_end = stats_end.bytes_recv
+
+        # Calculate bytes per second
+        bytes_sent_per_sec = (bytes_sent_end - bytes_sent_start) / duration
+        bytes_recv_per_sec = (bytes_recv_end - bytes_recv_start) / duration
+
+        # Convert to bits per second (bps)
+        bits_sent_per_sec = bytes_sent_per_sec * 8
+        bits_recv_per_sec = bytes_recv_per_sec * 8
+        bits_total_per_sec = bits_sent_per_sec + bits_recv_per_sec
+
+        return bits_sent_per_sec, bits_recv_per_sec, bits_total_per_sec
+
+    log_info(None, f"Starting system resource polling for test '{test}'")
+    storage_interface = config["storage_dev"]
+    total_cpus = psutil.cpu_count(logical=True)
+    ticks = 1
+    osd_cpu_utilization = 0
+    osd_memory_utilization = 0
+    mon_cpu_utilization = 0
+    mon_memory_utilization = 0
+    total_cpu_utilization = 0
+    total_memory_utilization = 0
+    storage_sent_bps = 0
+    storage_recv_bps = 0
+    storage_total_bps = 0
+
+    while process.poll() is None:
+        # Do collection of statistics like network bandwidth and cpu utilization
+        current_osd_cpu_utilization = get_cpu_utilization_by_name("ceph-osd")
+        current_osd_memory_utilization = get_memory_utilization_by_name("ceph-osd")
+        current_mon_cpu_utilization = get_cpu_utilization_by_name("ceph-mon")
+        current_mon_memory_utilization = get_memory_utilization_by_name("ceph-mon")
+        current_total_cpu_utilization = psutil.cpu_percent(interval=1)
+        current_total_memory_utilization = psutil.virtual_memory().percent
+        (
+            current_storage_sent_bps,
+            current_storage_recv_bps,
+            current_storage_total_bps,
+        ) = get_network_traffic_bps(storage_interface)
+        # Recheck if the process is done yet; if it's not, we add the values and increase the ticks
+        # This helps ensure that if the process finishes earlier than the longer polls above,
+        # this particular tick isn't counted which can skew the average
+        if process.poll() is None:
+            osd_cpu_utilization += current_osd_cpu_utilization
+            osd_memory_utilization += current_osd_memory_utilization
+            mon_cpu_utilization += current_mon_cpu_utilization
+            mon_memory_utilization += current_mon_memory_utilization
+            total_cpu_utilization += current_total_cpu_utilization
+            total_memory_utilization += current_total_memory_utilization
+            storage_sent_bps += current_storage_sent_bps
+            storage_recv_bps += current_storage_recv_bps
+            storage_total_bps += current_storage_total_bps
+            ticks += 1
+
+    # Get the 1-minute load average and CPU utilization, which covers the test duration
+    load1, _, _ = os.getloadavg()
+    load1 = round(load1, 2)
+
+    # Calculate the average CPU utilization values over the runtime
+    # Divide the OSD and MON CPU utilization by the total number of CPU cores, because
+    # the total is divided this way
+    avg_osd_cpu_utilization = round(osd_cpu_utilization / ticks / total_cpus, 2)
+    avg_osd_memory_utilization = round(osd_memory_utilization / ticks, 2)
+    avg_mon_cpu_utilization = round(mon_cpu_utilization / ticks / total_cpus, 2)
+    avg_mon_memory_utilization = round(mon_memory_utilization / ticks, 2)
+    avg_total_cpu_utilization = round(total_cpu_utilization / ticks, 2)
+    avg_total_memory_utilization = round(total_memory_utilization / ticks, 2)
+    avg_storage_sent_bps = round(storage_sent_bps / ticks, 2)
+    avg_storage_recv_bps = round(storage_recv_bps / ticks, 2)
+    avg_storage_total_bps = round(storage_total_bps / ticks, 2)
+
+    stdout, stderr = process.communicate()
+    retcode = process.returncode
+
+    resource_data = {
+        "avg_cpu_util_percent": {
+            "total": avg_total_cpu_utilization,
+            "ceph-mon": avg_mon_cpu_utilization,
+            "ceph-osd": avg_osd_cpu_utilization,
+        },
+        "avg_memory_util_percent": {
+            "total": avg_total_memory_utilization,
+            "ceph-mon": avg_mon_memory_utilization,
+            "ceph-osd": avg_osd_memory_utilization,
+        },
+        "avg_network_util_bps": {
+            "sent": avg_storage_sent_bps,
+            "recv": avg_storage_recv_bps,
+            "total": avg_storage_total_bps,
+        },
+    }
+
    try:
        jstdout = loads(stdout)
        if retcode:
            raise
    except Exception:
-        cleanup(
-            job_name,
-            db_conn=db_conn,
-            db_cur=db_cur,
-            zkhandler=zkhandler,
-        )
-        fail(
-            None,
-            f"Failed to run fio test '{test}': {stderr}",
-        )
+        return None, None

-    return jstdout
+    return resource_data, jstdout


-def worker_run_benchmark(zkhandler, celery, config, pool):
+def worker_run_benchmark(zkhandler, celery, config, pool, name):
    # Phase 0 - connect to databases
-    cur_time = datetime.now().isoformat(timespec="seconds")
-    cur_primary = zkhandler.read("base.config.primary_node")
-    job_name = f"{cur_time}_{cur_primary}"
+    if not name:
+        cur_time = datetime.now().isoformat(timespec="seconds")
+        cur_primary = zkhandler.read("base.config.primary_node")
+        job_name = f"{cur_time}_{cur_primary}"
+    else:
+        job_name = name

    current_stage = 0
    total_stages = 13
@ -357,7 +495,8 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
            total=total_stages,
        )

-        results[test] = run_benchmark_job(
+        resource_data, fio_data = run_benchmark_job(
+            config,
            test,
            pool,
            job_name=job_name,
@ -365,6 +504,25 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
            db_cur=db_cur,
            zkhandler=zkhandler,
        )
+        if resource_data is None or fio_data is None:
+            cleanup_benchmark_volume(
+                pool,
+                job_name=job_name,
+                db_conn=db_conn,
+                db_cur=db_cur,
+                zkhandler=zkhandler,
+            )
+            cleanup(
+                job_name,
+                db_conn=db_conn,
+                db_cur=db_cur,
+                zkhandler=zkhandler,
+            )
+            fail(
+                None,
+                f"Failed to run fio test '{test}'",
+            )
+        results[test] = {**resource_data, **fio_data}

    # Phase 3 - cleanup
    current_stage += 1
@ -410,6 +568,7 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
        db_conn=db_conn,
        db_cur=db_cur,
        zkhandler=zkhandler,
+        final=True,
    )

    current_stage += 1
--- a/daemon-common/celery.py
+++ b/daemon-common/celery.py
@ -3,7 +3,7 @@
 # celery.py - PVC client function library, Celery helper fuctions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/daemon-common/ceph.py
+++ b/daemon-common/ceph.py
@ -3,7 +3,7 @@
 # ceph.py - PVC client function library, Ceph cluster fuctions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -123,13 +123,13 @@ def format_bytes_tohuman(databytes):
 def format_bytes_fromhuman(datahuman):
    if not re.search(r"[A-Za-z]+", datahuman):
        dataunit = "B"
-        datasize = int(datahuman)
+        datasize = float(datahuman)
    else:
-        dataunit = str(re.match(r"[0-9]+([A-Za-z])[iBb]*", datahuman).group(1))
-        datasize = int(re.match(r"([0-9]+)[A-Za-z]+", datahuman).group(1))
+        dataunit = str(re.match(r"[0-9\.]+([A-Za-z])[iBb]*", datahuman).group(1))
+        datasize = float(re.match(r"([0-9\.]+)[A-Za-z]+", datahuman).group(1))

-    if byte_unit_matrix.get(dataunit):
-        databytes = datasize * byte_unit_matrix[dataunit]
+    if byte_unit_matrix.get(dataunit.upper()):
+        databytes = int(datasize * byte_unit_matrix[dataunit.upper()])
        return databytes
    else:
        return None
@ -155,7 +155,7 @@ def format_ops_fromhuman(datahuman):
    # Trim off human-readable character
    dataunit = datahuman[-1]
    datasize = int(datahuman[:-1])
-    dataops = datasize * ops_unit_matrix[dataunit]
+    dataops = datasize * ops_unit_matrix[dataunit.upper()]
    return "{}".format(dataops)


@ -215,14 +215,26 @@ def getClusterOSDList(zkhandler):


 def getOSDInformation(zkhandler, osd_id):
-    # Get the devices
-    osd_fsid = zkhandler.read(("osd.ofsid", osd_id))
-    osd_node = zkhandler.read(("osd.node", osd_id))
-    osd_device = zkhandler.read(("osd.device", osd_id))
-    osd_is_split = bool(strtobool(zkhandler.read(("osd.is_split", osd_id))))
-    osd_db_device = zkhandler.read(("osd.db_device", osd_id))
+    (
+        osd_fsid,
+        osd_node,
+        osd_device,
+        _osd_is_split,
+        osd_db_device,
+        osd_stats_raw,
+    ) = zkhandler.read_many(
+        [
+            ("osd.ofsid", osd_id),
+            ("osd.node", osd_id),
+            ("osd.device", osd_id),
+            ("osd.is_split", osd_id),
+            ("osd.db_device", osd_id),
+            ("osd.stats", osd_id),
+        ]
+    )
+
+    osd_is_split = bool(strtobool(_osd_is_split))
    # Parse the stats data
-    osd_stats_raw = zkhandler.read(("osd.stats", osd_id))
    osd_stats = dict(json.loads(osd_stats_raw))

    osd_information = {
@ -279,7 +291,7 @@ def unset_osd(zkhandler, option):
    return True, 'Unset OSD property "{}".'.format(option)


-def get_list_osd(zkhandler, limit, is_fuzzy=True):
+def get_list_osd(zkhandler, limit=None, is_fuzzy=True):
    osd_list = []
    full_osd_list = zkhandler.children("base.osd")

@ -308,13 +320,22 @@ def get_list_osd(zkhandler, limit, is_fuzzy=True):
 #
 def getPoolInformation(zkhandler, pool):
    # Parse the stats data
-    pool_stats_raw = zkhandler.read(("pool.stats", pool))
+    (
+        pool_stats_raw,
+        tier,
+        pgs,
+    ) = zkhandler.read_many(
+        [
+            ("pool.stats", pool),
+            ("pool.tier", pool),
+            ("pool.pgs", pool),
+        ]
+    )
+
    pool_stats = dict(json.loads(pool_stats_raw))
    volume_count = len(getCephVolumes(zkhandler, pool))
-    tier = zkhandler.read(("pool.tier", pool))
    if tier is None:
        tier = "default"
-    pgs = zkhandler.read(("pool.pgs", pool))

    pool_information = {
        "name": pool,
@ -472,7 +493,7 @@ def set_pgs_pool(zkhandler, name, pgs):
    return True, f'Set PGs count to {pgs} for RBD pool "{name}".'


-def get_list_pool(zkhandler, limit, is_fuzzy=True):
+def get_list_pool(zkhandler, limit=None, is_fuzzy=True):
    full_pool_list = zkhandler.children("base.pool")

    if is_fuzzy and limit:
@ -519,7 +540,10 @@ def getCephVolumes(zkhandler, pool):
        pool_list = [pool]

    for pool_name in pool_list:
-        for volume_name in zkhandler.children(("volume", pool_name)):
+        children = zkhandler.children(("volume", pool_name))
+        if children is None:
+            continue
+        for volume_name in children:
            volume_list.append("{}/{}".format(pool_name, volume_name))

    return volume_list
@ -536,7 +560,21 @@ def getVolumeInformation(zkhandler, pool, volume):
    return volume_information


-def add_volume(zkhandler, pool, name, size):
+def scan_volume(zkhandler, pool, name):
+    retcode, stdout, stderr = common.run_os_command(
+        "rbd info --format json {}/{}".format(pool, name)
+    )
+    volstats = stdout
+
+    # 3. Add the new volume to Zookeeper
+    zkhandler.write(
+        [
+            (("volume.stats", f"{pool}/{name}"), volstats),
+        ]
+    )
+
+
+def add_volume(zkhandler, pool, name, size, force_flag=False, zk_only=False):
    # 1. Verify the size of the volume
    pool_information = getPoolInformation(zkhandler, pool)
    size_bytes = format_bytes_fromhuman(size)
@ -546,46 +584,88 @@ def add_volume(zkhandler, pool, name, size):
            f"ERROR: Requested volume size '{size}' does not have a valid SI unit",
        )

-    if size_bytes >= int(pool_information["stats"]["free_bytes"]):
+    pool_total_free_bytes = int(pool_information["stats"]["free_bytes"])
+    if size_bytes >= pool_total_free_bytes:
        return (
            False,
            f"ERROR: Requested volume size '{format_bytes_tohuman(size_bytes)}' is greater than the available free space in the pool ('{format_bytes_tohuman(pool_information['stats']['free_bytes'])}')",
        )

-    # 2. Create the volume
-    retcode, stdout, stderr = common.run_os_command(
-        "rbd create --size {}B {}/{}".format(size_bytes, pool, name)
+    # Check if we're greater than 80% utilization after the create; error if so unless we have the force flag
+    pool_total_bytes = (
+        int(pool_information["stats"]["used_bytes"]) + pool_total_free_bytes
    )
-    if retcode:
-        return False, 'ERROR: Failed to create RBD volume "{}": {}'.format(name, stderr)
+    pool_safe_total_bytes = int(pool_total_bytes * 0.80)
+    pool_safe_free_bytes = pool_safe_total_bytes - int(
+        pool_information["stats"]["used_bytes"]
+    )
+    if size_bytes >= pool_safe_free_bytes and not force_flag:
+        return (
+            False,
+            f"ERROR: Requested volume size '{format_bytes_tohuman(size_bytes)}' is greater than the safe free space in the pool ('{format_bytes_tohuman(pool_safe_free_bytes)}' for 80% full); retry with force to ignore this error",
+        )

-    # 2. Get volume stats
-    retcode, stdout, stderr = common.run_os_command(
-        "rbd info --format json {}/{}".format(pool, name)
-    )
-    volstats = stdout
+    # 2. Create the volume
+    # zk_only flag skips actually creating the volume - this would be done by some other mechanism
+    if not zk_only:
+        retcode, stdout, stderr = common.run_os_command(
+            "rbd create --size {}B {}/{}".format(size_bytes, pool, name)
+        )
+        if retcode:
+            return False, 'ERROR: Failed to create RBD volume "{}": {}'.format(
+                name, stderr
+            )

    # 3. Add the new volume to Zookeeper
    zkhandler.write(
        [
            (("volume", f"{pool}/{name}"), ""),
-            (("volume.stats", f"{pool}/{name}"), volstats),
+            (("volume.stats", f"{pool}/{name}"), ""),
            (("snapshot", f"{pool}/{name}"), ""),
        ]
    )

+    # 4. Scan the volume stats
+    scan_volume(zkhandler, pool, name)
+
    return True, 'Created RBD volume "{}" of size "{}" in pool "{}".'.format(
        name, format_bytes_tohuman(size_bytes), pool
    )


-def clone_volume(zkhandler, pool, name_src, name_new):
+def clone_volume(zkhandler, pool, name_src, name_new, force_flag=False):
+    # 1. Verify the volume
    if not verifyVolume(zkhandler, pool, name_src):
        return False, 'ERROR: No volume with name "{}" is present in pool "{}".'.format(
            name_src, pool
        )

-    # 1. Clone the volume
+    volume_stats_raw = zkhandler.read(("volume.stats", f"{pool}/{name_src}"))
+    volume_stats = dict(json.loads(volume_stats_raw))
+    size_bytes = volume_stats["size"]
+    pool_information = getPoolInformation(zkhandler, pool)
+    pool_total_free_bytes = int(pool_information["stats"]["free_bytes"])
+    if size_bytes >= pool_total_free_bytes:
+        return (
+            False,
+            f"ERROR: Clone volume size '{format_bytes_tohuman(size_bytes)}' is greater than the available free space in the pool ('{format_bytes_tohuman(pool_information['stats']['free_bytes'])}')",
+        )
+
+    # Check if we're greater than 80% utilization after the create; error if so unless we have the force flag
+    pool_total_bytes = (
+        int(pool_information["stats"]["used_bytes"]) + pool_total_free_bytes
+    )
+    pool_safe_total_bytes = int(pool_total_bytes * 0.80)
+    pool_safe_free_bytes = pool_safe_total_bytes - int(
+        pool_information["stats"]["used_bytes"]
+    )
+    if size_bytes >= pool_safe_free_bytes and not force_flag:
+        return (
+            False,
+            f"ERROR: Clone volume size '{format_bytes_tohuman(size_bytes)}' is greater than the safe free space in the pool ('{format_bytes_tohuman(pool_safe_free_bytes)}' for 80% full); retry with force to ignore this error",
+        )
+
+    # 2. Clone the volume
    retcode, stdout, stderr = common.run_os_command(
        "rbd copy {}/{} {}/{}".format(pool, name_src, pool, name_new)
    )
@ -597,27 +677,24 @@ def clone_volume(zkhandler, pool, name_src, name_new):
            ),
        )

-    # 2. Get volume stats
-    retcode, stdout, stderr = common.run_os_command(
-        "rbd info --format json {}/{}".format(pool, name_new)
-    )
-    volstats = stdout
-
    # 3. Add the new volume to Zookeeper
    zkhandler.write(
        [
            (("volume", f"{pool}/{name_new}"), ""),
-            (("volume.stats", f"{pool}/{name_new}"), volstats),
+            (("volume.stats", f"{pool}/{name_new}"), ""),
            (("snapshot", f"{pool}/{name_new}"), ""),
        ]
    )

+    # 4. Scan the volume stats
+    scan_volume(zkhandler, pool, name_new)
+
    return True, 'Cloned RBD volume "{}" to "{}" in pool "{}"'.format(
        name_src, name_new, pool
    )


-def resize_volume(zkhandler, pool, name, size):
+def resize_volume(zkhandler, pool, name, size, force_flag=False):
    if not verifyVolume(zkhandler, pool, name):
        return False, 'ERROR: No volume with name "{}" is present in pool "{}".'.format(
            name, pool
@ -632,12 +709,27 @@ def resize_volume(zkhandler, pool, name, size):
            f"ERROR: Requested volume size '{size}' does not have a valid SI unit",
        )

-    if size_bytes >= int(pool_information["stats"]["free_bytes"]):
+    pool_total_free_bytes = int(pool_information["stats"]["free_bytes"])
+    if size_bytes >= pool_total_free_bytes:
        return (
            False,
            f"ERROR: Requested volume size '{format_bytes_tohuman(size_bytes)}' is greater than the available free space in the pool ('{format_bytes_tohuman(pool_information['stats']['free_bytes'])}')",
        )

+    # Check if we're greater than 80% utilization after the create; error if so unless we have the force flag
+    pool_total_bytes = (
+        int(pool_information["stats"]["used_bytes"]) + pool_total_free_bytes
+    )
+    pool_safe_total_bytes = int(pool_total_bytes * 0.80)
+    pool_safe_free_bytes = pool_safe_total_bytes - int(
+        pool_information["stats"]["used_bytes"]
+    )
+    if size_bytes >= pool_safe_free_bytes and not force_flag:
+        return (
+            False,
+            f"ERROR: Requested volume size '{format_bytes_tohuman(size_bytes)}' is greater than the safe free space in the pool ('{format_bytes_tohuman(pool_safe_free_bytes)}' for 80% full); retry with force to ignore this error",
+        )
+
    # 2. Resize the volume
    retcode, stdout, stderr = common.run_os_command(
        "rbd resize --size {} {}/{}".format(
@ -681,20 +773,8 @@ def resize_volume(zkhandler, pool, name, size):
        except Exception:
            pass

-    # 4. Get volume stats
-    retcode, stdout, stderr = common.run_os_command(
-        "rbd info --format json {}/{}".format(pool, name)
-    )
-    volstats = stdout
-
-    # 5. Update the volume in Zookeeper
-    zkhandler.write(
-        [
-            (("volume", f"{pool}/{name}"), ""),
-            (("volume.stats", f"{pool}/{name}"), volstats),
-            (("snapshot", f"{pool}/{name}"), ""),
-        ]
-    )
+    # 4. Scan the volume stats
+    scan_volume(zkhandler, pool, name)

    return True, 'Resized RBD volume "{}" to size "{}" in pool "{}".'.format(
        name, format_bytes_tohuman(size_bytes), pool
@ -727,18 +807,8 @@ def rename_volume(zkhandler, pool, name, new_name):
        ]
    )

-    # 3. Get volume stats
-    retcode, stdout, stderr = common.run_os_command(
-        "rbd info --format json {}/{}".format(pool, new_name)
-    )
-    volstats = stdout
-
-    # 4. Update the volume stats in Zookeeper
-    zkhandler.write(
-        [
-            (("volume.stats", f"{pool}/{new_name}"), volstats),
-        ]
-    )
+    # 3. Scan the volume stats
+    scan_volume(zkhandler, pool, new_name)

    return True, 'Renamed RBD volume "{}" to "{}" in pool "{}".'.format(
        name, new_name, pool
@ -751,10 +821,22 @@ def remove_volume(zkhandler, pool, name):
            name, pool
        )

-    # 1. Remove volume snapshots
+    # 1a. Remove PVC-managed volume snapshots
    for snapshot in zkhandler.children(("snapshot", f"{pool}/{name}")):
        remove_snapshot(zkhandler, pool, name, snapshot)

+    # 1b. Purge any remaining volume snapshots
+    retcode, stdout, stderr = common.run_os_command(
+        "rbd snap purge {}/{}".format(pool, name)
+    )
+    if retcode:
+        return (
+            False,
+            'ERROR: Failed to purge snapshots from RBD volume "{}" in pool "{}": {}'.format(
+                name, pool, stderr
+            ),
+        )
+
    # 2. Remove the volume
    retcode, stdout, stderr = common.run_os_command("rbd rm {}/{}".format(pool, name))
    if retcode:
@ -830,7 +912,7 @@ def unmap_volume(zkhandler, pool, name):
    return True, 'Unmapped RBD volume at "{}".'.format(mapped_volume)


-def get_list_volume(zkhandler, pool, limit, is_fuzzy=True):
+def get_list_volume(zkhandler, pool, limit=None, is_fuzzy=True):
    if pool and not verifyPool(zkhandler, pool):
        return False, 'ERROR: No pool with name "{}" is present in the cluster.'.format(
            pool
@ -923,23 +1005,27 @@ def add_snapshot(zkhandler, pool, volume, name, zk_only=False):
                ),
            )

-    # 2. Add the snapshot to Zookeeper
+    # 2. Get snapshot stats
+    retcode, stdout, stderr = common.run_os_command(
+        "rbd info --format json {}/{}@{}".format(pool, volume, name)
+    )
+    snapstats = stdout
+
+    # 3. Add the snapshot to Zookeeper
    zkhandler.write(
        [
            (("snapshot", f"{pool}/{volume}/{name}"), ""),
-            (("snapshot.stats", f"{pool}/{volume}/{name}"), "{}"),
+            (("snapshot.stats", f"{pool}/{volume}/{name}"), snapstats),
        ]
    )

-    # 3. Update the count of snapshots on this volume
+    # 4. Update the count of snapshots on this volume
    volume_stats_raw = zkhandler.read(("volume.stats", f"{pool}/{volume}"))
    volume_stats = dict(json.loads(volume_stats_raw))
-    # Format the size to something nicer
    volume_stats["snapshot_count"] = volume_stats["snapshot_count"] + 1
-    volume_stats_raw = json.dumps(volume_stats)
    zkhandler.write(
        [
-            (("volume.stats", f"{pool}/{volume}"), volume_stats_raw),
+            (("volume.stats", f"{pool}/{volume}"), json.dumps(volume_stats)),
        ]
    )

@ -993,6 +1079,36 @@ def rename_snapshot(zkhandler, pool, volume, name, new_name):
    )


+def rollback_snapshot(zkhandler, pool, volume, name):
+    if not verifyVolume(zkhandler, pool, volume):
+        return False, 'ERROR: No volume with name "{}" is present in pool "{}".'.format(
+            volume, pool
+        )
+    if not verifySnapshot(zkhandler, pool, volume, name):
+        return (
+            False,
+            'ERROR: No snapshot with name "{}" is present for volume "{}" in pool "{}".'.format(
+                name, volume, pool
+            ),
+        )
+
+        # 1. Roll back the snapshot
+        retcode, stdout, stderr = common.run_os_command(
+            "rbd snap rollback {}/{}@{}".format(pool, volume, name)
+        )
+        if retcode:
+            return (
+                False,
+                'ERROR: Failed to roll back RBD volume "{}" in pool "{}" to snapshot "{}": {}'.format(
+                    volume, pool, name, stderr
+                ),
+            )
+
+    return True, 'Rolled back RBD volume "{}" in pool "{}" to snapshot "{}".'.format(
+        volume, pool, name
+    )
+
+
 def remove_snapshot(zkhandler, pool, volume, name):
    if not verifyVolume(zkhandler, pool, volume):
        return False, 'ERROR: No volume with name "{}" is present in pool "{}".'.format(
@ -1034,20 +1150,9 @@ def remove_snapshot(zkhandler, pool, volume, name):
    )


-def get_list_snapshot(zkhandler, pool, volume, limit, is_fuzzy=True):
+def get_list_snapshot(zkhandler, target_pool, target_volume, limit=None, is_fuzzy=True):
    snapshot_list = []
-    if pool and not verifyPool(zkhandler, pool):
-        return False, 'ERROR: No pool with name "{}" is present in the cluster.'.format(
-            pool
-        )
-
-    if volume and not verifyPool(zkhandler, volume):
-        return (
-            False,
-            'ERROR: No volume with name "{}" is present in the cluster.'.format(volume),
-        )
-
-    full_snapshot_list = getCephSnapshots(zkhandler, pool, volume)
+    full_snapshot_list = getCephSnapshots(zkhandler, target_pool, target_volume)

    if is_fuzzy and limit:
        # Implicitly assume fuzzy limits
@ -1059,6 +1164,15 @@ def get_list_snapshot(zkhandler, pool, volume, limit, is_fuzzy=True):
    for snapshot in full_snapshot_list:
        volume, snapshot_name = snapshot.split("@")
        pool_name, volume_name = volume.split("/")
+        if target_pool and pool_name != target_pool:
+            continue
+        if target_volume and volume_name != target_volume:
+            continue
+        snapshot_stats = json.loads(
+            zkhandler.read(
+                ("snapshot.stats", f"{pool_name}/{volume_name}/{snapshot_name}")
+            )
+        )
        if limit:
            try:
                if re.fullmatch(limit, snapshot_name):
@ -1067,13 +1181,19 @@ def get_list_snapshot(zkhandler, pool, volume, limit, is_fuzzy=True):
                            "pool": pool_name,
                            "volume": volume_name,
                            "snapshot": snapshot_name,
+                            "stats": snapshot_stats,
                        }
                    )
            except Exception as e:
                return False, "Regex Error: {}".format(e)
        else:
            snapshot_list.append(
-                {"pool": pool_name, "volume": volume_name, "snapshot": snapshot_name}
+                {
+                    "pool": pool_name,
+                    "volume": volume_name,
+                    "snapshot": snapshot_name,
+                    "stats": snapshot_stats,
+                }
            )

    return True, sorted(snapshot_list, key=lambda x: str(x["snapshot"]))
--- a/daemon-common/cluster.py
+++ b/daemon-common/cluster.py
--- a/daemon-common/common.py
+++ b/daemon-common/common.py
@ -3,7 +3,7 @@
 # common.py - PVC client function library, common fuctions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -26,14 +26,74 @@ import subprocess
 import signal
 from json import loads
 from re import match as re_match
+from re import search as re_search
 from re import split as re_split
 from re import sub as re_sub
+from difflib import unified_diff
 from distutils.util import strtobool
 from threading import Thread
 from shlex import split as shlex_split
 from functools import wraps


+###############################################################################
+# Global Variables
+###############################################################################
+
+
+# State lists
+fault_state_combinations = [
+    "new",
+    "ack",
+]
+node_state_combinations = [
+    "run,ready",
+    "run,flush",
+    "run,flushed",
+    "run,unflush",
+    "init,ready",
+    "init,flush",
+    "init,flushed",
+    "init,unflush",
+    "shutdown,ready",
+    "shutdown,flush",
+    "shutdown,flushed",
+    "shutdown,unflush",
+    "stop,ready",
+    "stop,flush",
+    "stop,flushed",
+    "stop,unflush",
+    "dead,ready",
+    "dead,flush",
+    "dead,fence-flush",
+    "dead,flushed",
+    "dead,unflush",
+    "fenced,ready",
+    "fenced,flush",
+    "fenced,flushed",
+    "fenced,unflush",
+]
+vm_state_combinations = [
+    "start",
+    "restart",
+    "shutdown",
+    "stop",
+    "disable",
+    "fail",
+    "migrate",
+    "unmigrate",
+    "provision",
+    "import",
+    "restore",
+]
+ceph_osd_state_combinations = [
+    "up,in",
+    "up,out",
+    "down,in",
+    "down,out",
+]
+
+
 ###############################################################################
 # Performance Profiler decorator
 ###############################################################################
@ -349,18 +409,118 @@ def getDomainTags(zkhandler, dom_uuid):
    """
    tags = list()

-    for tag in zkhandler.children(("domain.meta.tags", dom_uuid)):
-        tag_type = zkhandler.read(("domain.meta.tags", dom_uuid, "tag.type", tag))
-        protected = bool(
-            strtobool(
-                zkhandler.read(("domain.meta.tags", dom_uuid, "tag.protected", tag))
-            )
-        )
+    all_tags = zkhandler.children(("domain.meta.tags", dom_uuid))
+
+    tag_reads = list()
+    for tag in all_tags:
+        tag_reads += [
+            ("domain.meta.tags", dom_uuid, "tag.type", tag),
+            ("domain.meta.tags", dom_uuid, "tag.protected", tag),
+        ]
+    all_tag_data = zkhandler.read_many(tag_reads)
+
+    for tidx, tag in enumerate(all_tags):
+        # Split the large list of return values by the IDX of this tag
+        # Each tag result is 2 fields long
+        pos_start = tidx * 2
+        pos_end = tidx * 2 + 2
+        tag_type, protected = tuple(all_tag_data[pos_start:pos_end])
+        protected = bool(strtobool(protected))
        tags.append({"name": tag, "type": tag_type, "protected": protected})

    return tags


+#
+# Get a list of domain snapshots
+#
+def getDomainSnapshots(zkhandler, dom_uuid):
+    """
+    Get a list of snapshots for domain dom_uuid
+
+    The UUID must be validated before calling this function!
+    """
+    snapshots = list()
+
+    all_snapshots = zkhandler.children(("domain.snapshots", dom_uuid))
+
+    current_timestamp = time.time()
+    current_dom_xml = zkhandler.read(("domain.xml", dom_uuid))
+
+    snapshots = list()
+    for snapshot in all_snapshots:
+        (
+            snap_name,
+            snap_timestamp,
+            _snap_rbd_snapshots,
+            snap_dom_xml,
+        ) = zkhandler.read_many(
+            [
+                ("domain.snapshots", dom_uuid, "domain_snapshot.name", snapshot),
+                ("domain.snapshots", dom_uuid, "domain_snapshot.timestamp", snapshot),
+                (
+                    "domain.snapshots",
+                    dom_uuid,
+                    "domain_snapshot.rbd_snapshots",
+                    snapshot,
+                ),
+                ("domain.snapshots", dom_uuid, "domain_snapshot.xml", snapshot),
+            ]
+        )
+
+        snap_rbd_snapshots = _snap_rbd_snapshots.split(",")
+
+        snap_dom_xml_diff = list(
+            unified_diff(
+                current_dom_xml.split("\n"),
+                snap_dom_xml.split("\n"),
+                fromfile="current",
+                tofile="snapshot",
+                fromfiledate="",
+                tofiledate="",
+                n=1,
+                lineterm="",
+            )
+        )
+
+        _snap_timestamp = float(snap_timestamp)
+        snap_age_secs = int(current_timestamp) - int(_snap_timestamp)
+        snap_age = f"{snap_age_secs} seconds"
+        snap_age_minutes = int(snap_age_secs / 60)
+        if snap_age_minutes > 0:
+            if snap_age_minutes > 1:
+                s = "s"
+            else:
+                s = ""
+            snap_age = f"{snap_age_minutes} minute{s}"
+        snap_age_hours = int(snap_age_secs / 3600)
+        if snap_age_hours > 0:
+            if snap_age_hours > 1:
+                s = "s"
+            else:
+                s = ""
+            snap_age = f"{snap_age_hours} hour{s}"
+        snap_age_days = int(snap_age_secs / 86400)
+        if snap_age_days > 0:
+            if snap_age_days > 1:
+                s = "s"
+            else:
+                s = ""
+            snap_age = f"{snap_age_days} day{s}"
+
+        snapshots.append(
+            {
+                "name": snap_name,
+                "timestamp": snap_timestamp,
+                "age": snap_age,
+                "xml_diff_lines": snap_dom_xml_diff,
+                "rbd_snapshots": snap_rbd_snapshots,
+            }
+        )
+
+    return sorted(snapshots, key=lambda s: s["timestamp"], reverse=True)
+
+
 #
 # Get a set of domain metadata
 #
@ -370,24 +530,45 @@ def getDomainMetadata(zkhandler, dom_uuid):

    The UUID must be validated before calling this function!
    """
-    domain_node_limit = zkhandler.read(("domain.meta.node_limit", dom_uuid))
-    domain_node_selector = zkhandler.read(("domain.meta.node_selector", dom_uuid))
-    domain_node_autostart = zkhandler.read(("domain.meta.autostart", dom_uuid))
-    domain_migration_method = zkhandler.read(("domain.meta.migrate_method", dom_uuid))
+    (
+        domain_node_limit,
+        domain_node_selector,
+        domain_node_autostart,
+        domain_migration_method,
+        domain_migration_max_downtime,
+    ) = zkhandler.read_many(
+        [
+            ("domain.meta.node_limit", dom_uuid),
+            ("domain.meta.node_selector", dom_uuid),
+            ("domain.meta.autostart", dom_uuid),
+            ("domain.meta.migrate_method", dom_uuid),
+            ("domain.meta.migrate_max_downtime", dom_uuid),
+        ]
+    )

    if not domain_node_limit:
        domain_node_limit = None
    else:
        domain_node_limit = domain_node_limit.split(",")

+    if not domain_node_selector or domain_node_selector == "none":
+        domain_node_selector = None
+
    if not domain_node_autostart:
        domain_node_autostart = None

+    if not domain_migration_method or domain_migration_method == "none":
+        domain_migration_method = None
+
+    if not domain_migration_max_downtime or domain_migration_max_downtime == "none":
+        domain_migration_max_downtime = 300
+
    return (
        domain_node_limit,
        domain_node_selector,
        domain_node_autostart,
        domain_migration_method,
+        domain_migration_max_downtime,
    )


@ -399,30 +580,45 @@ def getInformationFromXML(zkhandler, uuid):
    Gather information about a VM from the Libvirt XML configuration in the Zookeper database
    and return a dict() containing it.
    """
-    domain_state = zkhandler.read(("domain.state", uuid))
-    domain_node = zkhandler.read(("domain.node", uuid))
-    domain_lastnode = zkhandler.read(("domain.last_node", uuid))
-    domain_failedreason = zkhandler.read(("domain.failed_reason", uuid))
+    (
+        domain_state,
+        domain_node,
+        domain_lastnode,
+        domain_failedreason,
+        domain_profile,
+        domain_vnc,
+        stats_data,
+    ) = zkhandler.read_many(
+        [
+            ("domain.state", uuid),
+            ("domain.node", uuid),
+            ("domain.last_node", uuid),
+            ("domain.failed_reason", uuid),
+            ("domain.profile", uuid),
+            ("domain.console.vnc", uuid),
+            ("domain.stats", uuid),
+        ]
+    )

    (
        domain_node_limit,
        domain_node_selector,
        domain_node_autostart,
        domain_migration_method,
+        domain_migration_max_downtime,
    ) = getDomainMetadata(zkhandler, uuid)
-    domain_tags = getDomainTags(zkhandler, uuid)
-    domain_profile = zkhandler.read(("domain.profile", uuid))

-    domain_vnc = zkhandler.read(("domain.console.vnc", uuid))
+    domain_tags = getDomainTags(zkhandler, uuid)
+    domain_snapshots = getDomainSnapshots(zkhandler, uuid)
+
    if domain_vnc:
        domain_vnc_listen, domain_vnc_port = domain_vnc.split(":")
    else:
-        domain_vnc_listen = "None"
-        domain_vnc_port = "None"
+        domain_vnc_listen = None
+        domain_vnc_port = None

    parsed_xml = getDomainXML(zkhandler, uuid)

-    stats_data = zkhandler.read(("domain.stats", uuid))
    if stats_data is not None:
        try:
            stats_data = loads(stats_data)
@ -439,6 +635,7 @@ def getInformationFromXML(zkhandler, uuid):
        domain_vcpu,
        domain_vcputopo,
    ) = getDomainMainDetails(parsed_xml)
+
    domain_networks = getDomainNetworks(parsed_xml, stats_data)

    (
@ -470,7 +667,9 @@ def getInformationFromXML(zkhandler, uuid):
        "node_selector": domain_node_selector,
        "node_autostart": bool(strtobool(domain_node_autostart)),
        "migration_method": domain_migration_method,
+        "migration_max_downtime": int(domain_migration_max_downtime),
        "tags": domain_tags,
+        "snapshots": domain_snapshots,
        "description": domain_description,
        "profile": domain_profile,
        "memory": int(domain_memory),
@ -875,7 +1074,7 @@ def sortInterfaceNames(interface_names):
 #
 # Parse a "detect" device into a real block device name
 #
-def get_detect_device(detect_string):
+def get_detect_device_lsscsi(detect_string):
    """
    Parses a "detect:" string into a normalized block device path using lsscsi.

@ -942,3 +1141,96 @@ def get_detect_device(detect_string):
            break

    return blockdev
+
+
+def get_detect_device_nvme(detect_string):
+    """
+    Parses a "detect:" string into a normalized block device path using nvme.
+
+    A detect string is formatted "detect:<NAME>:<SIZE>:<ID>", where
+    NAME is some unique identifier in lsscsi, SIZE is a human-readable
+    size value to within +/- 3% of the real size of the device, and
+    ID is the Nth (0-indexed) matching entry of that NAME and SIZE.
+    """
+
+    unit_map = {
+        "kB": 1000,
+        "MB": 1000 * 1000,
+        "GB": 1000 * 1000 * 1000,
+        "TB": 1000 * 1000 * 1000 * 1000,
+        "PB": 1000 * 1000 * 1000 * 1000 * 1000,
+        "EB": 1000 * 1000 * 1000 * 1000 * 1000 * 1000,
+    }
+
+    _, name, _size, idd = detect_string.split(":")
+    if _ != "detect":
+        return None
+
+    size_re = re_search(r"([\d.]+)([kKMGTP]B)", _size)
+    size_val = float(size_re.group(1))
+    size_unit = size_re.group(2)
+    size_bytes = int(size_val * unit_map[size_unit])
+
+    retcode, stdout, stderr = run_os_command("nvme list --output-format json")
+    if retcode:
+        print(f"Failed to run nvme: {stderr}")
+        return None
+
+    # Parse the output with json
+    nvme_data = loads(stdout).get("Devices", list())
+
+    # Handle size determination (+/- 3%)
+    size = None
+    nvme_sizes = set()
+    for entry in nvme_data:
+        nvme_sizes.add(entry["PhysicalSize"])
+    for l_size in nvme_sizes:
+        plusthreepct = size_bytes * 1.03
+        minusthreepct = size_bytes * 0.97
+
+        if l_size > minusthreepct and l_size < plusthreepct:
+            size = l_size
+            break
+    if size is None:
+        return None
+
+    blockdev = None
+    matches = list()
+    for entry in nvme_data:
+        # Skip if name is not contained in the line (case-insensitive)
+        if name.lower() not in entry["ModelNumber"].lower():
+            continue
+        # Skip if the size does not match
+        if size != entry["PhysicalSize"]:
+            continue
+        # Get our blockdev and append to the list
+        matches.append(entry["DevicePath"])
+
+    blockdev = None
+    # Find the blockdev at index {idd}
+    for idx, _blockdev in enumerate(matches):
+        if int(idx) == int(idd):
+            blockdev = _blockdev
+            break
+
+    return blockdev
+
+
+def get_detect_device(detect_string):
+    """
+    Parses a "detect:" string into a normalized block device path.
+
+    First tries to parse using "lsscsi" (get_detect_device_lsscsi). If this returns an invalid
+    block device name, then try to parse using "nvme" (get_detect_device_nvme). This works around
+    issues with more recent devices (e.g. the Dell R6615 series) not properly reporting block
+    device paths for NVMe devices with "lsscsi".
+    """
+
+    device = get_detect_device_lsscsi(detect_string)
+    if device is None or not re_match(r"^/dev", device):
+        device = get_detect_device_nvme(detect_string)
+
+    if device is not None and re_match(r"^/dev", device):
+        return device
+    else:
+        return None
--- a/daemon-common/config.py
+++ b/daemon-common/config.py
@ -3,7 +3,7 @@
 # config.py - Utility functions for pvcnoded configuration parsing
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -176,6 +176,7 @@ def get_parsed_configuration(config_file):
            "enable_storage": o_subsystem.get("enable_storage", True),
            "enable_worker": o_subsystem.get("enable_worker", True),
            "enable_api": o_subsystem.get("enable_api", True),
+            "enable_prometheus": o_subsystem.get("enable_prometheus", True),
        }
        config = {**config, **config_subsystem}

@ -243,9 +244,9 @@ def get_parsed_configuration(config_file):
                    ]
                ][0]

-            config_cluster_networks_specific[
-                f"{network_type}_dev_ip"
-            ] = f"{list(network.hosts())[address_id]}/{network.prefixlen}"
+            config_cluster_networks_specific[f"{network_type}_dev_ip"] = (
+                f"{list(network.hosts())[address_id]}/{network.prefixlen}"
+            )

            config = {**config, **config_cluster_networks_specific}

@ -284,7 +285,7 @@ def get_parsed_configuration(config_file):
        config_timer = {
            "vm_shutdown_timeout": int(o_timer.get("vm_shutdown_timeout", 180)),
            "keepalive_interval": int(o_timer.get("keepalive_interval", 5)),
-            "monitoring_interval": int(o_timer.get("monitoring_interval", 60)),
+            "monitoring_interval": int(o_timer.get("monitoring_interval", 15)),
        }
        config = {**config, **config_timer}

@ -405,6 +406,78 @@ def get_configuration():
    return config


+def get_parsed_autobackup_configuration(config_file):
+    """
+    Load the configuration; this is the same main pvc.conf that the daemons read
+    """
+    print('Loading configuration from file "{}"'.format(config_file))
+
+    with open(config_file, "r") as cfgfh:
+        try:
+            o_config = yaml.load(cfgfh, Loader=yaml.SafeLoader)
+        except Exception as e:
+            print(f"ERROR: Failed to parse configuration file: {e}")
+            os._exit(1)
+
+    config = dict()
+
+    try:
+        o_cluster = o_config["cluster"]
+        config_cluster = {
+            "cluster": o_cluster["name"],
+            "autobackup_enabled": True,
+        }
+        config = {**config, **config_cluster}
+
+        o_autobackup = o_config["autobackup"]
+        if o_autobackup is None:
+            config["autobackup_enabled"] = False
+            return config
+
+        config_autobackup = {
+            "backup_root_path": o_autobackup["backup_root_path"],
+            "backup_root_suffix": o_autobackup["backup_root_suffix"],
+            "backup_tags": o_autobackup["backup_tags"],
+            "backup_schedule": o_autobackup["backup_schedule"],
+        }
+        config = {**config, **config_autobackup}
+
+        o_automount = o_autobackup["auto_mount"]
+        config_automount = {
+            "auto_mount_enabled": o_automount["enabled"],
+        }
+        config = {**config, **config_automount}
+        if config["auto_mount_enabled"]:
+            config["mount_cmds"] = list()
+            for _mount_cmd in o_automount["mount_cmds"]:
+                if "{backup_root_path}" in _mount_cmd:
+                    _mount_cmd = _mount_cmd.format(
+                        backup_root_path=config["backup_root_path"]
+                    )
+                config["mount_cmds"].append(_mount_cmd)
+            config["unmount_cmds"] = list()
+            for _unmount_cmd in o_automount["unmount_cmds"]:
+                if "{backup_root_path}" in _unmount_cmd:
+                    _unmount_cmd = _unmount_cmd.format(
+                        backup_root_path=config["backup_root_path"]
+                    )
+                config["unmount_cmds"].append(_unmount_cmd)
+
+    except Exception as e:
+        raise MalformedConfigurationError(e)
+
+    return config
+
+
+def get_autobackup_configuration():
+    """
+    Get the configuration.
+    """
+    pvc_config_file = get_configuration_path()
+    config = get_parsed_autobackup_configuration(pvc_config_file)
+    return config
+
+
 def validate_directories(config):
    if not os.path.exists(config["dynamic_directory"]):
        os.makedirs(config["dynamic_directory"])
--- a/daemon-common/faults.py
+++ b/daemon-common/faults.py
@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+
+# faults.py - PVC client function library, faults management
+# Part of the Parallel Virtual Cluster (PVC) system
+#
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
+#
+#    This program is free software: you can redistribute it and/or modify
+#    it under the terms of the GNU General Public License as published by
+#    the Free Software Foundation, version 3.
+#
+#    This program is distributed in the hope that it will be useful,
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#    GNU General Public License for more details.
+#
+#    You should have received a copy of the GNU General Public License
+#    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+#
+###############################################################################
+
+from datetime import datetime
+
+
+def generate_fault(
+    zkhandler,
+    logger,
+    fault_name,
+    fault_time,
+    fault_delta,
+    fault_message,
+    fault_details=None,
+):
+    # Strip the microseconds off of the fault time; we don't care about that precision
+    fault_time = str(fault_time).split(".")[0]
+
+    if fault_details is not None:
+        fault_message = f"{fault_message}: {fault_details}"
+
+    # If a fault already exists with this ID, just update the time
+    if not zkhandler.exists("base.faults"):
+        logger.out(
+            f"Skipping fault reporting for {fault_name} due to missing Zookeeper schemas",
+            state="w",
+        )
+        return
+
+    existing_faults = zkhandler.children("base.faults")
+    if fault_name in existing_faults:
+        logger.out(
+            f"Updating fault {fault_name}: {fault_message} @ {fault_time}", state="i"
+        )
+    else:
+        logger.out(
+            f"Generating fault {fault_name}: {fault_message} @ {fault_time}",
+            state="i",
+        )
+
+    if zkhandler.read("base.config.maintenance") == "true":
+        logger.out(
+            f"Skipping fault reporting for {fault_name} due to maintenance mode",
+            state="w",
+        )
+        return
+
+    # Update an existing fault
+    if fault_name in existing_faults:
+        zkhandler.write(
+            [
+                (("faults.last_time", fault_name), fault_time),
+                (("faults.delta", fault_name), fault_delta),
+                (("faults.message", fault_name), fault_message),
+            ]
+        )
+    # Generate a new fault
+    else:
+        zkhandler.write(
+            [
+                (("faults.id", fault_name), ""),
+                (("faults.first_time", fault_name), fault_time),
+                (("faults.last_time", fault_name), fault_time),
+                (("faults.ack_time", fault_name), ""),
+                (("faults.status", fault_name), "new"),
+                (("faults.delta", fault_name), fault_delta),
+                (("faults.message", fault_name), fault_message),
+            ]
+        )
+
+
+def getFault(zkhandler, fault_id):
+    """
+    Get the details of a fault based on the fault ID
+    """
+    if not zkhandler.exists(("faults.id", fault_id)):
+        return None
+
+    fault_id = fault_id
+
+    (
+        fault_last_time,
+        fault_first_time,
+        fault_ack_time,
+        fault_status,
+        fault_delta,
+        fault_message,
+    ) = zkhandler.read_many(
+        [
+            ("faults.last_time", fault_id),
+            ("faults.first_time", fault_id),
+            ("faults.ack_time", fault_id),
+            ("faults.status", fault_id),
+            ("faults.delta", fault_id),
+            ("faults.message", fault_id),
+        ]
+    )
+
+    # Acknowledged faults have a delta of 0
+    if fault_ack_time != "":
+        fault_delta = 0
+
+    fault = {
+        "id": fault_id,
+        "last_reported": fault_last_time,
+        "first_reported": fault_first_time,
+        "acknowledged_at": fault_ack_time,
+        "status": fault_status,
+        "health_delta": int(fault_delta),
+        "message": fault_message,
+    }
+
+    return fault
+
+
+def getAllFaults(zkhandler, sort_key="last_reported"):
+    """
+    Get the details of all registered faults
+    """
+
+    all_faults = zkhandler.children(("base.faults"))
+
+    faults_reads = list()
+    for fault_id in all_faults:
+        faults_reads += [
+            ("faults.last_time", fault_id),
+            ("faults.first_time", fault_id),
+            ("faults.ack_time", fault_id),
+            ("faults.status", fault_id),
+            ("faults.delta", fault_id),
+            ("faults.message", fault_id),
+        ]
+    all_faults_data = list(zkhandler.read_many(faults_reads))
+
+    faults_detail = list()
+    for fidx, fault_id in enumerate(all_faults):
+        # Split the large list of return values by the IDX of this fault
+        # Each fault result is 6 fields long
+        pos_start = fidx * 6
+        pos_end = fidx * 6 + 6
+        (
+            fault_last_time,
+            fault_first_time,
+            fault_ack_time,
+            fault_status,
+            fault_delta,
+            fault_message,
+        ) = tuple(all_faults_data[pos_start:pos_end])
+        fault_output = {
+            "id": fault_id,
+            "last_reported": fault_last_time,
+            "first_reported": fault_first_time,
+            "acknowledged_at": fault_ack_time,
+            "status": fault_status,
+            "health_delta": int(fault_delta),
+            "message": fault_message,
+        }
+        faults_detail.append(fault_output)
+
+    sorted_faults = sorted(faults_detail, key=lambda x: x[sort_key])
+    # Sort newest-first for time-based sorts
+    if sort_key in ["first_reported", "last_reported", "acknowledge_at"]:
+        sorted_faults.reverse()
+
+    return sorted_faults
+
+
+def get_list(zkhandler, limit=None, sort_key="last_reported"):
+    """
+    Get a list of all known faults, sorted by {sort_key}
+    """
+    if sort_key not in [
+        "first_reported",
+        "last_reported",
+        "acknowledged_at",
+        "status",
+        "health_delta",
+        "message",
+    ]:
+        return False, f"Invalid sort key {sort_key} provided"
+
+    all_faults = getAllFaults(zkhandler, sort_key=sort_key)
+
+    if limit is not None:
+        all_faults = [fault for fault in all_faults if fault["id"] == limit]
+
+    return True, all_faults
+
+
+def acknowledge(zkhandler, fault_id=None):
+    """
+    Acknowledge a fault or all faults
+    """
+    if fault_id is None:
+        faults = getAllFaults(zkhandler)
+    else:
+        fault = getFault(zkhandler, fault_id)
+
+        if fault is None:
+            return False, f"No fault with ID {fault_id} found"
+
+        faults = [fault]
+
+    for fault in faults:
+        # Don't reacknowledge already-acknowledged faults
+        if fault["status"] != "ack":
+            zkhandler.write(
+                [
+                    (
+                        ("faults.ack_time", fault["id"]),
+                        str(datetime.now()).split(".")[0],
+                    ),
+                    (("faults.status", fault["id"]), "ack"),
+                ]
+            )
+
+    return (
+        True,
+        f"Successfully acknowledged fault(s) {', '.join([fault['id'] for fault in faults])}",
+    )
+
+
+def delete(zkhandler, fault_id=None):
+    """
+    Delete a fault or all faults
+    """
+    if fault_id is None:
+        faults = getAllFaults(zkhandler)
+    else:
+        fault = getFault(zkhandler, fault_id)
+
+        if fault is None:
+            return False, f"No fault with ID {fault_id} found"
+
+        faults = [fault]
+
+    for fault in faults:
+        zkhandler.delete(("faults.id", fault["id"]), recursive=True)
+
+    return (
+        True,
+        f"Successfully deleted fault(s) {', '.join([fault['id'] for fault in faults])}",
+    )
--- a/api-daemon/pvcapid/libvirt_schema.py
+++ b/api-daemon/pvcapid/libvirt_schema.py
@ -3,7 +3,7 @@
 # libvirt_schema.py - Libvirt schema elements
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/daemon-common/log.py
+++ b/daemon-common/log.py
@ -3,7 +3,7 @@
 # log.py - PVC daemon logger functions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -115,6 +115,10 @@ class Logger(object):

    # Output function
    def out(self, message, state=None, prefix=""):
+        # Only handle d-state (debug) messages if we're in debug mode
+        if state in ["d"] and not self.config["debug"]:
+            return
+
        # Get the date
        if self.config["log_dates"]:
            date = "{} ".format(datetime.now().strftime("%Y/%m/%d %H:%M:%S.%f"))
@ -146,7 +150,7 @@ class Logger(object):
        if self.config["stdout_logging"]:
            # Assemble output string
            output = colour + prompt + endc + date + prefix + message
-            print(output)
+            print(output + "\n", end="")

        # Log to file
        if self.config["file_logging"]:
--- a/daemon-common/migrations/versions/11.json
+++ b/daemon-common/migrations/versions/11.json
@ -0,0 +1 @@
+{"version": "11", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
--- a/daemon-common/migrations/versions/12.json
+++ b/daemon-common/migrations/versions/12.json
@ -0,0 +1 @@
+{"version": "12", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
--- a/daemon-common/migrations/versions/13.json
+++ b/daemon-common/migrations/versions/13.json
@ -0,0 +1 @@
+{"version": "13", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
--- a/daemon-common/migrations/versions/14.json
+++ b/daemon-common/migrations/versions/14.json
@ -0,0 +1 @@
+{"version": "14", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock", "snapshots": "/snapshots"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "domain_snapshot": {"name": "", "timestamp": "/timestamp", "xml": "/xml", "rbd_snapshots": "/rbdsnaplist"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
--- a/daemon-common/network.py
+++ b/daemon-common/network.py
@ -3,7 +3,7 @@
 # network.py - PVC client function library, Network fuctions
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -142,19 +142,37 @@ def getNetworkACLs(zkhandler, vni, _direction):


 def getNetworkInformation(zkhandler, vni):
-    description = zkhandler.read(("network", vni))
-    nettype = zkhandler.read(("network.type", vni))
-    mtu = zkhandler.read(("network.mtu", vni))
-    domain = zkhandler.read(("network.domain", vni))
-    name_servers = zkhandler.read(("network.nameservers", vni))
-    ip6_network = zkhandler.read(("network.ip6.network", vni))
-    ip6_gateway = zkhandler.read(("network.ip6.gateway", vni))
-    dhcp6_flag = zkhandler.read(("network.ip6.dhcp", vni))
-    ip4_network = zkhandler.read(("network.ip4.network", vni))
-    ip4_gateway = zkhandler.read(("network.ip4.gateway", vni))
-    dhcp4_flag = zkhandler.read(("network.ip4.dhcp", vni))
-    dhcp4_start = zkhandler.read(("network.ip4.dhcp_start", vni))
-    dhcp4_end = zkhandler.read(("network.ip4.dhcp_end", vni))
+    (
+        description,
+        nettype,
+        mtu,
+        domain,
+        name_servers,
+        ip6_network,
+        ip6_gateway,
+        dhcp6_flag,
+        ip4_network,
+        ip4_gateway,
+        dhcp4_flag,
+        dhcp4_start,
+        dhcp4_end,
+    ) = zkhandler.read_many(
+        [
+            ("network", vni),
+            ("network.type", vni),
+            ("network.mtu", vni),
+            ("network.domain", vni),
+            ("network.nameservers", vni),
+            ("network.ip6.network", vni),
+            ("network.ip6.gateway", vni),
+            ("network.ip6.dhcp", vni),
+            ("network.ip4.network", vni),
+            ("network.ip4.gateway", vni),
+            ("network.ip4.dhcp", vni),
+            ("network.ip4.dhcp_start", vni),
+            ("network.ip4.dhcp_end", vni),
+        ]
+    )

    # Construct a data structure to represent the data
    network_information = {
@ -818,31 +836,45 @@ def getSRIOVVFInformation(zkhandler, node, vf):
    if not zkhandler.exists(("node.sriov.vf", node, "sriov_vf", vf)):
        return []

-    pf = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pf", vf))
-    mtu = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mtu", vf))
-    mac = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mac", vf))
-    vlan_id = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf))
-    vlan_qos = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf))
-    tx_rate_min = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf)
+    (
+        pf,
+        mtu,
+        mac,
+        vlan_id,
+        vlan_qos,
+        tx_rate_min,
+        tx_rate_max,
+        link_state,
+        spoof_check,
+        trust,
+        query_rss,
+        pci_domain,
+        pci_bus,
+        pci_slot,
+        pci_function,
+        used,
+        used_by_domain,
+    ) = zkhandler.read_many(
+        [
+            ("node.sriov.vf", node, "sriov_vf.pf", vf),
+            ("node.sriov.vf", node, "sriov_vf.mtu", vf),
+            ("node.sriov.vf", node, "sriov_vf.mac", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.link_state", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.trust", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.query_rss", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.domain", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.bus", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.slot", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.function", vf),
+            ("node.sriov.vf", node, "sriov_vf.used", vf),
+            ("node.sriov.vf", node, "sriov_vf.used_by", vf),
+        ]
    )
-    tx_rate_max = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf)
-    )
-    link_state = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.link_state", vf)
-    )
-    spoof_check = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf)
-    )
-    trust = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.trust", vf))
-    query_rss = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.query_rss", vf))
-    pci_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.domain", vf))
-    pci_bus = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.bus", vf))
-    pci_slot = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.slot", vf))
-    pci_function = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.function", vf))
-    used = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used", vf))
-    used_by_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used_by", vf))

    vf_information = {
        "phy": vf,
--- a/daemon-common/node.py
+++ b/daemon-common/node.py
@ -3,7 +3,7 @@
 # node.py - PVC client function library, node management
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -26,69 +26,143 @@ import json
 import daemon_lib.common as common


-def getNodeInformation(zkhandler, node_name):
-    """
-    Gather information about a node from the Zookeeper database and return a dict() containing it.
-    """
-    node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
-    node_coordinator_state = zkhandler.read(("node.state.router", node_name))
-    node_domain_state = zkhandler.read(("node.state.domain", node_name))
-    node_static_data = zkhandler.read(("node.data.static", node_name)).split()
-    node_pvc_version = zkhandler.read(("node.data.pvc_version", node_name))
-    node_cpu_count = int(node_static_data[0])
-    node_kernel = node_static_data[1]
-    node_os = node_static_data[2]
-    node_arch = node_static_data[3]
-    node_vcpu_allocated = int(zkhandler.read(("node.vcpu.allocated", node_name)))
-    node_mem_total = int(zkhandler.read(("node.memory.total", node_name)))
-    node_mem_allocated = int(zkhandler.read(("node.memory.allocated", node_name)))
-    node_mem_provisioned = int(zkhandler.read(("node.memory.provisioned", node_name)))
-    node_mem_used = int(zkhandler.read(("node.memory.used", node_name)))
-    node_mem_free = int(zkhandler.read(("node.memory.free", node_name)))
-    node_load = float(zkhandler.read(("node.cpu.load", node_name)))
-    node_domains_count = int(
-        zkhandler.read(("node.count.provisioned_domains", node_name))
-    )
-    node_running_domains = zkhandler.read(("node.running_domains", node_name)).split()
-    try:
-        node_health = int(zkhandler.read(("node.monitoring.health", node_name)))
-    except Exception:
-        node_health = "N/A"
-    try:
-        node_health_plugins = zkhandler.read(
-            ("node.monitoring.plugins", node_name)
-        ).split()
-    except Exception:
-        node_health_plugins = list()
-
-    node_health_details = list()
+def getNodeHealthDetails(zkhandler, node_name, node_health_plugins):
+    plugin_reads = list()
    for plugin in node_health_plugins:
-        plugin_last_run = zkhandler.read(
-            ("node.monitoring.data", node_name, "monitoring_plugin.last_run", plugin)
-        )
-        plugin_health_delta = zkhandler.read(
+        plugin_reads += [
+            (
+                "node.monitoring.data",
+                node_name,
+                "monitoring_plugin.last_run",
+                plugin,
+            ),
            (
                "node.monitoring.data",
                node_name,
                "monitoring_plugin.health_delta",
                plugin,
-            )
-        )
-        plugin_message = zkhandler.read(
-            ("node.monitoring.data", node_name, "monitoring_plugin.message", plugin)
-        )
-        plugin_data = zkhandler.read(
-            ("node.monitoring.data", node_name, "monitoring_plugin.data", plugin)
-        )
+            ),
+            (
+                "node.monitoring.data",
+                node_name,
+                "monitoring_plugin.message",
+                plugin,
+            ),
+            (
+                "node.monitoring.data",
+                node_name,
+                "monitoring_plugin.data",
+                plugin,
+            ),
+        ]
+    all_plugin_data = list(zkhandler.read_many(plugin_reads))
+
+    node_health_details = list()
+    for pidx, plugin in enumerate(node_health_plugins):
+        # Split the large list of return values by the IDX of this plugin
+        # Each plugin result is 4 fields long
+        pos_start = pidx * 4
+        pos_end = pidx * 4 + 4
+        (
+            plugin_last_run,
+            plugin_health_delta,
+            plugin_message,
+            plugin_data,
+        ) = tuple(all_plugin_data[pos_start:pos_end])
+        if plugin_data is None:
+            continue
        plugin_output = {
            "name": plugin,
-            "last_run": int(plugin_last_run),
+            "last_run": int(plugin_last_run) if plugin_last_run is not None else None,
            "health_delta": int(plugin_health_delta),
            "message": plugin_message,
            "data": json.loads(plugin_data),
        }
        node_health_details.append(plugin_output)

+    return node_health_details
+
+
+def getNodeInformation(zkhandler, node_name):
+    """
+    Gather information about a node from the Zookeeper database and return a dict() containing it.
+    """
+
+    (
+        node_daemon_state,
+        node_coordinator_state,
+        node_domain_state,
+        node_pvc_version,
+        _node_static_data,
+        _node_vcpu_allocated,
+        _node_mem_total,
+        _node_mem_allocated,
+        _node_mem_provisioned,
+        _node_mem_used,
+        _node_mem_free,
+        _node_load,
+        _node_domains_count,
+        _node_running_domains,
+        _node_health,
+        _node_health_plugins,
+        _node_network_stats,
+    ) = zkhandler.read_many(
+        [
+            ("node.state.daemon", node_name),
+            ("node.state.router", node_name),
+            ("node.state.domain", node_name),
+            ("node.data.pvc_version", node_name),
+            ("node.data.static", node_name),
+            ("node.vcpu.allocated", node_name),
+            ("node.memory.total", node_name),
+            ("node.memory.allocated", node_name),
+            ("node.memory.provisioned", node_name),
+            ("node.memory.used", node_name),
+            ("node.memory.free", node_name),
+            ("node.cpu.load", node_name),
+            ("node.count.provisioned_domains", node_name),
+            ("node.running_domains", node_name),
+            ("node.monitoring.health", node_name),
+            ("node.monitoring.plugins", node_name),
+            ("node.network.stats", node_name),
+        ]
+    )
+
+    node_static_data = _node_static_data.split()
+    node_cpu_count = int(node_static_data[0])
+    node_kernel = node_static_data[1]
+    node_os = node_static_data[2]
+    node_arch = node_static_data[3]
+
+    node_vcpu_allocated = int(_node_vcpu_allocated)
+    node_mem_total = int(_node_mem_total)
+    node_mem_allocated = int(_node_mem_allocated)
+    node_mem_provisioned = int(_node_mem_provisioned)
+    node_mem_used = int(_node_mem_used)
+    node_mem_free = int(_node_mem_free)
+    node_load = float(_node_load)
+    node_domains_count = int(_node_domains_count)
+    node_running_domains = _node_running_domains.split()
+
+    try:
+        node_health = int(_node_health)
+    except Exception:
+        node_health = "N/A"
+
+    try:
+        node_health_plugins = _node_health_plugins.split()
+    except Exception:
+        node_health_plugins = list()
+
+    node_health_details = getNodeHealthDetails(
+        zkhandler, node_name, node_health_plugins
+    )
+
+    try:
+        node_network_stats = json.loads(_node_network_stats)
+    except Exception:
+        node_network_stats = dict()
+
    # Construct a data structure to represent the data
    node_information = {
        "name": node_name,
@ -117,6 +191,7 @@ def getNodeInformation(zkhandler, node_name):
            "used": node_mem_used,
            "free": node_mem_free,
        },
+        "interfaces": node_network_stats,
    }
    return node_information

@ -261,7 +336,7 @@ def get_info(zkhandler, node):

 def get_list(
    zkhandler,
-    limit,
+    limit=None,
    daemon_state=None,
    coordinator_state=None,
    domain_state=None,
@ -269,6 +344,8 @@ def get_list(
 ):
    node_list = []
    full_node_list = zkhandler.children("base.node")
+    if full_node_list is None:
+        full_node_list = list()
    full_node_list.sort()

    if is_fuzzy and limit:
--- a/daemon-common/vm.py
+++ b/daemon-common/vm.py
--- a/daemon-common/vmbuilder.py
+++ b/daemon-common/vmbuilder.py
@ -167,7 +167,7 @@ def open_db(config):
        conn = psycopg2.connect(
            host=config["api_postgresql_host"],
            port=config["api_postgresql_port"],
-            dbname=config["api_postgresql_name"],
+            dbname=config["api_postgresql_dbname"],
            user=config["api_postgresql_user"],
            password=config["api_postgresql_password"],
        )
@ -258,6 +258,13 @@ def worker_create_vm(
        args = (vm_profile,)
        db_cur.execute(query, args)
        profile_data = db_cur.fetchone()
+        if profile_data is None:
+            fail(
+                celery,
+                f'Provisioner profile "{vm_profile}" is not present on the cluster',
+                exception=ClusterError,
+            )
+
        if profile_data.get("arguments"):
            vm_data["script_arguments"] = profile_data.get("arguments").split("|")
        else:
@ -335,7 +342,7 @@ def worker_create_vm(
        monitor_list.append("{}.{}".format(monitor, config["storage_domain"]))
    vm_data["ceph_monitor_list"] = monitor_list
    vm_data["ceph_monitor_port"] = config["ceph_monitor_port"]
-    vm_data["ceph_monitor_secret"] = config["ceph_storage_secret_uuid"]
+    vm_data["ceph_monitor_secret"] = config["ceph_secret_uuid"]

    # Parse the script arguments
    script_arguments = dict()
@ -744,6 +751,7 @@ def worker_create_vm(
        node_selector = vm_data["system_details"]["node_selector"]
        node_autostart = vm_data["system_details"]["node_autostart"]
        migration_method = vm_data["system_details"]["migration_method"]
+        migration_max_downtime = vm_data["system_details"]["migration_max_downtime"]
        with open_zk(config) as zkhandler:
            retcode, retmsg = pvc_vm.define_vm(
                zkhandler,
@ -753,6 +761,7 @@ def worker_create_vm(
                node_selector,
                node_autostart,
                migration_method,
+                migration_max_downtime,
                vm_profile,
                initial_state="provision",
            )
--- a/daemon-common/zkhandler.py
+++ b/daemon-common/zkhandler.py
@ -3,7 +3,7 @@
 # zkhandler.py - Secure versioned ZooKeeper updates
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -19,6 +19,7 @@
 #
 ###############################################################################

+import asyncio
 import os
 import time
 import uuid
@ -29,6 +30,9 @@ from kazoo.client import KazooClient, KazooState
 from kazoo.exceptions import NoNodeError


+SCHEMA_ROOT_PATH = "/usr/share/pvc/daemon_lib/migrations/versions"
+
+
 #
 # Function decorators
 #
@ -56,10 +60,11 @@ class ZKConnection(object):
                schema_version = 0
            zkhandler.schema.load(schema_version, quiet=True)

-            ret = function(zkhandler, *args, **kwargs)
-
-            zkhandler.disconnect()
-            del zkhandler
+            try:
+                ret = function(zkhandler, *args, **kwargs)
+            finally:
+                zkhandler.disconnect()
+                del zkhandler

            return ret

@ -239,10 +244,41 @@ class ZKHandler(object):
                # This path is invalid; this is likely due to missing schema entries, so return None
                return None

-            return self.zk_conn.get(path)[0].decode(self.encoding)
+            res = self.zk_conn.get(path)
+            return res[0].decode(self.encoding)
        except NoNodeError:
            return None

+    async def read_async(self, key):
+        """
+        Read data from a key asynchronously
+        """
+        try:
+            path = self.get_schema_path(key)
+            if path is None:
+                # This path is invalid; this is likely due to missing schema entries, so return None
+                return None
+
+            val = self.zk_conn.get_async(path)
+            data = val.get()
+            return data[0].decode(self.encoding)
+        except NoNodeError:
+            return None
+
+    async def _read_many(self, keys):
+        """
+        Async runner for read_many
+        """
+        res = await asyncio.gather(*(self.read_async(key) for key in keys))
+        return tuple(res)
+
+    def read_many(self, keys):
+        """
+        Read data from several keys, asynchronously. Returns a tuple of all key values once all
+        reads are complete.
+        """
+        return asyncio.run(self._read_many(keys))
+
    def write(self, kvpairs):
        """
        Create or update one or more keys' data
@ -335,11 +371,11 @@ class ZKHandler(object):
        try:
            path = self.get_schema_path(key)
            if path is None:
-                # This path is invalid; this is likely due to missing schema entries, so return None
-                return None
+                raise NoNodeError

            return self.zk_conn.get_children(path)
        except NoNodeError:
+            # This path is invalid; this is likely due to missing schema entries, so return None
            return None

    def rename(self, kkpairs):
@ -540,7 +576,7 @@ class ZKHandler(object):
 #
 class ZKSchema(object):
    # Current version
-    _version = 10
+    _version = 14

    # Root for doing nested keys
    _schema_root = ""
@ -560,7 +596,8 @@ class ZKSchema(object):
            "config.primary_node.sync_lock": f"{_schema_root}/config/primary_node/sync_lock",
            "config.upstream_ip": f"{_schema_root}/config/upstream_ip",
            "config.migration_target_selector": f"{_schema_root}/config/migration_target_selector",
-            "logs": "/logs",
+            "logs": f"{_schema_root}/logs",
+            "faults": f"{_schema_root}/faults",
            "node": f"{_schema_root}/nodes",
            "domain": f"{_schema_root}/domains",
            "network": f"{_schema_root}/networks",
@ -577,6 +614,16 @@ class ZKSchema(object):
            "node": "",  # The root key
            "messages": "/messages",
        },
+        # The schema of an individual logs entry (/logs/{id})
+        "faults": {
+            "id": "",  # The root key
+            "last_time": "/last_time",
+            "first_time": "/first_time",
+            "ack_time": "/ack_time",
+            "status": "/status",
+            "delta": "/delta",
+            "message": "/message",
+        },
        # The schema of an individual node entry (/nodes/{node_name})
        "node": {
            "name": "",  # The root key
@ -608,6 +655,7 @@ class ZKSchema(object):
            "monitoring.plugins": "/monitoring_plugins",
            "monitoring.data": "/monitoring_data",
            "monitoring.health": "/monitoring_health",
+            "network.stats": "/network_stats",
        },
        # The schema of an individual monitoring plugin data entry (/nodes/{node_name}/monitoring_data/{plugin})
        "monitoring_plugin": {
@ -619,7 +667,11 @@ class ZKSchema(object):
            "runtime": "/runtime",
        },
        # The schema of an individual SR-IOV PF entry (/nodes/{node_name}/sriov/pf/{pf})
-        "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"},  # The root key
+        "sriov_pf": {
+            "phy": "",
+            "mtu": "/mtu",
+            "vfcount": "/vfcount",
+        },  # The root key
        # The schema of an individual SR-IOV VF entry (/nodes/{node_name}/sriov/vf/{vf})
        "sriov_vf": {
            "phy": "",  # The root key
@ -659,13 +711,26 @@ class ZKSchema(object):
            "console.vnc": "/vnc",
            "meta.autostart": "/node_autostart",
            "meta.migrate_method": "/migration_method",
+            "meta.migrate_max_downtime": "/migration_max_downtime",
            "meta.node_selector": "/node_selector",
            "meta.node_limit": "/node_limit",
            "meta.tags": "/tags",
            "migrate.sync_lock": "/migrate_sync_lock",
+            "snapshots": "/snapshots",
        },
        # The schema of an individual domain tag entry (/domains/{domain}/tags/{tag})
-        "tag": {"name": "", "type": "/type", "protected": "/protected"},  # The root key
+        "tag": {
+            "name": "",  # The root key
+            "type": "/type",
+            "protected": "/protected",
+        },
+        # The schema of an individual domain snapshot entry (/domains/{domain}/snapshots/{snapshot})
+        "domain_snapshot": {
+            "name": "",  # The root key
+            "timestamp": "/timestamp",
+            "xml": "/xml",
+            "rbd_snapshots": "/rbdsnaplist",
+        },
        # The schema of an individual network entry (/networks/{vni})
        "network": {
            "vni": "",  # The root key
@ -702,7 +767,11 @@ class ZKSchema(object):
            "client_id": "/clientid",
        },
        # The schema for an individual network ACL entry (/networks/{vni}/firewall_rules/(in|out)/{acl}
-        "rule": {"description": "", "rule": "/rule", "order": "/order"},  # The root key
+        "rule": {
+            "description": "",
+            "rule": "/rule",
+            "order": "/order",
+        },  # The root key
        # The schema of an individual OSD entry (/ceph/osds/{osd_id})
        "osd": {
            "id": "",  # The root key
@ -726,9 +795,15 @@ class ZKSchema(object):
            "stats": "/stats",
        },  # The root key
        # The schema of an individual volume entry (/ceph/volumes/{pool_name}/{volume_name})
-        "volume": {"name": "", "stats": "/stats"},  # The root key
+        "volume": {
+            "name": "",
+            "stats": "/stats",
+        },  # The root key
        # The schema of an individual snapshot entry (/ceph/volumes/{pool_name}/{volume_name}/{snapshot_name})
-        "snapshot": {"name": "", "stats": "/stats"},  # The root key
+        "snapshot": {
+            "name": "",
+            "stats": "/stats",
+        },  # The root key
    }

    # Properties
@ -797,7 +872,7 @@ class ZKSchema(object):
        if not quiet:
            print(f"Loading schema version {version}")

-        with open(f"daemon_lib/migrations/versions/{version}.json", "r") as sfh:
+        with open(f"{SCHEMA_ROOT_PATH}/{version}.json", "r") as sfh:
            self.schema = json.load(sfh)
            self.version = self.schema.get("version")

@ -964,6 +1039,8 @@ class ZKSchema(object):
                            default_data = "False"
                        elif elem == "pool" and ikey == "tier":
                            default_data = "default"
+                        elif elem == "domain" and ikey == "meta.migrate_max_downtime":
+                            default_data = "300"
                        else:
                            default_data = ""
                        zkhandler.zk_conn.create(
@ -1144,7 +1221,7 @@ class ZKSchema(object):
    # Write the latest schema to a file
    @classmethod
    def write(cls):
-        schema_file = "daemon_lib/migrations/versions/{}.json".format(cls._version)
+        schema_file = f"{SCHEMA_ROOT_PATH}/{cls._version}.json"
        with open(schema_file, "w") as sfh:
            json.dump(cls._schema, sfh)

@ -1152,7 +1229,7 @@ class ZKSchema(object):
    @staticmethod
    def find_all(start=0, end=None):
        versions = list()
-        for version in os.listdir("daemon_lib/migrations/versions"):
+        for version in os.listdir(SCHEMA_ROOT_PATH):
            sequence_id = int(version.split(".")[0])
            if end is None:
                if sequence_id > start:
@ -1168,7 +1245,7 @@ class ZKSchema(object):
    @staticmethod
    def find_latest():
        latest_version = 0
-        for version in os.listdir("daemon_lib/migrations/versions"):
+        for version in os.listdir(SCHEMA_ROOT_PATH):
            sequence_id = int(version.split(".")[0])
            if sequence_id > latest_version:
                latest_version = sequence_id
--- a/debian/changelog
+++ b/debian/changelog
@ -1,3 +1,172 @@
+pvc (0.9.100-0) unstable; urgency=high
+
+  * [API Daemon] Improves the handling of "detect:" disk strings on newer systems by leveraging the "nvme" command
+  * [Client CLI] Update help text about "detect:" disk strings
+  * [Meta] Updates deprecation warnings and updates builder to only add this version for Debian 12 (Bookworm)
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Fri, 30 Aug 2024 11:03:33 -0400
+
+pvc (0.9.99-0) unstable; urgency=high
+
+  **Deprecation Warning**: `pvc vm backup` commands are now deprecated and will be removed in **0.9.100**. Use `pvc vm snapshot` commands instead.
+  **Breaking Change**: The on-disk format of VM snapshot exports differs from backup exports, and the PVC autobackup system now leverages these. It is recommended to start fresh with a new tree of backups for `pvc autobackup` for maximum compatibility.
+  **Breaking Change**: VM autobackups now run in `pvcworkerd` instead of the CLI client directly, allowing them to be triggerd from any node (or externally). It is important to apply the timer unit changes from the `pvc-ansible` role after upgrading to 0.9.99 to avoid duplicate runs.
+  **Usage Note**: VM snapshots are displayed in the `pvc vm list` and `pvc vm info` outputs, not in a unique "list" endpoint.
+
+  * [API Daemon] Adds a proper error when an invalid provisioner profile is specified
+  * [Node Daemon] Sorts Ceph pools properly in node keepalive to avoid incorrect ordering
+  * [Health Daemon] Improves handling of IPMI checks by adding multiple tries but a shorter timeout
+  * [API Daemon] Improves handling of XML parsing errors in VM configurations
+  * [ALL] Adds support for whole VM snapshots, including configuration XML details, and direct rollback to snapshots
+  * [ALL] Adds support for exporting and importing whole VM snapshots
+  * [Client CLI] Removes vCPU topology from short VM info output
+  * [Client CLI] Improves output format of VM info output
+  * [API Daemon] Adds an endpoint to get the current primary node
+  * [Client CLI] Fixes a bug where API requests were made 3 times
+  * [Other] Improves the build-and-deploy.sh script
+  * [API Daemon] Improves the "vm rename" command to avoid redefining VM, preserving history etc.
+  * [API Daemon] Adds an indication when a task is run on the primary node
+  * [API Daemon] Fixes a bug where the ZK schema relative path didn't work sometimes
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Wed, 28 Aug 2024 11:15:55 -0400
+
+pvc (0.9.98-0) unstable; urgency=high
+
+  * [CLI Client] Fixed output when API call times out
+  * [Node Daemon] Improves the handling of fence states
+  * [API Daemon/CLI Client] Adds support for storage snapshot rollback
+  * [CLI Client] Adds additional warning messages about snapshot consistency to help output
+  * [API Daemon] Fixes a bug listing snapshots by pool/volume
+  * [Node Daemon] Adds a --version flag for information gathering by update-motd.sh
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Wed, 05 Jun 2024 12:01:31 -0400
+
+pvc (0.9.97-0) unstable; urgency=high
+
+  * [Client CLI] Ensures --lines is always an integer value
+  * [Node Daemon] Fixes a bug if d_network changes during iteration
+  * [Node Daemon] Moves to using allocated instead of free memory for node reporting
+  * [API Daemon] Fixes a bug if lingering RBD snapshots exist when removing a volume (#180)
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Fri, 19 Apr 2024 10:32:16 -0400
+
+pvc (0.9.96-0) unstable; urgency=high
+
+  * [API Daemon] Fixes a bug when reporting node stats
+  * [API Daemon] Fixes a bug deleteing successful benchmark results
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Fri, 08 Mar 2024 14:23:06 -0500
+
+pvc (0.9.95-0) unstable; urgency=high
+
+  * [API Daemon/CLI Client] Adds a flag to allow duplicate VNIs in network templates
+  * [API Daemon] Ensures that storage template disks are returned in disk ID order
+  * [Client CLI] Fixes a display bug showing all OSDs as split
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Fri, 09 Feb 2024 12:42:00 -0500
+
+pvc (0.9.94-0) unstable; urgency=high
+
+  * [CLI Client] Fixes an incorrect ordering issue with autobackup summary emails
+  * [API Daemon/CLI Client] Adds an additional safety check for 80% cluster fullness when doing volume adds or resizes
+  * [API Daemon/CLI Client] Adds safety checks to volume clones as well
+  * [API Daemon] Fixes a few remaining memory bugs for stopped/disabled VMs
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Mon, 05 Feb 2024 09:58:07 -0500
+
+pvc (0.9.93-0) unstable; urgency=high
+
+  * [API Daemon] Fixes a bug where stuck zkhandler threads were not cleaned up on error
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 30 Jan 2024 09:51:21 -0500
+
+pvc (0.9.92-0) unstable; urgency=high
+
+  * [CLI Client] Adds the new restore state to the colours list for VM status
+  * [API Daemon] Fixes an incorrect variable assignment
+  * [Provisioner] Improves the error handling of various steps in the debootstrap and rinse example scripts
+  * [CLI Client] Fixes two bugs around missing keys that were added recently (uses get() instead direct dictionary refs)
+  * [CLI Client] Improves API error handling via GET retries (x3) and better server status code handling
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Mon, 29 Jan 2024 09:39:10 -0500
+
+pvc (0.9.91-0) unstable; urgency=high
+
+  * [Client CLI] Fixes a bug and improves output during cluster task events.
+  * [Client CLI] Improves the output of the task list display.
+  * [Provisioner] Fixes some missing cloud-init modules in the default debootstrap script.
+  * [Client CLI] Fixes a bug with a missing argument to the vm_define helper function.
+  * [All] Fixes inconsistent package find + rm commands to avoid errors in dpkg.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 23 Jan 2024 10:02:19 -0500
+
+pvc (0.9.90-0) unstable; urgency=high
+
+  * [Client CLI/API Daemon] Adds additional backup metainfo and an emailed report option to autobackups.
+  * [All] Adds a live migration maximum downtime selector to help with busy VM migrations.
+  * [API Daemon] Fixes a database migration bug on Debian 10/11.
+  * [Node Daemon] Fixes a race condition when applying Zookeeper schema changes.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Thu, 11 Jan 2024 00:14:49 -0500
+
+pvc (0.9.89-0) unstable; urgency=high
+
+  * [API/Worker Daemons] Fixes a bug with the Celery result backends not being properly initialized on Debian 10/11.
+  * [API Daemon] Fixes a bug if VM CPU stats are missing on Debian 10.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 09 Jan 2024 12:15:53 -0500
+
+pvc (0.9.88-0) unstable; urgency=high
+
+  * [API Daemon] Adds an additional Prometheus metrics proxy for Zookeeper stats.
+  * [API Daemon] Adds a new configuration to enable or disable metric endpoints if desired, defaulting to enabled.
+  * [API Daemon] Alters and adjusts the metrics output for VMs to complement new dashboard.
+  * [CLI Client] Adds a "json-prometheus" output format to "pvc connection list" to auto-generate file SD configs.
+  * [Monitoring] Adds a new VM dashboard, updates the Cluster dashboard, and adds a README.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Fri, 29 Dec 2023 14:50:40 -0500
+
+pvc (0.9.87-0) unstable; urgency=high
+
+  * [API Daemon] Adds cluster Prometheus resource utilization metrics and an updated Grafana dashboard.
+  * [Node Daemon] Adds network traffic rate calculation subsystem.
+  * [All Daemons] Fixes a printing bug where newlines were not added atomically.
+  * [CLI Client] Fixes a bug listing connections if no default is specified.
+  * [All Daemons] Simplifies debug logging conditionals by moving into the Logger instance itself.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Wed, 27 Dec 2023 13:40:51 -0500
+
+pvc (0.9.86-0) unstable; urgency=high
+
+  * [API Daemon] Significantly improves the performance of several commands via async Zookeeper calls and removal of superfluous backend calls.
+  * [Docs] Improves the project README and updates screenshot images to show the current output and more functionality.
+  * [API Daemon/CLI] Corrects some bugs in VM metainformation output.
+  * [Node Daemon] Fixes resource reporting bugs from 0.9.81 and properly clears node resource numbers on a fence.
+  * [Health Daemon] Adds a wait during pvchealthd startup until the node is in run state, to avoid erroneous faults during node bootup.
+  * [API Daemon] Fixes an incorrect reference to legacy pvcapid.yaml file in migration script.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Thu, 14 Dec 2023 14:46:29 -0500
+
+pvc (0.9.85-0) unstable; urgency=high
+
+  * [Packaging] Fixes a dependency bug introduced in 0.9.84
+  * [Node Daemon] Fixes an output bug during keepalives
+  * [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Sun, 10 Dec 2023 01:00:33 -0500
+
+pvc (0.9.84-0) unstable; urgency=high
+
+  **Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.
+
+  * [All] Adds persistent fault reporting to clusters, replacing the old cluster health calculations.
+  * [API Daemon] Adds cluster-level Prometheus metric exporting as well as a Ceph Prometheus proxy to the API.
+  * [CLI Client] Improves formatting output of "pvc cluster status".
+  * [Node Daemon] Fixes several bugs and enhances the working of the psql health check plugin.
+  * [Worker Daemon] Fixes several bugs in the example provisioner scripts, and moves the libvirt_schema library into the daemon common libraries.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Sat, 09 Dec 2023 23:05:40 -0500
+
 pvc (0.9.83-0) unstable; urgency=high

  **Breaking Changes:** This release features a breaking change for the daemon config. A new unified "pvc.conf" file is required for all daemons (and the CLI client for Autobackup and API-on-this-host functionality), which will be written by the "pvc" role in the PVC Ansible framework. Using the "update-pvc-daemons" oneshot playbook from PVC Ansible is **required** to update to this release, as it will ensure this file is written to the proper place before deploying the new package versions, and also ensures that the old entires are cleaned up afterwards. In addition, this release fully splits the node worker and health subsystems into discrete daemons ("pvcworkerd" and "pvchealthd") and packages ("pvc-daemon-worker" and "pvc-daemon-health") respectively. The "pvc-daemon-node" package also now depends on both packages, and the "pvc-daemon-api" package can now be reliably used outside of the PVC nodes themselves (for instance, in a VM) without any strange cross-dependency issues.
--- a/debian/control
+++ b/debian/control
@ -32,7 +32,7 @@ Description: Parallel Virtual Cluster worker daemon

 Package: pvc-daemon-api
 Architecture: all
-Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
+Depends: systemd, pvc-daemon-common, gunicorn, python3-gunicorn, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
 Description: Parallel Virtual Cluster API daemon
 A KVM/Zookeeper/Ceph-based VM and private cloud manager
 .
--- a/debian/pvc-client-cli.postinst
+++ b/debian/pvc-client-cli.postinst
@ -2,7 +2,12 @@

 # Generate the bash completion configuration
 if [ -d /etc/bash_completion.d ]; then
+    echo "Installing BASH completion configuration"
    _PVC_COMPLETE=source_bash pvc > /etc/bash_completion.d/pvc
 fi

+# Remove any cached CPython directories or files
+echo "Cleaning up CPython caches"
+find /usr/lib/python3/dist-packages/pvc -type d -name "__pycache__" -exec rm -fr {} + &>/dev/null || true
+
 exit 0
--- a/debian/pvc-daemon-api.postinst
+++ b/debian/pvc-daemon-api.postinst
@ -9,11 +9,6 @@ if systemctl is-active --quiet pvcapid.service; then
    /usr/share/pvc/pvc-api-db-upgrade
    systemctl start pvcapid.service
 fi
-# Restart the worker daemon
-if systemctl is-active --quiet pvcworkerd.service; then
-    systemctl stop pvcworkerd.service
-    systemctl start pvcworkerd.service
-fi

 if [ ! -f /etc/pvc/pvc.conf ]; then
    echo "NOTE: The PVC client API daemon (pvcapid.service) and the PVC Worker daemon (pvcworkerd.service) have not been started; create a config file at /etc/pvc/pvc.conf, then run the database configuration (/usr/share/pvc/pvc-api-db-upgrade) and start them manually."
--- a/debian/pvc-daemon-api.preinst
+++ b/debian/pvc-daemon-api.preinst
@ -1,5 +1,5 @@
 #!/bin/sh

 # Remove any cached CPython directories or files
-echo "Cleaning up existing CPython files"
-find /usr/share/pvc/pvcapid -type d -name "__pycache__" -exec rm -rf {} \; &>/dev/null || true
+echo "Cleaning up CPython caches"
+find /usr/share/pvc/pvcapid -type d -name "__pycache__" -exec rm -fr {} + &>/dev/null || true
--- a/debian/pvc-daemon-common.preinst
+++ b/debian/pvc-daemon-common.preinst
@ -0,0 +1,5 @@
+#!/bin/sh
+
+# Remove any cached CPython directories or files
+echo "Cleaning up CPython caches"
+find /usr/share/pvc/daemon_lib -type d -name "__pycache__" -exec rm -fr {} + &>/dev/null || true
--- a/debian/pvc-daemon-health.preinst
+++ b/debian/pvc-daemon-health.preinst
@ -1,6 +1,6 @@
 #!/bin/sh

 # Remove any cached CPython directories or files
-echo "Cleaning up existing CPython files"
-find /usr/share/pvc/pvchealthd -type d -name "__pycache__" -exec rm -rf {} \; &>/dev/null || true
-find /usr/share/pvc/plugins -type d -name "__pycache__" -exec rm -rf {} \; &>/dev/null || true
+echo "Cleaning up CPython caches"
+find /usr/share/pvc/pvchealthd -type d -name "__pycache__" -exec rm -fr {} + &>/dev/null || true
+find /usr/share/pvc/plugins -type d -name "__pycache__" -exec rm -fr {} + &>/dev/null || true
--- a/debian/pvc-daemon-node.install
+++ b/debian/pvc-daemon-node.install
@ -3,4 +3,4 @@ node-daemon/pvcnoded usr/share/pvc
 node-daemon/pvcnoded.service lib/systemd/system
 node-daemon/pvc.target lib/systemd/system
 node-daemon/pvcautoready.service lib/systemd/system
-node-daemon/monitoring usr/share/pvc
+monitoring usr/share/pvc
--- a/debian/pvc-daemon-node.preinst
+++ b/debian/pvc-daemon-node.preinst
@ -1,5 +1,5 @@
 #!/bin/sh

 # Remove any cached CPython directories or files
-echo "Cleaning up existing CPython files"
-find /usr/share/pvc/pvcnoded -type d -name "__pycache__" -exec rm -rf {} \; &>/dev/null || true
+echo "Cleaning up CPython caches"
+find /usr/share/pvc/pvcnoded -type d -name "__pycache__" -exec rm -fr {} + &>/dev/null || true
--- a/debian/pvc-daemon-worker.preinst
+++ b/debian/pvc-daemon-worker.preinst
@ -1,5 +1,5 @@
 #!/bin/sh

 # Remove any cached CPython directories or files
-echo "Cleaning up existing CPython files"
-find /usr/share/pvc/pvcworkerd -type d -name "__pycache__" -exec rm -rf {} \; &>/dev/null || true
+echo "Cleaning up CPython caches"
+find /usr/share/pvc/pvcworkerd -type d -name "__pycache__" -exec rm -fr {} + &>/dev/null || true
--- a/debian/rules
+++ b/debian/rules
@ -13,7 +13,7 @@ override_dh_python3:
 	rm -r $(CURDIR)/client-cli/.pybuild $(CURDIR)/client-cli/pvc.egg-info

 override_dh_auto_clean:
-	find . -name "__pycache__" -o -name ".pybuild" -exec rm -r {} \; || true
+	find . -name "__pycache__" -o -name ".pybuild" -exec rm -fr {} + || true

 # If you need to rebuild the Sphinx documentation
 # Add spinxdoc to the dh --with line
--- a/docs/images/pvc-migration.png
+++ b/docs/images/pvc-migration.png
--- a/docs/images/pvc-networks.png
+++ b/docs/images/pvc-networks.png
--- a/docs/images/pvc-nodelog.png
+++ b/docs/images/pvc-nodelog.png
--- a/docs/images/pvc-nodes.png
+++ b/docs/images/pvc-nodes.png
--- a/13
+++ b/13
@ -2,12 +2,19 @@

 # Generate the database migration files

+set -o xtrace
+
 VERSION="$( head -1 debian/changelog | awk -F'[()-]' '{ print $2 }' )"

+sudo ip addr add 10.0.1.250/32 dev lo
+
 pushd $( git rev-parse --show-toplevel ) &>/dev/null
 pushd api-daemon &>/dev/null
-export PVC_CONFIG_FILE="./pvcapid.sample.yaml"
-./pvcapid-manage_flask.py db migrate -m "PVC version ${VERSION}"
-./pvcapid-manage_flask.py db upgrade
+export PVC_CONFIG_FILE="../pvc.sample.conf"
+export FLASK_APP=./pvcapid-manage_flask.py
+flask db migrate -m "PVC version ${VERSION}"
+flask db upgrade
 popd &>/dev/null
 popd &>/dev/null
+
+sudo ip addr del 10.0.1.250/32 dev lo
--- a/health-daemon/plugins/disk
+++ b/health-daemon/plugins/disk
@ -3,7 +3,7 @@
 # disk.py - PVC Monitoring example plugin for disk (system + OSD)
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/dpkg
+++ b/health-daemon/plugins/dpkg
@ -3,7 +3,7 @@
 # dpkg.py - PVC Monitoring example plugin for dpkg status
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/edac
+++ b/health-daemon/plugins/edac
@ -3,7 +3,7 @@
 # edac.py - PVC Monitoring example plugin for EDAC
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/hwrd
+++ b/health-daemon/plugins/hwrd
@ -3,7 +3,7 @@
 # hwrd.py - PVC Monitoring example plugin for hardware RAID Arrays
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/ipmi
+++ b/health-daemon/plugins/ipmi
@ -3,7 +3,7 @@
 # ipmi.py - PVC Monitoring example plugin for IPMI
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -69,26 +69,33 @@ class MonitoringPluginScript(MonitoringPlugin):

        # Run any imports first
        from daemon_lib.common import run_os_command
+        from time import sleep

        # Check the node's IPMI interface
        ipmi_hostname = self.config["ipmi_hostname"]
        ipmi_username = self.config["ipmi_username"]
        ipmi_password = self.config["ipmi_password"]
-        retcode, _, _ = run_os_command(
-            f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_username} -P {ipmi_password} chassis power status",
-            timeout=5
-        )
+        retcode = 1
+        trycount = 0
+        while retcode > 0 and trycount < 3:
+            retcode, _, _ = run_os_command(
+                f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_username} -P {ipmi_password} chassis power status",
+                timeout=2
+            )
+            trycount += 1
+            if retcode > 0 and trycount < 3:
+                sleep(trycount)

        if retcode > 0:
            # Set the health delta to 10 (subtract 10 from the total of 100)
            health_delta = 10
            # Craft a message that can be used by the clients
-            message = f"IPMI via {ipmi_username}@{ipmi_hostname} is NOT responding"
+            message = f"IPMI via {ipmi_username}@{ipmi_hostname} is NOT responding after 3 attempts"
        else:
            # Set the health delta to 0 (no change)
            health_delta = 0
            # Craft a message that can be used by the clients
-            message = f"IPMI via {ipmi_username}@{ipmi_hostname} is responding"
+            message = f"IPMI via {ipmi_username}@{ipmi_hostname} is responding after {trycount} attempts"

        # Set the health delta in our local PluginResult object
        self.plugin_result.set_health_delta(health_delta)
--- a/health-daemon/plugins/kydb
+++ b/health-daemon/plugins/kydb
@ -3,7 +3,7 @@
 # kydb.py - PVC Monitoring example plugin for KeyDB/Redis
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/lbvt
+++ b/health-daemon/plugins/lbvt
@ -3,7 +3,7 @@
 # lbvt.py - PVC Monitoring example plugin for Libvirtd
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/load
+++ b/health-daemon/plugins/load
@ -3,7 +3,7 @@
 # load.py - PVC Monitoring example plugin for load
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/nics
+++ b/health-daemon/plugins/nics
@ -3,7 +3,7 @@
 # nics.py - PVC Monitoring example plugin for NIC interfaces
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/psql
+++ b/health-daemon/plugins/psql
@ -3,7 +3,7 @@
 # psql.py - PVC Monitoring example plugin for Postgres/Patroni
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -55,7 +55,8 @@ class MonitoringPluginScript(MonitoringPlugin):
        This step is optional and should be used sparingly.
        """

-        pass
+        # Prepare the last coordinator state
+        self.last_coordinator_state = None

    def run(self, coordinator_state=None):
        """
@ -66,6 +67,8 @@ class MonitoringPluginScript(MonitoringPlugin):

        # Run any imports first
        from psycopg2 import connect
+        from json import loads as jloads
+        from daemon_lib.common import run_os_command

        conn_api = None
        cur_api = None
@ -77,7 +80,7 @@ class MonitoringPluginScript(MonitoringPlugin):
        # Craft a message that can be used by the clients
        message = "Successfully connected to PostgreSQL databases on localhost"

-        # Check the Metadata database (primary)
+        # Check the API database
        try:
            conn_api = connect(
                host=self.this_node.name,
@ -99,34 +102,38 @@ class MonitoringPluginScript(MonitoringPlugin):
            if conn_api is not None:
                conn_api.close()

-        if health_delta == 0:
-            # Check the PowerDNS database (secondary)
-            try:
-                conn_pdns = connect(
-                    host=self.this_node.name,
-                    port=self.config["pdns_postgresql_port"],
-                    dbname=self.config["pdns_postgresql_dbname"],
-                    user=self.config["pdns_postgresql_user"],
-                    password=self.config["pdns_postgresql_password"],
-                )
-                cur_pdns = conn_pdns.cursor()
-                cur_pdns.execute("""SELECT * FROM supermasters""")
-                data = cur_pdns.fetchone()
-            except Exception as e:
-                health_delta = 50
-                err = str(e).split('\n')[0]
-                message = f"Failed to connect to PostgreSQL database {self.config['pdns_postgresql_dbname']}: {err}"
-            finally:
-                if cur_pdns is not None:
-                    cur_pdns.close()
-                if conn_pdns is not None:
-                    conn_pdns.close()
+        # Check for Patroni status
+        _, stdout, _ = run_os_command("patronictl --config-file /etc/patroni/config.yml list --format json")
+        patronictl_status = jloads(stdout)
+        this_node_patronictl_status = [p for p in patronictl_status if p["Member"] == self.this_node.name][0]
+        self.logger.out(f"{this_node_patronictl_status}, last node state: {self.last_coordinator_state}, current node state: {coordinator_state}", state="d")
+
+        # Invalid state, nothing returned; this is a fault
+        if health_delta == 0 and not this_node_patronictl_status:
+            health_delta = 10
+            message = "Unable to determine Patroni PostgreSQL node state"
+        # We want to check for a non-running Patroni, but not during or immediately after a coordinator
+        # transition. So we wait until 2 runs with the same coordinator state have been completed.
+        elif health_delta == 0 and self.last_coordinator_state == coordinator_state and this_node_patronictl_status["State"] != "running":
+            health_delta = 10
+            message = "Patroni PostgreSQL state is not running"
+
+        # Handle some exceptional cases
+        if health_delta > 0:
+            if coordinator_state in ["takeover", "relinquish"]:
+                # This scenario occurrs if this plugin run catches a node transitioning from primary to
+                # secondary coordinator. We can ignore it.
+                health_delta = 0
+                message = "Patroni PostgreSQL error reported but currently transitioning coordinator state; ignoring."

        # Set the health delta in our local PluginResult object
        self.plugin_result.set_health_delta(health_delta)

        # Set the message in our local PluginResult object
        self.plugin_result.set_message(message)
+    
+        # Update the last coordinator state
+        self.last_coordinator_state = coordinator_state

        # Return our local PluginResult object
        return self.plugin_result
--- a/health-daemon/plugins/psur
+++ b/health-daemon/plugins/psur
@ -3,7 +3,7 @@
 # psur.py - PVC Monitoring example plugin for PSU Redundancy
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/plugins/zkpr
+++ b/health-daemon/plugins/zkpr
@ -3,7 +3,7 @@
 # zkpr.py - PVC Monitoring example plugin for Zookeeper
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/pvchealthd.py
+++ b/health-daemon/pvchealthd.py
@ -3,7 +3,7 @@
 # pvchealthd.py - Health daemon startup stub
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/pvchealthd/Daemon.py
+++ b/health-daemon/pvchealthd/Daemon.py
@ -3,7 +3,7 @@
 # Daemon.py - Health daemon main entrypoing
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -33,7 +33,7 @@ import os
 import signal

 # Daemon version
-version = "0.9.83"
+version = "0.9.100"


 ##########################################################
@ -80,6 +80,11 @@ def entrypoint():
    # Connect to Zookeeper and return our handler and current schema version
    zkhandler, _ = pvchealthd.util.zookeeper.connect(logger, config)

+    logger.out("Waiting for node daemon to be operating", state="s")
+    while zkhandler.read(("node.state.daemon", config["node_hostname"])) != "run":
+        sleep(5)
+    logger.out("Node daemon in run state, continuing health daemon startup", state="s")
+
    # Define a cleanup function
    def cleanup(failure=False):
        nonlocal logger, zkhandler, monitoring_instance
--- a/health-daemon/pvchealthd/objects/MonitoringInstance.py
+++ b/health-daemon/pvchealthd/objects/MonitoringInstance.py
@ -3,7 +3,7 @@
 # MonitoringInstance.py - Class implementing a PVC monitor in pvchealthd
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
@ -25,9 +25,11 @@ import importlib.util

 from os import walk
 from datetime import datetime
-from json import dumps
+from json import dumps, loads
 from apscheduler.schedulers.background import BackgroundScheduler

+from daemon_lib.faults import generate_fault
+

 class PluginError(Exception):
    """
@ -155,9 +157,6 @@ class MonitoringPlugin(object):
            "w": warning
            "e": error
        """
-        if state == "d" and not self.config["debug"]:
-            return
-
        self.logger.out(message, state=state, prefix=self.plugin_name)

    #
@ -196,6 +195,131 @@ class MonitoringInstance(object):
        self.config = config
        self.logger = logger
        self.this_node = this_node
+        self.faults = 0
+
+        # Create functions for each fault type
+        def get_node_daemon_states():
+            node_daemon_states = [
+                {
+                    "entry": node,
+                    "check": self.zkhandler.read(("node.state.daemon", node)),
+                    "details": None,
+                }
+                for node in self.zkhandler.children("base.node")
+            ]
+            return node_daemon_states
+
+        def get_osd_in_states():
+            osd_in_states = [
+                {
+                    "entry": osd,
+                    "check": loads(self.zkhandler.read(("osd.stats", osd))).get(
+                        "in", 0
+                    ),
+                    "details": None,
+                }
+                for osd in self.zkhandler.children("base.osd")
+            ]
+            return osd_in_states
+
+        def get_ceph_health_entries():
+            ceph_health_entries = [
+                {
+                    "entry": key,
+                    "check": value["severity"],
+                    "details": value["summary"]["message"],
+                }
+                for key, value in loads(zkhandler.read("base.storage.health"))[
+                    "checks"
+                ].items()
+            ]
+            return ceph_health_entries
+
+        def get_vm_states():
+            vm_states = [
+                {
+                    "entry": self.zkhandler.read(("domain.name", domain)),
+                    "check": self.zkhandler.read(("domain.state", domain)),
+                    "details": self.zkhandler.read(("domain.failed_reason", domain)),
+                }
+                for domain in self.zkhandler.children("base.domain")
+            ]
+            return vm_states
+
+        def get_overprovisioned_memory():
+            all_nodes = self.zkhandler.children("base.node")
+            current_memory_provisioned = sum(
+                [
+                    int(self.zkhandler.read(("node.memory.allocated", node)))
+                    for node in all_nodes
+                ]
+            )
+            node_memory_totals = [
+                int(self.zkhandler.read(("node.memory.total", node)))
+                for node in all_nodes
+            ]
+            total_node_memory = sum(node_memory_totals)
+            most_node_memory = sorted(node_memory_totals)[-1]
+            available_node_memory = total_node_memory - most_node_memory
+
+            if current_memory_provisioned >= available_node_memory:
+                op_str = "overprovisioned"
+            else:
+                op_str = "ok"
+            overprovisioned_memory = [
+                {
+                    "entry": "Cluster memory was overprovisioned",
+                    "check": op_str,
+                    "details": f"{current_memory_provisioned}MB > {available_node_memory}MB (N-1)",
+                }
+            ]
+            return overprovisioned_memory
+
+        # This is a list of all possible faults (cluster error messages) and their corresponding details
+        self.cluster_faults_map = {
+            "dead_or_fenced_node": {
+                "name": "DEAD_NODE_{entry}",
+                "entries": get_node_daemon_states,
+                "conditions": ["dead", "fenced"],
+                "delta": 50,
+                "message": "Node {entry} was dead and/or fenced",
+            },
+            "ceph_osd_out": {
+                "name": "CEPH_OSD_OUT_{entry}",
+                "entries": get_osd_in_states,
+                "conditions": ["0"],
+                "delta": 50,
+                "message": "OSD {entry} was marked out",
+            },
+            "ceph_warn": {
+                "name": "CEPH_WARN_{entry}",
+                "entries": get_ceph_health_entries,
+                "conditions": ["HEALTH_WARN"],
+                "delta": 10,
+                "message": "{entry} reported by Ceph cluster",
+            },
+            "ceph_err": {
+                "name": "CEPH_ERR_{entry}",
+                "entries": get_ceph_health_entries,
+                "conditions": ["HEALTH_ERR"],
+                "delta": 50,
+                "message": "{entry} reported by Ceph cluster",
+            },
+            "vm_failed": {
+                "name": "VM_FAILED_{entry}",
+                "entries": get_vm_states,
+                "conditions": ["fail"],
+                "delta": 10,
+                "message": "VM {entry} was failed",
+            },
+            "memory_overprovisioned": {
+                "name": "MEMORY_OVERPROVISIONED",
+                "entries": get_overprovisioned_memory,
+                "conditions": ["overprovisioned"],
+                "delta": 50,
+                "message": "{entry}",
+            },
+        }

        # Get a list of plugins from the plugin_directory
        plugin_files = next(walk(self.config["plugin_directory"]), (None, None, []))[
@ -344,38 +468,84 @@ class MonitoringInstance(object):
                        )
                    )

-        self.run_plugins()
-        self.start_check_timer()
+        self.start_timer()

    def __del__(self):
        self.shutdown()

    def shutdown(self):
-        self.stop_check_timer()
+        self.stop_timer()
        self.run_cleanups()
        return

-    def start_check_timer(self):
-        check_interval = self.config["monitoring_interval"]
+    def start_timer(self):
+        check_interval = int(self.config["monitoring_interval"])
+
+        self.timer = BackgroundScheduler()
+        self.timer.add_job(
+            self.run_checks,
+            trigger="interval",
+            seconds=check_interval,
+        )
+
        self.logger.out(
            f"Starting monitoring check timer ({check_interval} second interval)",
            state="s",
        )
-        self.check_timer = BackgroundScheduler()
-        self.check_timer.add_job(
-            self.run_plugins,
-            trigger="interval",
-            seconds=check_interval,
-        )
-        self.check_timer.start()
+        self.timer.start()

-    def stop_check_timer(self):
+        self.run_checks()
+
+    def stop_timer(self):
        try:
-            self.check_timer.shutdown()
            self.logger.out("Stopping monitoring check timer", state="s")
+            self.timer.shutdown()
        except Exception:
            self.logger.out("Failed to stop monitoring check timer", state="w")

+    def run_faults(self, coordinator_state=None):
+        self.logger.out(
+            f"Starting cluster fault check run at {datetime.now()}",
+            state="t",
+        )
+
+        for fault_type in self.cluster_faults_map.keys():
+            fault_data = self.cluster_faults_map[fault_type]
+
+            if self.config["log_monitoring_details"] or self.config["debug"]:
+                self.logger.out(
+                    f"Running fault check {fault_type}",
+                    state="t",
+                )
+
+            entries = fault_data["entries"]()
+
+            self.logger.out(
+                f"Entries for fault check {fault_type}: {dumps(entries)}",
+                state="d",
+            )
+
+            for _entry in entries:
+                entry = _entry["entry"]
+                check = _entry["check"]
+                details = _entry["details"]
+                for condition in fault_data["conditions"]:
+                    if str(condition) == str(check):
+                        fault_time = datetime.now()
+                        fault_delta = fault_data["delta"]
+                        fault_name = fault_data["name"].format(entry=entry.upper())
+                        fault_message = fault_data["message"].format(entry=entry)
+                        generate_fault(
+                            self.zkhandler,
+                            self.logger,
+                            fault_name,
+                            fault_time,
+                            fault_delta,
+                            fault_message,
+                            fault_details=details,
+                        )
+                        self.faults += 1
+
    def run_plugin(self, plugin):
        time_start = datetime.now()
        try:
@ -394,19 +564,9 @@ class MonitoringInstance(object):
        result.to_zookeeper()
        return result

-    def run_plugins(self):
-        if self.this_node.coordinator_state == "primary":
-            cst_colour = self.logger.fmt_green
-        elif self.this_node.coordinator_state == "secondary":
-            cst_colour = self.logger.fmt_blue
-        else:
-            cst_colour = self.logger.fmt_cyan
-
-        active_coordinator_state = self.this_node.coordinator_state
-
-        runtime_start = datetime.now()
+    def run_plugins(self, coordinator_state=None):
        self.logger.out(
-            "Starting monitoring healthcheck run",
+            f"Starting node plugin check run at {datetime.now()}",
            state="t",
        )

@ -427,7 +587,33 @@ class MonitoringInstance(object):
                    state="t",
                    prefix=f"{result.plugin_name} ({result.runtime}s)",
                )
-            total_health -= result.health_delta
+
+            # Generate a cluster fault if the plugin is in a suboptimal state
+            if result.health_delta > 0:
+                fault_name = f"NODE_PLUGIN_{result.plugin_name.upper()}_{self.this_node.name.upper()}"
+                fault_time = datetime.now()
+
+                # Map our check results to fault results
+                # These are not 1-to-1, as faults are cluster-wide.
+                # We divide the delta by two since 2 nodes with the same problem
+                # should equal what the result says.
+                fault_delta = int(result.health_delta / 2)
+
+                fault_message = (
+                    f"{self.this_node.name} {result.plugin_name}: {result.message}"
+                )
+                generate_fault(
+                    self.zkhandler,
+                    self.logger,
+                    fault_name,
+                    fault_time,
+                    fault_delta,
+                    fault_message,
+                    fault_details=None,
+                )
+                self.faults += 1
+
+                total_health -= result.health_delta

        if total_health < 0:
            total_health = 0
@ -441,38 +627,6 @@ class MonitoringInstance(object):
            ]
        )

-        runtime_end = datetime.now()
-        runtime_delta = runtime_end - runtime_start
-        runtime = "{:0.02f}".format(runtime_delta.total_seconds())
-        time.sleep(0.2)
-
-        if isinstance(self.this_node.health, int):
-            if self.this_node.health > 90:
-                health_colour = self.logger.fmt_green
-            elif self.this_node.health > 50:
-                health_colour = self.logger.fmt_yellow
-            else:
-                health_colour = self.logger.fmt_red
-            health_text = str(self.this_node.health) + "%"
-        else:
-            health_colour = self.logger.fmt_blue
-            health_text = "N/A"
-
-        self.logger.out(
-            "{start_colour}{hostname} healthcheck @ {starttime}{nofmt} [{cst_colour}{costate}{nofmt}] result is {health_colour}{health}{nofmt} in {runtime} seconds".format(
-                start_colour=self.logger.fmt_purple,
-                cst_colour=self.logger.fmt_bold + cst_colour,
-                health_colour=health_colour,
-                nofmt=self.logger.fmt_end,
-                hostname=self.config["node_hostname"],
-                starttime=runtime_start,
-                costate=active_coordinator_state,
-                health=health_text,
-                runtime=runtime,
-            ),
-            state="t",
-        )
-
    def run_cleanup(self, plugin):
        return plugin.cleanup()

@ -494,3 +648,66 @@ class MonitoringInstance(object):
                ),
            ]
        )
+
+    def run_checks(self):
+        self.faults = 0
+        runtime_start = datetime.now()
+
+        coordinator_state = self.this_node.coordinator_state
+
+        if coordinator_state == "primary":
+            cst_colour = self.logger.fmt_green
+        elif coordinator_state == "secondary":
+            cst_colour = self.logger.fmt_blue
+        else:
+            cst_colour = self.logger.fmt_cyan
+
+        self.run_plugins(coordinator_state=coordinator_state)
+
+        if coordinator_state in ["primary", "takeover"]:
+            self.run_faults(coordinator_state=coordinator_state)
+
+        runtime_end = datetime.now()
+        runtime_delta = runtime_end - runtime_start
+        runtime = "{:0.02f}".format(runtime_delta.total_seconds())
+
+        result_text = list()
+
+        if coordinator_state in ["primary", "secondary", "takeover", "relinquish"]:
+            if self.faults > 0:
+                fault_colour = self.logger.fmt_red
+            else:
+                fault_colour = self.logger.fmt_green
+            if self.faults != 1:
+                s = "s"
+            else:
+                s = ""
+            fault_text = f"{fault_colour}{self.faults}{self.logger.fmt_end} fault{s}"
+            result_text.append(fault_text)
+
+        if isinstance(self.this_node.health, int):
+            if self.this_node.health > 90:
+                health_colour = self.logger.fmt_green
+            elif self.this_node.health > 50:
+                health_colour = self.logger.fmt_yellow
+            else:
+                health_colour = self.logger.fmt_red
+            health_text = f"{health_colour}{self.this_node.health}%{self.logger.fmt_end} node health"
+            result_text.append(health_text)
+        else:
+            health_text = f"{self.logger.fmt_blue}N/A{self.logger.fmt_end} node health"
+            result_text.append(health_text)
+
+        self.logger.out(
+            "{start_colour}{hostname} health check @ {starttime}{nofmt} [{cst_colour}{costate}{nofmt}] result is {result_text} in {runtime} seconds".format(
+                start_colour=self.logger.fmt_purple,
+                cst_colour=self.logger.fmt_bold + cst_colour,
+                nofmt=self.logger.fmt_end,
+                hostname=self.config["node_hostname"],
+                starttime=runtime_start,
+                costate=coordinator_state,
+                runtime=runtime,
+                result_text=", ".join(result_text),
+            ),
+            state="t",
+        )
--- a/health-daemon/pvchealthd/objects/NodeInstance.py
+++ b/health-daemon/pvchealthd/objects/NodeInstance.py
@ -3,7 +3,7 @@
 # NodeInstance.py - Class implementing a PVC node in pvchealthd
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/health-daemon/pvchealthd/util/zookeeper.py
+++ b/health-daemon/pvchealthd/util/zookeeper.py
@ -4,7 +4,7 @@
 # zookeeper.py - Utility functions for pvcnoded Zookeeper connections
 # Part of the Parallel Virtual Cluster (PVC) system
 #
-#    Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
+#    Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
--- a/images/0-integrated-help.png
+++ b/images/0-integrated-help.png
--- a/images/1-connection-management.png
+++ b/images/1-connection-management.png
--- a/images/10-provisioner.png
+++ b/images/10-provisioner.png
--- a/images/11-prometheus-grafana.png
+++ b/images/11-prometheus-grafana.png
--- a/images/2-cluster-details-and-output-formats.png
+++ b/images/2-cluster-details-and-output-formats.png
--- a/Show More
+++ b/Show More
 @ -1 +1 @@
 .9.83
 .9.100
				`@ -0,0 +1 @@`
				{"version": "11", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
				`@ -0,0 +1 @@`
				{"version": "12", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
				`@ -0,0 +1 @@`
				{"version": "13", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
				`@ -0,0 +1 @@`
				{"version": "14", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock", "snapshots": "/snapshots"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "domain_snapshot": {"name": "", "timestamp": "/timestamp", "xml": "/xml", "rbd_snapshots": "/rbdsnaplist"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}