Update Grafana dashboard to overview

Adds resource utilization in addition to health.
Move monitoring folder to top level
2023-12-27 11:38:39 -05:00 · 2023-12-27 11:37:49 -05:00 · 2023-12-27 10:03:00 -05:00 · 2023-12-27 09:51:24 -05:00 · 2023-12-27 09:43:37 -05:00 · 2023-12-26 12:43:51 -05:00
55 changed files with 8979 additions and 3032 deletions
--- a/.flake8
+++ b/.flake8
@ -8,7 +8,7 @@
 ignore = W503, E501, F403, F405
 extend-ignore = E203
 # We exclude the Debian, migrations, and provisioner examples
-exclude = debian,api-daemon/migrations/versions,api-daemon/provisioner/examples,node-daemon/monitoring
+exclude = debian,monitoring,api-daemon/migrations/versions,api-daemon/provisioner/examples
 # Set the max line length to 88 for Black
 max-line-length = 88

--- a/.version
+++ b/.version
@ -1 +1 @@
-0.9.84
+0.9.86
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,20 @@
 ## PVC Changelog

+###### [v0.9.86](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.86)
+
+  * [API Daemon] Significantly improves the performance of several commands via async Zookeeper calls and removal of superfluous backend calls.
+  * [Docs] Improves the project README and updates screenshot images to show the current output and more functionality.
+  * [API Daemon/CLI] Corrects some bugs in VM metainformation output.
+  * [Node Daemon] Fixes resource reporting bugs from 0.9.81 and properly clears node resource numbers on a fence.
+  * [Health Daemon] Adds a wait during pvchealthd startup until the node is in run state, to avoid erroneous faults during node bootup.
+  * [API Daemon] Fixes an incorrect reference to legacy pvcapid.yaml file in migration script.
+
+###### [v0.9.85](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.85)
+
+  * [Packaging] Fixes a dependency bug introduced in 0.9.84
+  * [Node Daemon] Fixes an output bug during keepalives
+  * [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
+
 ###### [v0.9.84](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.84)

  **Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.
--- a/README.md
+++ b/README.md
@ -1,5 +1,5 @@
 <p align="center">
-<img alt="Logo banner" src="docs/images/pvc_logo_black.png"/>
+<img alt="Logo banner" src="images/pvc_logo_black.png"/>
 <br/><br/>
 <a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
 <a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
@ -19,41 +19,66 @@ As a consequence of its features, PVC makes administrating very high-uptime VMs

 PVC also features an optional, fully customizable VM provisioning framework, designed to automate and simplify VM deployments using custom provisioning profiles, scripts, and CloudInit userdata API support.

-Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client or WebUI.
+Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client ~~or WebUI~~ (eventually).

 Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.

-
-## What is it based on?
-
-The core node and API daemons, as well as the CLI API client, are written in Python 3 and are fully Free Software (GNU GPL v3). In addition to these, PVC makes use of the following software tools to provide a holistic hyperconverged infrastructure solution:
-
-  * Debian GNU/Linux as the base OS.
-  * Linux KVM, QEMU, and Libvirt for VM management.
-  * Linux `ip`, FRRouting, NFTables, DNSMasq, and PowerDNS for network management.
-  * Ceph for storage management.
-  * Apache Zookeeper for the primary cluster state database.
-  * Patroni PostgreSQL manager for the secondary relation databases (DNS aggregation, Provisioner configuration).
-
-
 ## Getting Started

-To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/getting-started/) page for details on configuring your first cluster.
-
+To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster.

 ## Changelog

-View the changelog in [CHANGELOG.md](CHANGELOG.md).
-
+View the changelog in [CHANGELOG.md](CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**

 ## Screenshots

-While PVC's API and internals aren't very screenshot-worthy, here is some example output of the CLI tool.
+These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85.

-<p><img alt="Node listing" src="docs/images/pvc-nodes.png"/><br/><i>Listing the nodes in a cluster</i></p>
+<p><img alt="0. Integrated help" src="images/0-integrated-help.png"/><br/>
+<i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i>
+</p>

-<p><img alt="Network listing" src="docs/images/pvc-networks.png"/><br/><i>Listing the networks in a cluster, showing 3 bridged and 1 IPv4-only managed networks</i></p>
+<p><img alt="1. Connection management" src="images/1-connection-management.png"/><br/>
+<i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i>
+</p>

-<p><img alt="VM listing and migration" src="docs/images/pvc-migration.png"/><br/><i>Listing a limited set of VMs and migrating one with status updates</i></p>
+<p><img alt="2. Cluster details and output formats" src="images/2-cluster-details-and-output-formats.png"/><br/>
+<i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i>
+</p>

-<p><img alt="Node logs" src="docs/images/pvc-nodelog.png"/><br/><i>Viewing the logs of a node (keepalives and VM [un]migration)</i></p>
+<p><img alt="3. Node information" src="images/3-node-information.png"/><br/>
+<i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i>
+</p>
+
+<p><img alt="4. VM information" src="images/4-vm-information.png"/><br/>
+<i>PVC can show details about the VMs in the cluster, including their state, resource allocations, current hosting node, and metadata.</i>
+</p>
+
+<p><img alt="5. VM details" src="images/5-vm-details.png"/><br/>
+<i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i>
+</p>
+
+<p><img alt="6. Network information" src="images/6-network-information.png"/><br/>
+<i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i>
+</p>
+
+<p><img alt="7. Storage information" src="images/7-storage-information.png"/><br/>
+<i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i>
+</p>
+
+<p><img alt="8. VM and node logs" src="images/8-vm-and-node-logs.png"/><br/>
+<i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i>
+</p>
+
+<p><img alt="9. VM and worker tasks" src="images/9-vm-and-worker-tasks.png"/><br/>
+<i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i>
+</p>
+
+<p><img alt="10. Provisioner" src="images/10-provisioner.png"/><br/>
+<i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i>
+</p>
+
+<p><img alt="11. Prometheus and Grafana dashboard" src="images/11-prometheus-grafana.png"/><br/>
+<i>PVC features several monitoring integration examples under "node-daemon/monitoring", including CheckMK, Munin, and, most recently, Prometheus, including an example Grafana dashboard for cluster monitoring and alerting.</i>
+</p>
--- a/api-daemon/pvc-api-db-upgrade
+++ b/api-daemon/pvc-api-db-upgrade
@ -3,7 +3,7 @@
 # Apply PVC database migrations
 # Part of the Parallel Virtual Cluster (PVC) system

-export PVC_CONFIG_FILE="/etc/pvc/pvcapid.yaml"
+export PVC_CONFIG_FILE="/etc/pvc/pvc.conf"

 if [[ ! -f ${PVC_CONFIG_FILE} ]]; then
    echo "Create a configuration file at ${PVC_CONFIG_FILE} before upgrading the database."
--- a/api-daemon/pvcapid/Daemon.py
+++ b/api-daemon/pvcapid/Daemon.py
@ -27,7 +27,7 @@ from distutils.util import strtobool as dustrtobool
 import daemon_lib.config as cfg

 # Daemon version
-version = "0.9.84"
+version = "0.9.86"

 # API version
 API_VERSION = 1.0
--- a/api-daemon/pvcapid/flaskapi.py
+++ b/api-daemon/pvcapid/flaskapi.py
@ -640,14 +640,15 @@ class API_Metrics(Resource):
          400:
            description: Bad request
        """
-        cluster_output, cluster_retcode = api_helper.cluster_metrics()
+        health_output, health_retcode = api_helper.cluster_health_metrics()
+        resource_output, resource_retcode = api_helper.cluster_resource_metrics()
        ceph_output, ceph_retcode = api_helper.ceph_metrics()

-        if cluster_retcode != 200 or ceph_retcode != 200:
+        if health_retcode != 200 or resource_retcode != 200 or ceph_retcode != 200:
            output = "Error: Failed to obtain data"
            retcode = 400
        else:
-            output = cluster_output + ceph_output
+            output = health_output + resource_output + ceph_output
            retcode = 200

        response = flask.make_response(output, retcode)
@ -658,11 +659,11 @@ class API_Metrics(Resource):
 api.add_resource(API_Metrics, "/metrics")


-# /metrics/pvc
-class API_Metrics_PVC(Resource):
+# /metrics/health
+class API_Metrics_Health(Resource):
    def get(self):
        """
-        Return the current PVC cluster status in Prometheus-compatible metrics format
+        Return the current PVC cluster health status in Prometheus-compatible metrics format

        Endpoint is unauthenticated to allow metrics exfiltration without having to deal
        with the Prometheus compatibility later.
@ -675,13 +676,13 @@ class API_Metrics_PVC(Resource):
          400:
            description: Bad request
        """
-        cluster_output, cluster_retcode = api_helper.cluster_metrics()
+        health_output, health_retcode = api_helper.cluster_health_metrics()

-        if cluster_retcode != 200:
+        if health_retcode != 200:
            output = "Error: Failed to obtain data"
            retcode = 400
        else:
-            output = cluster_output
+            output = health_output
            retcode = 200

        response = flask.make_response(output, retcode)
@ -689,7 +690,41 @@ class API_Metrics_PVC(Resource):
        return response


-api.add_resource(API_Metrics_PVC, "/metrics/pvc")
+api.add_resource(API_Metrics_Health, "/metrics/health")
+
+
+# /metrics/resource
+class API_Metrics_Resource(Resource):
+    def get(self):
+        """
+        Return the current PVC cluster resource utilizations in Prometheus-compatible metrics format
+
+        Endpoint is unauthenticated to allow metrics exfiltration without having to deal
+        with the Prometheus compatibility later.
+        ---
+        tags:
+          - root
+        responses:
+          200:
+            description: OK
+          400:
+            description: Bad request
+        """
+        resource_output, resource_retcode = api_helper.cluster_resource_metrics()
+
+        if resource_retcode != 200:
+            output = "Error: Failed to obtain data"
+            retcode = 400
+        else:
+            output = resource_output
+            retcode = 200
+
+        response = flask.make_response(output, retcode)
+        response.mimetype = "text/plain"
+        return response
+
+
+api.add_resource(API_Metrics_Resource, "/metrics/resource")


 # /metrics/ceph
@ -1133,6 +1168,9 @@ class API_Node_Root(Resource):
                    provisioned:
                      type: integer
                      description: The total amount of RAM provisioned to all domains (regardless of state) on this node in MB
+                interfaces:
+                  type: object
+                  description: Details on speed, bytes, and packets per second of each node physical network interface
        parameters:
          - in: query
            name: limit
--- a/api-daemon/pvcapid/helper.py
+++ b/api-daemon/pvcapid/helper.py
@ -126,170 +126,32 @@ def cluster_maintenance(zkhandler, maint_state="false"):
 #
@pvc_common.Profiler(config)
@ZKConnection(config)
-def cluster_metrics(zkhandler):
+def cluster_health_metrics(zkhandler):
    """
-    Format status data from cluster_status into Prometheus-compatible metrics
+    Get cluster-wide Prometheus metrics for health
    """

-    # Get general cluster information
-    status_retflag, status_data = pvc_cluster.get_info(zkhandler)
-    if not status_retflag:
-        return "Error: Status data threw error", 400
+    retflag, retdata = pvc_cluster.get_health_metrics(zkhandler)
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+    return retdata, retcode

-    faults_retflag, faults_data = pvc_faults.get_list(zkhandler)
-    if not faults_retflag:
-        return "Error: Faults data threw error", 400

-    node_retflag, node_data = pvc_node.get_list(zkhandler)
-    if not node_retflag:
-        return "Error: Node data threw error", 400
+@pvc_common.Profiler(config)
+@ZKConnection(config)
+def cluster_resource_metrics(zkhandler):
+    """
+    Get cluster-wide Prometheus metrics for resource utilization
+    """

-    vm_retflag, vm_data = pvc_vm.get_list(zkhandler)
-    if not vm_retflag:
-        return "Error: VM data threw error", 400
-
-    osd_retflag, osd_data = pvc_ceph.get_list_osd(zkhandler)
-    if not osd_retflag:
-        return "Error: OSD data threw error", 400
-
-    output_lines = list()
-
-    output_lines.append("# HELP pvc_info PVC cluster information")
-    output_lines.append("# TYPE pvc_info gauge")
-    output_lines.append(
-        f"pvc_info{{primary_node=\"{status_data['primary_node']}\", version=\"{status_data['pvc_version']}\", upstream_ip=\"{status_data['upstream_ip']}\"}} 1"
-    )
-
-    output_lines.append("# HELP pvc_cluster_maintenance PVC cluster maintenance state")
-    output_lines.append("# TYPE pvc_cluster_maintenance gauge")
-    output_lines.append(
-        f"pvc_cluster_maintenance {1 if bool(strtobool(status_data['maintenance'])) else 0}"
-    )
-
-    output_lines.append("# HELP pvc_cluster_health PVC cluster health status")
-    output_lines.append("# TYPE pvc_cluster_health gauge")
-    output_lines.append(f"pvc_cluster_health {status_data['cluster_health']['health']}")
-
-    output_lines.append("# HELP pvc_cluster_faults PVC cluster new faults")
-    output_lines.append("# TYPE pvc_cluster_faults gauge")
-    fault_map = dict()
-    for fault_type in pvc_common.fault_state_combinations:
-        fault_map[fault_type] = 0
-    for fault in faults_data:
-        fault_map[fault["status"]] += 1
-    for fault_type in fault_map:
-        output_lines.append(
-            f'pvc_cluster_faults{{status="{fault_type}"}} {fault_map[fault_type]}'
-        )
-
-    # output_lines.append("# HELP pvc_cluster_faults PVC cluster health faults")
-    # output_lines.append("# TYPE pvc_cluster_faults gauge")
-    # for fault_msg in status_data["cluster_health"]["messages"]:
-    #     output_lines.append(
-    #         f"pvc_cluster_faults{{id=\"{fault_msg['id']}\", message=\"{fault_msg['text']}\"}} {fault_msg['health_delta']}"
-    #     )
-
-    output_lines.append("# HELP pvc_node_health PVC cluster node health status")
-    output_lines.append("# TYPE pvc_node_health gauge")
-    for node in status_data["node_health"]:
-        if isinstance(status_data["node_health"][node]["health"], int):
-            output_lines.append(
-                f"pvc_node_health{{node=\"{node}\"}} {status_data['node_health'][node]['health']}"
-            )
-
-    output_lines.append("# HELP pvc_node_daemon_states PVC Node daemon state counts")
-    output_lines.append("# TYPE pvc_node_daemon_states gauge")
-    node_daemon_state_map = dict()
-    for state in set([s.split(",")[0] for s in pvc_common.node_state_combinations]):
-        node_daemon_state_map[state] = 0
-    for node in node_data:
-        node_daemon_state_map[node["daemon_state"]] += 1
-    for state in node_daemon_state_map:
-        output_lines.append(
-            f'pvc_node_daemon_states{{state="{state}"}} {node_daemon_state_map[state]}'
-        )
-
-    output_lines.append("# HELP pvc_node_domain_states PVC Node domain state counts")
-    output_lines.append("# TYPE pvc_node_domain_states gauge")
-    node_domain_state_map = dict()
-    for state in set([s.split(",")[1] for s in pvc_common.node_state_combinations]):
-        node_domain_state_map[state] = 0
-    for node in node_data:
-        node_domain_state_map[node["domain_state"]] += 1
-    for state in node_domain_state_map:
-        output_lines.append(
-            f'pvc_node_domain_states{{state="{state}"}} {node_domain_state_map[state]}'
-        )
-
-    output_lines.append("# HELP pvc_vm_states PVC VM state counts")
-    output_lines.append("# TYPE pvc_vm_states gauge")
-    vm_state_map = dict()
-    for state in set(pvc_common.vm_state_combinations):
-        vm_state_map[state] = 0
-    for vm in vm_data:
-        vm_state_map[vm["state"]] += 1
-    for state in vm_state_map:
-        output_lines.append(f'pvc_vm_states{{state="{state}"}} {vm_state_map[state]}')
-
-    output_lines.append("# HELP pvc_osd_up_states PVC OSD up state counts")
-    output_lines.append("# TYPE pvc_osd_up_states gauge")
-    osd_up_state_map = dict()
-    for state in set([s.split(",")[0] for s in pvc_common.ceph_osd_state_combinations]):
-        osd_up_state_map[state] = 0
-    for osd in osd_data:
-        if osd["stats"]["up"] > 0:
-            osd_up_state_map["up"] += 1
-        else:
-            osd_up_state_map["down"] += 1
-    for state in osd_up_state_map:
-        output_lines.append(
-            f'pvc_osd_up_states{{state="{state}"}} {osd_up_state_map[state]}'
-        )
-
-    output_lines.append("# HELP pvc_osd_in_states PVC OSD in state counts")
-    output_lines.append("# TYPE pvc_osd_in_states gauge")
-    osd_in_state_map = dict()
-    for state in set([s.split(",")[1] for s in pvc_common.ceph_osd_state_combinations]):
-        osd_in_state_map[state] = 0
-    for osd in osd_data:
-        if osd["stats"]["in"] > 0:
-            osd_in_state_map["in"] += 1
-        else:
-            osd_in_state_map["out"] += 1
-    for state in osd_in_state_map:
-        output_lines.append(
-            f'pvc_osd_in_states{{state="{state}"}} {osd_in_state_map[state]}'
-        )
-
-    output_lines.append("# HELP pvc_nodes PVC Node count")
-    output_lines.append("# TYPE pvc_nodes gauge")
-    output_lines.append(f"pvc_nodes {status_data['nodes']['total']}")
-
-    output_lines.append("# HELP pvc_vms PVC VM count")
-    output_lines.append("# TYPE pvc_vms gauge")
-    output_lines.append(f"pvc_vms {status_data['vms']['total']}")
-
-    output_lines.append("# HELP pvc_osds PVC OSD count")
-    output_lines.append("# TYPE pvc_osds gauge")
-    output_lines.append(f"pvc_osds {status_data['osds']['total']}")
-
-    output_lines.append("# HELP pvc_networks PVC Network count")
-    output_lines.append("# TYPE pvc_networks gauge")
-    output_lines.append(f"pvc_networks {status_data['networks']}")
-
-    output_lines.append("# HELP pvc_pools PVC Storage Pool count")
-    output_lines.append("# TYPE pvc_pools gauge")
-    output_lines.append(f"pvc_pools {status_data['pools']}")
-
-    output_lines.append("# HELP pvc_volumes PVC Storage Volume count")
-    output_lines.append("# TYPE pvc_volumes gauge")
-    output_lines.append(f"pvc_volumes {status_data['volumes']}")
-
-    output_lines.append("# HELP pvc_snapshots PVC Storage Snapshot count")
-    output_lines.append("# TYPE pvc_snapshots gauge")
-    output_lines.append(f"pvc_snapshots {status_data['snapshots']}")
-
-    return "\n".join(output_lines) + "\n", 200
+    retflag, retdata = pvc_cluster.get_resource_metrics(zkhandler)
+    if retflag:
+        retcode = 200
+    else:
+        retcode = 400
+    return retdata, retcode


@pvc_common.Profiler(config)
--- a/client-cli/pvc/lib/node.py
+++ b/client-cli/pvc/lib/node.py
@ -249,6 +249,8 @@ def getOutputColours(node_information):
        daemon_state_colour = ansiprint.yellow()
    elif node_information["daemon_state"] == "dead":
        daemon_state_colour = ansiprint.red() + ansiprint.bold()
+    elif node_information["daemon_state"] == "fenced":
+        daemon_state_colour = ansiprint.red()
    else:
        daemon_state_colour = ansiprint.blue()

--- a/client-cli/pvc/lib/vm.py
+++ b/client-cli/pvc/lib/vm.py
@ -1659,24 +1659,26 @@ def format_info(config, domain_information, long_output):
        )

    if not domain_information.get("node_selector"):
-        formatted_node_selector = "False"
+        formatted_node_selector = "Default"
    else:
-        formatted_node_selector = domain_information["node_selector"]
+        formatted_node_selector = str(domain_information["node_selector"]).title()

    if not domain_information.get("node_limit"):
-        formatted_node_limit = "False"
+        formatted_node_limit = "Any"
    else:
        formatted_node_limit = ", ".join(domain_information["node_limit"])

    if not domain_information.get("node_autostart"):
+        autostart_colour = ansiprint.blue()
        formatted_node_autostart = "False"
    else:
-        formatted_node_autostart = domain_information["node_autostart"]
+        autostart_colour = ansiprint.green()
+        formatted_node_autostart = "True"

    if not domain_information.get("migration_method"):
-        formatted_migration_method = "any"
+        formatted_migration_method = "Any"
    else:
-        formatted_migration_method = domain_information["migration_method"]
+        formatted_migration_method = str(domain_information["migration_method"]).title()

    ainformation.append(
        "{}Migration selector:{} {}".format(
@ -1689,8 +1691,12 @@ def format_info(config, domain_information, long_output):
        )
    )
    ainformation.append(
-        "{}Autostart:{}          {}".format(
-            ansiprint.purple(), ansiprint.end(), formatted_node_autostart
+        "{}Autostart:{}          {}{}{}".format(
+            ansiprint.purple(),
+            ansiprint.end(),
+            autostart_colour,
+            formatted_node_autostart,
+            ansiprint.end(),
        )
    )
    ainformation.append(
@ -1736,13 +1742,17 @@ def format_info(config, domain_information, long_output):
            domain_information["tags"], key=lambda t: t["type"] + t["name"]
        ):
            ainformation.append(
-                "                    {tags_name: <{tags_name_length}} {tags_type: <{tags_type_length}} {tags_protected: <{tags_protected_length}}".format(
+                "                    {tags_name: <{tags_name_length}} {tags_type: <{tags_type_length}} {tags_protected_colour}{tags_protected: <{tags_protected_length}}{end}".format(
                    tags_name_length=tags_name_length,
                    tags_type_length=tags_type_length,
                    tags_protected_length=tags_protected_length,
                    tags_name=tag["name"],
                    tags_type=tag["type"],
                    tags_protected=str(tag["protected"]),
+                    tags_protected_colour=ansiprint.green()
+                    if tag["protected"]
+                    else ansiprint.blue(),
+                    end=ansiprint.end(),
                )
            )
    else:
--- a/client-cli/setup.py
+++ b/client-cli/setup.py
@ -2,7 +2,7 @@ from setuptools import setup

 setup(
    name="pvc",
-    version="0.9.84",
+    version="0.9.86",
    packages=["pvc.cli", "pvc.lib"],
    install_requires=[
        "Click",
--- a/daemon-common/ceph.py
+++ b/daemon-common/ceph.py
@ -123,13 +123,13 @@ def format_bytes_tohuman(databytes):
 def format_bytes_fromhuman(datahuman):
    if not re.search(r"[A-Za-z]+", datahuman):
        dataunit = "B"
-        datasize = int(datahuman)
+        datasize = float(datahuman)
    else:
-        dataunit = str(re.match(r"[0-9]+([A-Za-z])[iBb]*", datahuman).group(1))
-        datasize = int(re.match(r"([0-9]+)[A-Za-z]+", datahuman).group(1))
+        dataunit = str(re.match(r"[0-9\.]+([A-Za-z])[iBb]*", datahuman).group(1))
+        datasize = float(re.match(r"([0-9\.]+)[A-Za-z]+", datahuman).group(1))

-    if byte_unit_matrix.get(dataunit):
-        databytes = datasize * byte_unit_matrix[dataunit]
+    if byte_unit_matrix.get(dataunit.upper()):
+        databytes = int(datasize * byte_unit_matrix[dataunit.upper()])
        return databytes
    else:
        return None
@ -155,7 +155,7 @@ def format_ops_fromhuman(datahuman):
    # Trim off human-readable character
    dataunit = datahuman[-1]
    datasize = int(datahuman[:-1])
-    dataops = datasize * ops_unit_matrix[dataunit]
+    dataops = datasize * ops_unit_matrix[dataunit.upper()]
    return "{}".format(dataops)


@ -215,14 +215,26 @@ def getClusterOSDList(zkhandler):


 def getOSDInformation(zkhandler, osd_id):
-    # Get the devices
-    osd_fsid = zkhandler.read(("osd.ofsid", osd_id))
-    osd_node = zkhandler.read(("osd.node", osd_id))
-    osd_device = zkhandler.read(("osd.device", osd_id))
-    osd_is_split = bool(strtobool(zkhandler.read(("osd.is_split", osd_id))))
-    osd_db_device = zkhandler.read(("osd.db_device", osd_id))
+    (
+        osd_fsid,
+        osd_node,
+        osd_device,
+        _osd_is_split,
+        osd_db_device,
+        osd_stats_raw,
+    ) = zkhandler.read_many(
+        [
+            ("osd.ofsid", osd_id),
+            ("osd.node", osd_id),
+            ("osd.device", osd_id),
+            ("osd.is_split", osd_id),
+            ("osd.db_device", osd_id),
+            ("osd.stats", osd_id),
+        ]
+    )
+
+    osd_is_split = bool(strtobool(_osd_is_split))
    # Parse the stats data
-    osd_stats_raw = zkhandler.read(("osd.stats", osd_id))
    osd_stats = dict(json.loads(osd_stats_raw))

    osd_information = {
@ -308,13 +320,18 @@ def get_list_osd(zkhandler, limit=None, is_fuzzy=True):
 #
 def getPoolInformation(zkhandler, pool):
    # Parse the stats data
-    pool_stats_raw = zkhandler.read(("pool.stats", pool))
+    (pool_stats_raw, tier, pgs,) = zkhandler.read_many(
+        [
+            ("pool.stats", pool),
+            ("pool.tier", pool),
+            ("pool.pgs", pool),
+        ]
+    )
+
    pool_stats = dict(json.loads(pool_stats_raw))
    volume_count = len(getCephVolumes(zkhandler, pool))
-    tier = zkhandler.read(("pool.tier", pool))
    if tier is None:
        tier = "default"
-    pgs = zkhandler.read(("pool.pgs", pool))

    pool_information = {
        "name": pool,
--- a/daemon-common/cluster.py
+++ b/daemon-common/cluster.py
--- a/daemon-common/common.py
+++ b/daemon-common/common.py
@ -401,13 +401,23 @@ def getDomainTags(zkhandler, dom_uuid):
    """
    tags = list()

-    for tag in zkhandler.children(("domain.meta.tags", dom_uuid)):
-        tag_type = zkhandler.read(("domain.meta.tags", dom_uuid, "tag.type", tag))
-        protected = bool(
-            strtobool(
-                zkhandler.read(("domain.meta.tags", dom_uuid, "tag.protected", tag))
-            )
-        )
+    all_tags = zkhandler.children(("domain.meta.tags", dom_uuid))
+
+    tag_reads = list()
+    for tag in all_tags:
+        tag_reads += [
+            ("domain.meta.tags", dom_uuid, "tag.type", tag),
+            ("domain.meta.tags", dom_uuid, "tag.protected", tag),
+        ]
+    all_tag_data = zkhandler.read_many(tag_reads)
+
+    for tidx, tag in enumerate(all_tags):
+        # Split the large list of return values by the IDX of this tag
+        # Each tag result is 2 fields long
+        pos_start = tidx * 2
+        pos_end = tidx * 2 + 2
+        tag_type, protected = tuple(all_tag_data[pos_start:pos_end])
+        protected = bool(strtobool(protected))
        tags.append({"name": tag, "type": tag_type, "protected": protected})

    return tags
@ -422,19 +432,34 @@ def getDomainMetadata(zkhandler, dom_uuid):

    The UUID must be validated before calling this function!
    """
-    domain_node_limit = zkhandler.read(("domain.meta.node_limit", dom_uuid))
-    domain_node_selector = zkhandler.read(("domain.meta.node_selector", dom_uuid))
-    domain_node_autostart = zkhandler.read(("domain.meta.autostart", dom_uuid))
-    domain_migration_method = zkhandler.read(("domain.meta.migrate_method", dom_uuid))
+    (
+        domain_node_limit,
+        domain_node_selector,
+        domain_node_autostart,
+        domain_migration_method,
+    ) = zkhandler.read_many(
+        [
+            ("domain.meta.node_limit", dom_uuid),
+            ("domain.meta.node_selector", dom_uuid),
+            ("domain.meta.autostart", dom_uuid),
+            ("domain.meta.migrate_method", dom_uuid),
+        ]
+    )

    if not domain_node_limit:
        domain_node_limit = None
    else:
        domain_node_limit = domain_node_limit.split(",")

+    if not domain_node_selector or domain_node_selector == "none":
+        domain_node_selector = None
+
    if not domain_node_autostart:
        domain_node_autostart = None

+    if not domain_migration_method or domain_migration_method == "none":
+        domain_migration_method = None
+
    return (
        domain_node_limit,
        domain_node_selector,
@ -451,10 +476,25 @@ def getInformationFromXML(zkhandler, uuid):
    Gather information about a VM from the Libvirt XML configuration in the Zookeper database
    and return a dict() containing it.
    """
-    domain_state = zkhandler.read(("domain.state", uuid))
-    domain_node = zkhandler.read(("domain.node", uuid))
-    domain_lastnode = zkhandler.read(("domain.last_node", uuid))
-    domain_failedreason = zkhandler.read(("domain.failed_reason", uuid))
+    (
+        domain_state,
+        domain_node,
+        domain_lastnode,
+        domain_failedreason,
+        domain_profile,
+        domain_vnc,
+        stats_data,
+    ) = zkhandler.read_many(
+        [
+            ("domain.state", uuid),
+            ("domain.node", uuid),
+            ("domain.last_node", uuid),
+            ("domain.failed_reason", uuid),
+            ("domain.profile", uuid),
+            ("domain.console.vnc", uuid),
+            ("domain.stats", uuid),
+        ]
+    )

    (
        domain_node_limit,
@ -462,19 +502,17 @@ def getInformationFromXML(zkhandler, uuid):
        domain_node_autostart,
        domain_migration_method,
    ) = getDomainMetadata(zkhandler, uuid)
-    domain_tags = getDomainTags(zkhandler, uuid)
-    domain_profile = zkhandler.read(("domain.profile", uuid))

-    domain_vnc = zkhandler.read(("domain.console.vnc", uuid))
+    domain_tags = getDomainTags(zkhandler, uuid)
+
    if domain_vnc:
        domain_vnc_listen, domain_vnc_port = domain_vnc.split(":")
    else:
-        domain_vnc_listen = "None"
-        domain_vnc_port = "None"
+        domain_vnc_listen = None
+        domain_vnc_port = None

    parsed_xml = getDomainXML(zkhandler, uuid)

-    stats_data = zkhandler.read(("domain.stats", uuid))
    if stats_data is not None:
        try:
            stats_data = loads(stats_data)
@ -491,6 +529,7 @@ def getInformationFromXML(zkhandler, uuid):
        domain_vcpu,
        domain_vcputopo,
    ) = getDomainMainDetails(parsed_xml)
+
    domain_networks = getDomainNetworks(parsed_xml, stats_data)

    (
--- a/daemon-common/faults.py
+++ b/daemon-common/faults.py
@ -95,12 +95,24 @@ def getFault(zkhandler, fault_id):
        return None

    fault_id = fault_id
-    fault_last_time = zkhandler.read(("faults.last_time", fault_id))
-    fault_first_time = zkhandler.read(("faults.first_time", fault_id))
-    fault_ack_time = zkhandler.read(("faults.ack_time", fault_id))
-    fault_status = zkhandler.read(("faults.status", fault_id))
-    fault_delta = int(zkhandler.read(("faults.delta", fault_id)))
-    fault_message = zkhandler.read(("faults.message", fault_id))
+
+    (
+        fault_last_time,
+        fault_first_time,
+        fault_ack_time,
+        fault_status,
+        fault_delta,
+        fault_message,
+    ) = zkhandler.read_many(
+        [
+            ("faults.last_time", fault_id),
+            ("faults.first_time", fault_id),
+            ("faults.ack_time", fault_id),
+            ("faults.status", fault_id),
+            ("faults.delta", fault_id),
+            ("faults.message", fault_id),
+        ]
+    )

    # Acknowledged faults have a delta of 0
    if fault_ack_time != "":
@ -112,7 +124,7 @@ def getFault(zkhandler, fault_id):
        "first_reported": fault_first_time,
        "acknowledged_at": fault_ack_time,
        "status": fault_status,
-        "health_delta": fault_delta,
+        "health_delta": int(fault_delta),
        "message": fault_message,
    }

@ -126,11 +138,42 @@ def getAllFaults(zkhandler, sort_key="last_reported"):

    all_faults = zkhandler.children(("base.faults"))

-    faults_detail = list()
-
+    faults_reads = list()
    for fault_id in all_faults:
-        fault_detail = getFault(zkhandler, fault_id)
-        faults_detail.append(fault_detail)
+        faults_reads += [
+            ("faults.last_time", fault_id),
+            ("faults.first_time", fault_id),
+            ("faults.ack_time", fault_id),
+            ("faults.status", fault_id),
+            ("faults.delta", fault_id),
+            ("faults.message", fault_id),
+        ]
+    all_faults_data = list(zkhandler.read_many(faults_reads))
+
+    faults_detail = list()
+    for fidx, fault_id in enumerate(all_faults):
+        # Split the large list of return values by the IDX of this fault
+        # Each fault result is 6 fields long
+        pos_start = fidx * 6
+        pos_end = fidx * 6 + 6
+        (
+            fault_last_time,
+            fault_first_time,
+            fault_ack_time,
+            fault_status,
+            fault_delta,
+            fault_message,
+        ) = tuple(all_faults_data[pos_start:pos_end])
+        fault_output = {
+            "id": fault_id,
+            "last_reported": fault_last_time,
+            "first_reported": fault_first_time,
+            "acknowledged_at": fault_ack_time,
+            "status": fault_status,
+            "health_delta": int(fault_delta),
+            "message": fault_message,
+        }
+        faults_detail.append(fault_output)

    sorted_faults = sorted(faults_detail, key=lambda x: x[sort_key])
    # Sort newest-first for time-based sorts
--- a/daemon-common/log.py
+++ b/daemon-common/log.py
@ -146,7 +146,7 @@ class Logger(object):
        if self.config["stdout_logging"]:
            # Assemble output string
            output = colour + prompt + endc + date + prefix + message
-            print(output)
+            print(output + "\n", end="")

        # Log to file
        if self.config["file_logging"]:
--- a/daemon-common/migrations/versions/12.json
+++ b/daemon-common/migrations/versions/12.json
@ -0,0 +1 @@
+{"version": "12", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}
--- a/daemon-common/network.py
+++ b/daemon-common/network.py
@ -142,19 +142,37 @@ def getNetworkACLs(zkhandler, vni, _direction):


 def getNetworkInformation(zkhandler, vni):
-    description = zkhandler.read(("network", vni))
-    nettype = zkhandler.read(("network.type", vni))
-    mtu = zkhandler.read(("network.mtu", vni))
-    domain = zkhandler.read(("network.domain", vni))
-    name_servers = zkhandler.read(("network.nameservers", vni))
-    ip6_network = zkhandler.read(("network.ip6.network", vni))
-    ip6_gateway = zkhandler.read(("network.ip6.gateway", vni))
-    dhcp6_flag = zkhandler.read(("network.ip6.dhcp", vni))
-    ip4_network = zkhandler.read(("network.ip4.network", vni))
-    ip4_gateway = zkhandler.read(("network.ip4.gateway", vni))
-    dhcp4_flag = zkhandler.read(("network.ip4.dhcp", vni))
-    dhcp4_start = zkhandler.read(("network.ip4.dhcp_start", vni))
-    dhcp4_end = zkhandler.read(("network.ip4.dhcp_end", vni))
+    (
+        description,
+        nettype,
+        mtu,
+        domain,
+        name_servers,
+        ip6_network,
+        ip6_gateway,
+        dhcp6_flag,
+        ip4_network,
+        ip4_gateway,
+        dhcp4_flag,
+        dhcp4_start,
+        dhcp4_end,
+    ) = zkhandler.read_many(
+        [
+            ("network", vni),
+            ("network.type", vni),
+            ("network.mtu", vni),
+            ("network.domain", vni),
+            ("network.nameservers", vni),
+            ("network.ip6.network", vni),
+            ("network.ip6.gateway", vni),
+            ("network.ip6.dhcp", vni),
+            ("network.ip4.network", vni),
+            ("network.ip4.gateway", vni),
+            ("network.ip4.dhcp", vni),
+            ("network.ip4.dhcp_start", vni),
+            ("network.ip4.dhcp_end", vni),
+        ]
+    )

    # Construct a data structure to represent the data
    network_information = {
@ -818,31 +836,45 @@ def getSRIOVVFInformation(zkhandler, node, vf):
    if not zkhandler.exists(("node.sriov.vf", node, "sriov_vf", vf)):
        return []

-    pf = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pf", vf))
-    mtu = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mtu", vf))
-    mac = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mac", vf))
-    vlan_id = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf))
-    vlan_qos = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf))
-    tx_rate_min = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf)
+    (
+        pf,
+        mtu,
+        mac,
+        vlan_id,
+        vlan_qos,
+        tx_rate_min,
+        tx_rate_max,
+        link_state,
+        spoof_check,
+        trust,
+        query_rss,
+        pci_domain,
+        pci_bus,
+        pci_slot,
+        pci_function,
+        used,
+        used_by_domain,
+    ) = zkhandler.read_many(
+        [
+            ("node.sriov.vf", node, "sriov_vf.pf", vf),
+            ("node.sriov.vf", node, "sriov_vf.mtu", vf),
+            ("node.sriov.vf", node, "sriov_vf.mac", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.link_state", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.trust", vf),
+            ("node.sriov.vf", node, "sriov_vf.config.query_rss", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.domain", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.bus", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.slot", vf),
+            ("node.sriov.vf", node, "sriov_vf.pci.function", vf),
+            ("node.sriov.vf", node, "sriov_vf.used", vf),
+            ("node.sriov.vf", node, "sriov_vf.used_by", vf),
+        ]
    )
-    tx_rate_max = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf)
-    )
-    link_state = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.link_state", vf)
-    )
-    spoof_check = zkhandler.read(
-        ("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf)
-    )
-    trust = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.trust", vf))
-    query_rss = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.query_rss", vf))
-    pci_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.domain", vf))
-    pci_bus = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.bus", vf))
-    pci_slot = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.slot", vf))
-    pci_function = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.function", vf))
-    used = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used", vf))
-    used_by_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used_by", vf))

    vf_information = {
        "phy": vf,
--- a/daemon-common/node.py
+++ b/daemon-common/node.py
@ -26,69 +26,141 @@ import json
 import daemon_lib.common as common


-def getNodeInformation(zkhandler, node_name):
-    """
-    Gather information about a node from the Zookeeper database and return a dict() containing it.
-    """
-    node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
-    node_coordinator_state = zkhandler.read(("node.state.router", node_name))
-    node_domain_state = zkhandler.read(("node.state.domain", node_name))
-    node_static_data = zkhandler.read(("node.data.static", node_name)).split()
-    node_pvc_version = zkhandler.read(("node.data.pvc_version", node_name))
-    node_cpu_count = int(node_static_data[0])
-    node_kernel = node_static_data[1]
-    node_os = node_static_data[2]
-    node_arch = node_static_data[3]
-    node_vcpu_allocated = int(zkhandler.read(("node.vcpu.allocated", node_name)))
-    node_mem_total = int(zkhandler.read(("node.memory.total", node_name)))
-    node_mem_allocated = int(zkhandler.read(("node.memory.allocated", node_name)))
-    node_mem_provisioned = int(zkhandler.read(("node.memory.provisioned", node_name)))
-    node_mem_used = int(zkhandler.read(("node.memory.used", node_name)))
-    node_mem_free = int(zkhandler.read(("node.memory.free", node_name)))
-    node_load = float(zkhandler.read(("node.cpu.load", node_name)))
-    node_domains_count = int(
-        zkhandler.read(("node.count.provisioned_domains", node_name))
-    )
-    node_running_domains = zkhandler.read(("node.running_domains", node_name)).split()
-    try:
-        node_health = int(zkhandler.read(("node.monitoring.health", node_name)))
-    except Exception:
-        node_health = "N/A"
-    try:
-        node_health_plugins = zkhandler.read(
-            ("node.monitoring.plugins", node_name)
-        ).split()
-    except Exception:
-        node_health_plugins = list()
-
-    node_health_details = list()
+def getNodeHealthDetails(zkhandler, node_name, node_health_plugins):
+    plugin_reads = list()
    for plugin in node_health_plugins:
-        plugin_last_run = zkhandler.read(
-            ("node.monitoring.data", node_name, "monitoring_plugin.last_run", plugin)
-        )
-        plugin_health_delta = zkhandler.read(
+        plugin_reads += [
+            (
+                "node.monitoring.data",
+                node_name,
+                "monitoring_plugin.last_run",
+                plugin,
+            ),
            (
                "node.monitoring.data",
                node_name,
                "monitoring_plugin.health_delta",
                plugin,
-            )
-        )
-        plugin_message = zkhandler.read(
-            ("node.monitoring.data", node_name, "monitoring_plugin.message", plugin)
-        )
-        plugin_data = zkhandler.read(
-            ("node.monitoring.data", node_name, "monitoring_plugin.data", plugin)
-        )
+            ),
+            (
+                "node.monitoring.data",
+                node_name,
+                "monitoring_plugin.message",
+                plugin,
+            ),
+            (
+                "node.monitoring.data",
+                node_name,
+                "monitoring_plugin.data",
+                plugin,
+            ),
+        ]
+    all_plugin_data = list(zkhandler.read_many(plugin_reads))
+
+    node_health_details = list()
+    for pidx, plugin in enumerate(node_health_plugins):
+        # Split the large list of return values by the IDX of this plugin
+        # Each plugin result is 4 fields long
+        pos_start = pidx * 4
+        pos_end = pidx * 4 + 4
+        (
+            plugin_last_run,
+            plugin_health_delta,
+            plugin_message,
+            plugin_data,
+        ) = tuple(all_plugin_data[pos_start:pos_end])
        plugin_output = {
            "name": plugin,
-            "last_run": int(plugin_last_run),
+            "last_run": int(plugin_last_run) if plugin_last_run is not None else None,
            "health_delta": int(plugin_health_delta),
            "message": plugin_message,
            "data": json.loads(plugin_data),
        }
        node_health_details.append(plugin_output)

+    return node_health_details
+
+
+def getNodeInformation(zkhandler, node_name):
+    """
+    Gather information about a node from the Zookeeper database and return a dict() containing it.
+    """
+
+    (
+        node_daemon_state,
+        node_coordinator_state,
+        node_domain_state,
+        node_pvc_version,
+        _node_static_data,
+        _node_vcpu_allocated,
+        _node_mem_total,
+        _node_mem_allocated,
+        _node_mem_provisioned,
+        _node_mem_used,
+        _node_mem_free,
+        _node_load,
+        _node_domains_count,
+        _node_running_domains,
+        _node_health,
+        _node_health_plugins,
+        _node_network_stats,
+    ) = zkhandler.read_many(
+        [
+            ("node.state.daemon", node_name),
+            ("node.state.router", node_name),
+            ("node.state.domain", node_name),
+            ("node.data.pvc_version", node_name),
+            ("node.data.static", node_name),
+            ("node.vcpu.allocated", node_name),
+            ("node.memory.total", node_name),
+            ("node.memory.allocated", node_name),
+            ("node.memory.provisioned", node_name),
+            ("node.memory.used", node_name),
+            ("node.memory.free", node_name),
+            ("node.cpu.load", node_name),
+            ("node.count.provisioned_domains", node_name),
+            ("node.running_domains", node_name),
+            ("node.monitoring.health", node_name),
+            ("node.monitoring.plugins", node_name),
+            ("node.network.stats", node_name),
+        ]
+    )
+
+    node_static_data = _node_static_data.split()
+    node_cpu_count = int(node_static_data[0])
+    node_kernel = node_static_data[1]
+    node_os = node_static_data[2]
+    node_arch = node_static_data[3]
+
+    node_vcpu_allocated = int(_node_vcpu_allocated)
+    node_mem_total = int(_node_mem_total)
+    node_mem_allocated = int(_node_mem_allocated)
+    node_mem_provisioned = int(_node_mem_provisioned)
+    node_mem_used = int(_node_mem_used)
+    node_mem_free = int(_node_mem_free)
+    node_load = float(_node_load)
+    node_domains_count = int(_node_domains_count)
+    node_running_domains = _node_running_domains.split()
+
+    try:
+        node_health = int(_node_health)
+    except Exception:
+        node_health = "N/A"
+
+    try:
+        node_health_plugins = _node_health_plugins.split()
+    except Exception:
+        node_health_plugins = list()
+
+    node_health_details = getNodeHealthDetails(
+        zkhandler, node_name, node_health_plugins
+    )
+
+    if _node_network_stats is not None:
+        node_network_stats = json.loads(_node_network_stats)
+    else:
+        node_network_stats = dict()
+
    # Construct a data structure to represent the data
    node_information = {
        "name": node_name,
@ -117,6 +189,7 @@ def getNodeInformation(zkhandler, node_name):
            "used": node_mem_used,
            "free": node_mem_free,
        },
+        "interfaces": node_network_stats,
    }
    return node_information

@ -269,6 +342,8 @@ def get_list(
 ):
    node_list = []
    full_node_list = zkhandler.children("base.node")
+    if full_node_list is None:
+        full_node_list = list()
    full_node_list.sort()

    if is_fuzzy and limit:
--- a/daemon-common/zkhandler.py
+++ b/daemon-common/zkhandler.py
@ -19,6 +19,7 @@
 #
 ###############################################################################

+import asyncio
 import os
 import time
 import uuid
@ -239,10 +240,41 @@ class ZKHandler(object):
                # This path is invalid; this is likely due to missing schema entries, so return None
                return None

-            return self.zk_conn.get(path)[0].decode(self.encoding)
+            res = self.zk_conn.get(path)
+            return res[0].decode(self.encoding)
        except NoNodeError:
            return None

+    async def read_async(self, key):
+        """
+        Read data from a key asynchronously
+        """
+        try:
+            path = self.get_schema_path(key)
+            if path is None:
+                # This path is invalid; this is likely due to missing schema entries, so return None
+                return None
+
+            val = self.zk_conn.get_async(path)
+            data = val.get()
+            return data[0].decode(self.encoding)
+        except NoNodeError:
+            return None
+
+    async def _read_many(self, keys):
+        """
+        Async runner for read_many
+        """
+        res = await asyncio.gather(*(self.read_async(key) for key in keys))
+        return tuple(res)
+
+    def read_many(self, keys):
+        """
+        Read data from several keys, asynchronously. Returns a tuple of all key values once all
+        reads are complete.
+        """
+        return asyncio.run(self._read_many(keys))
+
    def write(self, kvpairs):
        """
        Create or update one or more keys' data
@ -540,7 +572,7 @@ class ZKHandler(object):
 #
 class ZKSchema(object):
    # Current version
-    _version = 11
+    _version = 12

    # Root for doing nested keys
    _schema_root = ""
@ -619,6 +651,7 @@ class ZKSchema(object):
            "monitoring.plugins": "/monitoring_plugins",
            "monitoring.data": "/monitoring_data",
            "monitoring.health": "/monitoring_health",
+            "network.stats": "/network_stats",
        },
        # The schema of an individual monitoring plugin data entry (/nodes/{node_name}/monitoring_data/{plugin})
        "monitoring_plugin": {
--- a/debian/changelog
+++ b/debian/changelog
@ -1,3 +1,22 @@
+pvc (0.9.86-0) unstable; urgency=high
+
+  * [API Daemon] Significantly improves the performance of several commands via async Zookeeper calls and removal of superfluous backend calls.
+  * [Docs] Improves the project README and updates screenshot images to show the current output and more functionality.
+  * [API Daemon/CLI] Corrects some bugs in VM metainformation output.
+  * [Node Daemon] Fixes resource reporting bugs from 0.9.81 and properly clears node resource numbers on a fence.
+  * [Health Daemon] Adds a wait during pvchealthd startup until the node is in run state, to avoid erroneous faults during node bootup.
+  * [API Daemon] Fixes an incorrect reference to legacy pvcapid.yaml file in migration script.
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Thu, 14 Dec 2023 14:46:29 -0500
+
+pvc (0.9.85-0) unstable; urgency=high
+
+  * [Packaging] Fixes a dependency bug introduced in 0.9.84
+  * [Node Daemon] Fixes an output bug during keepalives
+  * [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Sun, 10 Dec 2023 01:00:33 -0500
+
 pvc (0.9.84-0) unstable; urgency=high

  **Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.
--- a/debian/control
+++ b/debian/control
@ -8,7 +8,7 @@ X-Python3-Version: >= 3.7

 Package: pvc-daemon-node
 Architecture: all
-Depends: systemd, pvc-daemon-common, pvc-daemon-health, pvc-daemon-worker, python3-kazoo, python3-psutil, python3-apscheduler, python3-libvirt, python3-psycopg2, python3-dnspython, python3-yaml, python3-distutils, python3-rados, python3-gevent, python3-prometheus-client, ipmitool, libvirt-daemon-system, arping, vlan, bridge-utils, dnsmasq, nftables, pdns-server, pdns-backend-pgsql
+Depends: systemd, pvc-daemon-common, pvc-daemon-health, pvc-daemon-worker, python3-kazoo, python3-psutil, python3-apscheduler, python3-libvirt, python3-psycopg2, python3-dnspython, python3-yaml, python3-distutils, python3-rados, python3-gevent, ipmitool, libvirt-daemon-system, arping, vlan, bridge-utils, dnsmasq, nftables, pdns-server, pdns-backend-pgsql
 Description: Parallel Virtual Cluster node daemon
 A KVM/Zookeeper/Ceph-based VM and private cloud manager
 .
@ -16,7 +16,7 @@ Description: Parallel Virtual Cluster node daemon

 Package: pvc-daemon-health
 Architecture: all
-Depends: systemd, pvc-daemon-common, python3-kazoo, python3-psutil, python3-apscheduler, python3-yaml, python3-prometheus-client
+Depends: systemd, pvc-daemon-common, python3-kazoo, python3-psutil, python3-apscheduler, python3-yaml
 Description: Parallel Virtual Cluster health daemon
 A KVM/Zookeeper/Ceph-based VM and private cloud manager
 .
@ -24,7 +24,7 @@ Description: Parallel Virtual Cluster health daemon

 Package: pvc-daemon-worker
 Architecture: all
-Depends: systemd, pvc-daemon-common, python3-kazoo, python3-celery, python3-redis, python3-yaml, python3-prometheus-client, python-celery-common, fio
+Depends: systemd, pvc-daemon-common, python3-kazoo, python3-celery, python3-redis, python3-yaml, python-celery-common, fio
 Description: Parallel Virtual Cluster worker daemon
 A KVM/Zookeeper/Ceph-based VM and private cloud manager
 .
@ -32,7 +32,7 @@ Description: Parallel Virtual Cluster worker daemon

 Package: pvc-daemon-api
 Architecture: all
-Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate, python3-prometheus-client
+Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
 Description: Parallel Virtual Cluster API daemon
 A KVM/Zookeeper/Ceph-based VM and private cloud manager
 .
--- a/debian/pvc-daemon-node.install
+++ b/debian/pvc-daemon-node.install
@ -3,4 +3,4 @@ node-daemon/pvcnoded usr/share/pvc
 node-daemon/pvcnoded.service lib/systemd/system
 node-daemon/pvc.target lib/systemd/system
 node-daemon/pvcautoready.service lib/systemd/system
-node-daemon/monitoring usr/share/pvc
+monitoring usr/share/pvc
--- a/docs/images/pvc-migration.png
+++ b/docs/images/pvc-migration.png
--- a/docs/images/pvc-networks.png
+++ b/docs/images/pvc-networks.png
--- a/docs/images/pvc-nodelog.png
+++ b/docs/images/pvc-nodelog.png
--- a/docs/images/pvc-nodes.png
+++ b/docs/images/pvc-nodes.png
--- a/health-daemon/pvchealthd/Daemon.py
+++ b/health-daemon/pvchealthd/Daemon.py
@ -33,7 +33,7 @@ import os
 import signal

 # Daemon version
-version = "0.9.84"
+version = "0.9.86"


 ##########################################################
@ -80,6 +80,11 @@ def entrypoint():
    # Connect to Zookeeper and return our handler and current schema version
    zkhandler, _ = pvchealthd.util.zookeeper.connect(logger, config)

+    logger.out("Waiting for node daemon to be operating", state="s")
+    while zkhandler.read(("node.state.daemon", config["node_hostname"])) != "run":
+        sleep(5)
+    logger.out("Node daemon in run state, continuing health daemon startup", state="s")
+
    # Define a cleanup function
    def cleanup(failure=False):
        nonlocal logger, zkhandler, monitoring_instance
--- a/images/0-integrated-help.png
+++ b/images/0-integrated-help.png
--- a/images/1-connection-management.png
+++ b/images/1-connection-management.png
--- a/images/10-provisioner.png
+++ b/images/10-provisioner.png
--- a/images/11-prometheus-grafana.png
+++ b/images/11-prometheus-grafana.png
--- a/images/2-cluster-details-and-output-formats.png
+++ b/images/2-cluster-details-and-output-formats.png
--- a/images/3-node-information.png
+++ b/images/3-node-information.png
--- a/images/4-vm-information.png
+++ b/images/4-vm-information.png
--- a/images/5-vm-details.png
+++ b/images/5-vm-details.png
--- a/images/6-network-information.png
+++ b/images/6-network-information.png
--- a/images/7-storage-information.png
+++ b/images/7-storage-information.png
--- a/images/8-vm-and-node-logs.png
+++ b/images/8-vm-and-node-logs.png
--- a/images/9-vm-and-worker-tasks.png
+++ b/images/9-vm-and-worker-tasks.png
--- a/docs/images/pvc_logo_black.png
+++ b/docs/images/pvc_logo_black.png
--- a/node-daemon/monitoring/README.md
+++ b/node-daemon/monitoring/README.md
--- a/node-daemon/monitoring/checkmk/pvc
+++ b/node-daemon/monitoring/checkmk/pvc
--- a/node-daemon/monitoring/checkmk/pvc.py
+++ b/node-daemon/monitoring/checkmk/pvc.py
@ -70,7 +70,7 @@ def check_pvc(item, params, section):
        summary = f"Cluster health is {cluster_health}% (maintenance {maintenance})"

        if len(cluster_messages) > 0:
-            details = ", ".join(cluster_messages)
+            details = ", ".join([m["text"] for m in cluster_messages])

        if cluster_health <= 50 and maintenance == "off":
            state = State.CRIT
--- a/node-daemon/monitoring/munin/ceph_utilization
+++ b/node-daemon/monitoring/munin/ceph_utilization
--- a/node-daemon/monitoring/munin/pvc
+++ b/node-daemon/monitoring/munin/pvc
--- a/monitoring/prometheus/grafana-pvc-overview-dashboard.json
+++ b/monitoring/prometheus/grafana-pvc-overview-dashboard.json
--- a/node-daemon/monitoring/prometheus/prometheus.yml
+++ b/node-daemon/monitoring/prometheus/prometheus.yml
--- a/node-daemon/monitoring/prometheus/targets-pvc_cluster.json
+++ b/node-daemon/monitoring/prometheus/targets-pvc_cluster.json
--- a/node-daemon/monitoring/prometheus/grafana-pvc-dashboard.json
+++ b/node-daemon/monitoring/prometheus/grafana-pvc-dashboard.json
--- a/node-daemon/pvcnoded/Daemon.py
+++ b/node-daemon/pvcnoded/Daemon.py
@ -31,6 +31,7 @@ import pvcnoded.objects.MetadataAPIInstance as MetadataAPIInstance
 import pvcnoded.objects.VMInstance as VMInstance
 import pvcnoded.objects.NodeInstance as NodeInstance
 import pvcnoded.objects.VXNetworkInstance as VXNetworkInstance
+import pvcnoded.objects.NetstatsInstance as NetstatsInstance
 import pvcnoded.objects.SRIOVVFInstance as SRIOVVFInstance
 import pvcnoded.objects.CephInstance as CephInstance

@ -48,7 +49,7 @@ import re
 import json

 # Daemon version
-version = "0.9.84"
+version = "0.9.86"


 ##########################################################
@ -200,9 +201,9 @@ def entrypoint():

    # Define a cleanup function
    def cleanup(failure=False):
-        nonlocal logger, zkhandler, keepalive_timer, d_domain
+        nonlocal logger, zkhandler, keepalive_timer, d_domain, netstats

-        logger.out("Terminating pvcnoded and cleaning up", state="s")
+        logger.out("Terminating pvcnoded", state="s")

        # Set shutdown state in Zookeeper
        zkhandler.write([(("node.state.daemon", config["node_hostname"]), "shutdown")])
@ -249,12 +250,20 @@ def entrypoint():
        except Exception:
            pass

-        # Set stop state in Zookeeper
-        zkhandler.write([(("node.state.daemon", config["node_hostname"]), "stop")])
+        logger.out("Cleaning up", state="s")
+
+        # Stop netstats instance
+        try:
+            netstats.shutdown()
+        except Exception:
+            pass

        # Forcibly terminate dnsmasq because it gets stuck sometimes
        common.run_os_command("killall dnsmasq")

+        # Set stop state in Zookeeper
+        zkhandler.write([(("node.state.daemon", config["node_hostname"]), "stop")])
+
        # Close the Zookeeper connection
        try:
            zkhandler.disconnect(persistent=True)
@ -1000,9 +1009,12 @@ def entrypoint():
                        state="s",
                    )

+    # Set up netstats
+    netstats = NetstatsInstance.NetstatsInstance(logger, config, zkhandler, this_node)
+
    # Start keepalived thread
    keepalive_timer = pvcnoded.util.keepalive.start_keepalive_timer(
-        logger, config, zkhandler, this_node
+        logger, config, zkhandler, this_node, netstats
    )

    # Tick loop; does nothing since everything is async
--- a/node-daemon/pvcnoded/objects/NetstatsInstance.py
+++ b/node-daemon/pvcnoded/objects/NetstatsInstance.py
@ -0,0 +1,293 @@
+#!/usr/bin/env python3
+
+# NetstatsInstance.py - Class implementing a PVC network stats gatherer and run by pvcnoded
+# Part of the Parallel Virtual Cluster (PVC) system
+#
+#    Copyright (C) 2018-2023 Joshua M. Boniface <joshua@boniface.me>
+#
+#    This program is free software: you can redistribute it and/or modify
+#    it under the terms of the GNU General Public License as published by
+#    the Free Software Foundation, version 3.
+#
+#    This program is distributed in the hope that it will be useful,
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#    GNU General Public License for more details.
+#
+#    You should have received a copy of the GNU General Public License
+#    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+#
+###############################################################################
+
+
+from apscheduler.schedulers.background import BackgroundScheduler
+from collections import deque
+from json import dumps
+from os import walk
+from os.path import exists
+
+
+class NetstatsIfaceInstance(object):
+    """
+    NetstatsIfaceInstance
+
+    This class implements a rolling statistics poller for a network interface,
+    collecting stats on the bits and packets per second in both directions every
+    second.
+
+    Via the get_stats() function, it returns the rolling average of all 4 values,
+    as well as totals, over the last 5 seconds (self.avg_samples) as a tuple of:
+      (rx_bps, rx_pps, tx_bps, tx_pps, total_bps, total_pps, link_speed, state)
+    """
+
+    def __init__(self, logger, iface, avg_samples):
+        """
+        Initialize the class instance, creating our BackgroundScheduler, setting
+        the average sample rate, and creating the deques and average values.
+        """
+        self.logger = logger
+        self.iface = iface
+
+        self.data_valid = False
+        self.data_polls = 0
+
+        self.timer = BackgroundScheduler()
+        self.timer.add_job(self.gather_stats, trigger="interval", seconds=1)
+
+        self.avg_samples = avg_samples
+
+        self.link_speed = 0
+        self.state = "down"
+
+        self.rx_bits_rolling = deque(list(), self.avg_samples + 1)
+        self.rx_bps = 0
+
+        self.rx_packets_rolling = deque(list(), self.avg_samples + 1)
+        self.rx_pps = 0
+
+        self.tx_bits_rolling = deque(list(), self.avg_samples + 1)
+        self.tx_bps = 0
+
+        self.tx_packets_rolling = deque(list(), self.avg_samples + 1)
+        self.tx_pps = 0
+
+        self.total_bps = 0
+        self.total_pps = 0
+
+    def get_iface_stats(self):
+        """
+        Reads the interface statistics from the sysfs for the interface.
+        """
+        iface_state_path = f"/sys/class/net/{self.iface}/operstate"
+        with open(iface_state_path) as stfh:
+            self.state = stfh.read().strip()
+
+        iface_speed_path = f"/sys/class/net/{self.iface}/speed"
+        try:
+            with open(iface_speed_path) as spfh:
+                # The speed key is always in Mbps so multiply by 1000*1000 to get bps
+                self.link_speed = int(spfh.read()) * 1000 * 1000
+        except OSError:
+            self.link_speed = 0
+
+        iface_stats_path = f"/sys/class/net/{self.iface}/statistics"
+        with open(f"{iface_stats_path}/rx_bytes") as rxbfh:
+            self.rx_bits_rolling.append(int(rxbfh.read()) * 8)
+        with open(f"{iface_stats_path}/tx_bytes") as txbfh:
+            self.tx_bits_rolling.append(int(txbfh.read()) * 8)
+        with open(f"{iface_stats_path}/rx_packets") as rxpfh:
+            self.rx_packets_rolling.append(int(rxpfh.read()) * 8)
+        with open(f"{iface_stats_path}/tx_packets") as txpfh:
+            self.tx_packets_rolling.append(int(txpfh.read()) * 8)
+
+    def calculate_averages(self):
+        """
+        Calculates the bps/pps values from the rolling values.
+        """
+
+        rx_bits_diffs = list()
+        for sample_idx in range(self.avg_samples, 0, -1):
+            rx_bits_diffs.append(
+                self.rx_bits_rolling[sample_idx] - self.rx_bits_rolling[sample_idx - 1]
+            )
+        self.rx_bps = int(sum(rx_bits_diffs) / self.avg_samples)
+
+        rx_packets_diffs = list()
+        for sample_idx in range(self.avg_samples, 0, -1):
+            rx_packets_diffs.append(
+                self.rx_packets_rolling[sample_idx]
+                - self.rx_packets_rolling[sample_idx - 1]
+            )
+        self.rx_pps = int(sum(rx_packets_diffs) / self.avg_samples)
+
+        tx_bits_diffs = list()
+        for sample_idx in range(self.avg_samples, 0, -1):
+            tx_bits_diffs.append(
+                self.tx_bits_rolling[sample_idx] - self.tx_bits_rolling[sample_idx - 1]
+            )
+        self.tx_bps = int(sum(tx_bits_diffs) / self.avg_samples)
+
+        tx_packets_diffs = list()
+        for sample_idx in range(self.avg_samples, 0, -1):
+            tx_packets_diffs.append(
+                self.tx_packets_rolling[sample_idx]
+                - self.tx_packets_rolling[sample_idx - 1]
+            )
+        self.tx_pps = int(sum(tx_packets_diffs) / self.avg_samples)
+
+        self.total_bps = self.rx_bps + self.tx_bps
+        self.total_pps = self.rx_pps + self.tx_pps
+
+    def gather_stats(self):
+        """
+        Gathers the current stats and then calculates the averages.
+
+        Runs via the BackgroundScheduler timer every 1 second.
+        """
+        self.get_iface_stats()
+        if self.data_valid:
+            self.calculate_averages()
+
+        # Handle data validity: our data is invalid until we hit enough polls
+        # to make a valid average (avg_samples plus 1).
+        if not self.data_valid:
+            self.data_polls += 1
+            if self.data_polls > self.avg_samples:
+                self.data_valid = True
+
+    def start(self):
+        """
+        Starts the timer.
+        """
+        self.timer.start()
+
+    def stop(self):
+        """
+        Stops the timer.
+        """
+        self.timer.shutdown()
+
+    def get_stats(self):
+        """
+        Returns a tuple of the current statistics.
+        """
+        if not self.data_valid:
+            return None
+
+        return (
+            self.rx_bps,
+            self.rx_pps,
+            self.tx_bps,
+            self.tx_pps,
+            self.total_bps,
+            self.total_pps,
+            self.link_speed,
+            self.state,
+        )
+
+
+class NetstatsInstance(object):
+    """
+    NetstatsInstance
+
+    This class implements a rolling statistics poller for all PHYSICAL network interfaces,
+    on the system, initializing a NetstatsIfaceInstance for each, as well as handling
+    value updates into Zookeeper.
+    """
+
+    def __init__(self, logger, config, zkhandler, this_node):
+        """
+        Initialize the class instance.
+        """
+        self.logger = logger
+        self.config = config
+        self.zkhandler = zkhandler
+        self.node_name = this_node.name
+
+        self.interfaces = dict()
+
+        self.logger.out(
+            f"Starting netstats collector ({self.config['keepalive_interval']} second interval)",
+            state="s",
+        )
+
+        self.set_interfaces()
+
+    def shutdown(self):
+        """
+        Stop all pollers and delete the NetstatsIfaceInstance objects
+        """
+        # Empty the network stats object
+        self.zkhandler.write([(("node.network.stats", self.node_name), dumps({}))])
+
+        for iface in self.interfaces.keys():
+            self.interfaces[iface].stop()
+
+    def set_interfaces(self):
+        """
+        Sets the list of interfaces on the system, and then ensures that each
+        interface has a NetstatsIfaceInstance assigned to it and polling.
+        """
+        # Get a list of all active interfaces
+        net_root_path = "/sys/class/net"
+        all_ifaces = list()
+        for (_, dirnames, _) in walk(net_root_path):
+            all_ifaces.extend(dirnames)
+        all_ifaces.sort()
+
+        self.logger.out(
+            f"Parsing network list: {all_ifaces}", state="d", prefix="netstats-thread"
+        )
+
+        # Add any missing interfaces
+        for iface in all_ifaces:
+            if not exists(f"{net_root_path}/{iface}/device"):
+                # This is not a physical interface; skip it
+                continue
+
+            if iface not in self.interfaces.keys():
+                # Set the number of samples to be equal to the keepalive interval, so that each
+                # keepalive has a fresh set of data from the last keepalive_interval seconds.
+                self.interfaces[iface] = NetstatsIfaceInstance(
+                    self.logger, iface, self.config["keepalive_interval"]
+                )
+                self.interfaces[iface].start()
+        # Remove any superfluous interfaces
+        for iface in self.interfaces.keys():
+            if iface not in all_ifaces:
+                self.interfaces[iface].stop()
+                del self.interfaces[iface]
+
+    def set_data(self):
+        data = dict()
+        for iface in self.interfaces.keys():
+            self.logger.out(
+                f"Getting data for interface {iface}",
+                state="d",
+                prefix="netstats-thread",
+            )
+            iface_stats = self.interfaces[iface].get_stats()
+            if iface_stats is None:
+                continue
+            (
+                iface_rx_bps,
+                iface_rx_pps,
+                iface_tx_bps,
+                iface_tx_pps,
+                iface_total_bps,
+                iface_total_pps,
+                iface_link_speed,
+                iface_state,
+            ) = iface_stats
+            data[iface] = {
+                "rx_bps": iface_rx_bps,
+                "rx_pps": iface_rx_pps,
+                "tx_bps": iface_tx_bps,
+                "tx_pps": iface_tx_pps,
+                "total_bps": iface_total_bps,
+                "total_pps": iface_total_pps,
+                "link_speed": iface_link_speed,
+                "state": iface_state,
+            }
+
+        self.zkhandler.write([(("node.network.stats", self.node_name), dumps(data))])
--- a/node-daemon/pvcnoded/util/fencing.py
+++ b/node-daemon/pvcnoded/util/fencing.py
@ -115,6 +115,27 @@ def fence_node(node_name, zkhandler, config, logger):
    ):
        migrateFromFencedNode(zkhandler, node_name, config, logger)

+    # Reset all node resource values
+    logger.out(
+        f"Resetting all resource values for dead node {node_name} to zero",
+        state="i",
+        prefix=f"fencing {node_name}",
+    )
+    zkhandler.write(
+        [
+            (("node.running_domains", node_name), "0"),
+            (("node.count.provisioned_domains", node_name), "0"),
+            (("node.cpu.load", node_name), "0"),
+            (("node.vcpu.allocated", node_name), "0"),
+            (("node.memory.total", node_name), "0"),
+            (("node.memory.used", node_name), "0"),
+            (("node.memory.free", node_name), "0"),
+            (("node.memory.allocated", node_name), "0"),
+            (("node.memory.provisioned", node_name), "0"),
+            (("node.monitoring.health", node_name), None),
+        ]
+    )
+

 # Migrate hosts away from a fenced node
 def migrateFromFencedNode(zkhandler, node_name, config, logger):
--- a/node-daemon/pvcnoded/util/keepalive.py
+++ b/node-daemon/pvcnoded/util/keepalive.py
@ -51,7 +51,7 @@ libvirt_vm_states = {
 }


-def start_keepalive_timer(logger, config, zkhandler, this_node):
+def start_keepalive_timer(logger, config, zkhandler, this_node, netstats):
    keepalive_interval = config["keepalive_interval"]
    logger.out(
        f"Starting keepalive timer ({keepalive_interval} second interval)", state="s"
@ -59,7 +59,7 @@ def start_keepalive_timer(logger, config, zkhandler, this_node):
    keepalive_timer = BackgroundScheduler()
    keepalive_timer.add_job(
        node_keepalive,
-        args=(logger, config, zkhandler, this_node),
+        args=(logger, config, zkhandler, this_node, netstats),
        trigger="interval",
        seconds=keepalive_interval,
    )
@ -477,6 +477,10 @@ def collect_vm_stats(logger, config, zkhandler, this_node, queue):
    fixed_d_domain = this_node.d_domain.copy()
    for domain, instance in fixed_d_domain.items():
        if domain in this_node.domain_list:
+            # Add the allocated memory to our memalloc value
+            memalloc += instance.getmemory()
+            memprov += instance.getmemory()
+            vcpualloc += instance.getvcpus()
            if instance.getstate() == "start" and instance.getnode() == this_node.name:
                if instance.getdom() is not None:
                    try:
@ -532,11 +536,6 @@ def collect_vm_stats(logger, config, zkhandler, this_node, queue):
                continue
            domain_memory_stats = domain.memoryStats()
            domain_cpu_stats = domain.getCPUStats(True)[0]
-
-            # Add the allocated memory to our memalloc value
-            memalloc += instance.getmemory()
-            memprov += instance.getmemory()
-            vcpualloc += instance.getvcpus()
        except Exception as e:
            if debug:
                try:
@ -685,7 +684,7 @@ def collect_vm_stats(logger, config, zkhandler, this_node, queue):


 # Keepalive update function
-def node_keepalive(logger, config, zkhandler, this_node):
+def node_keepalive(logger, config, zkhandler, this_node, netstats):
    debug = config["debug"]

    # Display node information to the terminal
@ -701,7 +700,7 @@ def node_keepalive(logger, config, zkhandler, this_node):

        runtime_start = datetime.now()
        logger.out(
-            "Starting node keepalive run",
+            f"Starting node keepalive run at {datetime.now()}",
            state="t",
        )

@ -794,6 +793,10 @@ def node_keepalive(logger, config, zkhandler, this_node):
    this_node.memfree = int(psutil.virtual_memory().free / 1024 / 1024)
    this_node.cpuload = round(os.getloadavg()[0], 2)

+    # Get node network statistics via netstats instance
+    netstats.set_interfaces()
+    netstats.set_data()
+
    # Join against running threads
    if config["enable_hypervisor"]:
        vm_stats_thread.join(timeout=config["keepalive_interval"])
--- a/worker-daemon/pvcworkerd/Daemon.py
+++ b/worker-daemon/pvcworkerd/Daemon.py
@ -44,7 +44,7 @@ from daemon_lib.vmbuilder import (
 )

 # Daemon version
-version = "0.9.84"
+version = "0.9.86"


 config = cfg.get_configuration()
Author	SHA1	Message	Date
Joshua M. Boniface	d0de4f1825	Update Grafana dashboard to overview Adds resource utilization in addition to health.	2023-12-27 11:38:39 -05:00
Joshua M. Boniface	494c20263d	Move monitoring folder to top level	2023-12-27 11:37:49 -05:00
Joshua M. Boniface	431ee69620	Use proper percentage for pool util	2023-12-27 10:03:00 -05:00
Joshua M. Boniface	88f4d79d5a	Handle invalid values on older Libvirt versions	2023-12-27 09:51:24 -05:00
Joshua M. Boniface	84d22751d8	Fix bad JSON data handler	2023-12-27 09:43:37 -05:00
Joshua M. Boniface	40ff005a09	Fix handling of Ceph OSD bytes	2023-12-26 12:43:51 -05:00
Joshua M. Boniface	ab4ec7a5fa	Remove WebUI from README	2023-12-25 02:48:44 -05:00
Joshua M. Boniface	9604f655d0	Improve node utilization metrics and fix bugs	2023-12-25 02:47:41 -05:00
Joshua M. Boniface	3e4cc53fdd	Add node network statistics and utilization values Adds a new physical network interface stats parser to the node keepalives, and leverages this information to provide a network utilization overview in the Prometheus metrics.	2023-12-21 15:45:01 -05:00
Joshua M. Boniface	d2d2a9c617	Include our newline atomically Sometimes clashing log entries would print on the same line, likely due to some sort of race condition in Python's print() built-in. Instead, add a newline to our actual message and print without an end character. This ensures atomic printing of our log messages.	2023-12-21 13:12:43 -05:00
Joshua M. Boniface	6ed4efad33	Add new network.stats key to nodes	2023-12-21 12:48:48 -05:00
Joshua M. Boniface	39f9f3640c	Rename health metrics and add resource metrics	2023-12-21 09:40:49 -05:00
Joshua M. Boniface	c64e888d30	Fix incorrect cast of None	2023-12-14 16:00:53 -05:00
Joshua M. Boniface	f1249452e5	Fix bug if no nodes are present	2023-12-14 15:32:18 -05:00
Joshua M. Boniface	0a93f526e0	Bump version to 0.9.86	2023-12-14 14:46:29 -05:00
Joshua M. Boniface	7c9512fb22	Fix broken config file in API migration script	2023-12-14 14:45:58 -05:00
Joshua M. Boniface	e88b97f3a9	Print fenced state in red	2023-12-13 15:02:18 -05:00
Joshua M. Boniface	709c9cb73e	Pause pvchealthd startup until node daemon is run If the health daemon starts too soon during a node bootup, it will generate generate tons of erroneous faults while the node starts up. Adds a conditional wait for the current node daemon to be in "run" state before the health daemon really starts up.	2023-12-13 14:53:54 -05:00
Joshua M. Boniface	f41c5176be	Ensure health value is an int properly	2023-12-13 14:34:02 -05:00
Joshua M. Boniface	38e43b46c3	Update health detail messages format	2023-12-13 03:17:47 -05:00
Joshua M. Boniface	ed9c37982a	Move metric collection into daemon library	2023-12-11 19:20:30 -05:00
Joshua M. Boniface	0f24184b78	Explicitly clear resources of fenced node This actually solves the bug originally "fixed" in `5f1432ccdd` without breaking VM resource allocations for working nodes.	2023-12-11 12:14:56 -05:00
Joshua M. Boniface	1ba37fe33d	Restore VM resource allocation location Commit `5f1432ccdd` changed where these happen due to a bug after fencing. However this completely broke node resource reporting as only the final instance will be queried here. Revert this change and look further into the original bug.	2023-12-11 11:52:59 -05:00
Joshua M. Boniface	1a05077b10	Fix missing fstring	2023-12-11 11:29:49 -05:00
Joshua M. Boniface	57c28376a6	Port one final Ceph function to read_many	2023-12-11 10:25:36 -05:00
Joshua M. Boniface	e781d742e6	Fix bug with volume and snapshot listing	2023-12-11 10:21:46 -05:00
Joshua M. Boniface	6c6d1508a1	Add VNC info to screenshots	2023-12-11 03:40:49 -05:00
Joshua M. Boniface	741dafb26b	Port VM functions to read_many	2023-12-11 03:34:36 -05:00
Joshua M. Boniface	032d3ebf18	Remove debug output from image	2023-12-11 03:23:10 -05:00
Joshua M. Boniface	5d9e83e8ed	Fix output bugs in VM information	2023-12-11 03:04:46 -05:00
Joshua M. Boniface	ad0bd8649f	Finish missing sentence	2023-12-11 02:39:39 -05:00
Joshua M. Boniface	9b5e53e4b6	Add Grafana dashboard screenshot	2023-12-11 00:39:24 -05:00
Joshua M. Boniface	9617660342	Update Prometheus Grafana dashboard	2023-12-11 00:23:08 -05:00
Joshua M. Boniface	ab0a1e0946	Update and streamline README and update images	2023-12-10 23:57:01 -05:00
Joshua M. Boniface	7c116b2fbc	Ensure node health value is an int	2023-12-10 23:56:50 -05:00
Joshua M. Boniface	1023c55087	Fix bug in VM state list	2023-12-10 23:44:01 -05:00
Joshua M. Boniface	9235187c6f	Port Ceph functions to read_many Only ports getOSDInformation, as all the others feature 3 or less reads which is acceptable sequentially.	2023-12-10 22:24:38 -05:00
Joshua M. Boniface	0c94f1b4f8	Port Network functions to read_many	2023-12-10 22:19:21 -05:00
Joshua M. Boniface	44a4f0e1f7	Use new info detail output instead of new lists Avoids multiple additional ZK calls by using data that is now in the status detail output.	2023-12-10 22:19:09 -05:00
Joshua M. Boniface	5d53a3e529	Add state and faults detail to cluster information We already parse this information out anyways, so might as well add it to the API output JSON. This can be leveraged by the Prometheus endpoint as well to avoid duplicate listings.	2023-12-10 17:29:32 -05:00
Joshua M. Boniface	35e22cb50f	Simplify cluster status handling This significantly simplifies cluster state handling by removing most of the superfluous get_list() calls, replacing them with basic child reads since most of them are just for a count anyways. The ones that require states simplify this down to a child read plus direct reads for the exact items required while leveraging the new read_many() function.	2023-12-10 17:05:46 -05:00
Joshua M. Boniface	a3171b666b	Split node health into separate function	2023-12-10 16:52:10 -05:00
Joshua M. Boniface	48e41d7b05	Port Faults getFault and getAllFaults to read_many	2023-12-10 16:05:16 -05:00
Joshua M. Boniface	d6aecf195e	Port Node getNodeInformation to read_many	2023-12-10 15:53:28 -05:00
Joshua M. Boniface	9329784010	Implement async ZK read function Adds a function, "read_many", which can take in multiple ZK keys and return the values from all of them, using asyncio to avoid reading sequentially. Initial tests show a marked improvement in read performance of multiple read()-heavy functions (e.g. "get_list()" functions) with this method.	2023-12-10 15:35:40 -05:00
Joshua M. Boniface	9dc5097dbc	Bump version to 0.9.85	2023-12-10 01:00:33 -05:00
Joshua M. Boniface	5776cb3a09	Remove Prometheus client dependencies We don't actually use this (yet!) so remove the dependency for now.	2023-12-10 00:58:09 -05:00
Joshua M. Boniface	53d632f283	Fix bug in example PVC Grafana dashboard	2023-12-10 00:50:05 -05:00
Joshua M. Boniface	7bc0760b78	Add time to "starting keepalive" message Matches the pvchealthd output and provides a useful message detail to this otherwise contextless message.	2023-12-10 00:40:32 -05:00
 @ -1 +1 @@
 .9.84
 .9.86
				`@ -0,0 +1 @@`
				{"version": "12", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}