Compare commits
41 Commits
e8552b471b
...
v0.9.86
Author | SHA1 | Date | |
---|---|---|---|
c64e888d30 | |||
f1249452e5 | |||
0a93f526e0 | |||
7c9512fb22 | |||
e88b97f3a9 | |||
709c9cb73e | |||
f41c5176be | |||
38e43b46c3 | |||
ed9c37982a | |||
0f24184b78 | |||
1ba37fe33d | |||
1a05077b10 | |||
57c28376a6 | |||
e781d742e6 | |||
6c6d1508a1 | |||
741dafb26b | |||
032d3ebf18 | |||
5d9e83e8ed | |||
ad0bd8649f | |||
9b5e53e4b6 | |||
9617660342 | |||
ab0a1e0946 | |||
7c116b2fbc | |||
1023c55087 | |||
9235187c6f | |||
0c94f1b4f8 | |||
44a4f0e1f7 | |||
5d53a3e529 | |||
35e22cb50f | |||
a3171b666b | |||
48e41d7b05 | |||
d6aecf195e | |||
9329784010 | |||
9dc5097dbc | |||
5776cb3a09 | |||
53d632f283 | |||
7bc0760b78 | |||
9aee2a9075 | |||
8f0ae3e2dd | |||
946d3eaf43 | |||
1f6347d24b |
25
CHANGELOG.md
@ -1,5 +1,30 @@
|
||||
## PVC Changelog
|
||||
|
||||
###### [v0.9.86](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.86)
|
||||
|
||||
* [API Daemon] Significantly improves the performance of several commands via async Zookeeper calls and removal of superfluous backend calls.
|
||||
* [Docs] Improves the project README and updates screenshot images to show the current output and more functionality.
|
||||
* [API Daemon/CLI] Corrects some bugs in VM metainformation output.
|
||||
* [Node Daemon] Fixes resource reporting bugs from 0.9.81 and properly clears node resource numbers on a fence.
|
||||
* [Health Daemon] Adds a wait during pvchealthd startup until the node is in run state, to avoid erroneous faults during node bootup.
|
||||
* [API Daemon] Fixes an incorrect reference to legacy pvcapid.yaml file in migration script.
|
||||
|
||||
###### [v0.9.85](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.85)
|
||||
|
||||
* [Packaging] Fixes a dependency bug introduced in 0.9.84
|
||||
* [Node Daemon] Fixes an output bug during keepalives
|
||||
* [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
|
||||
|
||||
###### [v0.9.84](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.84)
|
||||
|
||||
**Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.
|
||||
|
||||
* [All] Adds persistent fault reporting to clusters, replacing the old cluster health calculations.
|
||||
* [API Daemon] Adds cluster-level Prometheus metric exporting as well as a Ceph Prometheus proxy to the API.
|
||||
* [CLI Client] Improves formatting output of "pvc cluster status".
|
||||
* [Node Daemon] Fixes several bugs and enhances the working of the psql health check plugin.
|
||||
* [Worker Daemon] Fixes several bugs in the example provisioner scripts, and moves the libvirt_schema library into the daemon common libraries.
|
||||
|
||||
###### [v0.9.83](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.83)
|
||||
|
||||
**Breaking Changes:** This release features a breaking change for the daemon config. A new unified "pvc.conf" file is required for all daemons (and the CLI client for Autobackup and API-on-this-host functionality), which will be written by the "pvc" role in the PVC Ansible framework. Using the "update-pvc-daemons" oneshot playbook from PVC Ansible is **required** to update to this release, as it will ensure this file is written to the proper place before deploying the new package versions, and also ensures that the old entires are cleaned up afterwards. In addition, this release fully splits the node worker and health subsystems into discrete daemons ("pvcworkerd" and "pvchealthd") and packages ("pvc-daemon-worker" and "pvc-daemon-health") respectively. The "pvc-daemon-node" package also now depends on both packages, and the "pvc-daemon-api" package can now be reliably used outside of the PVC nodes themselves (for instance, in a VM) without any strange cross-dependency issues.
|
||||
|
71
README.md
@ -1,5 +1,5 @@
|
||||
<p align="center">
|
||||
<img alt="Logo banner" src="docs/images/pvc_logo_black.png"/>
|
||||
<img alt="Logo banner" src="images/pvc_logo_black.png"/>
|
||||
<br/><br/>
|
||||
<a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
|
||||
<a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
|
||||
@ -23,37 +23,62 @@ Installation of PVC is accomplished by two main components: a [Node installer IS
|
||||
|
||||
Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.
|
||||
|
||||
|
||||
## What is it based on?
|
||||
|
||||
The core node and API daemons, as well as the CLI API client, are written in Python 3 and are fully Free Software (GNU GPL v3). In addition to these, PVC makes use of the following software tools to provide a holistic hyperconverged infrastructure solution:
|
||||
|
||||
* Debian GNU/Linux as the base OS.
|
||||
* Linux KVM, QEMU, and Libvirt for VM management.
|
||||
* Linux `ip`, FRRouting, NFTables, DNSMasq, and PowerDNS for network management.
|
||||
* Ceph for storage management.
|
||||
* Apache Zookeeper for the primary cluster state database.
|
||||
* Patroni PostgreSQL manager for the secondary relation databases (DNS aggregation, Provisioner configuration).
|
||||
|
||||
|
||||
## Getting Started
|
||||
|
||||
To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/getting-started/) page for details on configuring your first cluster.
|
||||
|
||||
To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster.
|
||||
|
||||
## Changelog
|
||||
|
||||
View the changelog in [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
View the changelog in [CHANGELOG.md](CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**
|
||||
|
||||
## Screenshots
|
||||
|
||||
While PVC's API and internals aren't very screenshot-worthy, here is some example output of the CLI tool.
|
||||
These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85.
|
||||
|
||||
<p><img alt="Node listing" src="docs/images/pvc-nodes.png"/><br/><i>Listing the nodes in a cluster</i></p>
|
||||
<p><img alt="0. Integrated help" src="images/0-integrated-help.png"/><br/>
|
||||
<i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="Network listing" src="docs/images/pvc-networks.png"/><br/><i>Listing the networks in a cluster, showing 3 bridged and 1 IPv4-only managed networks</i></p>
|
||||
<p><img alt="1. Connection management" src="images/1-connection-management.png"/><br/>
|
||||
<i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="VM listing and migration" src="docs/images/pvc-migration.png"/><br/><i>Listing a limited set of VMs and migrating one with status updates</i></p>
|
||||
<p><img alt="2. Cluster details and output formats" src="images/2-cluster-details-and-output-formats.png"/><br/>
|
||||
<i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="Node logs" src="docs/images/pvc-nodelog.png"/><br/><i>Viewing the logs of a node (keepalives and VM [un]migration)</i></p>
|
||||
<p><img alt="3. Node information" src="images/3-node-information.png"/><br/>
|
||||
<i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="4. VM information" src="images/4-vm-information.png"/><br/>
|
||||
<i>PVC can show details about the VMs in the cluster, including their state, resource allocations, current hosting node, and metadata.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="5. VM details" src="images/5-vm-details.png"/><br/>
|
||||
<i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="6. Network information" src="images/6-network-information.png"/><br/>
|
||||
<i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="7. Storage information" src="images/7-storage-information.png"/><br/>
|
||||
<i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="8. VM and node logs" src="images/8-vm-and-node-logs.png"/><br/>
|
||||
<i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="9. VM and worker tasks" src="images/9-vm-and-worker-tasks.png"/><br/>
|
||||
<i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="10. Provisioner" src="images/10-provisioner.png"/><br/>
|
||||
<i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i>
|
||||
</p>
|
||||
|
||||
<p><img alt="11. Prometheus and Grafana dashboard" src="images/11-prometheus-grafana.png"/><br/>
|
||||
<i>PVC features several monitoring integration examples under "node-daemon/monitoring", including CheckMK, Munin, and, most recently, Prometheus, including an example Grafana dashboard for cluster monitoring and alerting.</i>
|
||||
</p>
|
||||
|
@ -3,7 +3,7 @@
|
||||
# Apply PVC database migrations
|
||||
# Part of the Parallel Virtual Cluster (PVC) system
|
||||
|
||||
export PVC_CONFIG_FILE="/etc/pvc/pvcapid.yaml"
|
||||
export PVC_CONFIG_FILE="/etc/pvc/pvc.conf"
|
||||
|
||||
if [[ ! -f ${PVC_CONFIG_FILE} ]]; then
|
||||
echo "Create a configuration file at ${PVC_CONFIG_FILE} before upgrading the database."
|
||||
|
@ -27,7 +27,7 @@ from distutils.util import strtobool as dustrtobool
|
||||
import daemon_lib.config as cfg
|
||||
|
||||
# Daemon version
|
||||
version = "0.9.83"
|
||||
version = "0.9.86"
|
||||
|
||||
# API version
|
||||
API_VERSION = 1.0
|
||||
|
@ -131,165 +131,12 @@ def cluster_metrics(zkhandler):
|
||||
Format status data from cluster_status into Prometheus-compatible metrics
|
||||
"""
|
||||
|
||||
# Get general cluster information
|
||||
status_retflag, status_data = pvc_cluster.get_info(zkhandler)
|
||||
if not status_retflag:
|
||||
return "Error: Status data threw error", 400
|
||||
|
||||
faults_retflag, faults_data = pvc_faults.get_list(zkhandler)
|
||||
if not faults_retflag:
|
||||
return "Error: Faults data threw error", 400
|
||||
|
||||
node_retflag, node_data = pvc_node.get_list(zkhandler)
|
||||
if not node_retflag:
|
||||
return "Error: Node data threw error", 400
|
||||
|
||||
vm_retflag, vm_data = pvc_vm.get_list(zkhandler)
|
||||
if not vm_retflag:
|
||||
return "Error: VM data threw error", 400
|
||||
|
||||
osd_retflag, osd_data = pvc_ceph.get_list_osd(zkhandler)
|
||||
if not osd_retflag:
|
||||
return "Error: OSD data threw error", 400
|
||||
|
||||
output_lines = list()
|
||||
|
||||
output_lines.append("# HELP pvc_info PVC cluster information")
|
||||
output_lines.append("# TYPE pvc_info gauge")
|
||||
output_lines.append(
|
||||
f"pvc_info{{primary_node=\"{status_data['primary_node']}\", version=\"{status_data['pvc_version']}\", upstream_ip=\"{status_data['upstream_ip']}\"}} 1"
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_cluster_maintenance PVC cluster maintenance state")
|
||||
output_lines.append("# TYPE pvc_cluster_maintenance gauge")
|
||||
output_lines.append(
|
||||
f"pvc_cluster_maintenance {1 if bool(strtobool(status_data['maintenance'])) else 0}"
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_cluster_health PVC cluster health status")
|
||||
output_lines.append("# TYPE pvc_cluster_health gauge")
|
||||
output_lines.append(f"pvc_cluster_health {status_data['cluster_health']['health']}")
|
||||
|
||||
output_lines.append("# HELP pvc_cluster_faults PVC cluster new faults")
|
||||
output_lines.append("# TYPE pvc_cluster_faults gauge")
|
||||
fault_map = dict()
|
||||
for fault_type in pvc_common.fault_state_combinations:
|
||||
fault_map[fault_type] = 0
|
||||
for fault in faults_data:
|
||||
fault_map[fault["status"]] += 1
|
||||
for fault_type in fault_map:
|
||||
output_lines.append(
|
||||
f'pvc_cluster_faults{{status="{fault_type}"}} {fault_map[fault_type]}'
|
||||
)
|
||||
|
||||
# output_lines.append("# HELP pvc_cluster_faults PVC cluster health faults")
|
||||
# output_lines.append("# TYPE pvc_cluster_faults gauge")
|
||||
# for fault_msg in status_data["cluster_health"]["messages"]:
|
||||
# output_lines.append(
|
||||
# f"pvc_cluster_faults{{id=\"{fault_msg['id']}\", message=\"{fault_msg['text']}\"}} {fault_msg['health_delta']}"
|
||||
# )
|
||||
|
||||
output_lines.append("# HELP pvc_node_health PVC cluster node health status")
|
||||
output_lines.append("# TYPE pvc_node_health gauge")
|
||||
for node in status_data["node_health"]:
|
||||
if isinstance(status_data["node_health"][node]["health"], int):
|
||||
output_lines.append(
|
||||
f"pvc_node_health{{node=\"{node}\"}} {status_data['node_health'][node]['health']}"
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_node_daemon_states PVC Node daemon state counts")
|
||||
output_lines.append("# TYPE pvc_node_daemon_states gauge")
|
||||
node_daemon_state_map = dict()
|
||||
for state in set([s.split(",")[0] for s in pvc_common.node_state_combinations]):
|
||||
node_daemon_state_map[state] = 0
|
||||
for node in node_data:
|
||||
node_daemon_state_map[node["daemon_state"]] += 1
|
||||
for state in node_daemon_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_node_daemon_states{{state="{state}"}} {node_daemon_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_node_domain_states PVC Node domain state counts")
|
||||
output_lines.append("# TYPE pvc_node_domain_states gauge")
|
||||
node_domain_state_map = dict()
|
||||
for state in set([s.split(",")[1] for s in pvc_common.node_state_combinations]):
|
||||
node_domain_state_map[state] = 0
|
||||
for node in node_data:
|
||||
node_domain_state_map[node["domain_state"]] += 1
|
||||
for state in node_domain_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_node_domain_states{{state="{state}"}} {node_domain_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_vm_states PVC VM state counts")
|
||||
output_lines.append("# TYPE pvc_vm_states gauge")
|
||||
vm_state_map = dict()
|
||||
for state in set(pvc_common.vm_state_combinations):
|
||||
vm_state_map[state] = 0
|
||||
for vm in vm_data:
|
||||
vm_state_map[vm["state"]] += 1
|
||||
for state in vm_state_map:
|
||||
output_lines.append(f'pvc_vm_states{{state="{state}"}} {vm_state_map[state]}')
|
||||
|
||||
output_lines.append("# HELP pvc_osd_up_states PVC OSD up state counts")
|
||||
output_lines.append("# TYPE pvc_osd_up_states gauge")
|
||||
osd_up_state_map = dict()
|
||||
for state in set([s.split(",")[0] for s in pvc_common.ceph_osd_state_combinations]):
|
||||
osd_up_state_map[state] = 0
|
||||
for osd in osd_data:
|
||||
if osd["stats"]["up"] > 0:
|
||||
osd_up_state_map["up"] += 1
|
||||
else:
|
||||
osd_up_state_map["down"] += 1
|
||||
for state in osd_up_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_osd_up_states{{state="{state}"}} {osd_up_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_osd_in_states PVC OSD in state counts")
|
||||
output_lines.append("# TYPE pvc_osd_in_states gauge")
|
||||
osd_in_state_map = dict()
|
||||
for state in set([s.split(",")[1] for s in pvc_common.ceph_osd_state_combinations]):
|
||||
osd_in_state_map[state] = 0
|
||||
for osd in osd_data:
|
||||
if osd["stats"]["in"] > 0:
|
||||
osd_in_state_map["in"] += 1
|
||||
else:
|
||||
osd_in_state_map["out"] += 1
|
||||
for state in osd_in_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_osd_in_states{{state="{state}"}} {osd_in_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_nodes PVC Node count")
|
||||
output_lines.append("# TYPE pvc_nodes gauge")
|
||||
output_lines.append(f"pvc_nodes {status_data['nodes']['total']}")
|
||||
|
||||
output_lines.append("# HELP pvc_vms PVC VM count")
|
||||
output_lines.append("# TYPE pvc_vms gauge")
|
||||
output_lines.append(f"pvc_vms {status_data['vms']['total']}")
|
||||
|
||||
output_lines.append("# HELP pvc_osds PVC OSD count")
|
||||
output_lines.append("# TYPE pvc_osds gauge")
|
||||
output_lines.append(f"pvc_osds {status_data['osds']['total']}")
|
||||
|
||||
output_lines.append("# HELP pvc_networks PVC Network count")
|
||||
output_lines.append("# TYPE pvc_networks gauge")
|
||||
output_lines.append(f"pvc_networks {status_data['networks']}")
|
||||
|
||||
output_lines.append("# HELP pvc_pools PVC Storage Pool count")
|
||||
output_lines.append("# TYPE pvc_pools gauge")
|
||||
output_lines.append(f"pvc_pools {status_data['pools']}")
|
||||
|
||||
output_lines.append("# HELP pvc_volumes PVC Storage Volume count")
|
||||
output_lines.append("# TYPE pvc_volumes gauge")
|
||||
output_lines.append(f"pvc_volumes {status_data['volumes']}")
|
||||
|
||||
output_lines.append("# HELP pvc_snapshots PVC Storage Snapshot count")
|
||||
output_lines.append("# TYPE pvc_snapshots gauge")
|
||||
output_lines.append(f"pvc_snapshots {status_data['snapshots']}")
|
||||
|
||||
return "\n".join(output_lines) + "\n", 200
|
||||
retflag, retdata = pvc_cluster.get_metrics(zkhandler)
|
||||
if retflag:
|
||||
retcode = 200
|
||||
else:
|
||||
retcode = 400
|
||||
return retdata, retcode
|
||||
|
||||
|
||||
@pvc_common.Profiler(config)
|
||||
|
@ -249,6 +249,8 @@ def getOutputColours(node_information):
|
||||
daemon_state_colour = ansiprint.yellow()
|
||||
elif node_information["daemon_state"] == "dead":
|
||||
daemon_state_colour = ansiprint.red() + ansiprint.bold()
|
||||
elif node_information["daemon_state"] == "fenced":
|
||||
daemon_state_colour = ansiprint.red()
|
||||
else:
|
||||
daemon_state_colour = ansiprint.blue()
|
||||
|
||||
|
@ -1659,24 +1659,26 @@ def format_info(config, domain_information, long_output):
|
||||
)
|
||||
|
||||
if not domain_information.get("node_selector"):
|
||||
formatted_node_selector = "False"
|
||||
formatted_node_selector = "Default"
|
||||
else:
|
||||
formatted_node_selector = domain_information["node_selector"]
|
||||
formatted_node_selector = str(domain_information["node_selector"]).title()
|
||||
|
||||
if not domain_information.get("node_limit"):
|
||||
formatted_node_limit = "False"
|
||||
formatted_node_limit = "Any"
|
||||
else:
|
||||
formatted_node_limit = ", ".join(domain_information["node_limit"])
|
||||
|
||||
if not domain_information.get("node_autostart"):
|
||||
autostart_colour = ansiprint.blue()
|
||||
formatted_node_autostart = "False"
|
||||
else:
|
||||
formatted_node_autostart = domain_information["node_autostart"]
|
||||
autostart_colour = ansiprint.green()
|
||||
formatted_node_autostart = "True"
|
||||
|
||||
if not domain_information.get("migration_method"):
|
||||
formatted_migration_method = "any"
|
||||
formatted_migration_method = "Any"
|
||||
else:
|
||||
formatted_migration_method = domain_information["migration_method"]
|
||||
formatted_migration_method = str(domain_information["migration_method"]).title()
|
||||
|
||||
ainformation.append(
|
||||
"{}Migration selector:{} {}".format(
|
||||
@ -1689,8 +1691,12 @@ def format_info(config, domain_information, long_output):
|
||||
)
|
||||
)
|
||||
ainformation.append(
|
||||
"{}Autostart:{} {}".format(
|
||||
ansiprint.purple(), ansiprint.end(), formatted_node_autostart
|
||||
"{}Autostart:{} {}{}{}".format(
|
||||
ansiprint.purple(),
|
||||
ansiprint.end(),
|
||||
autostart_colour,
|
||||
formatted_node_autostart,
|
||||
ansiprint.end(),
|
||||
)
|
||||
)
|
||||
ainformation.append(
|
||||
@ -1736,13 +1742,17 @@ def format_info(config, domain_information, long_output):
|
||||
domain_information["tags"], key=lambda t: t["type"] + t["name"]
|
||||
):
|
||||
ainformation.append(
|
||||
" {tags_name: <{tags_name_length}} {tags_type: <{tags_type_length}} {tags_protected: <{tags_protected_length}}".format(
|
||||
" {tags_name: <{tags_name_length}} {tags_type: <{tags_type_length}} {tags_protected_colour}{tags_protected: <{tags_protected_length}}{end}".format(
|
||||
tags_name_length=tags_name_length,
|
||||
tags_type_length=tags_type_length,
|
||||
tags_protected_length=tags_protected_length,
|
||||
tags_name=tag["name"],
|
||||
tags_type=tag["type"],
|
||||
tags_protected=str(tag["protected"]),
|
||||
tags_protected_colour=ansiprint.green()
|
||||
if tag["protected"]
|
||||
else ansiprint.blue(),
|
||||
end=ansiprint.end(),
|
||||
)
|
||||
)
|
||||
else:
|
||||
|
@ -2,7 +2,7 @@ from setuptools import setup
|
||||
|
||||
setup(
|
||||
name="pvc",
|
||||
version="0.9.83",
|
||||
version="0.9.86",
|
||||
packages=["pvc.cli", "pvc.lib"],
|
||||
install_requires=[
|
||||
"Click",
|
||||
|
@ -215,14 +215,26 @@ def getClusterOSDList(zkhandler):
|
||||
|
||||
|
||||
def getOSDInformation(zkhandler, osd_id):
|
||||
# Get the devices
|
||||
osd_fsid = zkhandler.read(("osd.ofsid", osd_id))
|
||||
osd_node = zkhandler.read(("osd.node", osd_id))
|
||||
osd_device = zkhandler.read(("osd.device", osd_id))
|
||||
osd_is_split = bool(strtobool(zkhandler.read(("osd.is_split", osd_id))))
|
||||
osd_db_device = zkhandler.read(("osd.db_device", osd_id))
|
||||
(
|
||||
osd_fsid,
|
||||
osd_node,
|
||||
osd_device,
|
||||
_osd_is_split,
|
||||
osd_db_device,
|
||||
osd_stats_raw,
|
||||
) = zkhandler.read_many(
|
||||
[
|
||||
("osd.ofsid", osd_id),
|
||||
("osd.node", osd_id),
|
||||
("osd.device", osd_id),
|
||||
("osd.is_split", osd_id),
|
||||
("osd.db_device", osd_id),
|
||||
("osd.stats", osd_id),
|
||||
]
|
||||
)
|
||||
|
||||
osd_is_split = bool(strtobool(_osd_is_split))
|
||||
# Parse the stats data
|
||||
osd_stats_raw = zkhandler.read(("osd.stats", osd_id))
|
||||
osd_stats = dict(json.loads(osd_stats_raw))
|
||||
|
||||
osd_information = {
|
||||
@ -308,13 +320,18 @@ def get_list_osd(zkhandler, limit=None, is_fuzzy=True):
|
||||
#
|
||||
def getPoolInformation(zkhandler, pool):
|
||||
# Parse the stats data
|
||||
pool_stats_raw = zkhandler.read(("pool.stats", pool))
|
||||
(pool_stats_raw, tier, pgs,) = zkhandler.read_many(
|
||||
[
|
||||
("pool.stats", pool),
|
||||
("pool.tier", pool),
|
||||
("pool.pgs", pool),
|
||||
]
|
||||
)
|
||||
|
||||
pool_stats = dict(json.loads(pool_stats_raw))
|
||||
volume_count = len(getCephVolumes(zkhandler, pool))
|
||||
tier = zkhandler.read(("pool.tier", pool))
|
||||
if tier is None:
|
||||
tier = "default"
|
||||
pgs = zkhandler.read(("pool.pgs", pool))
|
||||
|
||||
pool_information = {
|
||||
"name": pool,
|
||||
|
@ -19,14 +19,12 @@
|
||||
#
|
||||
###############################################################################
|
||||
|
||||
from distutils.util import strtobool
|
||||
from json import loads
|
||||
|
||||
import daemon_lib.common as common
|
||||
import daemon_lib.faults as faults
|
||||
import daemon_lib.vm as pvc_vm
|
||||
import daemon_lib.node as pvc_node
|
||||
import daemon_lib.network as pvc_network
|
||||
import daemon_lib.ceph as pvc_ceph
|
||||
|
||||
|
||||
def set_maintenance(zkhandler, maint_state):
|
||||
@ -45,9 +43,7 @@ def set_maintenance(zkhandler, maint_state):
|
||||
return True, "Successfully set cluster in normal mode"
|
||||
|
||||
|
||||
def getClusterHealthFromFaults(zkhandler):
|
||||
faults_list = faults.getAllFaults(zkhandler)
|
||||
|
||||
def getClusterHealthFromFaults(zkhandler, faults_list):
|
||||
unacknowledged_faults = [fault for fault in faults_list if fault["status"] != "ack"]
|
||||
|
||||
# Generate total cluster health numbers
|
||||
@ -217,20 +213,40 @@ def getClusterHealth(zkhandler, node_list, vm_list, ceph_osd_list):
|
||||
|
||||
|
||||
def getNodeHealth(zkhandler, node_list):
|
||||
# Get the health state of all nodes
|
||||
node_health_reads = list()
|
||||
for node in node_list:
|
||||
node_health_reads += [
|
||||
("node.monitoring.health", node),
|
||||
("node.monitoring.plugins", node),
|
||||
]
|
||||
all_node_health_details = zkhandler.read_many(node_health_reads)
|
||||
# Parse out the Node health details
|
||||
node_health = dict()
|
||||
for index, node in enumerate(node_list):
|
||||
for nidx, node in enumerate(node_list):
|
||||
# Split the large list of return values by the IDX of this node
|
||||
# Each node result is 2 fields long
|
||||
pos_start = nidx * 2
|
||||
pos_end = nidx * 2 + 2
|
||||
node_health_value, node_health_plugins = tuple(
|
||||
all_node_health_details[pos_start:pos_end]
|
||||
)
|
||||
node_health_details = pvc_node.getNodeHealthDetails(
|
||||
zkhandler, node, node_health_plugins.split()
|
||||
)
|
||||
|
||||
node_health_messages = list()
|
||||
node_health_value = node["health"]
|
||||
for entry in node["health_details"]:
|
||||
for entry in node_health_details:
|
||||
if entry["health_delta"] > 0:
|
||||
node_health_messages.append(f"'{entry['name']}': {entry['message']}")
|
||||
|
||||
node_health_entry = {
|
||||
"health": node_health_value,
|
||||
"health": int(node_health_value)
|
||||
if isinstance(node_health_value, int)
|
||||
else node_health_value,
|
||||
"messages": node_health_messages,
|
||||
}
|
||||
|
||||
node_health[node["name"]] = node_health_entry
|
||||
node_health[node] = node_health_entry
|
||||
|
||||
return node_health
|
||||
|
||||
@ -239,78 +255,156 @@ def getClusterInformation(zkhandler):
|
||||
# Get cluster maintenance state
|
||||
maintenance_state = zkhandler.read("base.config.maintenance")
|
||||
|
||||
# Get node information object list
|
||||
retcode, node_list = pvc_node.get_list(zkhandler, None)
|
||||
|
||||
# Get primary node
|
||||
primary_node = common.getPrimaryNode(zkhandler)
|
||||
|
||||
# Get PVC version of primary node
|
||||
pvc_version = "0.0.0"
|
||||
for node in node_list:
|
||||
if node["name"] == primary_node:
|
||||
pvc_version = node["pvc_version"]
|
||||
|
||||
# Get vm information object list
|
||||
retcode, vm_list = pvc_vm.get_list(zkhandler, None, None, None, None)
|
||||
|
||||
# Get network information object list
|
||||
retcode, network_list = pvc_network.get_list(zkhandler, None, None)
|
||||
|
||||
# Get storage information object list
|
||||
retcode, ceph_osd_list = pvc_ceph.get_list_osd(zkhandler, None)
|
||||
retcode, ceph_pool_list = pvc_ceph.get_list_pool(zkhandler, None)
|
||||
retcode, ceph_volume_list = pvc_ceph.get_list_volume(zkhandler, None, None)
|
||||
retcode, ceph_snapshot_list = pvc_ceph.get_list_snapshot(
|
||||
zkhandler, None, None, None
|
||||
maintenance_state, primary_node = zkhandler.read_many(
|
||||
[
|
||||
("base.config.maintenance"),
|
||||
("base.config.primary_node"),
|
||||
]
|
||||
)
|
||||
|
||||
# Determine, for each subsection, the total count
|
||||
# Get PVC version of primary node
|
||||
pvc_version = zkhandler.read(("node.data.pvc_version", primary_node))
|
||||
|
||||
# Get the list of Nodes
|
||||
node_list = zkhandler.children("base.node")
|
||||
node_count = len(node_list)
|
||||
vm_count = len(vm_list)
|
||||
network_count = len(network_list)
|
||||
ceph_osd_count = len(ceph_osd_list)
|
||||
ceph_pool_count = len(ceph_pool_list)
|
||||
ceph_volume_count = len(ceph_volume_list)
|
||||
ceph_snapshot_count = len(ceph_snapshot_list)
|
||||
|
||||
# Format the Node states
|
||||
# Get the daemon and domain states of all Nodes
|
||||
node_state_reads = list()
|
||||
for node in node_list:
|
||||
node_state_reads += [
|
||||
("node.state.daemon", node),
|
||||
("node.state.domain", node),
|
||||
]
|
||||
all_node_states = zkhandler.read_many(node_state_reads)
|
||||
# Parse out the Node states
|
||||
node_data = list()
|
||||
formatted_node_states = {"total": node_count}
|
||||
for state in common.node_state_combinations:
|
||||
state_count = 0
|
||||
for node in node_list:
|
||||
node_state = f"{node['daemon_state']},{node['domain_state']}"
|
||||
if node_state == state:
|
||||
state_count += 1
|
||||
if state_count > 0:
|
||||
formatted_node_states[state] = state_count
|
||||
for nidx, node in enumerate(node_list):
|
||||
# Split the large list of return values by the IDX of this node
|
||||
# Each node result is 2 fields long
|
||||
pos_start = nidx * 2
|
||||
pos_end = nidx * 2 + 2
|
||||
node_daemon_state, node_domain_state = tuple(all_node_states[pos_start:pos_end])
|
||||
node_data.append(
|
||||
{
|
||||
"name": node,
|
||||
"daemon_state": node_daemon_state,
|
||||
"domain_state": node_domain_state,
|
||||
}
|
||||
)
|
||||
node_state = f"{node_daemon_state},{node_domain_state}"
|
||||
# Add to the count for this node's state
|
||||
if node_state in common.node_state_combinations:
|
||||
if formatted_node_states.get(node_state) is not None:
|
||||
formatted_node_states[node_state] += 1
|
||||
else:
|
||||
formatted_node_states[node_state] = 1
|
||||
|
||||
# Format the VM states
|
||||
# Get the list of VMs
|
||||
vm_list = zkhandler.children("base.domain")
|
||||
vm_count = len(vm_list)
|
||||
# Get the states of all VMs
|
||||
vm_state_reads = list()
|
||||
for vm in vm_list:
|
||||
vm_state_reads += [
|
||||
("domain", vm),
|
||||
("domain.state", vm),
|
||||
]
|
||||
all_vm_states = zkhandler.read_many(vm_state_reads)
|
||||
# Parse out the VM states
|
||||
vm_data = list()
|
||||
formatted_vm_states = {"total": vm_count}
|
||||
for state in common.vm_state_combinations:
|
||||
state_count = 0
|
||||
for vm in vm_list:
|
||||
if vm["state"] == state:
|
||||
state_count += 1
|
||||
if state_count > 0:
|
||||
formatted_vm_states[state] = state_count
|
||||
for vidx, vm in enumerate(vm_list):
|
||||
# Split the large list of return values by the IDX of this VM
|
||||
# Each VM result is 2 field long
|
||||
pos_start = vidx * 2
|
||||
pos_end = vidx * 2 + 2
|
||||
vm_name, vm_state = tuple(all_vm_states[pos_start:pos_end])
|
||||
vm_data.append(
|
||||
{
|
||||
"uuid": vm,
|
||||
"name": vm_name,
|
||||
"state": vm_state,
|
||||
}
|
||||
)
|
||||
# Add to the count for this VM's state
|
||||
if vm_state in common.vm_state_combinations:
|
||||
if formatted_vm_states.get(vm_state) is not None:
|
||||
formatted_vm_states[vm_state] += 1
|
||||
else:
|
||||
formatted_vm_states[vm_state] = 1
|
||||
|
||||
# Format the OSD states
|
||||
# Get the list of Ceph OSDs
|
||||
ceph_osd_list = zkhandler.children("base.osd")
|
||||
ceph_osd_count = len(ceph_osd_list)
|
||||
# Get the states of all OSDs ("stat" is not a typo since we're reading stats; states are in
|
||||
# the stats JSON object)
|
||||
osd_stat_reads = list()
|
||||
for osd in ceph_osd_list:
|
||||
osd_stat_reads += [("osd.stats", osd)]
|
||||
all_osd_stats = zkhandler.read_many(osd_stat_reads)
|
||||
# Parse out the OSD states
|
||||
osd_data = list()
|
||||
formatted_osd_states = {"total": ceph_osd_count}
|
||||
up_texts = {1: "up", 0: "down"}
|
||||
in_texts = {1: "in", 0: "out"}
|
||||
formatted_osd_states = {"total": ceph_osd_count}
|
||||
for state in common.ceph_osd_state_combinations:
|
||||
state_count = 0
|
||||
for ceph_osd in ceph_osd_list:
|
||||
ceph_osd_state = f"{up_texts[ceph_osd['stats']['up']]},{in_texts[ceph_osd['stats']['in']]}"
|
||||
if ceph_osd_state == state:
|
||||
state_count += 1
|
||||
if state_count > 0:
|
||||
formatted_osd_states[state] = state_count
|
||||
for oidx, osd in enumerate(ceph_osd_list):
|
||||
# Split the large list of return values by the IDX of this OSD
|
||||
# Each OSD result is 1 field long, so just use the IDX
|
||||
_osd_stats = all_osd_stats[oidx]
|
||||
# We have to load this JSON object and get our up/in states from it
|
||||
osd_stats = loads(_osd_stats)
|
||||
# Get our states
|
||||
osd_up = up_texts[osd_stats["up"]]
|
||||
osd_in = in_texts[osd_stats["in"]]
|
||||
osd_data.append(
|
||||
{
|
||||
"id": osd,
|
||||
"up": osd_up,
|
||||
"in": osd_in,
|
||||
}
|
||||
)
|
||||
osd_state = f"{osd_up},{osd_in}"
|
||||
# Add to the count for this OSD's state
|
||||
if osd_state in common.ceph_osd_state_combinations:
|
||||
if formatted_osd_states.get(osd_state) is not None:
|
||||
formatted_osd_states[osd_state] += 1
|
||||
else:
|
||||
formatted_osd_states[osd_state] = 1
|
||||
|
||||
# Get the list of Networks
|
||||
network_list = zkhandler.children("base.network")
|
||||
network_count = len(network_list)
|
||||
|
||||
# Get the list of Ceph pools
|
||||
ceph_pool_list = zkhandler.children("base.pool")
|
||||
ceph_pool_count = len(ceph_pool_list)
|
||||
|
||||
# Get the list of Ceph volumes
|
||||
ceph_volume_list = list()
|
||||
for pool in ceph_pool_list:
|
||||
ceph_volume_list_pool = zkhandler.children(("volume", pool))
|
||||
if ceph_volume_list_pool is not None:
|
||||
ceph_volume_list += [f"{pool}/{volume}" for volume in ceph_volume_list_pool]
|
||||
ceph_volume_count = len(ceph_volume_list)
|
||||
|
||||
# Get the list of Ceph snapshots
|
||||
ceph_snapshot_list = list()
|
||||
for volume in ceph_volume_list:
|
||||
ceph_snapshot_list_volume = zkhandler.children(("snapshot", volume))
|
||||
if ceph_snapshot_list_volume is not None:
|
||||
ceph_snapshot_list += [
|
||||
f"{volume}@{snapshot}" for snapshot in ceph_snapshot_list_volume
|
||||
]
|
||||
ceph_snapshot_count = len(ceph_snapshot_list)
|
||||
|
||||
# Get the list of faults
|
||||
faults_data = faults.getAllFaults(zkhandler)
|
||||
|
||||
# Format the status data
|
||||
cluster_information = {
|
||||
"cluster_health": getClusterHealthFromFaults(zkhandler),
|
||||
"cluster_health": getClusterHealthFromFaults(zkhandler, faults_data),
|
||||
"node_health": getNodeHealth(zkhandler, node_list),
|
||||
"maintenance": maintenance_state,
|
||||
"primary_node": primary_node,
|
||||
@ -323,6 +417,12 @@ def getClusterInformation(zkhandler):
|
||||
"pools": ceph_pool_count,
|
||||
"volumes": ceph_volume_count,
|
||||
"snapshots": ceph_snapshot_count,
|
||||
"detail": {
|
||||
"node": node_data,
|
||||
"vm": vm_data,
|
||||
"osd": osd_data,
|
||||
"faults": faults_data,
|
||||
},
|
||||
}
|
||||
|
||||
return cluster_information
|
||||
@ -337,6 +437,157 @@ def get_info(zkhandler):
|
||||
return False, "ERROR: Failed to obtain cluster information!"
|
||||
|
||||
|
||||
def get_metrics(zkhandler):
|
||||
# Get general cluster information
|
||||
status_retflag, status_data = get_info(zkhandler)
|
||||
if not status_retflag:
|
||||
return False, "Error: Status data threw error"
|
||||
|
||||
faults_data = status_data["detail"]["faults"]
|
||||
node_data = status_data["detail"]["node"]
|
||||
vm_data = status_data["detail"]["vm"]
|
||||
osd_data = status_data["detail"]["osd"]
|
||||
|
||||
output_lines = list()
|
||||
|
||||
output_lines.append("# HELP pvc_info PVC cluster information")
|
||||
output_lines.append("# TYPE pvc_info gauge")
|
||||
output_lines.append(
|
||||
f"pvc_info{{primary_node=\"{status_data['primary_node']}\", version=\"{status_data['pvc_version']}\", upstream_ip=\"{status_data['upstream_ip']}\"}} 1"
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_cluster_maintenance PVC cluster maintenance state")
|
||||
output_lines.append("# TYPE pvc_cluster_maintenance gauge")
|
||||
output_lines.append(
|
||||
f"pvc_cluster_maintenance {1 if bool(strtobool(status_data['maintenance'])) else 0}"
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_cluster_health PVC cluster health status")
|
||||
output_lines.append("# TYPE pvc_cluster_health gauge")
|
||||
output_lines.append(f"pvc_cluster_health {status_data['cluster_health']['health']}")
|
||||
|
||||
output_lines.append("# HELP pvc_cluster_faults PVC cluster new faults")
|
||||
output_lines.append("# TYPE pvc_cluster_faults gauge")
|
||||
fault_map = dict()
|
||||
for fault_type in common.fault_state_combinations:
|
||||
fault_map[fault_type] = 0
|
||||
for fault in faults_data:
|
||||
fault_map[fault["status"]] += 1
|
||||
for fault_type in fault_map:
|
||||
output_lines.append(
|
||||
f'pvc_cluster_faults{{status="{fault_type}"}} {fault_map[fault_type]}'
|
||||
)
|
||||
|
||||
# output_lines.append("# HELP pvc_cluster_faults PVC cluster health faults")
|
||||
# output_lines.append("# TYPE pvc_cluster_faults gauge")
|
||||
# for fault_msg in status_data["cluster_health"]["messages"]:
|
||||
# output_lines.append(
|
||||
# f"pvc_cluster_faults{{id=\"{fault_msg['id']}\", message=\"{fault_msg['text']}\"}} {fault_msg['health_delta']}"
|
||||
# )
|
||||
|
||||
output_lines.append("# HELP pvc_node_health PVC cluster node health status")
|
||||
output_lines.append("# TYPE pvc_node_health gauge")
|
||||
for node in status_data["node_health"]:
|
||||
if isinstance(status_data["node_health"][node]["health"], int):
|
||||
output_lines.append(
|
||||
f"pvc_node_health{{node=\"{node}\"}} {status_data['node_health'][node]['health']}"
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_node_daemon_states PVC Node daemon state counts")
|
||||
output_lines.append("# TYPE pvc_node_daemon_states gauge")
|
||||
node_daemon_state_map = dict()
|
||||
for state in set([s.split(",")[0] for s in common.node_state_combinations]):
|
||||
node_daemon_state_map[state] = 0
|
||||
for node in node_data:
|
||||
node_daemon_state_map[node["daemon_state"]] += 1
|
||||
for state in node_daemon_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_node_daemon_states{{state="{state}"}} {node_daemon_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_node_domain_states PVC Node domain state counts")
|
||||
output_lines.append("# TYPE pvc_node_domain_states gauge")
|
||||
node_domain_state_map = dict()
|
||||
for state in set([s.split(",")[1] for s in common.node_state_combinations]):
|
||||
node_domain_state_map[state] = 0
|
||||
for node in node_data:
|
||||
node_domain_state_map[node["domain_state"]] += 1
|
||||
for state in node_domain_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_node_domain_states{{state="{state}"}} {node_domain_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_vm_states PVC VM state counts")
|
||||
output_lines.append("# TYPE pvc_vm_states gauge")
|
||||
vm_state_map = dict()
|
||||
for state in set(common.vm_state_combinations):
|
||||
vm_state_map[state] = 0
|
||||
for vm in vm_data:
|
||||
vm_state_map[vm["state"]] += 1
|
||||
for state in vm_state_map:
|
||||
output_lines.append(f'pvc_vm_states{{state="{state}"}} {vm_state_map[state]}')
|
||||
|
||||
output_lines.append("# HELP pvc_osd_up_states PVC OSD up state counts")
|
||||
output_lines.append("# TYPE pvc_osd_up_states gauge")
|
||||
osd_up_state_map = dict()
|
||||
for state in set([s.split(",")[0] for s in common.ceph_osd_state_combinations]):
|
||||
osd_up_state_map[state] = 0
|
||||
for osd in osd_data:
|
||||
if osd["up"] == "up":
|
||||
osd_up_state_map["up"] += 1
|
||||
else:
|
||||
osd_up_state_map["down"] += 1
|
||||
for state in osd_up_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_osd_up_states{{state="{state}"}} {osd_up_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_osd_in_states PVC OSD in state counts")
|
||||
output_lines.append("# TYPE pvc_osd_in_states gauge")
|
||||
osd_in_state_map = dict()
|
||||
for state in set([s.split(",")[1] for s in common.ceph_osd_state_combinations]):
|
||||
osd_in_state_map[state] = 0
|
||||
for osd in osd_data:
|
||||
if osd["in"] == "in":
|
||||
osd_in_state_map["in"] += 1
|
||||
else:
|
||||
osd_in_state_map["out"] += 1
|
||||
for state in osd_in_state_map:
|
||||
output_lines.append(
|
||||
f'pvc_osd_in_states{{state="{state}"}} {osd_in_state_map[state]}'
|
||||
)
|
||||
|
||||
output_lines.append("# HELP pvc_nodes PVC Node count")
|
||||
output_lines.append("# TYPE pvc_nodes gauge")
|
||||
output_lines.append(f"pvc_nodes {status_data['nodes']['total']}")
|
||||
|
||||
output_lines.append("# HELP pvc_vms PVC VM count")
|
||||
output_lines.append("# TYPE pvc_vms gauge")
|
||||
output_lines.append(f"pvc_vms {status_data['vms']['total']}")
|
||||
|
||||
output_lines.append("# HELP pvc_osds PVC OSD count")
|
||||
output_lines.append("# TYPE pvc_osds gauge")
|
||||
output_lines.append(f"pvc_osds {status_data['osds']['total']}")
|
||||
|
||||
output_lines.append("# HELP pvc_networks PVC Network count")
|
||||
output_lines.append("# TYPE pvc_networks gauge")
|
||||
output_lines.append(f"pvc_networks {status_data['networks']}")
|
||||
|
||||
output_lines.append("# HELP pvc_pools PVC Storage Pool count")
|
||||
output_lines.append("# TYPE pvc_pools gauge")
|
||||
output_lines.append(f"pvc_pools {status_data['pools']}")
|
||||
|
||||
output_lines.append("# HELP pvc_volumes PVC Storage Volume count")
|
||||
output_lines.append("# TYPE pvc_volumes gauge")
|
||||
output_lines.append(f"pvc_volumes {status_data['volumes']}")
|
||||
|
||||
output_lines.append("# HELP pvc_snapshots PVC Storage Snapshot count")
|
||||
output_lines.append("# TYPE pvc_snapshots gauge")
|
||||
output_lines.append(f"pvc_snapshots {status_data['snapshots']}")
|
||||
|
||||
return True, "\n".join(output_lines) + "\n"
|
||||
|
||||
|
||||
def cluster_initialize(zkhandler, overwrite=False):
|
||||
# Abort if we've initialized the cluster before
|
||||
if zkhandler.exists("base.config.primary_node") and not overwrite:
|
||||
|
@ -401,13 +401,23 @@ def getDomainTags(zkhandler, dom_uuid):
|
||||
"""
|
||||
tags = list()
|
||||
|
||||
for tag in zkhandler.children(("domain.meta.tags", dom_uuid)):
|
||||
tag_type = zkhandler.read(("domain.meta.tags", dom_uuid, "tag.type", tag))
|
||||
protected = bool(
|
||||
strtobool(
|
||||
zkhandler.read(("domain.meta.tags", dom_uuid, "tag.protected", tag))
|
||||
)
|
||||
)
|
||||
all_tags = zkhandler.children(("domain.meta.tags", dom_uuid))
|
||||
|
||||
tag_reads = list()
|
||||
for tag in all_tags:
|
||||
tag_reads += [
|
||||
("domain.meta.tags", dom_uuid, "tag.type", tag),
|
||||
("domain.meta.tags", dom_uuid, "tag.protected", tag),
|
||||
]
|
||||
all_tag_data = zkhandler.read_many(tag_reads)
|
||||
|
||||
for tidx, tag in enumerate(all_tags):
|
||||
# Split the large list of return values by the IDX of this tag
|
||||
# Each tag result is 2 fields long
|
||||
pos_start = tidx * 2
|
||||
pos_end = tidx * 2 + 2
|
||||
tag_type, protected = tuple(all_tag_data[pos_start:pos_end])
|
||||
protected = bool(strtobool(protected))
|
||||
tags.append({"name": tag, "type": tag_type, "protected": protected})
|
||||
|
||||
return tags
|
||||
@ -422,19 +432,34 @@ def getDomainMetadata(zkhandler, dom_uuid):
|
||||
|
||||
The UUID must be validated before calling this function!
|
||||
"""
|
||||
domain_node_limit = zkhandler.read(("domain.meta.node_limit", dom_uuid))
|
||||
domain_node_selector = zkhandler.read(("domain.meta.node_selector", dom_uuid))
|
||||
domain_node_autostart = zkhandler.read(("domain.meta.autostart", dom_uuid))
|
||||
domain_migration_method = zkhandler.read(("domain.meta.migrate_method", dom_uuid))
|
||||
(
|
||||
domain_node_limit,
|
||||
domain_node_selector,
|
||||
domain_node_autostart,
|
||||
domain_migration_method,
|
||||
) = zkhandler.read_many(
|
||||
[
|
||||
("domain.meta.node_limit", dom_uuid),
|
||||
("domain.meta.node_selector", dom_uuid),
|
||||
("domain.meta.autostart", dom_uuid),
|
||||
("domain.meta.migrate_method", dom_uuid),
|
||||
]
|
||||
)
|
||||
|
||||
if not domain_node_limit:
|
||||
domain_node_limit = None
|
||||
else:
|
||||
domain_node_limit = domain_node_limit.split(",")
|
||||
|
||||
if not domain_node_selector or domain_node_selector == "none":
|
||||
domain_node_selector = None
|
||||
|
||||
if not domain_node_autostart:
|
||||
domain_node_autostart = None
|
||||
|
||||
if not domain_migration_method or domain_migration_method == "none":
|
||||
domain_migration_method = None
|
||||
|
||||
return (
|
||||
domain_node_limit,
|
||||
domain_node_selector,
|
||||
@ -451,10 +476,25 @@ def getInformationFromXML(zkhandler, uuid):
|
||||
Gather information about a VM from the Libvirt XML configuration in the Zookeper database
|
||||
and return a dict() containing it.
|
||||
"""
|
||||
domain_state = zkhandler.read(("domain.state", uuid))
|
||||
domain_node = zkhandler.read(("domain.node", uuid))
|
||||
domain_lastnode = zkhandler.read(("domain.last_node", uuid))
|
||||
domain_failedreason = zkhandler.read(("domain.failed_reason", uuid))
|
||||
(
|
||||
domain_state,
|
||||
domain_node,
|
||||
domain_lastnode,
|
||||
domain_failedreason,
|
||||
domain_profile,
|
||||
domain_vnc,
|
||||
stats_data,
|
||||
) = zkhandler.read_many(
|
||||
[
|
||||
("domain.state", uuid),
|
||||
("domain.node", uuid),
|
||||
("domain.last_node", uuid),
|
||||
("domain.failed_reason", uuid),
|
||||
("domain.profile", uuid),
|
||||
("domain.console.vnc", uuid),
|
||||
("domain.stats", uuid),
|
||||
]
|
||||
)
|
||||
|
||||
(
|
||||
domain_node_limit,
|
||||
@ -462,19 +502,17 @@ def getInformationFromXML(zkhandler, uuid):
|
||||
domain_node_autostart,
|
||||
domain_migration_method,
|
||||
) = getDomainMetadata(zkhandler, uuid)
|
||||
domain_tags = getDomainTags(zkhandler, uuid)
|
||||
domain_profile = zkhandler.read(("domain.profile", uuid))
|
||||
|
||||
domain_vnc = zkhandler.read(("domain.console.vnc", uuid))
|
||||
domain_tags = getDomainTags(zkhandler, uuid)
|
||||
|
||||
if domain_vnc:
|
||||
domain_vnc_listen, domain_vnc_port = domain_vnc.split(":")
|
||||
else:
|
||||
domain_vnc_listen = "None"
|
||||
domain_vnc_port = "None"
|
||||
domain_vnc_listen = None
|
||||
domain_vnc_port = None
|
||||
|
||||
parsed_xml = getDomainXML(zkhandler, uuid)
|
||||
|
||||
stats_data = zkhandler.read(("domain.stats", uuid))
|
||||
if stats_data is not None:
|
||||
try:
|
||||
stats_data = loads(stats_data)
|
||||
@ -491,6 +529,7 @@ def getInformationFromXML(zkhandler, uuid):
|
||||
domain_vcpu,
|
||||
domain_vcputopo,
|
||||
) = getDomainMainDetails(parsed_xml)
|
||||
|
||||
domain_networks = getDomainNetworks(parsed_xml, stats_data)
|
||||
|
||||
(
|
||||
|
@ -95,12 +95,24 @@ def getFault(zkhandler, fault_id):
|
||||
return None
|
||||
|
||||
fault_id = fault_id
|
||||
fault_last_time = zkhandler.read(("faults.last_time", fault_id))
|
||||
fault_first_time = zkhandler.read(("faults.first_time", fault_id))
|
||||
fault_ack_time = zkhandler.read(("faults.ack_time", fault_id))
|
||||
fault_status = zkhandler.read(("faults.status", fault_id))
|
||||
fault_delta = int(zkhandler.read(("faults.delta", fault_id)))
|
||||
fault_message = zkhandler.read(("faults.message", fault_id))
|
||||
|
||||
(
|
||||
fault_last_time,
|
||||
fault_first_time,
|
||||
fault_ack_time,
|
||||
fault_status,
|
||||
fault_delta,
|
||||
fault_message,
|
||||
) = zkhandler.read_many(
|
||||
[
|
||||
("faults.last_time", fault_id),
|
||||
("faults.first_time", fault_id),
|
||||
("faults.ack_time", fault_id),
|
||||
("faults.status", fault_id),
|
||||
("faults.delta", fault_id),
|
||||
("faults.message", fault_id),
|
||||
]
|
||||
)
|
||||
|
||||
# Acknowledged faults have a delta of 0
|
||||
if fault_ack_time != "":
|
||||
@ -112,7 +124,7 @@ def getFault(zkhandler, fault_id):
|
||||
"first_reported": fault_first_time,
|
||||
"acknowledged_at": fault_ack_time,
|
||||
"status": fault_status,
|
||||
"health_delta": fault_delta,
|
||||
"health_delta": int(fault_delta),
|
||||
"message": fault_message,
|
||||
}
|
||||
|
||||
@ -126,11 +138,42 @@ def getAllFaults(zkhandler, sort_key="last_reported"):
|
||||
|
||||
all_faults = zkhandler.children(("base.faults"))
|
||||
|
||||
faults_detail = list()
|
||||
|
||||
faults_reads = list()
|
||||
for fault_id in all_faults:
|
||||
fault_detail = getFault(zkhandler, fault_id)
|
||||
faults_detail.append(fault_detail)
|
||||
faults_reads += [
|
||||
("faults.last_time", fault_id),
|
||||
("faults.first_time", fault_id),
|
||||
("faults.ack_time", fault_id),
|
||||
("faults.status", fault_id),
|
||||
("faults.delta", fault_id),
|
||||
("faults.message", fault_id),
|
||||
]
|
||||
all_faults_data = list(zkhandler.read_many(faults_reads))
|
||||
|
||||
faults_detail = list()
|
||||
for fidx, fault_id in enumerate(all_faults):
|
||||
# Split the large list of return values by the IDX of this fault
|
||||
# Each fault result is 6 fields long
|
||||
pos_start = fidx * 6
|
||||
pos_end = fidx * 6 + 6
|
||||
(
|
||||
fault_last_time,
|
||||
fault_first_time,
|
||||
fault_ack_time,
|
||||
fault_status,
|
||||
fault_delta,
|
||||
fault_message,
|
||||
) = tuple(all_faults_data[pos_start:pos_end])
|
||||
fault_output = {
|
||||
"id": fault_id,
|
||||
"last_reported": fault_last_time,
|
||||
"first_reported": fault_first_time,
|
||||
"acknowledged_at": fault_ack_time,
|
||||
"status": fault_status,
|
||||
"health_delta": int(fault_delta),
|
||||
"message": fault_message,
|
||||
}
|
||||
faults_detail.append(fault_output)
|
||||
|
||||
sorted_faults = sorted(faults_detail, key=lambda x: x[sort_key])
|
||||
# Sort newest-first for time-based sorts
|
||||
|
@ -142,19 +142,37 @@ def getNetworkACLs(zkhandler, vni, _direction):
|
||||
|
||||
|
||||
def getNetworkInformation(zkhandler, vni):
|
||||
description = zkhandler.read(("network", vni))
|
||||
nettype = zkhandler.read(("network.type", vni))
|
||||
mtu = zkhandler.read(("network.mtu", vni))
|
||||
domain = zkhandler.read(("network.domain", vni))
|
||||
name_servers = zkhandler.read(("network.nameservers", vni))
|
||||
ip6_network = zkhandler.read(("network.ip6.network", vni))
|
||||
ip6_gateway = zkhandler.read(("network.ip6.gateway", vni))
|
||||
dhcp6_flag = zkhandler.read(("network.ip6.dhcp", vni))
|
||||
ip4_network = zkhandler.read(("network.ip4.network", vni))
|
||||
ip4_gateway = zkhandler.read(("network.ip4.gateway", vni))
|
||||
dhcp4_flag = zkhandler.read(("network.ip4.dhcp", vni))
|
||||
dhcp4_start = zkhandler.read(("network.ip4.dhcp_start", vni))
|
||||
dhcp4_end = zkhandler.read(("network.ip4.dhcp_end", vni))
|
||||
(
|
||||
description,
|
||||
nettype,
|
||||
mtu,
|
||||
domain,
|
||||
name_servers,
|
||||
ip6_network,
|
||||
ip6_gateway,
|
||||
dhcp6_flag,
|
||||
ip4_network,
|
||||
ip4_gateway,
|
||||
dhcp4_flag,
|
||||
dhcp4_start,
|
||||
dhcp4_end,
|
||||
) = zkhandler.read_many(
|
||||
[
|
||||
("network", vni),
|
||||
("network.type", vni),
|
||||
("network.mtu", vni),
|
||||
("network.domain", vni),
|
||||
("network.nameservers", vni),
|
||||
("network.ip6.network", vni),
|
||||
("network.ip6.gateway", vni),
|
||||
("network.ip6.dhcp", vni),
|
||||
("network.ip4.network", vni),
|
||||
("network.ip4.gateway", vni),
|
||||
("network.ip4.dhcp", vni),
|
||||
("network.ip4.dhcp_start", vni),
|
||||
("network.ip4.dhcp_end", vni),
|
||||
]
|
||||
)
|
||||
|
||||
# Construct a data structure to represent the data
|
||||
network_information = {
|
||||
@ -818,31 +836,45 @@ def getSRIOVVFInformation(zkhandler, node, vf):
|
||||
if not zkhandler.exists(("node.sriov.vf", node, "sriov_vf", vf)):
|
||||
return []
|
||||
|
||||
pf = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pf", vf))
|
||||
mtu = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mtu", vf))
|
||||
mac = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mac", vf))
|
||||
vlan_id = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf))
|
||||
vlan_qos = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf))
|
||||
tx_rate_min = zkhandler.read(
|
||||
("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf)
|
||||
(
|
||||
pf,
|
||||
mtu,
|
||||
mac,
|
||||
vlan_id,
|
||||
vlan_qos,
|
||||
tx_rate_min,
|
||||
tx_rate_max,
|
||||
link_state,
|
||||
spoof_check,
|
||||
trust,
|
||||
query_rss,
|
||||
pci_domain,
|
||||
pci_bus,
|
||||
pci_slot,
|
||||
pci_function,
|
||||
used,
|
||||
used_by_domain,
|
||||
) = zkhandler.read_many(
|
||||
[
|
||||
("node.sriov.vf", node, "sriov_vf.pf", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.mtu", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.mac", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.link_state", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.trust", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.config.query_rss", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.pci.domain", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.pci.bus", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.pci.slot", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.pci.function", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.used", vf),
|
||||
("node.sriov.vf", node, "sriov_vf.used_by", vf),
|
||||
]
|
||||
)
|
||||
tx_rate_max = zkhandler.read(
|
||||
("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf)
|
||||
)
|
||||
link_state = zkhandler.read(
|
||||
("node.sriov.vf", node, "sriov_vf.config.link_state", vf)
|
||||
)
|
||||
spoof_check = zkhandler.read(
|
||||
("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf)
|
||||
)
|
||||
trust = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.trust", vf))
|
||||
query_rss = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.query_rss", vf))
|
||||
pci_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.domain", vf))
|
||||
pci_bus = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.bus", vf))
|
||||
pci_slot = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.slot", vf))
|
||||
pci_function = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.function", vf))
|
||||
used = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used", vf))
|
||||
used_by_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used_by", vf))
|
||||
|
||||
vf_information = {
|
||||
"phy": vf,
|
||||
|
@ -26,69 +26,134 @@ import json
|
||||
import daemon_lib.common as common
|
||||
|
||||
|
||||
def getNodeInformation(zkhandler, node_name):
|
||||
"""
|
||||
Gather information about a node from the Zookeeper database and return a dict() containing it.
|
||||
"""
|
||||
node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
|
||||
node_coordinator_state = zkhandler.read(("node.state.router", node_name))
|
||||
node_domain_state = zkhandler.read(("node.state.domain", node_name))
|
||||
node_static_data = zkhandler.read(("node.data.static", node_name)).split()
|
||||
node_pvc_version = zkhandler.read(("node.data.pvc_version", node_name))
|
||||
node_cpu_count = int(node_static_data[0])
|
||||
node_kernel = node_static_data[1]
|
||||
node_os = node_static_data[2]
|
||||
node_arch = node_static_data[3]
|
||||
node_vcpu_allocated = int(zkhandler.read(("node.vcpu.allocated", node_name)))
|
||||
node_mem_total = int(zkhandler.read(("node.memory.total", node_name)))
|
||||
node_mem_allocated = int(zkhandler.read(("node.memory.allocated", node_name)))
|
||||
node_mem_provisioned = int(zkhandler.read(("node.memory.provisioned", node_name)))
|
||||
node_mem_used = int(zkhandler.read(("node.memory.used", node_name)))
|
||||
node_mem_free = int(zkhandler.read(("node.memory.free", node_name)))
|
||||
node_load = float(zkhandler.read(("node.cpu.load", node_name)))
|
||||
node_domains_count = int(
|
||||
zkhandler.read(("node.count.provisioned_domains", node_name))
|
||||
)
|
||||
node_running_domains = zkhandler.read(("node.running_domains", node_name)).split()
|
||||
try:
|
||||
node_health = int(zkhandler.read(("node.monitoring.health", node_name)))
|
||||
except Exception:
|
||||
node_health = "N/A"
|
||||
try:
|
||||
node_health_plugins = zkhandler.read(
|
||||
("node.monitoring.plugins", node_name)
|
||||
).split()
|
||||
except Exception:
|
||||
node_health_plugins = list()
|
||||
|
||||
node_health_details = list()
|
||||
def getNodeHealthDetails(zkhandler, node_name, node_health_plugins):
|
||||
plugin_reads = list()
|
||||
for plugin in node_health_plugins:
|
||||
plugin_last_run = zkhandler.read(
|
||||
("node.monitoring.data", node_name, "monitoring_plugin.last_run", plugin)
|
||||
)
|
||||
plugin_health_delta = zkhandler.read(
|
||||
plugin_reads += [
|
||||
(
|
||||
"node.monitoring.data",
|
||||
node_name,
|
||||
"monitoring_plugin.last_run",
|
||||
plugin,
|
||||
),
|
||||
(
|
||||
"node.monitoring.data",
|
||||
node_name,
|
||||
"monitoring_plugin.health_delta",
|
||||
plugin,
|
||||
)
|
||||
)
|
||||
plugin_message = zkhandler.read(
|
||||
("node.monitoring.data", node_name, "monitoring_plugin.message", plugin)
|
||||
)
|
||||
plugin_data = zkhandler.read(
|
||||
("node.monitoring.data", node_name, "monitoring_plugin.data", plugin)
|
||||
)
|
||||
),
|
||||
(
|
||||
"node.monitoring.data",
|
||||
node_name,
|
||||
"monitoring_plugin.message",
|
||||
plugin,
|
||||
),
|
||||
(
|
||||
"node.monitoring.data",
|
||||
node_name,
|
||||
"monitoring_plugin.data",
|
||||
plugin,
|
||||
),
|
||||
]
|
||||
all_plugin_data = list(zkhandler.read_many(plugin_reads))
|
||||
|
||||
node_health_details = list()
|
||||
for pidx, plugin in enumerate(node_health_plugins):
|
||||
# Split the large list of return values by the IDX of this plugin
|
||||
# Each plugin result is 4 fields long
|
||||
pos_start = pidx * 4
|
||||
pos_end = pidx * 4 + 4
|
||||
(
|
||||
plugin_last_run,
|
||||
plugin_health_delta,
|
||||
plugin_message,
|
||||
plugin_data,
|
||||
) = tuple(all_plugin_data[pos_start:pos_end])
|
||||
plugin_output = {
|
||||
"name": plugin,
|
||||
"last_run": int(plugin_last_run),
|
||||
"last_run": int(plugin_last_run) if plugin_last_run is not None else None,
|
||||
"health_delta": int(plugin_health_delta),
|
||||
"message": plugin_message,
|
||||
"data": json.loads(plugin_data),
|
||||
}
|
||||
node_health_details.append(plugin_output)
|
||||
|
||||
return node_health_details
|
||||
|
||||
|
||||
def getNodeInformation(zkhandler, node_name):
|
||||
"""
|
||||
Gather information about a node from the Zookeeper database and return a dict() containing it.
|
||||
"""
|
||||
|
||||
(
|
||||
node_daemon_state,
|
||||
node_coordinator_state,
|
||||
node_domain_state,
|
||||
node_pvc_version,
|
||||
_node_static_data,
|
||||
_node_vcpu_allocated,
|
||||
_node_mem_total,
|
||||
_node_mem_allocated,
|
||||
_node_mem_provisioned,
|
||||
_node_mem_used,
|
||||
_node_mem_free,
|
||||
_node_load,
|
||||
_node_domains_count,
|
||||
_node_running_domains,
|
||||
_node_health,
|
||||
_node_health_plugins,
|
||||
) = zkhandler.read_many(
|
||||
[
|
||||
("node.state.daemon", node_name),
|
||||
("node.state.router", node_name),
|
||||
("node.state.domain", node_name),
|
||||
("node.data.pvc_version", node_name),
|
||||
("node.data.static", node_name),
|
||||
("node.vcpu.allocated", node_name),
|
||||
("node.memory.total", node_name),
|
||||
("node.memory.allocated", node_name),
|
||||
("node.memory.provisioned", node_name),
|
||||
("node.memory.used", node_name),
|
||||
("node.memory.free", node_name),
|
||||
("node.cpu.load", node_name),
|
||||
("node.count.provisioned_domains", node_name),
|
||||
("node.running_domains", node_name),
|
||||
("node.monitoring.health", node_name),
|
||||
("node.monitoring.plugins", node_name),
|
||||
]
|
||||
)
|
||||
|
||||
node_static_data = _node_static_data.split()
|
||||
node_cpu_count = int(node_static_data[0])
|
||||
node_kernel = node_static_data[1]
|
||||
node_os = node_static_data[2]
|
||||
node_arch = node_static_data[3]
|
||||
|
||||
node_vcpu_allocated = int(_node_vcpu_allocated)
|
||||
node_mem_total = int(_node_mem_total)
|
||||
node_mem_allocated = int(_node_mem_allocated)
|
||||
node_mem_provisioned = int(_node_mem_provisioned)
|
||||
node_mem_used = int(_node_mem_used)
|
||||
node_mem_free = int(_node_mem_free)
|
||||
node_load = float(_node_load)
|
||||
node_domains_count = int(_node_domains_count)
|
||||
node_running_domains = _node_running_domains.split()
|
||||
|
||||
try:
|
||||
node_health = int(_node_health)
|
||||
except Exception:
|
||||
node_health = "N/A"
|
||||
|
||||
try:
|
||||
node_health_plugins = _node_health_plugins.split()
|
||||
except Exception:
|
||||
node_health_plugins = list()
|
||||
|
||||
node_health_details = getNodeHealthDetails(
|
||||
zkhandler, node_name, node_health_plugins
|
||||
)
|
||||
|
||||
# Construct a data structure to represent the data
|
||||
node_information = {
|
||||
"name": node_name,
|
||||
@ -269,6 +334,8 @@ def get_list(
|
||||
):
|
||||
node_list = []
|
||||
full_node_list = zkhandler.children("base.node")
|
||||
if full_node_list is None:
|
||||
full_node_list = list()
|
||||
full_node_list.sort()
|
||||
|
||||
if is_fuzzy and limit:
|
||||
|
@ -19,6 +19,7 @@
|
||||
#
|
||||
###############################################################################
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import time
|
||||
import uuid
|
||||
@ -239,10 +240,41 @@ class ZKHandler(object):
|
||||
# This path is invalid; this is likely due to missing schema entries, so return None
|
||||
return None
|
||||
|
||||
return self.zk_conn.get(path)[0].decode(self.encoding)
|
||||
res = self.zk_conn.get(path)
|
||||
return res[0].decode(self.encoding)
|
||||
except NoNodeError:
|
||||
return None
|
||||
|
||||
async def read_async(self, key):
|
||||
"""
|
||||
Read data from a key asynchronously
|
||||
"""
|
||||
try:
|
||||
path = self.get_schema_path(key)
|
||||
if path is None:
|
||||
# This path is invalid; this is likely due to missing schema entries, so return None
|
||||
return None
|
||||
|
||||
val = self.zk_conn.get_async(path)
|
||||
data = val.get()
|
||||
return data[0].decode(self.encoding)
|
||||
except NoNodeError:
|
||||
return None
|
||||
|
||||
async def _read_many(self, keys):
|
||||
"""
|
||||
Async runner for read_many
|
||||
"""
|
||||
res = await asyncio.gather(*(self.read_async(key) for key in keys))
|
||||
return tuple(res)
|
||||
|
||||
def read_many(self, keys):
|
||||
"""
|
||||
Read data from several keys, asynchronously. Returns a tuple of all key values once all
|
||||
reads are complete.
|
||||
"""
|
||||
return asyncio.run(self._read_many(keys))
|
||||
|
||||
def write(self, kvpairs):
|
||||
"""
|
||||
Create or update one or more keys' data
|
||||
|
31
debian/changelog
vendored
@ -1,3 +1,34 @@
|
||||
pvc (0.9.86-0) unstable; urgency=high
|
||||
|
||||
* [API Daemon] Significantly improves the performance of several commands via async Zookeeper calls and removal of superfluous backend calls.
|
||||
* [Docs] Improves the project README and updates screenshot images to show the current output and more functionality.
|
||||
* [API Daemon/CLI] Corrects some bugs in VM metainformation output.
|
||||
* [Node Daemon] Fixes resource reporting bugs from 0.9.81 and properly clears node resource numbers on a fence.
|
||||
* [Health Daemon] Adds a wait during pvchealthd startup until the node is in run state, to avoid erroneous faults during node bootup.
|
||||
* [API Daemon] Fixes an incorrect reference to legacy pvcapid.yaml file in migration script.
|
||||
|
||||
-- Joshua M. Boniface <joshua@boniface.me> Thu, 14 Dec 2023 14:46:29 -0500
|
||||
|
||||
pvc (0.9.85-0) unstable; urgency=high
|
||||
|
||||
* [Packaging] Fixes a dependency bug introduced in 0.9.84
|
||||
* [Node Daemon] Fixes an output bug during keepalives
|
||||
* [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
|
||||
|
||||
-- Joshua M. Boniface <joshua@boniface.me> Sun, 10 Dec 2023 01:00:33 -0500
|
||||
|
||||
pvc (0.9.84-0) unstable; urgency=high
|
||||
|
||||
**Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.
|
||||
|
||||
* [All] Adds persistent fault reporting to clusters, replacing the old cluster health calculations.
|
||||
* [API Daemon] Adds cluster-level Prometheus metric exporting as well as a Ceph Prometheus proxy to the API.
|
||||
* [CLI Client] Improves formatting output of "pvc cluster status".
|
||||
* [Node Daemon] Fixes several bugs and enhances the working of the psql health check plugin.
|
||||
* [Worker Daemon] Fixes several bugs in the example provisioner scripts, and moves the libvirt_schema library into the daemon common libraries.
|
||||
|
||||
-- Joshua M. Boniface <joshua@boniface.me> Sat, 09 Dec 2023 23:05:40 -0500
|
||||
|
||||
pvc (0.9.83-0) unstable; urgency=high
|
||||
|
||||
**Breaking Changes:** This release features a breaking change for the daemon config. A new unified "pvc.conf" file is required for all daemons (and the CLI client for Autobackup and API-on-this-host functionality), which will be written by the "pvc" role in the PVC Ansible framework. Using the "update-pvc-daemons" oneshot playbook from PVC Ansible is **required** to update to this release, as it will ensure this file is written to the proper place before deploying the new package versions, and also ensures that the old entires are cleaned up afterwards. In addition, this release fully splits the node worker and health subsystems into discrete daemons ("pvcworkerd" and "pvchealthd") and packages ("pvc-daemon-worker" and "pvc-daemon-health") respectively. The "pvc-daemon-node" package also now depends on both packages, and the "pvc-daemon-api" package can now be reliably used outside of the PVC nodes themselves (for instance, in a VM) without any strange cross-dependency issues.
|
||||
|
8
debian/control
vendored
@ -8,7 +8,7 @@ X-Python3-Version: >= 3.7
|
||||
|
||||
Package: pvc-daemon-node
|
||||
Architecture: all
|
||||
Depends: systemd, pvc-daemon-common, pvc-daemon-health, pvc-daemon-worker, python3-kazoo, python3-psutil, python3-apscheduler, python3-libvirt, python3-psycopg2, python3-dnspython, python3-yaml, python3-distutils, python3-rados, python3-gevent, python3-prometheus-client, ipmitool, libvirt-daemon-system, arping, vlan, bridge-utils, dnsmasq, nftables, pdns-server, pdns-backend-pgsql
|
||||
Depends: systemd, pvc-daemon-common, pvc-daemon-health, pvc-daemon-worker, python3-kazoo, python3-psutil, python3-apscheduler, python3-libvirt, python3-psycopg2, python3-dnspython, python3-yaml, python3-distutils, python3-rados, python3-gevent, ipmitool, libvirt-daemon-system, arping, vlan, bridge-utils, dnsmasq, nftables, pdns-server, pdns-backend-pgsql
|
||||
Description: Parallel Virtual Cluster node daemon
|
||||
A KVM/Zookeeper/Ceph-based VM and private cloud manager
|
||||
.
|
||||
@ -16,7 +16,7 @@ Description: Parallel Virtual Cluster node daemon
|
||||
|
||||
Package: pvc-daemon-health
|
||||
Architecture: all
|
||||
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-psutil, python3-apscheduler, python3-yaml, python3-prometheus-client
|
||||
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-psutil, python3-apscheduler, python3-yaml
|
||||
Description: Parallel Virtual Cluster health daemon
|
||||
A KVM/Zookeeper/Ceph-based VM and private cloud manager
|
||||
.
|
||||
@ -24,7 +24,7 @@ Description: Parallel Virtual Cluster health daemon
|
||||
|
||||
Package: pvc-daemon-worker
|
||||
Architecture: all
|
||||
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-celery, python3-redis, python3-yaml, python3-prometheus-client, python-celery-common, fio
|
||||
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-celery, python3-redis, python3-yaml, python-celery-common, fio
|
||||
Description: Parallel Virtual Cluster worker daemon
|
||||
A KVM/Zookeeper/Ceph-based VM and private cloud manager
|
||||
.
|
||||
@ -32,7 +32,7 @@ Description: Parallel Virtual Cluster worker daemon
|
||||
|
||||
Package: pvc-daemon-api
|
||||
Architecture: all
|
||||
Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate, python3-prometheus-client
|
||||
Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
|
||||
Description: Parallel Virtual Cluster API daemon
|
||||
A KVM/Zookeeper/Ceph-based VM and private cloud manager
|
||||
.
|
||||
|
Before Width: | Height: | Size: 88 KiB |
Before Width: | Height: | Size: 41 KiB |
Before Width: | Height: | Size: 300 KiB |
Before Width: | Height: | Size: 42 KiB |
@ -6,7 +6,7 @@ VERSION="$( head -1 debian/changelog | awk -F'[()-]' '{ print $2 }' )"
|
||||
|
||||
pushd $( git rev-parse --show-toplevel ) &>/dev/null
|
||||
pushd api-daemon &>/dev/null
|
||||
export PVC_CONFIG_FILE="./pvcapid.sample.yaml"
|
||||
export PVC_CONFIG_FILE="../pvc.sample.conf"
|
||||
./pvcapid-manage_flask.py db migrate -m "PVC version ${VERSION}"
|
||||
./pvcapid-manage_flask.py db upgrade
|
||||
popd &>/dev/null
|
||||
|
@ -33,7 +33,7 @@ import os
|
||||
import signal
|
||||
|
||||
# Daemon version
|
||||
version = "0.9.83"
|
||||
version = "0.9.86"
|
||||
|
||||
|
||||
##########################################################
|
||||
@ -80,6 +80,11 @@ def entrypoint():
|
||||
# Connect to Zookeeper and return our handler and current schema version
|
||||
zkhandler, _ = pvchealthd.util.zookeeper.connect(logger, config)
|
||||
|
||||
logger.out("Waiting for node daemon to be operating", state="s")
|
||||
while zkhandler.read(("node.state.daemon", config["node_hostname"])) != "run":
|
||||
sleep(5)
|
||||
logger.out("Node daemon in run state, continuing health daemon startup", state="s")
|
||||
|
||||
# Define a cleanup function
|
||||
def cleanup(failure=False):
|
||||
nonlocal logger, zkhandler, monitoring_instance
|
||||
|
BIN
images/0-integrated-help.png
Normal file
After Width: | Height: | Size: 100 KiB |
BIN
images/1-connection-management.png
Normal file
After Width: | Height: | Size: 50 KiB |
BIN
images/10-provisioner.png
Normal file
After Width: | Height: | Size: 124 KiB |
BIN
images/11-prometheus-grafana.png
Normal file
After Width: | Height: | Size: 168 KiB |
BIN
images/2-cluster-details-and-output-formats.png
Normal file
After Width: | Height: | Size: 140 KiB |
BIN
images/3-node-information.png
Normal file
After Width: | Height: | Size: 97 KiB |
BIN
images/4-vm-information.png
Normal file
After Width: | Height: | Size: 109 KiB |
BIN
images/5-vm-details.png
Normal file
After Width: | Height: | Size: 136 KiB |
BIN
images/6-network-information.png
Normal file
After Width: | Height: | Size: 118 KiB |
BIN
images/7-storage-information.png
Normal file
After Width: | Height: | Size: 166 KiB |
BIN
images/8-vm-and-node-logs.png
Normal file
After Width: | Height: | Size: 177 KiB |
BIN
images/9-vm-and-worker-tasks.png
Normal file
After Width: | Height: | Size: 67 KiB |
Before Width: | Height: | Size: 49 KiB After Width: | Height: | Size: 49 KiB |
@ -2,6 +2,14 @@
|
||||
|
||||
This directory contains several monitoring resources that can be used with various monitoring systems to track and alert on a PVC cluster system.
|
||||
|
||||
## Prometheus + Grafana
|
||||
|
||||
The included example Prometheus configuration and Grafana dashboard can be used to query the PVC API for Prometheus data and display it with a consistent dashboard.
|
||||
|
||||
Note that the default configuration here also includes Ceph cluster information; a Ceph dashboard can be found externally.
|
||||
|
||||
Note too that this does not include node export examples from individual PVC nodes; those must be set up separately.
|
||||
|
||||
## Munin
|
||||
|
||||
The included Munin plugins can be activated by linking to them from `/etc/munin/plugins/`. Two plugins are provided:
|
||||
|
@ -70,7 +70,7 @@ def check_pvc(item, params, section):
|
||||
summary = f"Cluster health is {cluster_health}% (maintenance {maintenance})"
|
||||
|
||||
if len(cluster_messages) > 0:
|
||||
details = ", ".join(cluster_messages)
|
||||
details = ", ".join([m["text"] for m in cluster_messages])
|
||||
|
||||
if cluster_health <= 50 and maintenance == "off":
|
||||
state = State.CRIT
|
||||
|
2599
node-daemon/monitoring/prometheus/grafana-pvc-dashboard.json
Normal file
8
node-daemon/monitoring/prometheus/prometheus.yml
Normal file
@ -0,0 +1,8 @@
|
||||
# Other configuration omitted
|
||||
scrape_configs:
|
||||
- job_name: "pvc_cluster"
|
||||
metrics_path: /api/v1/metrics
|
||||
scheme: "http"
|
||||
file_sd_configs:
|
||||
- files:
|
||||
- 'targets-pvc_cluster.json'
|
11
node-daemon/monitoring/prometheus/targets-pvc_cluster.json
Normal file
@ -0,0 +1,11 @@
|
||||
[
|
||||
{
|
||||
"targets": [
|
||||
"pvc.upstream.floating.address.tld:7370"
|
||||
],
|
||||
"labels": {
|
||||
"cluster": "cluster1"
|
||||
}
|
||||
}
|
||||
]
|
||||
|
@ -48,7 +48,7 @@ import re
|
||||
import json
|
||||
|
||||
# Daemon version
|
||||
version = "0.9.83"
|
||||
version = "0.9.86"
|
||||
|
||||
|
||||
##########################################################
|
||||
|
@ -115,6 +115,27 @@ def fence_node(node_name, zkhandler, config, logger):
|
||||
):
|
||||
migrateFromFencedNode(zkhandler, node_name, config, logger)
|
||||
|
||||
# Reset all node resource values
|
||||
logger.out(
|
||||
f"Resetting all resource values for dead node {node_name} to zero",
|
||||
state="i",
|
||||
prefix=f"fencing {node_name}",
|
||||
)
|
||||
zkhandler.write(
|
||||
[
|
||||
(("node.running_domains", node_name), "0"),
|
||||
(("node.count.provisioned_domains", node_name), "0"),
|
||||
(("node.cpu.load", node_name), "0"),
|
||||
(("node.vcpu.allocated", node_name), "0"),
|
||||
(("node.memory.total", node_name), "0"),
|
||||
(("node.memory.used", node_name), "0"),
|
||||
(("node.memory.free", node_name), "0"),
|
||||
(("node.memory.allocated", node_name), "0"),
|
||||
(("node.memory.provisioned", node_name), "0"),
|
||||
(("node.monitoring.health", node_name), None),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
# Migrate hosts away from a fenced node
|
||||
def migrateFromFencedNode(zkhandler, node_name, config, logger):
|
||||
|
@ -477,6 +477,10 @@ def collect_vm_stats(logger, config, zkhandler, this_node, queue):
|
||||
fixed_d_domain = this_node.d_domain.copy()
|
||||
for domain, instance in fixed_d_domain.items():
|
||||
if domain in this_node.domain_list:
|
||||
# Add the allocated memory to our memalloc value
|
||||
memalloc += instance.getmemory()
|
||||
memprov += instance.getmemory()
|
||||
vcpualloc += instance.getvcpus()
|
||||
if instance.getstate() == "start" and instance.getnode() == this_node.name:
|
||||
if instance.getdom() is not None:
|
||||
try:
|
||||
@ -532,11 +536,6 @@ def collect_vm_stats(logger, config, zkhandler, this_node, queue):
|
||||
continue
|
||||
domain_memory_stats = domain.memoryStats()
|
||||
domain_cpu_stats = domain.getCPUStats(True)[0]
|
||||
|
||||
# Add the allocated memory to our memalloc value
|
||||
memalloc += instance.getmemory()
|
||||
memprov += instance.getmemory()
|
||||
vcpualloc += instance.getvcpus()
|
||||
except Exception as e:
|
||||
if debug:
|
||||
try:
|
||||
@ -701,7 +700,7 @@ def node_keepalive(logger, config, zkhandler, this_node):
|
||||
|
||||
runtime_start = datetime.now()
|
||||
logger.out(
|
||||
"Starting node keepalive run",
|
||||
f"Starting node keepalive run at {datetime.now()}",
|
||||
state="t",
|
||||
)
|
||||
|
||||
|
@ -167,6 +167,7 @@ _pvc storage pool remove --yes testing
|
||||
|
||||
# Remove the VM
|
||||
_pvc vm stop --yes testx
|
||||
sleep 5
|
||||
_pvc vm remove --yes testx
|
||||
|
||||
_pvc provisioner profile remove --yes test
|
||||
|
@ -44,7 +44,7 @@ from daemon_lib.vmbuilder import (
|
||||
)
|
||||
|
||||
# Daemon version
|
||||
version = "0.9.83"
|
||||
version = "0.9.86"
|
||||
|
||||
|
||||
config = cfg.get_configuration()
|
||||
|