Compare commits

...

16 Commits

Author SHA1 Message Date
ab0a1e0946 Update and streamline README and update images 2023-12-10 23:57:01 -05:00
7c116b2fbc Ensure node health value is an int 2023-12-10 23:56:50 -05:00
1023c55087 Fix bug in VM state list 2023-12-10 23:44:01 -05:00
9235187c6f Port Ceph functions to read_many
Only ports getOSDInformation, as all the others feature 3 or less reads
which is acceptable sequentially.
2023-12-10 22:24:38 -05:00
0c94f1b4f8 Port Network functions to read_many 2023-12-10 22:19:21 -05:00
44a4f0e1f7 Use new info detail output instead of new lists
Avoids multiple additional ZK calls by using data that is now in the
status detail output.
2023-12-10 22:19:09 -05:00
5d53a3e529 Add state and faults detail to cluster information
We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
2023-12-10 17:29:32 -05:00
35e22cb50f Simplify cluster status handling
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
2023-12-10 17:05:46 -05:00
a3171b666b Split node health into separate function 2023-12-10 16:52:10 -05:00
48e41d7b05 Port Faults getFault and getAllFaults to read_many 2023-12-10 16:05:16 -05:00
d6aecf195e Port Node getNodeInformation to read_many 2023-12-10 15:53:28 -05:00
9329784010 Implement async ZK read function
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.

Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
2023-12-10 15:35:40 -05:00
9dc5097dbc Bump version to 0.9.85 2023-12-10 01:00:33 -05:00
5776cb3a09 Remove Prometheus client dependencies
We don't actually use this (yet!) so remove the dependency for now.
2023-12-10 00:58:09 -05:00
53d632f283 Fix bug in example PVC Grafana dashboard 2023-12-10 00:50:05 -05:00
7bc0760b78 Add time to "starting keepalive" message
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00
35 changed files with 524 additions and 229 deletions

View File

@ -1 +1 @@
0.9.84
0.9.85

View File

@ -1,5 +1,11 @@
## PVC Changelog
###### [v0.9.85](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.85)
* [Packaging] Fixes a dependency bug introduced in 0.9.84
* [Node Daemon] Fixes an output bug during keepalives
* [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
###### [v0.9.84](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.84)
**Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.

View File

@ -1,5 +1,5 @@
<p align="center">
<img alt="Logo banner" src="docs/images/pvc_logo_black.png"/>
<img alt="Logo banner" src="images/pvc_logo_black.png"/>
<br/><br/>
<a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
<a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
@ -23,37 +23,58 @@ Installation of PVC is accomplished by two main components: a [Node installer IS
Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.
## What is it based on?
The core node and API daemons, as well as the CLI API client, are written in Python 3 and are fully Free Software (GNU GPL v3). In addition to these, PVC makes use of the following software tools to provide a holistic hyperconverged infrastructure solution:
* Debian GNU/Linux as the base OS.
* Linux KVM, QEMU, and Libvirt for VM management.
* Linux `ip`, FRRouting, NFTables, DNSMasq, and PowerDNS for network management.
* Ceph for storage management.
* Apache Zookeeper for the primary cluster state database.
* Patroni PostgreSQL manager for the secondary relation databases (DNS aggregation, Provisioner configuration).
## Getting Started
To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/getting-started/) page for details on configuring your first cluster.
To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster.
## Changelog
View the changelog in [CHANGELOG.md](CHANGELOG.md).
View the changelog in [CHANGELOG.md](CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**
## Screenshots
While PVC's API and internals aren't very screenshot-worthy, here is some example output of the CLI tool.
These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85.
<p><img alt="Node listing" src="docs/images/pvc-nodes.png"/><br/><i>Listing the nodes in a cluster</i></p>
<p><img alt="0. Integrated help" src="images/0-integrated-help.png"/><br/>
<i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i>
</p>
<p><img alt="Network listing" src="docs/images/pvc-networks.png"/><br/><i>Listing the networks in a cluster, showing 3 bridged and 1 IPv4-only managed networks</i></p>
<p><img alt="1. Connection management" src="images/1-connection-management.png"/><br/>
<i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i>
</p>
<p><img alt="VM listing and migration" src="docs/images/pvc-migration.png"/><br/><i>Listing a limited set of VMs and migrating one with status updates</i></p>
<p><img alt="2. Cluster details and output formats" src="images/2-cluster-details-and-output-formats.png"/><br/>
<i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i>
</p>
<p><img alt="Node logs" src="docs/images/pvc-nodelog.png"/><br/><i>Viewing the logs of a node (keepalives and VM [un]migration)</i></p>
<p><img alt="3. Node information" src="images/3-node-information.png"/><br/>
<i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i>
</p>
<p><img alt="4. VM information" src="images/4-vm-information.png"/><br/>
<i>PVC can show details about the VMs in the cluster, including their state, resource allocations</i>
</p>
<p><img alt="5. VM details" src="images/5-vm-details.png"/><br/>
<i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i>
</p>
<p><img alt="6. Network information" src="images/6-network-information.png"/><br/>
<i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i>
</p>
<p><img alt="7. Storage information" src="images/7-storage-information.png"/><br/>
<i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i>
</p>
<p><img alt="8. VM and node logs" src="images/8-vm-and-node-logs.png"/><br/>
<i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i>
</p>
<p><img alt="9. VM and worker tasks" src="images/9-vm-and-worker-tasks.png"/><br/>
<i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i>
</p>
<p><img alt="10. Provisioner" src="images/10-provisioner.png"/><br/>
<i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i>
</p>

View File

@ -27,7 +27,7 @@ from distutils.util import strtobool as dustrtobool
import daemon_lib.config as cfg
# Daemon version
version = "0.9.84"
version = "0.9.85"
# API version
API_VERSION = 1.0

View File

@ -136,21 +136,10 @@ def cluster_metrics(zkhandler):
if not status_retflag:
return "Error: Status data threw error", 400
faults_retflag, faults_data = pvc_faults.get_list(zkhandler)
if not faults_retflag:
return "Error: Faults data threw error", 400
node_retflag, node_data = pvc_node.get_list(zkhandler)
if not node_retflag:
return "Error: Node data threw error", 400
vm_retflag, vm_data = pvc_vm.get_list(zkhandler)
if not vm_retflag:
return "Error: VM data threw error", 400
osd_retflag, osd_data = pvc_ceph.get_list_osd(zkhandler)
if not osd_retflag:
return "Error: OSD data threw error", 400
faults_data = status_data["detail"]["faults"]
node_data = status_data["detail"]["node"]
vm_data = status_data["detail"]["vm"]
osd_data = status_data["detail"]["osd"]
output_lines = list()
@ -237,7 +226,7 @@ def cluster_metrics(zkhandler):
for state in set([s.split(",")[0] for s in pvc_common.ceph_osd_state_combinations]):
osd_up_state_map[state] = 0
for osd in osd_data:
if osd["stats"]["up"] > 0:
if osd["up"] == "up":
osd_up_state_map["up"] += 1
else:
osd_up_state_map["down"] += 1
@ -252,7 +241,7 @@ def cluster_metrics(zkhandler):
for state in set([s.split(",")[1] for s in pvc_common.ceph_osd_state_combinations]):
osd_in_state_map[state] = 0
for osd in osd_data:
if osd["stats"]["in"] > 0:
if osd["in"] == "in":
osd_in_state_map["in"] += 1
else:
osd_in_state_map["out"] += 1

View File

@ -2,7 +2,7 @@ from setuptools import setup
setup(
name="pvc",
version="0.9.84",
version="0.9.85",
packages=["pvc.cli", "pvc.lib"],
install_requires=[
"Click",

View File

@ -215,14 +215,26 @@ def getClusterOSDList(zkhandler):
def getOSDInformation(zkhandler, osd_id):
# Get the devices
osd_fsid = zkhandler.read(("osd.ofsid", osd_id))
osd_node = zkhandler.read(("osd.node", osd_id))
osd_device = zkhandler.read(("osd.device", osd_id))
osd_is_split = bool(strtobool(zkhandler.read(("osd.is_split", osd_id))))
osd_db_device = zkhandler.read(("osd.db_device", osd_id))
(
osd_fsid,
osd_node,
osd_device,
_osd_is_split,
osd_db_device,
osd_stats_raw,
) = zkhandler.read_many(
[
("osd.ofsid", osd_id),
("osd.node", osd_id),
("osd.device", osd_id),
("osd.is_split", osd_id),
("osd.db_device", osd_id),
("osd.stats", osd_id),
]
)
osd_is_split = bool(strtobool(_osd_is_split))
# Parse the stats data
osd_stats_raw = zkhandler.read(("osd.stats", osd_id))
osd_stats = dict(json.loads(osd_stats_raw))
osd_information = {

View File

@ -23,10 +23,7 @@ from json import loads
import daemon_lib.common as common
import daemon_lib.faults as faults
import daemon_lib.vm as pvc_vm
import daemon_lib.node as pvc_node
import daemon_lib.network as pvc_network
import daemon_lib.ceph as pvc_ceph
def set_maintenance(zkhandler, maint_state):
@ -45,9 +42,7 @@ def set_maintenance(zkhandler, maint_state):
return True, "Successfully set cluster in normal mode"
def getClusterHealthFromFaults(zkhandler):
faults_list = faults.getAllFaults(zkhandler)
def getClusterHealthFromFaults(zkhandler, faults_list):
unacknowledged_faults = [fault for fault in faults_list if fault["status"] != "ack"]
# Generate total cluster health numbers
@ -217,20 +212,38 @@ def getClusterHealth(zkhandler, node_list, vm_list, ceph_osd_list):
def getNodeHealth(zkhandler, node_list):
# Get the health state of all nodes
node_health_reads = list()
for node in node_list:
node_health_reads += [
("node.monitoring.health", node),
("node.monitoring.plugins", node),
]
all_node_health_details = zkhandler.read_many(node_health_reads)
# Parse out the Node health details
node_health = dict()
for index, node in enumerate(node_list):
for nidx, node in enumerate(node_list):
# Split the large list of return values by the IDX of this node
# Each node result is 2 fields long
pos_start = nidx * 2
pos_end = nidx * 2 + 2
node_health_value, node_health_plugins = tuple(
all_node_health_details[pos_start:pos_end]
)
node_health_details = pvc_node.getNodeHealthDetails(
zkhandler, node, node_health_plugins.split()
)
node_health_messages = list()
node_health_value = node["health"]
for entry in node["health_details"]:
for entry in node_health_details:
if entry["health_delta"] > 0:
node_health_messages.append(f"'{entry['name']}': {entry['message']}")
node_health_entry = {
"health": node_health_value,
"health": int(node_health_value),
"messages": node_health_messages,
}
node_health[node["name"]] = node_health_entry
node_health[node] = node_health_entry
return node_health
@ -239,78 +252,146 @@ def getClusterInformation(zkhandler):
# Get cluster maintenance state
maintenance_state = zkhandler.read("base.config.maintenance")
# Get node information object list
retcode, node_list = pvc_node.get_list(zkhandler, None)
# Get primary node
primary_node = common.getPrimaryNode(zkhandler)
# Get PVC version of primary node
pvc_version = "0.0.0"
for node in node_list:
if node["name"] == primary_node:
pvc_version = node["pvc_version"]
# Get vm information object list
retcode, vm_list = pvc_vm.get_list(zkhandler, None, None, None, None)
# Get network information object list
retcode, network_list = pvc_network.get_list(zkhandler, None, None)
# Get storage information object list
retcode, ceph_osd_list = pvc_ceph.get_list_osd(zkhandler, None)
retcode, ceph_pool_list = pvc_ceph.get_list_pool(zkhandler, None)
retcode, ceph_volume_list = pvc_ceph.get_list_volume(zkhandler, None, None)
retcode, ceph_snapshot_list = pvc_ceph.get_list_snapshot(
zkhandler, None, None, None
maintenance_state, primary_node = zkhandler.read_many(
[
("base.config.maintenance"),
("base.config.primary_node"),
]
)
# Determine, for each subsection, the total count
# Get PVC version of primary node
pvc_version = zkhandler.read(("node.data.pvc_version", primary_node))
# Get the list of Nodes
node_list = zkhandler.children("base.node")
node_count = len(node_list)
vm_count = len(vm_list)
network_count = len(network_list)
ceph_osd_count = len(ceph_osd_list)
ceph_pool_count = len(ceph_pool_list)
ceph_volume_count = len(ceph_volume_list)
ceph_snapshot_count = len(ceph_snapshot_list)
# Format the Node states
# Get the daemon and domain states of all Nodes
node_state_reads = list()
for node in node_list:
node_state_reads += [
("node.state.daemon", node),
("node.state.domain", node),
]
all_node_states = zkhandler.read_many(node_state_reads)
# Parse out the Node states
node_data = list()
formatted_node_states = {"total": node_count}
for state in common.node_state_combinations:
state_count = 0
for node in node_list:
node_state = f"{node['daemon_state']},{node['domain_state']}"
if node_state == state:
state_count += 1
if state_count > 0:
formatted_node_states[state] = state_count
for nidx, node in enumerate(node_list):
# Split the large list of return values by the IDX of this node
# Each node result is 2 fields long
pos_start = nidx * 2
pos_end = nidx * 2 + 2
node_daemon_state, node_domain_state = tuple(all_node_states[pos_start:pos_end])
node_data.append(
{
"name": node,
"daemon_state": node_daemon_state,
"domain_state": node_domain_state,
}
)
node_state = f"{node_daemon_state},{node_domain_state}"
# Add to the count for this node's state
if node_state in common.node_state_combinations:
if formatted_node_states.get(node_state) is not None:
formatted_node_states[node_state] += 1
else:
formatted_node_states[node_state] = 1
# Format the VM states
# Get the list of VMs
vm_list = zkhandler.children("base.domain")
vm_count = len(vm_list)
# Get the states of all VMs
vm_state_reads = list()
for vm in vm_list:
vm_state_reads += [
("domain", vm),
("domain.state", vm),
]
all_vm_states = zkhandler.read_many(vm_state_reads)
# Parse out the VM states
vm_data = list()
formatted_vm_states = {"total": vm_count}
for state in common.vm_state_combinations:
state_count = 0
for vm in vm_list:
if vm["state"] == state:
state_count += 1
if state_count > 0:
formatted_vm_states[state] = state_count
for vidx, vm in enumerate(vm_list):
# Split the large list of return values by the IDX of this VM
# Each VM result is 2 field long
pos_start = vidx * 2
pos_end = vidx * 2 + 2
vm_name, vm_state = tuple(all_vm_states[pos_start:pos_end])
vm_data.append(
{
"uuid": vm,
"name": vm_name,
"state": vm_state,
}
)
# Add to the count for this VM's state
if vm_state in common.vm_state_combinations:
if formatted_vm_states.get(vm_state) is not None:
formatted_vm_states[vm_state] += 1
else:
formatted_vm_states[vm_state] = 1
# Format the OSD states
# Get the list of Ceph OSDs
ceph_osd_list = zkhandler.children("base.osd")
ceph_osd_count = len(ceph_osd_list)
# Get the states of all OSDs ("stat" is not a typo since we're reading stats; states are in
# the stats JSON object)
osd_stat_reads = list()
for osd in ceph_osd_list:
osd_stat_reads += [("osd.stats", osd)]
all_osd_stats = zkhandler.read_many(osd_stat_reads)
# Parse out the OSD states
osd_data = list()
formatted_osd_states = {"total": ceph_osd_count}
up_texts = {1: "up", 0: "down"}
in_texts = {1: "in", 0: "out"}
formatted_osd_states = {"total": ceph_osd_count}
for state in common.ceph_osd_state_combinations:
state_count = 0
for ceph_osd in ceph_osd_list:
ceph_osd_state = f"{up_texts[ceph_osd['stats']['up']]},{in_texts[ceph_osd['stats']['in']]}"
if ceph_osd_state == state:
state_count += 1
if state_count > 0:
formatted_osd_states[state] = state_count
for oidx, osd in enumerate(ceph_osd_list):
# Split the large list of return values by the IDX of this OSD
# Each OSD result is 1 field long, so just use the IDX
_osd_stats = all_osd_stats[oidx]
# We have to load this JSON object and get our up/in states from it
osd_stats = loads(_osd_stats)
# Get our states
osd_up = up_texts[osd_stats["up"]]
osd_in = in_texts[osd_stats["in"]]
osd_data.append(
{
"id": osd,
"up": osd_up,
"in": osd_in,
}
)
osd_state = f"{osd_up},{osd_in}"
# Add to the count for this OSD's state
if osd_state in common.ceph_osd_state_combinations:
if formatted_osd_states.get(osd_state) is not None:
formatted_osd_states[osd_state] += 1
else:
formatted_osd_states[osd_state] = 1
# Get the list of Networks
network_list = zkhandler.children("base.network")
network_count = len(network_list)
# Get the list of Ceph pools
ceph_pool_list = zkhandler.children("base.pool")
ceph_pool_count = len(ceph_pool_list)
# Get the list of Ceph volumes
ceph_volume_list = zkhandler.children("base.volume")
ceph_volume_count = len(ceph_volume_list)
# Get the list of Ceph snapshots
ceph_snapshot_list = zkhandler.children("base.snapshot")
ceph_snapshot_count = len(ceph_snapshot_list)
# Get the list of faults
faults_data = faults.getAllFaults(zkhandler)
# Format the status data
cluster_information = {
"cluster_health": getClusterHealthFromFaults(zkhandler),
"cluster_health": getClusterHealthFromFaults(zkhandler, faults_data),
"node_health": getNodeHealth(zkhandler, node_list),
"maintenance": maintenance_state,
"primary_node": primary_node,
@ -323,6 +404,12 @@ def getClusterInformation(zkhandler):
"pools": ceph_pool_count,
"volumes": ceph_volume_count,
"snapshots": ceph_snapshot_count,
"detail": {
"node": node_data,
"vm": vm_data,
"osd": osd_data,
"faults": faults_data,
},
}
return cluster_information

View File

@ -95,12 +95,24 @@ def getFault(zkhandler, fault_id):
return None
fault_id = fault_id
fault_last_time = zkhandler.read(("faults.last_time", fault_id))
fault_first_time = zkhandler.read(("faults.first_time", fault_id))
fault_ack_time = zkhandler.read(("faults.ack_time", fault_id))
fault_status = zkhandler.read(("faults.status", fault_id))
fault_delta = int(zkhandler.read(("faults.delta", fault_id)))
fault_message = zkhandler.read(("faults.message", fault_id))
(
fault_last_time,
fault_first_time,
fault_ack_time,
fault_status,
fault_delta,
fault_message,
) = zkhandler.read_many(
[
("faults.last_time", fault_id),
("faults.first_time", fault_id),
("faults.ack_time", fault_id),
("faults.status", fault_id),
("faults.delta", fault_id),
("faults.message", fault_id),
]
)
# Acknowledged faults have a delta of 0
if fault_ack_time != "":
@ -112,7 +124,7 @@ def getFault(zkhandler, fault_id):
"first_reported": fault_first_time,
"acknowledged_at": fault_ack_time,
"status": fault_status,
"health_delta": fault_delta,
"health_delta": int(fault_delta),
"message": fault_message,
}
@ -126,11 +138,42 @@ def getAllFaults(zkhandler, sort_key="last_reported"):
all_faults = zkhandler.children(("base.faults"))
faults_detail = list()
faults_reads = list()
for fault_id in all_faults:
fault_detail = getFault(zkhandler, fault_id)
faults_detail.append(fault_detail)
faults_reads += [
("faults.last_time", fault_id),
("faults.first_time", fault_id),
("faults.ack_time", fault_id),
("faults.status", fault_id),
("faults.delta", fault_id),
("faults.message", fault_id),
]
all_faults_data = list(zkhandler.read_many(faults_reads))
faults_detail = list()
for fidx, fault_id in enumerate(all_faults):
# Split the large list of return values by the IDX of this fault
# Each fault result is 6 fields long
pos_start = fidx * 6
pos_end = fidx * 6 + 6
(
fault_last_time,
fault_first_time,
fault_ack_time,
fault_status,
fault_delta,
fault_message,
) = tuple(all_faults_data[pos_start:pos_end])
fault_output = {
"id": fault_id,
"last_reported": fault_last_time,
"first_reported": fault_first_time,
"acknowledged_at": fault_ack_time,
"status": fault_status,
"health_delta": int(fault_delta),
"message": fault_message,
}
faults_detail.append(fault_output)
sorted_faults = sorted(faults_detail, key=lambda x: x[sort_key])
# Sort newest-first for time-based sorts

View File

@ -142,19 +142,37 @@ def getNetworkACLs(zkhandler, vni, _direction):
def getNetworkInformation(zkhandler, vni):
description = zkhandler.read(("network", vni))
nettype = zkhandler.read(("network.type", vni))
mtu = zkhandler.read(("network.mtu", vni))
domain = zkhandler.read(("network.domain", vni))
name_servers = zkhandler.read(("network.nameservers", vni))
ip6_network = zkhandler.read(("network.ip6.network", vni))
ip6_gateway = zkhandler.read(("network.ip6.gateway", vni))
dhcp6_flag = zkhandler.read(("network.ip6.dhcp", vni))
ip4_network = zkhandler.read(("network.ip4.network", vni))
ip4_gateway = zkhandler.read(("network.ip4.gateway", vni))
dhcp4_flag = zkhandler.read(("network.ip4.dhcp", vni))
dhcp4_start = zkhandler.read(("network.ip4.dhcp_start", vni))
dhcp4_end = zkhandler.read(("network.ip4.dhcp_end", vni))
(
description,
nettype,
mtu,
domain,
name_servers,
ip6_network,
ip6_gateway,
dhcp6_flag,
ip4_network,
ip4_gateway,
dhcp4_flag,
dhcp4_start,
dhcp4_end,
) = zkhandler.read_many(
[
("network", vni),
("network.type", vni),
("network.mtu", vni),
("network.domain", vni),
("network.nameservers", vni),
("network.ip6.network", vni),
("network.ip6.gateway", vni),
("network.ip6.dhcp", vni),
("network.ip4.network", vni),
("network.ip4.gateway", vni),
("network.ip4.dhcp", vni),
("network.ip4.dhcp_start", vni),
("network.ip4.dhcp_end", vni),
]
)
# Construct a data structure to represent the data
network_information = {
@ -818,31 +836,45 @@ def getSRIOVVFInformation(zkhandler, node, vf):
if not zkhandler.exists(("node.sriov.vf", node, "sriov_vf", vf)):
return []
pf = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pf", vf))
mtu = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mtu", vf))
mac = zkhandler.read(("node.sriov.vf", node, "sriov_vf.mac", vf))
vlan_id = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf))
vlan_qos = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf))
tx_rate_min = zkhandler.read(
("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf)
(
pf,
mtu,
mac,
vlan_id,
vlan_qos,
tx_rate_min,
tx_rate_max,
link_state,
spoof_check,
trust,
query_rss,
pci_domain,
pci_bus,
pci_slot,
pci_function,
used,
used_by_domain,
) = zkhandler.read_many(
[
("node.sriov.vf", node, "sriov_vf.pf", vf),
("node.sriov.vf", node, "sriov_vf.mtu", vf),
("node.sriov.vf", node, "sriov_vf.mac", vf),
("node.sriov.vf", node, "sriov_vf.config.vlan_id", vf),
("node.sriov.vf", node, "sriov_vf.config.vlan_qos", vf),
("node.sriov.vf", node, "sriov_vf.config.tx_rate_min", vf),
("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf),
("node.sriov.vf", node, "sriov_vf.config.link_state", vf),
("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf),
("node.sriov.vf", node, "sriov_vf.config.trust", vf),
("node.sriov.vf", node, "sriov_vf.config.query_rss", vf),
("node.sriov.vf", node, "sriov_vf.pci.domain", vf),
("node.sriov.vf", node, "sriov_vf.pci.bus", vf),
("node.sriov.vf", node, "sriov_vf.pci.slot", vf),
("node.sriov.vf", node, "sriov_vf.pci.function", vf),
("node.sriov.vf", node, "sriov_vf.used", vf),
("node.sriov.vf", node, "sriov_vf.used_by", vf),
]
)
tx_rate_max = zkhandler.read(
("node.sriov.vf", node, "sriov_vf.config.tx_rate_max", vf)
)
link_state = zkhandler.read(
("node.sriov.vf", node, "sriov_vf.config.link_state", vf)
)
spoof_check = zkhandler.read(
("node.sriov.vf", node, "sriov_vf.config.spoof_check", vf)
)
trust = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.trust", vf))
query_rss = zkhandler.read(("node.sriov.vf", node, "sriov_vf.config.query_rss", vf))
pci_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.domain", vf))
pci_bus = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.bus", vf))
pci_slot = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.slot", vf))
pci_function = zkhandler.read(("node.sriov.vf", node, "sriov_vf.pci.function", vf))
used = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used", vf))
used_by_domain = zkhandler.read(("node.sriov.vf", node, "sriov_vf.used_by", vf))
vf_information = {
"phy": vf,

View File

@ -26,60 +26,49 @@ import json
import daemon_lib.common as common
def getNodeInformation(zkhandler, node_name):
"""
Gather information about a node from the Zookeeper database and return a dict() containing it.
"""
node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
node_coordinator_state = zkhandler.read(("node.state.router", node_name))
node_domain_state = zkhandler.read(("node.state.domain", node_name))
node_static_data = zkhandler.read(("node.data.static", node_name)).split()
node_pvc_version = zkhandler.read(("node.data.pvc_version", node_name))
node_cpu_count = int(node_static_data[0])
node_kernel = node_static_data[1]
node_os = node_static_data[2]
node_arch = node_static_data[3]
node_vcpu_allocated = int(zkhandler.read(("node.vcpu.allocated", node_name)))
node_mem_total = int(zkhandler.read(("node.memory.total", node_name)))
node_mem_allocated = int(zkhandler.read(("node.memory.allocated", node_name)))
node_mem_provisioned = int(zkhandler.read(("node.memory.provisioned", node_name)))
node_mem_used = int(zkhandler.read(("node.memory.used", node_name)))
node_mem_free = int(zkhandler.read(("node.memory.free", node_name)))
node_load = float(zkhandler.read(("node.cpu.load", node_name)))
node_domains_count = int(
zkhandler.read(("node.count.provisioned_domains", node_name))
)
node_running_domains = zkhandler.read(("node.running_domains", node_name)).split()
try:
node_health = int(zkhandler.read(("node.monitoring.health", node_name)))
except Exception:
node_health = "N/A"
try:
node_health_plugins = zkhandler.read(
("node.monitoring.plugins", node_name)
).split()
except Exception:
node_health_plugins = list()
node_health_details = list()
def getNodeHealthDetails(zkhandler, node_name, node_health_plugins):
plugin_reads = list()
for plugin in node_health_plugins:
plugin_last_run = zkhandler.read(
("node.monitoring.data", node_name, "monitoring_plugin.last_run", plugin)
)
plugin_health_delta = zkhandler.read(
plugin_reads += [
(
"node.monitoring.data",
node_name,
"monitoring_plugin.last_run",
plugin,
),
(
"node.monitoring.data",
node_name,
"monitoring_plugin.health_delta",
plugin,
)
)
plugin_message = zkhandler.read(
("node.monitoring.data", node_name, "monitoring_plugin.message", plugin)
)
plugin_data = zkhandler.read(
("node.monitoring.data", node_name, "monitoring_plugin.data", plugin)
)
),
(
"node.monitoring.data",
node_name,
"monitoring_plugin.message",
plugin,
),
(
"node.monitoring.data",
node_name,
"monitoring_plugin.data",
plugin,
),
]
all_plugin_data = list(zkhandler.read_many(plugin_reads))
node_health_details = list()
for pidx, plugin in enumerate(node_health_plugins):
# Split the large list of return values by the IDX of this plugin
# Each plugin result is 4 fields long
pos_start = pidx * 4
pos_end = pidx * 4 + 4
(
plugin_last_run,
plugin_health_delta,
plugin_message,
plugin_data,
) = tuple(all_plugin_data[pos_start:pos_end])
plugin_output = {
"name": plugin,
"last_run": int(plugin_last_run),
@ -89,6 +78,82 @@ def getNodeInformation(zkhandler, node_name):
}
node_health_details.append(plugin_output)
return node_health_details
def getNodeInformation(zkhandler, node_name):
"""
Gather information about a node from the Zookeeper database and return a dict() containing it.
"""
(
node_daemon_state,
node_coordinator_state,
node_domain_state,
node_pvc_version,
_node_static_data,
_node_vcpu_allocated,
_node_mem_total,
_node_mem_allocated,
_node_mem_provisioned,
_node_mem_used,
_node_mem_free,
_node_load,
_node_domains_count,
_node_running_domains,
_node_health,
_node_health_plugins,
) = zkhandler.read_many(
[
("node.state.daemon", node_name),
("node.state.router", node_name),
("node.state.domain", node_name),
("node.data.pvc_version", node_name),
("node.data.static", node_name),
("node.vcpu.allocated", node_name),
("node.memory.total", node_name),
("node.memory.allocated", node_name),
("node.memory.provisioned", node_name),
("node.memory.used", node_name),
("node.memory.free", node_name),
("node.cpu.load", node_name),
("node.count.provisioned_domains", node_name),
("node.running_domains", node_name),
("node.monitoring.health", node_name),
("node.monitoring.plugins", node_name),
]
)
node_static_data = _node_static_data.split()
node_cpu_count = int(node_static_data[0])
node_kernel = node_static_data[1]
node_os = node_static_data[2]
node_arch = node_static_data[3]
node_vcpu_allocated = int(_node_vcpu_allocated)
node_mem_total = int(_node_mem_total)
node_mem_allocated = int(_node_mem_allocated)
node_mem_provisioned = int(_node_mem_provisioned)
node_mem_used = int(_node_mem_used)
node_mem_free = int(_node_mem_free)
node_load = float(_node_load)
node_domains_count = int(_node_domains_count)
node_running_domains = _node_running_domains.split()
try:
node_health = int(_node_health)
except Exception:
node_health = "N/A"
try:
node_health_plugins = _node_health_plugins.split()
except Exception:
node_health_plugins = list()
node_health_details = getNodeHealthDetails(
zkhandler, node_name, node_health_plugins
)
# Construct a data structure to represent the data
node_information = {
"name": node_name,

View File

@ -19,6 +19,7 @@
#
###############################################################################
import asyncio
import os
import time
import uuid
@ -239,10 +240,41 @@ class ZKHandler(object):
# This path is invalid; this is likely due to missing schema entries, so return None
return None
return self.zk_conn.get(path)[0].decode(self.encoding)
res = self.zk_conn.get(path)
return res[0].decode(self.encoding)
except NoNodeError:
return None
async def read_async(self, key):
"""
Read data from a key asynchronously
"""
try:
path = self.get_schema_path(key)
if path is None:
# This path is invalid; this is likely due to missing schema entries, so return None
return None
val = self.zk_conn.get_async(path)
data = val.get()
return data[0].decode(self.encoding)
except NoNodeError:
return None
async def _read_many(self, keys):
"""
Async runner for read_many
"""
res = await asyncio.gather(*(self.read_async(key) for key in keys))
return tuple(res)
def read_many(self, keys):
"""
Read data from several keys, asynchronously. Returns a tuple of all key values once all
reads are complete.
"""
return asyncio.run(self._read_many(keys))
def write(self, kvpairs):
"""
Create or update one or more keys' data

8
debian/changelog vendored
View File

@ -1,3 +1,11 @@
pvc (0.9.85-0) unstable; urgency=high
* [Packaging] Fixes a dependency bug introduced in 0.9.84
* [Node Daemon] Fixes an output bug during keepalives
* [Node Daemon] Fixes a bug in the example Prometheus Grafana dashboard
-- Joshua M. Boniface <joshua@boniface.me> Sun, 10 Dec 2023 01:00:33 -0500
pvc (0.9.84-0) unstable; urgency=high
**Breaking Changes:** This release features a major reconfiguration to how monitoring and reporting of the cluster health works. Node health plugins now report "faults", as do several other issues which were previously manually checked for in "cluster" daemon library for the "/status" endpoint, from within the Health daemon. These faults are persistent, and under each given identifier can be triggered once and subsequent triggers simply update the "last reported" time. An additional set of API endpoints and commands are added to manage these faults, either by "ack"(nowledging) them (keeping the alert around to be further updated but setting its health delta to 0%), or "delete"ing them (completely removing the fault unless it retriggers), both individually, to (from the CLI) multiple, or all. Cluster health reporting is now done based on these faults instead of anything else, and the default interval for health checks is reduced to 15 seconds to accomodate this. In addition to this, Promethius metrics have been added, along with an example Grafana dashboard, for the PVC cluster itself, as well as a proxy to the Ceph cluster metrics. This release also fixes some bugs in the VM provisioner that were introduced in 0.9.83; these fixes require a **reimport or reconfiguration of any provisioner scripts**; reference the updated examples for details.

8
debian/control vendored
View File

@ -8,7 +8,7 @@ X-Python3-Version: >= 3.7
Package: pvc-daemon-node
Architecture: all
Depends: systemd, pvc-daemon-common, pvc-daemon-health, pvc-daemon-worker, python3-kazoo, python3-psutil, python3-apscheduler, python3-libvirt, python3-psycopg2, python3-dnspython, python3-yaml, python3-distutils, python3-rados, python3-gevent, python3-prometheus-client, ipmitool, libvirt-daemon-system, arping, vlan, bridge-utils, dnsmasq, nftables, pdns-server, pdns-backend-pgsql
Depends: systemd, pvc-daemon-common, pvc-daemon-health, pvc-daemon-worker, python3-kazoo, python3-psutil, python3-apscheduler, python3-libvirt, python3-psycopg2, python3-dnspython, python3-yaml, python3-distutils, python3-rados, python3-gevent, ipmitool, libvirt-daemon-system, arping, vlan, bridge-utils, dnsmasq, nftables, pdns-server, pdns-backend-pgsql
Description: Parallel Virtual Cluster node daemon
A KVM/Zookeeper/Ceph-based VM and private cloud manager
.
@ -16,7 +16,7 @@ Description: Parallel Virtual Cluster node daemon
Package: pvc-daemon-health
Architecture: all
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-psutil, python3-apscheduler, python3-yaml, python3-prometheus-client
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-psutil, python3-apscheduler, python3-yaml
Description: Parallel Virtual Cluster health daemon
A KVM/Zookeeper/Ceph-based VM and private cloud manager
.
@ -24,7 +24,7 @@ Description: Parallel Virtual Cluster health daemon
Package: pvc-daemon-worker
Architecture: all
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-celery, python3-redis, python3-yaml, python3-prometheus-client, python-celery-common, fio
Depends: systemd, pvc-daemon-common, python3-kazoo, python3-celery, python3-redis, python3-yaml, python-celery-common, fio
Description: Parallel Virtual Cluster worker daemon
A KVM/Zookeeper/Ceph-based VM and private cloud manager
.
@ -32,7 +32,7 @@ Description: Parallel Virtual Cluster worker daemon
Package: pvc-daemon-api
Architecture: all
Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate, python3-prometheus-client
Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
Description: Parallel Virtual Cluster API daemon
A KVM/Zookeeper/Ceph-based VM and private cloud manager
.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 300 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

View File

@ -33,7 +33,7 @@ import os
import signal
# Daemon version
version = "0.9.84"
version = "0.9.85"
##########################################################

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

BIN
images/10-provisioner.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

BIN
images/4-vm-information.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

BIN
images/5-vm-details.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 126 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 177 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

View File

Before

Width:  |  Height:  |  Size: 49 KiB

After

Width:  |  Height:  |  Size: 49 KiB

View File

@ -2475,7 +2475,7 @@
},
"disableTextWrap": false,
"editorMode": "builder",
"expr": "pvc_osd_in_states",
"expr": "pvc_osd_in_states{cluster=\"$cluster\"}",
"fullMetaSearch": false,
"hide": false,
"includeNullMetadata": true,
@ -2592,6 +2592,6 @@
"timezone": "",
"title": "PVC Cluster",
"uid": "fbddd9f9-aadb-4c97-8aea-57c29e5de234",
"version": 55,
"version": 56,
"weekStart": ""
}
}

View File

@ -48,7 +48,7 @@ import re
import json
# Daemon version
version = "0.9.84"
version = "0.9.85"
##########################################################

View File

@ -701,7 +701,7 @@ def node_keepalive(logger, config, zkhandler, this_node):
runtime_start = datetime.now()
logger.out(
"Starting node keepalive run",
"Starting node keepalive run at {datetime.now()}",
state="t",
)

View File

@ -44,7 +44,7 @@ from daemon_lib.vmbuilder import (
)
# Daemon version
version = "0.9.84"
version = "0.9.85"
config = cfg.get_configuration()