3135 Commits

Author SHA1 Message Date
cb3c2cd86d Adjust name of PVC cluster dashboard 2023-12-27 11:42:58 -05:00
d0de4f1825 Update Grafana dashboard to overview
Adds resource utilization in addition to health.
2023-12-27 11:38:39 -05:00
494c20263d Move monitoring folder to top level 2023-12-27 11:37:49 -05:00
431ee69620 Use proper percentage for pool util 2023-12-27 10:03:00 -05:00
88f4d79d5a Handle invalid values on older Libvirt versions 2023-12-27 09:51:24 -05:00
84d22751d8 Fix bad JSON data handler 2023-12-27 09:43:37 -05:00
40ff005a09 Fix handling of Ceph OSD bytes 2023-12-26 12:43:51 -05:00
ab4ec7a5fa Remove WebUI from README 2023-12-25 02:48:44 -05:00
9604f655d0 Improve node utilization metrics and fix bugs 2023-12-25 02:47:41 -05:00
3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
d2d2a9c617 Include our newline atomically
Sometimes clashing log entries would print on the same line, likely due
to some sort of race condition in Python's print() built-in.

Instead, add a newline to our actual message and print without an end
character. This ensures atomic printing of our log messages.
2023-12-21 13:12:43 -05:00
6ed4efad33 Add new network.stats key to nodes 2023-12-21 12:48:48 -05:00
39f9f3640c Rename health metrics and add resource metrics 2023-12-21 09:40:49 -05:00
c64e888d30 Fix incorrect cast of None v0.9.86 2023-12-14 16:00:53 -05:00
f1249452e5 Fix bug if no nodes are present 2023-12-14 15:32:18 -05:00
0a93f526e0 Bump version to 0.9.86 2023-12-14 14:46:29 -05:00
7c9512fb22 Fix broken config file in API migration script 2023-12-14 14:45:58 -05:00
e88b97f3a9 Print fenced state in red 2023-12-13 15:02:18 -05:00
709c9cb73e Pause pvchealthd startup until node daemon is run
If the health daemon starts too soon during a node bootup, it will
generate generate tons of erroneous faults while the node starts up.
Adds a conditional wait for the current node daemon to be in "run"
state before the health daemon really starts up.
2023-12-13 14:53:54 -05:00
f41c5176be Ensure health value is an int properly 2023-12-13 14:34:02 -05:00
38e43b46c3 Update health detail messages format 2023-12-13 03:17:47 -05:00
ed9c37982a Move metric collection into daemon library 2023-12-11 19:20:30 -05:00
0f24184b78 Explicitly clear resources of fenced node
This actually solves the bug originally "fixed" in
5f1432ccdd38996dac0f528035634cbc82827abd without breaking VM resource
allocations for working nodes.
2023-12-11 12:14:56 -05:00
1ba37fe33d Restore VM resource allocation location
Commit 5f1432ccdd38996dac0f528035634cbc82827abd changed where these
happen due to a bug after fencing. However this completely broke node
resource reporting as only the final instance will be queried here.

Revert this change and look further into the original bug.
2023-12-11 11:52:59 -05:00
1a05077b10 Fix missing fstring 2023-12-11 11:29:49 -05:00
57c28376a6 Port one final Ceph function to read_many 2023-12-11 10:25:36 -05:00
e781d742e6 Fix bug with volume and snapshot listing 2023-12-11 10:21:46 -05:00
6c6d1508a1 Add VNC info to screenshots 2023-12-11 03:40:49 -05:00
741dafb26b Port VM functions to read_many 2023-12-11 03:34:36 -05:00
032d3ebf18 Remove debug output from image 2023-12-11 03:23:10 -05:00
5d9e83e8ed Fix output bugs in VM information 2023-12-11 03:04:46 -05:00
ad0bd8649f Finish missing sentence 2023-12-11 02:39:39 -05:00
9b5e53e4b6 Add Grafana dashboard screenshot 2023-12-11 00:39:24 -05:00
9617660342 Update Prometheus Grafana dashboard 2023-12-11 00:23:08 -05:00
ab0a1e0946 Update and streamline README and update images 2023-12-10 23:57:01 -05:00
7c116b2fbc Ensure node health value is an int 2023-12-10 23:56:50 -05:00
1023c55087 Fix bug in VM state list 2023-12-10 23:44:01 -05:00
9235187c6f Port Ceph functions to read_many
Only ports getOSDInformation, as all the others feature 3 or less reads
which is acceptable sequentially.
2023-12-10 22:24:38 -05:00
0c94f1b4f8 Port Network functions to read_many 2023-12-10 22:19:21 -05:00
44a4f0e1f7 Use new info detail output instead of new lists
Avoids multiple additional ZK calls by using data that is now in the
status detail output.
2023-12-10 22:19:09 -05:00
5d53a3e529 Add state and faults detail to cluster information
We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
2023-12-10 17:29:32 -05:00
35e22cb50f Simplify cluster status handling
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
2023-12-10 17:05:46 -05:00
a3171b666b Split node health into separate function 2023-12-10 16:52:10 -05:00
48e41d7b05 Port Faults getFault and getAllFaults to read_many 2023-12-10 16:05:16 -05:00
d6aecf195e Port Node getNodeInformation to read_many 2023-12-10 15:53:28 -05:00
9329784010 Implement async ZK read function
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.

Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
2023-12-10 15:35:40 -05:00
9dc5097dbc Bump version to 0.9.85 v0.9.85 2023-12-10 01:00:33 -05:00
5776cb3a09 Remove Prometheus client dependencies
We don't actually use this (yet!) so remove the dependency for now.
2023-12-10 00:58:09 -05:00
53d632f283 Fix bug in example PVC Grafana dashboard 2023-12-10 00:50:05 -05:00
7bc0760b78 Add time to "starting keepalive" message
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00