Commit Graph

3367 Commits

Author SHA1 Message Date
Joshua Boniface 709c9cb73e Pause pvchealthd startup until node daemon is run
If the health daemon starts too soon during a node bootup, it will
generate generate tons of erroneous faults while the node starts up.
Adds a conditional wait for the current node daemon to be in "run"
state before the health daemon really starts up.
2023-12-13 14:53:54 -05:00
Joshua Boniface f41c5176be Ensure health value is an int properly 2023-12-13 14:34:02 -05:00
Joshua Boniface 38e43b46c3 Update health detail messages format 2023-12-13 03:17:47 -05:00
Joshua Boniface ed9c37982a Move metric collection into daemon library 2023-12-11 19:20:30 -05:00
Joshua Boniface 0f24184b78 Explicitly clear resources of fenced node
This actually solves the bug originally "fixed" in
5f1432ccdd without breaking VM resource
allocations for working nodes.
2023-12-11 12:14:56 -05:00
Joshua Boniface 1ba37fe33d Restore VM resource allocation location
Commit 5f1432ccdd changed where these
happen due to a bug after fencing. However this completely broke node
resource reporting as only the final instance will be queried here.

Revert this change and look further into the original bug.
2023-12-11 11:52:59 -05:00
Joshua Boniface 1a05077b10 Fix missing fstring 2023-12-11 11:29:49 -05:00
Joshua Boniface 57c28376a6 Port one final Ceph function to read_many 2023-12-11 10:25:36 -05:00
Joshua Boniface e781d742e6 Fix bug with volume and snapshot listing 2023-12-11 10:21:46 -05:00
Joshua Boniface 6c6d1508a1 Add VNC info to screenshots 2023-12-11 03:40:49 -05:00
Joshua Boniface 741dafb26b Port VM functions to read_many 2023-12-11 03:34:36 -05:00
Joshua Boniface 032d3ebf18 Remove debug output from image 2023-12-11 03:23:10 -05:00
Joshua Boniface 5d9e83e8ed Fix output bugs in VM information 2023-12-11 03:04:46 -05:00
Joshua Boniface ad0bd8649f Finish missing sentence 2023-12-11 02:39:39 -05:00
Joshua Boniface 9b5e53e4b6 Add Grafana dashboard screenshot 2023-12-11 00:39:24 -05:00
Joshua Boniface 9617660342 Update Prometheus Grafana dashboard 2023-12-11 00:23:08 -05:00
Joshua Boniface ab0a1e0946 Update and streamline README and update images 2023-12-10 23:57:01 -05:00
Joshua Boniface 7c116b2fbc Ensure node health value is an int 2023-12-10 23:56:50 -05:00
Joshua Boniface 1023c55087 Fix bug in VM state list 2023-12-10 23:44:01 -05:00
Joshua Boniface 9235187c6f Port Ceph functions to read_many
Only ports getOSDInformation, as all the others feature 3 or less reads
which is acceptable sequentially.
2023-12-10 22:24:38 -05:00
Joshua Boniface 0c94f1b4f8 Port Network functions to read_many 2023-12-10 22:19:21 -05:00
Joshua Boniface 44a4f0e1f7 Use new info detail output instead of new lists
Avoids multiple additional ZK calls by using data that is now in the
status detail output.
2023-12-10 22:19:09 -05:00
Joshua Boniface 5d53a3e529 Add state and faults detail to cluster information
We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
2023-12-10 17:29:32 -05:00
Joshua Boniface 35e22cb50f Simplify cluster status handling
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
2023-12-10 17:05:46 -05:00
Joshua Boniface a3171b666b Split node health into separate function 2023-12-10 16:52:10 -05:00
Joshua Boniface 48e41d7b05 Port Faults getFault and getAllFaults to read_many 2023-12-10 16:05:16 -05:00
Joshua Boniface d6aecf195e Port Node getNodeInformation to read_many 2023-12-10 15:53:28 -05:00
Joshua Boniface 9329784010 Implement async ZK read function
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.

Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
2023-12-10 15:35:40 -05:00
Joshua Boniface 9dc5097dbc Bump version to 0.9.85 2023-12-10 01:00:33 -05:00
Joshua Boniface 5776cb3a09 Remove Prometheus client dependencies
We don't actually use this (yet!) so remove the dependency for now.
2023-12-10 00:58:09 -05:00
Joshua Boniface 53d632f283 Fix bug in example PVC Grafana dashboard 2023-12-10 00:50:05 -05:00
Joshua Boniface 7bc0760b78 Add time to "starting keepalive" message
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00
Joshua Boniface 9aee2a9075 Bump version to 0.9.84 2023-12-09 23:05:40 -05:00
Joshua Boniface 8f0ae3e2dd Fix config file for database migrations 2023-12-09 22:51:54 -05:00
Joshua Boniface 946d3eaf43 Add wait after stopping VM 2023-12-09 18:14:03 -05:00
Joshua Boniface 1f6347d24b Add Prometheus monitoring examples 2023-12-09 17:42:51 -05:00
Joshua Boniface e8552b471b Require at least one FAULT_ID 2023-12-09 17:31:56 -05:00
Joshua Boniface fc443a323b Allow ack/delete of multiple faults at once 2023-12-09 17:28:13 -05:00
Joshua Boniface b0557edb76 Ensure entry in name is uppercase 2023-12-09 17:01:41 -05:00
Joshua Boniface 47bd7bf2f5 Only run cluster-wide health checks on primary
Avoids multiple coordinators trying to write updated cluster-wide fault
events. Instead, they are now only written by the primary (or the
incoming primary if still in a transition).
2023-12-09 16:50:51 -05:00
Joshua Boniface b9fbfe2ed5 Improve fault ID format
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
2023-12-09 16:48:14 -05:00
Joshua Boniface 764e3e3722 Fix bug in fault header format 2023-12-09 16:47:56 -05:00
Joshua Boniface 7e6d922877 Improve fault detail handling further
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
2023-12-09 16:13:36 -05:00
Joshua Boniface 4ca2381077 Rework metrics output and add combined endpoint 2023-12-09 15:47:40 -05:00
Joshua Boniface 4003204f14 Remove bracketed text from fault_str
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.

Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
Joshua Boniface a70c1d63b0 Separate state totals from states, separate states 2023-12-09 13:59:17 -05:00
Joshua Boniface 2bea78d25e Make all remaining limits optional 2023-12-09 13:43:58 -05:00
Joshua Boniface fd717b702d Use external list of fault states 2023-12-09 12:51:41 -05:00
Joshua Boniface 132cde5591 Add totals and nice-format states
Avoids tons of annoying rewriting in the UI later.
2023-12-09 12:50:19 -05:00
Joshua Boniface ba565ead4c Report all state combinations in Prom metrics
Ensures that every state combination is always shown to metrics, even if
it contains 0 entries.
2023-12-09 12:40:37 -05:00