Commit Graph

374 Commits

Author SHA1 Message Date
Joshua Boniface 9604f655d0 Improve node utilization metrics and fix bugs 2023-12-25 02:47:41 -05:00
Joshua Boniface 3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
Joshua Boniface d2d2a9c617 Include our newline atomically
Sometimes clashing log entries would print on the same line, likely due
to some sort of race condition in Python's print() built-in.

Instead, add a newline to our actual message and print without an end
character. This ensures atomic printing of our log messages.
2023-12-21 13:12:43 -05:00
Joshua Boniface 6ed4efad33 Add new network.stats key to nodes 2023-12-21 12:48:48 -05:00
Joshua Boniface 39f9f3640c Rename health metrics and add resource metrics 2023-12-21 09:40:49 -05:00
Joshua Boniface c64e888d30 Fix incorrect cast of None 2023-12-14 16:00:53 -05:00
Joshua Boniface f1249452e5 Fix bug if no nodes are present 2023-12-14 15:32:18 -05:00
Joshua Boniface f41c5176be Ensure health value is an int properly 2023-12-13 14:34:02 -05:00
Joshua Boniface ed9c37982a Move metric collection into daemon library 2023-12-11 19:20:30 -05:00
Joshua Boniface 57c28376a6 Port one final Ceph function to read_many 2023-12-11 10:25:36 -05:00
Joshua Boniface e781d742e6 Fix bug with volume and snapshot listing 2023-12-11 10:21:46 -05:00
Joshua Boniface 741dafb26b Port VM functions to read_many 2023-12-11 03:34:36 -05:00
Joshua Boniface 5d9e83e8ed Fix output bugs in VM information 2023-12-11 03:04:46 -05:00
Joshua Boniface 7c116b2fbc Ensure node health value is an int 2023-12-10 23:56:50 -05:00
Joshua Boniface 1023c55087 Fix bug in VM state list 2023-12-10 23:44:01 -05:00
Joshua Boniface 9235187c6f Port Ceph functions to read_many
Only ports getOSDInformation, as all the others feature 3 or less reads
which is acceptable sequentially.
2023-12-10 22:24:38 -05:00
Joshua Boniface 0c94f1b4f8 Port Network functions to read_many 2023-12-10 22:19:21 -05:00
Joshua Boniface 44a4f0e1f7 Use new info detail output instead of new lists
Avoids multiple additional ZK calls by using data that is now in the
status detail output.
2023-12-10 22:19:09 -05:00
Joshua Boniface 5d53a3e529 Add state and faults detail to cluster information
We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
2023-12-10 17:29:32 -05:00
Joshua Boniface 35e22cb50f Simplify cluster status handling
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
2023-12-10 17:05:46 -05:00
Joshua Boniface a3171b666b Split node health into separate function 2023-12-10 16:52:10 -05:00
Joshua Boniface 48e41d7b05 Port Faults getFault and getAllFaults to read_many 2023-12-10 16:05:16 -05:00
Joshua Boniface d6aecf195e Port Node getNodeInformation to read_many 2023-12-10 15:53:28 -05:00
Joshua Boniface 9329784010 Implement async ZK read function
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.

Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
2023-12-10 15:35:40 -05:00
Joshua Boniface b9fbfe2ed5 Improve fault ID format
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
2023-12-09 16:48:14 -05:00
Joshua Boniface 7e6d922877 Improve fault detail handling further
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
2023-12-09 16:13:36 -05:00
Joshua Boniface 4003204f14 Remove bracketed text from fault_str
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.

Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
Joshua Boniface 2bea78d25e Make all remaining limits optional 2023-12-09 13:43:58 -05:00
Joshua Boniface fd717b702d Use external list of fault states 2023-12-09 12:51:41 -05:00
Joshua Boniface 317ca4b98c Move defined state combinations into common 2023-12-09 12:36:32 -05:00
Joshua Boniface 0bda095571 Move libvirt_schema and fix other imports 2023-12-09 12:20:29 -05:00
Joshua Boniface 813aef1463 Fix incorrect UUID key name 2023-12-09 12:14:57 -05:00
Joshua Boniface 5a7ea25266 Fix incorrect database name entries 2023-12-09 12:12:00 -05:00
Joshua Boniface 61b39d0739 Fix incorrect cluster health calculation 2023-12-07 11:13:36 -05:00
Joshua Boniface 4bf80a5913 Fix missing datetime shrink 2023-12-06 17:15:36 -05:00
Joshua Boniface e0bf7f7d1a Fix bad ID values in acknowledge 2023-12-06 14:18:31 -05:00
Joshua Boniface 20acf3295f Add mass ack/delete of faults 2023-12-06 13:59:39 -05:00
Joshua Boniface d1e34e7333 Store fault times only to the second
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00
Joshua Boniface 79eb54d5da Move fault generation to common library 2023-12-06 13:17:10 -05:00
Joshua Boniface 2267a9c85d Improve output formatting for simplicity 2023-12-05 10:37:35 -05:00
Joshua Boniface 672e58133f Implement interfaces to faults 2023-12-04 01:37:54 -05:00
Joshua Boniface 3dc48c1783 Lower default monitoring interval to 15s
Faults are also reported on the monitoring interval, so 60s seems like
too long. Lower this to 15 seconds by default instead.
2023-12-01 17:38:28 -05:00
Joshua Boniface 9c2b1b29ee Add node health to fault states
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.

Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.
2023-12-01 17:38:28 -05:00
Joshua Boniface 8594eb697f Add initial fault generation in pvchealthd
References: #164
2023-12-01 17:38:27 -05:00
Joshua Boniface 7cb9ebae6b Remove legacy configuration handler
This is not going to be needed.
2023-12-01 01:25:40 -05:00
Joshua Boniface 102c3c3106 Port all Celery worker functions to discrete pkg
Moves all tasks run by the Celery worker into a discrete package/module
for easier installation. Also adjusts several parameters throughout to
accomplish this.
2023-11-30 02:24:54 -05:00
Joshua Boniface 03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
Joshua Boniface 11db3c5b20 Fix ordering during termination 2023-11-29 21:21:51 -05:00
Joshua Boniface fa12a3c9b1 Permit buffered log appending 2023-11-29 21:21:51 -05:00
Joshua Boniface 787f4216b3 Expand Zookeeper log daemon prefix to match 2023-11-29 21:21:51 -05:00