Commit Graph

389 Commits

Author SHA1 Message Date
Joshua Boniface 09269f182c Add live migrate max downtime selector meta field
Adds a new flag to VM metadata to allow setting the VM live migration
max downtime. This will enable very busy VMs that hang live migration to
have this value changed.
2024-01-11 00:05:50 -05:00
Joshua Boniface 362edeed8c Add backup reporting and improve metrics
Major improvements to autobackup and backups, including additional
information/fields in the backup JSON itself, improved error handling,
and the ability to email reports of autobackups using a local sendmail
utility.
2024-01-10 14:18:44 -05:00
Joshua Boniface 39c8367723 Add additional metainfo to VM backups
Adds additional information about failures, runtime, file sizes, etc. to
the JSON output of a VM backup.

This helps enable additional reporting and summary information for
autobackup runs.
2024-01-10 10:37:29 -05:00
Joshua Boniface aac306c55b Fix missing stats on old Debians 2024-01-09 12:10:16 -05:00
Joshua Boniface c1ae571213 Add additional VM details to Prometheus 2023-12-29 14:09:39 -05:00
Joshua Boniface 123c7ce857 Update copyright header on all files for 2024
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
Joshua Boniface 4969e90f8a Allow enable/disable of Prometheus endpoints
Since these are unauthenticated, it might be the case that an
administrator wishes to completely disable these metrics endpoints.
Provide that option via pvc.conf through pvc-ansible's existing
enable_prometheus_exporters option and the new enable_prometheus
configuration flag.

Defaults to "yes" to provide all functionality unless explicitly
disabled, as the author assumes that the PVC API is secured in other
ways as well and that metric information is not completely sensitive.
2023-12-29 09:25:10 -05:00
Joshua Boniface 3346ce9bb0 Add missing shutdown state from combinations 2023-12-27 13:40:30 -05:00
Joshua Boniface e654fbba08 Move debug condition handling to Logger
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
Joshua Boniface 4375f66793 Use proper get() for invalid values 2023-12-27 12:03:48 -05:00
Joshua Boniface 3df3ca5b44 Fix value for OSD utilization
Ceph provides in KB; convert to bytes.
2023-12-27 11:56:50 -05:00
Joshua Boniface 431ee69620 Use proper percentage for pool util 2023-12-27 10:03:00 -05:00
Joshua Boniface 88f4d79d5a Handle invalid values on older Libvirt versions 2023-12-27 09:51:24 -05:00
Joshua Boniface 84d22751d8 Fix bad JSON data handler 2023-12-27 09:43:37 -05:00
Joshua Boniface 40ff005a09 Fix handling of Ceph OSD bytes 2023-12-26 12:43:51 -05:00
Joshua Boniface 9604f655d0 Improve node utilization metrics and fix bugs 2023-12-25 02:47:41 -05:00
Joshua Boniface 3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
Joshua Boniface d2d2a9c617 Include our newline atomically
Sometimes clashing log entries would print on the same line, likely due
to some sort of race condition in Python's print() built-in.

Instead, add a newline to our actual message and print without an end
character. This ensures atomic printing of our log messages.
2023-12-21 13:12:43 -05:00
Joshua Boniface 6ed4efad33 Add new network.stats key to nodes 2023-12-21 12:48:48 -05:00
Joshua Boniface 39f9f3640c Rename health metrics and add resource metrics 2023-12-21 09:40:49 -05:00
Joshua Boniface c64e888d30 Fix incorrect cast of None 2023-12-14 16:00:53 -05:00
Joshua Boniface f1249452e5 Fix bug if no nodes are present 2023-12-14 15:32:18 -05:00
Joshua Boniface f41c5176be Ensure health value is an int properly 2023-12-13 14:34:02 -05:00
Joshua Boniface ed9c37982a Move metric collection into daemon library 2023-12-11 19:20:30 -05:00
Joshua Boniface 57c28376a6 Port one final Ceph function to read_many 2023-12-11 10:25:36 -05:00
Joshua Boniface e781d742e6 Fix bug with volume and snapshot listing 2023-12-11 10:21:46 -05:00
Joshua Boniface 741dafb26b Port VM functions to read_many 2023-12-11 03:34:36 -05:00
Joshua Boniface 5d9e83e8ed Fix output bugs in VM information 2023-12-11 03:04:46 -05:00
Joshua Boniface 7c116b2fbc Ensure node health value is an int 2023-12-10 23:56:50 -05:00
Joshua Boniface 1023c55087 Fix bug in VM state list 2023-12-10 23:44:01 -05:00
Joshua Boniface 9235187c6f Port Ceph functions to read_many
Only ports getOSDInformation, as all the others feature 3 or less reads
which is acceptable sequentially.
2023-12-10 22:24:38 -05:00
Joshua Boniface 0c94f1b4f8 Port Network functions to read_many 2023-12-10 22:19:21 -05:00
Joshua Boniface 44a4f0e1f7 Use new info detail output instead of new lists
Avoids multiple additional ZK calls by using data that is now in the
status detail output.
2023-12-10 22:19:09 -05:00
Joshua Boniface 5d53a3e529 Add state and faults detail to cluster information
We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
2023-12-10 17:29:32 -05:00
Joshua Boniface 35e22cb50f Simplify cluster status handling
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
2023-12-10 17:05:46 -05:00
Joshua Boniface a3171b666b Split node health into separate function 2023-12-10 16:52:10 -05:00
Joshua Boniface 48e41d7b05 Port Faults getFault and getAllFaults to read_many 2023-12-10 16:05:16 -05:00
Joshua Boniface d6aecf195e Port Node getNodeInformation to read_many 2023-12-10 15:53:28 -05:00
Joshua Boniface 9329784010 Implement async ZK read function
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.

Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
2023-12-10 15:35:40 -05:00
Joshua Boniface b9fbfe2ed5 Improve fault ID format
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
2023-12-09 16:48:14 -05:00
Joshua Boniface 7e6d922877 Improve fault detail handling further
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
2023-12-09 16:13:36 -05:00
Joshua Boniface 4003204f14 Remove bracketed text from fault_str
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.

Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
Joshua Boniface 2bea78d25e Make all remaining limits optional 2023-12-09 13:43:58 -05:00
Joshua Boniface fd717b702d Use external list of fault states 2023-12-09 12:51:41 -05:00
Joshua Boniface 317ca4b98c Move defined state combinations into common 2023-12-09 12:36:32 -05:00
Joshua Boniface 0bda095571 Move libvirt_schema and fix other imports 2023-12-09 12:20:29 -05:00
Joshua Boniface 813aef1463 Fix incorrect UUID key name 2023-12-09 12:14:57 -05:00
Joshua Boniface 5a7ea25266 Fix incorrect database name entries 2023-12-09 12:12:00 -05:00
Joshua Boniface 61b39d0739 Fix incorrect cluster health calculation 2023-12-07 11:13:36 -05:00
Joshua Boniface 4bf80a5913 Fix missing datetime shrink 2023-12-06 17:15:36 -05:00