Commit Graph

480 Commits

Author SHA1 Message Date
Joshua Boniface 0d918d66fe Port VM autobackups into pvcworkerd with snaps
Moves VM autobackups from being in-CLI to being handled by the
pvcworkerd system on the primary coordinator. Turns the CLI autobackup
command into an actual API client endpoint rather than having its logic
in the CLI.

In addition, modifies the new autobackup to leverage the new "pvc vm
snapshot" function set, just with special snapshot names. This helps
automate this within the new snapshot scaffolding.
2024-08-23 17:23:06 -04:00
Joshua Boniface f6c009beac Allow overriding stages in some commands
This allows them to be called by autobackup commands while still
preserving the current Celery report flow.
2024-08-23 11:21:02 -04:00
Joshua Boniface fc89f4f2f5 Fix error message contents 2024-08-23 10:23:51 -04:00
Joshua Boniface 565011b277 Set snapshot name before start 2024-08-20 23:01:52 -04:00
Joshua Boniface 0bf9cc6b06 Improve stage handling
Run start() at the beginning, and leverage the new tweaks to the CLI to
update the total steps later. Allows errors to be handled gracefully
2024-08-20 17:50:27 -04:00
Joshua Boniface f2dfada73e Improve return handling for snapshot tasks 2024-08-20 17:40:44 -04:00
Joshua Boniface 9b3075be18 Add UUID check and fix wording
Don't suggest renaming any more as it's not enough.
2024-08-20 17:05:27 -04:00
Joshua Boniface 9a661d0173 Convert VM snapshots to worker tasks
Improves manageability and offloads these from the API context.
2024-08-20 16:50:41 -04:00
Joshua Boniface 4a0680b27f Fix issues with snapshot imports 2024-08-20 13:59:05 -04:00
Joshua Boniface f42a1bad0e Allow passing zk_only into VM snapshot creation 2024-08-20 12:57:53 -04:00
Joshua Boniface 3fb52a13c2 Add missing VM states from snapshots 2024-08-20 11:53:57 -04:00
Joshua Boniface 8937ddf331 Simplify VM rename to preserve data
A rename is simply a change to two values, so instead of undefining and
re-defining the VM, just edit those two fields. This ensures things like
snapshots are preserved automatically.
2024-08-20 11:37:28 -04:00
Joshua Boniface 7cc354466f Finish implementing snapshot import 2024-08-20 11:25:09 -04:00
Joshua Boniface 0a8bad3418 Add VM snapshot import 2024-08-20 10:53:56 -04:00
Joshua Boniface f10d32987b Fix up comments 2024-08-20 10:37:58 -04:00
Joshua Boniface d060787503 Add initial implementation of snapshot export 2024-08-19 18:46:07 -04:00
Joshua Boniface f1b4593367 Store current stats with snapshots
Allows getting info like size, etc. for the snapshot.
2024-08-19 14:07:27 -04:00
Joshua Boniface 33f905459a Implement VM rollback
Closes #184
2024-08-16 10:47:18 -04:00
Joshua Boniface 359191c83f Ensure snapshot name does not already exist 2024-08-16 10:46:25 -04:00
Joshua Boniface 3d0d5e63f6 Make default snap name just the datestring 2024-08-16 10:46:25 -04:00
Joshua Boniface e6bfbb6d45 Actually fix incorrect naming bug 2024-08-16 10:46:25 -04:00
Joshua Boniface b80f9e28dc Add human-readable age to snapshots
This is parsed server-side for consistent timing and to simplify the API
consumers.
2024-08-16 10:46:25 -04:00
Joshua Boniface fbd5b3cca3 Remove is_backup flag for snapshots
This won't be needed for anything.
2024-08-16 10:46:25 -04:00
Joshua Boniface 2b1082590e Fix bug in snapshot removal 2024-08-16 10:46:25 -04:00
Joshua Boniface 6fc7f45027 Add snapshot lists and timestamp
Adds snapshots to the list of data in VM objects
2024-08-16 10:46:25 -04:00
Joshua Boniface 0c240a5129 Add VM snapshot removal 2024-08-16 10:46:25 -04:00
Joshua Boniface 553c1e670e Add VM snapshots functionality
Adds the ability to create snapshots of an entire VM, including all its
RBD disks and the VM XML config, though not any PVC metadata.
2024-08-16 10:46:25 -04:00
Joshua Boniface 942de9f15b Add better exception handling for XML configs 2024-08-16 10:46:04 -04:00
Joshua Boniface c186015d6f Add check for invalid profile 2024-07-13 17:13:40 -04:00
Joshua Boniface 7a99e0e524 Fix bugs listing snapshots by pool/volume
The logic of this didn't work, so reconfigure to use these like limits.
Also fixes a bug in the upper getCephVolumes for invalid pools.
2024-05-16 16:32:22 -04:00
Joshua Boniface 5d0e7931d1 Add support for rolling back snapshots
We supported creating snapshots, but not doing anything with them. This
removes the manual task of restoring a snapshot and replace it with a
PVC abstraction of rolling back to a snapshot.

While Ceph recommends cloning a snapshot instead of rolling back, due to
the time taken, in our usecase I don't think that is an optimal
strategy, as it will leave dangling clones that we'd then have to
manage.

Closes #183
2024-05-13 15:24:51 -04:00
Joshua Boniface ab944f9b95 Add RBD snap purge during volume removal
Fixes #180
2024-04-19 10:31:11 -04:00
Joshua Boniface 9714ac20b2 Update formatting for Black 24.4.0 2024-04-19 10:26:06 -04:00
Joshua Boniface a461791ce8 Fix bug cleaning up successful benchmark results 2024-03-08 14:22:07 -05:00
Joshua Boniface 9fdb6d8708 Fix bug with network stats 2024-03-07 15:44:35 -05:00
Joshua Boniface 2fb7c40497 Work around bad plugin data 2024-03-07 14:37:05 -05:00
Joshua Boniface 67ec41aaf9 Fix invalid memory errors for stopped VMs 2024-02-06 13:30:48 -05:00
Joshua Boniface a95e72008e Add size validations for volume clones
Adds the same validations as a volume add or resize to volume clones, to
ensure there is enough free space for them.
2024-02-02 11:37:29 -05:00
Joshua Boniface efc7434143 Add safety check for 80% full size
Adds a check that a volume creation or resize won't violate the 80% full
rule for the storage cluster. This ensures a cluster won't get too full
if a storage volume fills up.

Also adds a force flag throughout the pipeline to override this check,
should an administrator really want to do so.

Closes #177
2024-02-02 11:37:00 -05:00
Joshua Boniface 8419659e1b Ensure zkhandler is always cleaned up
Even if the subfunction of an API @ZKConnection call fails, the
zkhandler needs to terminate and clean up, or it leaves stuck threads
around.
2024-01-30 09:48:17 -05:00
Joshua Boniface d28fb71f57 Fix incorrect variable set 2024-01-24 14:40:40 -05:00
Joshua Boniface 09269f182c Add live migrate max downtime selector meta field
Adds a new flag to VM metadata to allow setting the VM live migration
max downtime. This will enable very busy VMs that hang live migration to
have this value changed.
2024-01-11 00:05:50 -05:00
Joshua Boniface 362edeed8c Add backup reporting and improve metrics
Major improvements to autobackup and backups, including additional
information/fields in the backup JSON itself, improved error handling,
and the ability to email reports of autobackups using a local sendmail
utility.
2024-01-10 14:18:44 -05:00
Joshua Boniface 39c8367723 Add additional metainfo to VM backups
Adds additional information about failures, runtime, file sizes, etc. to
the JSON output of a VM backup.

This helps enable additional reporting and summary information for
autobackup runs.
2024-01-10 10:37:29 -05:00
Joshua Boniface aac306c55b Fix missing stats on old Debians 2024-01-09 12:10:16 -05:00
Joshua Boniface c1ae571213 Add additional VM details to Prometheus 2023-12-29 14:09:39 -05:00
Joshua Boniface 123c7ce857 Update copyright header on all files for 2024
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
Joshua Boniface 4969e90f8a Allow enable/disable of Prometheus endpoints
Since these are unauthenticated, it might be the case that an
administrator wishes to completely disable these metrics endpoints.
Provide that option via pvc.conf through pvc-ansible's existing
enable_prometheus_exporters option and the new enable_prometheus
configuration flag.

Defaults to "yes" to provide all functionality unless explicitly
disabled, as the author assumes that the PVC API is secured in other
ways as well and that metric information is not completely sensitive.
2023-12-29 09:25:10 -05:00
Joshua Boniface 3346ce9bb0 Add missing shutdown state from combinations 2023-12-27 13:40:30 -05:00
Joshua Boniface e654fbba08 Move debug condition handling to Logger
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
Joshua Boniface 4375f66793 Use proper get() for invalid values 2023-12-27 12:03:48 -05:00
Joshua Boniface 3df3ca5b44 Fix value for OSD utilization
Ceph provides in KB; convert to bytes.
2023-12-27 11:56:50 -05:00
Joshua Boniface 431ee69620 Use proper percentage for pool util 2023-12-27 10:03:00 -05:00
Joshua Boniface 88f4d79d5a Handle invalid values on older Libvirt versions 2023-12-27 09:51:24 -05:00
Joshua Boniface 84d22751d8 Fix bad JSON data handler 2023-12-27 09:43:37 -05:00
Joshua Boniface 40ff005a09 Fix handling of Ceph OSD bytes 2023-12-26 12:43:51 -05:00
Joshua Boniface 9604f655d0 Improve node utilization metrics and fix bugs 2023-12-25 02:47:41 -05:00
Joshua Boniface 3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
Joshua Boniface d2d2a9c617 Include our newline atomically
Sometimes clashing log entries would print on the same line, likely due
to some sort of race condition in Python's print() built-in.

Instead, add a newline to our actual message and print without an end
character. This ensures atomic printing of our log messages.
2023-12-21 13:12:43 -05:00
Joshua Boniface 6ed4efad33 Add new network.stats key to nodes 2023-12-21 12:48:48 -05:00
Joshua Boniface 39f9f3640c Rename health metrics and add resource metrics 2023-12-21 09:40:49 -05:00
Joshua Boniface c64e888d30 Fix incorrect cast of None 2023-12-14 16:00:53 -05:00
Joshua Boniface f1249452e5 Fix bug if no nodes are present 2023-12-14 15:32:18 -05:00
Joshua Boniface f41c5176be Ensure health value is an int properly 2023-12-13 14:34:02 -05:00
Joshua Boniface ed9c37982a Move metric collection into daemon library 2023-12-11 19:20:30 -05:00
Joshua Boniface 57c28376a6 Port one final Ceph function to read_many 2023-12-11 10:25:36 -05:00
Joshua Boniface e781d742e6 Fix bug with volume and snapshot listing 2023-12-11 10:21:46 -05:00
Joshua Boniface 741dafb26b Port VM functions to read_many 2023-12-11 03:34:36 -05:00
Joshua Boniface 5d9e83e8ed Fix output bugs in VM information 2023-12-11 03:04:46 -05:00
Joshua Boniface 7c116b2fbc Ensure node health value is an int 2023-12-10 23:56:50 -05:00
Joshua Boniface 1023c55087 Fix bug in VM state list 2023-12-10 23:44:01 -05:00
Joshua Boniface 9235187c6f Port Ceph functions to read_many
Only ports getOSDInformation, as all the others feature 3 or less reads
which is acceptable sequentially.
2023-12-10 22:24:38 -05:00
Joshua Boniface 0c94f1b4f8 Port Network functions to read_many 2023-12-10 22:19:21 -05:00
Joshua Boniface 44a4f0e1f7 Use new info detail output instead of new lists
Avoids multiple additional ZK calls by using data that is now in the
status detail output.
2023-12-10 22:19:09 -05:00
Joshua Boniface 5d53a3e529 Add state and faults detail to cluster information
We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
2023-12-10 17:29:32 -05:00
Joshua Boniface 35e22cb50f Simplify cluster status handling
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
2023-12-10 17:05:46 -05:00
Joshua Boniface a3171b666b Split node health into separate function 2023-12-10 16:52:10 -05:00
Joshua Boniface 48e41d7b05 Port Faults getFault and getAllFaults to read_many 2023-12-10 16:05:16 -05:00
Joshua Boniface d6aecf195e Port Node getNodeInformation to read_many 2023-12-10 15:53:28 -05:00
Joshua Boniface 9329784010 Implement async ZK read function
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.

Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
2023-12-10 15:35:40 -05:00
Joshua Boniface b9fbfe2ed5 Improve fault ID format
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
2023-12-09 16:48:14 -05:00
Joshua Boniface 7e6d922877 Improve fault detail handling further
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
2023-12-09 16:13:36 -05:00
Joshua Boniface 4003204f14 Remove bracketed text from fault_str
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.

Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
Joshua Boniface 2bea78d25e Make all remaining limits optional 2023-12-09 13:43:58 -05:00
Joshua Boniface fd717b702d Use external list of fault states 2023-12-09 12:51:41 -05:00
Joshua Boniface 317ca4b98c Move defined state combinations into common 2023-12-09 12:36:32 -05:00
Joshua Boniface 0bda095571 Move libvirt_schema and fix other imports 2023-12-09 12:20:29 -05:00
Joshua Boniface 813aef1463 Fix incorrect UUID key name 2023-12-09 12:14:57 -05:00
Joshua Boniface 5a7ea25266 Fix incorrect database name entries 2023-12-09 12:12:00 -05:00
Joshua Boniface 61b39d0739 Fix incorrect cluster health calculation 2023-12-07 11:13:36 -05:00
Joshua Boniface 4bf80a5913 Fix missing datetime shrink 2023-12-06 17:15:36 -05:00
Joshua Boniface e0bf7f7d1a Fix bad ID values in acknowledge 2023-12-06 14:18:31 -05:00
Joshua Boniface 20acf3295f Add mass ack/delete of faults 2023-12-06 13:59:39 -05:00
Joshua Boniface d1e34e7333 Store fault times only to the second
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00
Joshua Boniface 79eb54d5da Move fault generation to common library 2023-12-06 13:17:10 -05:00
Joshua Boniface 2267a9c85d Improve output formatting for simplicity 2023-12-05 10:37:35 -05:00
Joshua Boniface 672e58133f Implement interfaces to faults 2023-12-04 01:37:54 -05:00
Joshua Boniface 3dc48c1783 Lower default monitoring interval to 15s
Faults are also reported on the monitoring interval, so 60s seems like
too long. Lower this to 15 seconds by default instead.
2023-12-01 17:38:28 -05:00
Joshua Boniface 9c2b1b29ee Add node health to fault states
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.

Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.
2023-12-01 17:38:28 -05:00
Joshua Boniface 8594eb697f Add initial fault generation in pvchealthd
References: #164
2023-12-01 17:38:27 -05:00