Moves VM autobackups from being in-CLI to being handled by the
pvcworkerd system on the primary coordinator. Turns the CLI autobackup
command into an actual API client endpoint rather than having its logic
in the CLI.
In addition, modifies the new autobackup to leverage the new "pvc vm
snapshot" function set, just with special snapshot names. This helps
automate this within the new snapshot scaffolding.
A rename is simply a change to two values, so instead of undefining and
re-defining the VM, just edit those two fields. This ensures things like
snapshots are preserved automatically.
We supported creating snapshots, but not doing anything with them. This
removes the manual task of restoring a snapshot and replace it with a
PVC abstraction of rolling back to a snapshot.
While Ceph recommends cloning a snapshot instead of rolling back, due to
the time taken, in our usecase I don't think that is an optimal
strategy, as it will leave dangling clones that we'd then have to
manage.
Closes#183
Adds a check that a volume creation or resize won't violate the 80% full
rule for the storage cluster. This ensures a cluster won't get too full
if a storage volume fills up.
Also adds a force flag throughout the pipeline to override this check,
should an administrator really want to do so.
Closes#177
Adds a new flag to VM metadata to allow setting the VM live migration
max downtime. This will enable very busy VMs that hang live migration to
have this value changed.
Major improvements to autobackup and backups, including additional
information/fields in the backup JSON itself, improved error handling,
and the ability to email reports of autobackups using a local sendmail
utility.
Adds additional information about failures, runtime, file sizes, etc. to
the JSON output of a VM backup.
This helps enable additional reporting and summary information for
autobackup runs.
Since these are unauthenticated, it might be the case that an
administrator wishes to completely disable these metrics endpoints.
Provide that option via pvc.conf through pvc-ansible's existing
enable_prometheus_exporters option and the new enable_prometheus
configuration flag.
Defaults to "yes" to provide all functionality unless explicitly
disabled, as the author assumes that the PVC API is secured in other
ways as well and that metric information is not completely sensitive.
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
Sometimes clashing log entries would print on the same line, likely due
to some sort of race condition in Python's print() built-in.
Instead, add a newline to our actual message and print without an end
character. This ensures atomic printing of our log messages.
We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.
Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.
Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.
Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.