Compare commits

..

176 Commits

Author SHA1 Message Date
Joshua Boniface 9441cb3b2e Bump version to 0.9.103 2024-11-01 17:23:24 -04:00
Joshua Boniface b16542c8fc Fix double-appending domain bug
Since storage_hosts now includes the storage domain as FQDNs, don't
re-append it within vmbuilder.
2024-11-01 17:18:51 -04:00
Joshua Boniface de0c7e37f2 Allow environment setting for Munin 2024-10-30 13:12:08 -04:00
Joshua Boniface ae26a071c7 Fix bugs with Munin plugin 2024-10-30 12:53:29 -04:00
Joshua Boniface 49a34acd14 Fix README images 2024-10-25 23:51:08 -04:00
Joshua Boniface 82365ea539 Update README badge order 2024-10-25 23:47:33 -04:00
Joshua Boniface 86f0c5c3ae Update README 2024-10-25 23:43:57 -04:00
Joshua Boniface 83294298e1 Update README to match GitHub 2024-10-25 23:37:32 -04:00
Joshua Boniface 4187aacc5b Correct formatting of OpenAPI Swagger specs 2024-10-19 02:23:46 -04:00
Joshua Boniface 35c82b5249 Bump version to 0.9.102 2024-10-17 10:48:31 -04:00
Joshua Boniface e80b797e3a Add missing sorter for detail parser 2024-10-17 10:09:49 -04:00
Joshua Boniface 7c8c71dff7 Improve handling of local connections in CLI
1. Ensure the local connection is actually always present if it exists,
and stored in the store file.

2. Remove any invalid "local" store entries if present (i.e.
pvcapid.yaml entries from legacy versions).

3. Order the connection lists such that "local" is always first.

4. Improve pretty list output format such that all fields are wider if
needed
2024-10-17 09:56:54 -04:00
Joshua Boniface 861fef91e3 Add modification of Monitor hosts on XML import
Missing this means clusters with different storage hosts would fail to
start silently. Ensure these are updated like the secret UUID is as
well.
2024-10-16 16:00:54 -04:00
Joshua Boniface d1fcac1f0a Bump version to 0.9.101 2024-10-15 11:39:11 -04:00
Joshua Boniface 6ace2ebf6a Set expected PVC version for mirroring 2024-10-15 11:31:50 -04:00
Joshua Boniface 962fba7621 Bump up startup waits slightly
Ensures there's more time for daemons (specifically Zookeeper) to start
up and synchronize between nodes.
2024-10-15 11:10:23 -04:00
Joshua Boniface 49bf51da38 Fix indentation of previous fix 2024-10-15 10:57:33 -04:00
Joshua Boniface 1293e8ae7e Fix bugs in lock freeing function
1. The destination state on an error was invalid; should be "stop".

2. If a lock was listed but removing it fails (because it was already
cleared somehow, this would error. In turn this would cause the VM to
not migrate and be left in an undefined state. Fix that when unlocking
is forced.
2024-10-15 10:43:52 -04:00
Joshua Boniface ae2cf8a070 Add some time for Zookeeper to synchronize 2024-10-15 10:43:44 -04:00
Joshua Boniface ab5bd3c57d Fix handling of invalid nets in list
Ensure we add the difference in length between the visual output and the
ANSI-coded output to avoid the format handler mishandling the length.
2024-10-14 12:51:02 -04:00
Joshua Boniface 35153cd6b6 Fix path handling for zkhandler
Using full paths broke the local schema generator, so convert these to
proper class instance methods and use them along with a new default +
settable override.
2024-10-11 16:03:40 -04:00
Joshua Boniface 7f7047dd52 Add one more instance of mirror as purple 2024-10-11 14:44:14 -04:00
Joshua Boniface 9a91767405 Add proper return codes to API handlers 2024-10-11 14:43:44 -04:00
Joshua Boniface bcfa6851e1 Use purple for mirror state colour 2024-10-11 10:44:39 -04:00
Joshua Boniface 28b8b3bb44 Use proper response parsing instead of raise_for 2024-10-11 10:32:15 -04:00
Joshua Boniface 02425159ef Update Grafana graphs 2024-10-11 09:47:19 -04:00
Joshua Boniface a6f8500309 Improve fence handling to prevent anomalies
1. Move fence monitoring to its own thread rather than doing the listing
and triggering within the main keepalive thread.
2. Add a global lock key at /config/fence_lock and use this lock key to
prevent multiple nodes from trying to run fences simultaneously.
3. Run the fencing monitor for each node sequentially within the context
of the main fence monitoring thread, to ensure that fences of multiple
nodes happen sequentially rather than in parallel.

All of these should help to prevent any anomalies where one node can try
to fence multiple nodes at once without recourse.
2024-10-10 16:42:57 -04:00
Joshua Boniface ebec1332e9 Return to relative paths for SCHEMA_ROOT_PATH 2024-10-10 16:20:02 -04:00
Joshua Boniface c08c3b2d7d Improve thread timeouts in keepalive
Avoids various parts of the keepalive deadlocking waiting on data that
will never come when various internal processes fail. This should ensure
based on testing that the keepalive will always finish in <5 seconds.
2024-10-10 15:33:47 -04:00
Joshua Boniface 4c0d90b517 Add read lock timeouts to prevent deadlocks 2024-10-10 15:19:05 -04:00
Joshua Boniface 70c588d3a8 Add confirmation option for mirror promote 2024-10-10 01:57:06 -04:00
Joshua Boniface 214e7f835a Properly preserve state on promotion
Ensure if the state is start, stop, or disable, that state is preserved;
if it's anything else, the remote side will be started.
2024-10-10 01:21:05 -04:00
Joshua Boniface 96cebfb42a Handle cross-cluster Ceph storage secrets 2024-10-10 00:47:50 -04:00
Joshua Boniface c4763ac596 Fix invalid responses during promote 2024-10-09 01:14:19 -04:00
Joshua Boniface ea5512e3d8 Only shut down VM if it is running 2024-10-09 01:10:42 -04:00
Joshua Boniface ac00f7c4c8 Fix boolean state of remove_on_source 2024-10-09 01:04:08 -04:00
Joshua Boniface 6d31bf439e Update error text 2024-10-09 01:00:51 -04:00
Joshua Boniface c714093a2e Ensure VM start is forced 2024-10-09 00:58:43 -04:00
Joshua Boniface 04a09b9269 Fix invalid data in state change 2024-10-09 00:55:13 -04:00
Joshua Boniface 3ede0c7d38 Name mirror snapshots like autobackup snapshots 2024-10-09 00:49:22 -04:00
Joshua Boniface ab9390fdb8 Fix another bad stage counting instance 2024-10-09 00:44:20 -04:00
Joshua Boniface 1c83584788 Set correct verbage 2024-10-09 00:38:59 -04:00
Joshua Boniface 7f3ab4e119 Fix stage counting in tasks 2024-10-09 00:37:13 -04:00
Joshua Boniface 16eb09dc22 Fix ordering bug with vm_detail 2024-10-09 00:33:00 -04:00
Joshua Boniface 7ba75adef4 Fix bug if destination is missing 2024-10-09 00:27:42 -04:00
Joshua Boniface a691d26c30 Add check for scheme in destination
Allows handling invalid cluster names properly.
2024-10-09 00:25:13 -04:00
Joshua Boniface 1d90b066bc Add guard rails against manipulating mirrors
Snapshot mirrors should normally be promoted using "mirror promote", and
not started manually. This adds guard rails against that to the "start",
"stop", and "disable" state commands to prevent changing mirror states
without an explicit "--force" option.
2024-10-08 23:51:48 -04:00
Joshua Boniface 3ea7421f09 Implement friendlier VM mirror commands
Adds two helper commands which automate sending and promoting VM
snapshots as "vm mirror" commands.

"vm mirror create" replicates the functionality of "snapshot create" and
"snapshot send", performing both in one single task using an
autogenerated dated snapshot name for automatic cross-cluster
replication.

"vm mirror promote" replicates the functionality of "vm shutdown",
"snapshot create", "snapshot send", "vm start" (remote), and,
optionally, "vm remove", performing in one single task an entire
cross-cluster VM move with or without retaining the copy on the local
cluster (if retained, the local copy becomes a snapshot mirror of the
remote, flipping their statuses).
2024-10-08 23:51:39 -04:00
Joshua Boniface df4d437d31 Update the description of VM define endpoint 2024-10-01 13:30:44 -04:00
Joshua Boniface 8295e2089d Add proper response schema for 202 responses 2024-10-01 13:25:11 -04:00
Joshua Boniface 4ccb570762 Enhance documentation of snapshot send command 2024-09-30 23:54:53 -04:00
Joshua Boniface 235299942a Add volume resize if changed 2024-09-30 20:51:59 -04:00
Joshua Boniface 9aa32134a9 Fix bug in API specification 2024-09-30 20:51:49 -04:00
Joshua Boniface 75eac356d5 Increase send blocksize and add total speed
It's much faster and seems to cause no issues.
2024-09-30 20:11:12 -04:00
Joshua Boniface fb8561cc5d Actually fix incremental sending 2024-09-30 17:00:18 -04:00
Joshua Boniface 5f7aa0b2d6 Improve incremental send speed 2024-09-30 04:15:17 -04:00
Joshua Boniface 7fac7a62cf Clean up debug print statements 2024-09-30 03:51:39 -04:00
Joshua Boniface b19642aa2e Fix bug where snapshot rollback was never called 2024-09-30 03:04:35 -04:00
Joshua Boniface 974e0d6ac2 Shorten progress bars to 20 characters
They were needlessly long and this limited the message size.
2024-09-30 03:04:10 -04:00
Joshua Boniface 7785166a7e Finish working implementation of send/receive
Required some significant refactoring due to issues with the diff send,
but it works.
2024-09-30 02:53:23 -04:00
Joshua Boniface 34f0a2f388 Add mostly complete implementation of VM send 2024-09-29 01:31:13 -04:00
Joshua Boniface 8fa37d21c0 Fix handling of invalid network lengths 2024-09-29 00:39:53 -04:00
Joshua Boniface f462ebbc6b Add VM snapshot send (initial) 2024-09-28 10:49:35 -04:00
Joshua Boniface 0d533f3658 Rework task output bar operation
Allows sending constant updates including changes to the message within
the same task.
2024-09-28 10:48:39 -04:00
Joshua Boniface 792d135950 Update responses for Celery tasks 2024-09-28 02:01:56 -04:00
Joshua Boniface a64e0c1985 Fix incorrect default value typos 2024-09-28 02:01:56 -04:00
Joshua Boniface 1cbadb1172 Add "mirror" VM state 2024-09-28 02:01:56 -04:00
Joshua Boniface b1c4b2e928 Add Ceph block receive (initial) 2024-09-28 02:01:56 -04:00
Joshua Boniface 7fe1262887 Fix indentation in faults 2024-09-28 02:01:33 -04:00
Joshua Boniface 0e389ba1f4 Fix bug when setting split count = 1
Would set the OSD as split in Zookeeper, even though it wasn't.
2024-09-23 13:06:05 -04:00
Joshua Boniface 41cd34ba4d Allow specifying job names for benchmarks 2024-09-18 14:55:12 -04:00
Joshua Boniface 736762901c Update benchmarks to include resource utilization
Adds additional polled information on node cpu, memory, and network
bandwidth for the node running the test. This should provide additional
useful information about the results of the test.

Also bumps the test format to 2 to ensure clients can handle the changes
properly.
2024-09-18 14:32:03 -04:00
Joshua Boniface ecb812ccac Update linting for pvcapid recent changes 2024-09-18 10:18:50 -04:00
Joshua Boniface a2e5df9f6d Add support for Gunicorn execution
Modifies pvcapid to run under Gunicorn when in non-debug mode, instead
of the Flask development server. This is proper practice for one, and
also helps increase performance slightly in some workloads (file uploads
mainly).
2024-09-09 13:20:03 -04:00
Joshua Boniface 73c0834f85 Remove headers and add util to short output 2024-09-06 11:40:39 -04:00
Joshua Boniface 2de999c700 Add total cluster utilization stats
Useful for evaluating the cluster resources as a whole.
2024-09-05 16:05:33 -04:00
Joshua Boniface 7543eb839d Add dedicated volume scan endpoint
Allows an imported volume to be scanned for stats independently.

Designed to be used as part of a snapshot import via API, to allow the
"create" to happen before the real import (to check for available space,
etc.) and then run this import after when the RBD volume actually
exists.
2024-09-03 20:32:27 -04:00
Joshua Boniface 8cb44c0c5d Bump version to 0.9.100 2024-08-30 11:03:33 -04:00
Joshua Boniface c55021f30c Update information about detect strings in CLI 2024-08-30 11:02:44 -04:00
Joshua Boniface 783c9e46c2 Only add packages to bookworm repo
Deprecates Debian 10 (Buster) and 11 (Bullseye); those versions will not
receive PVC 0.9.100 or newer.
2024-08-30 10:56:24 -04:00
Joshua Boniface b7f33c1fcb Update deprecation warning
Hotfixes throw a wrench in this, so just make them generic.
2024-08-30 10:55:24 -04:00
Joshua Boniface 0f578d7c7d Ensure decimals are captured from size regex 2024-08-30 10:51:41 -04:00
Joshua Boniface f87b96887c Add detect string parser with nvme
Some newer servers do not report NVMe device paths properly using
`lsscsi` as expected. To work around this, add an `nvme`-based detect
parser that is called if the `lsscsi` parser returns a `-` (or None).
2024-08-30 10:41:56 -04:00
Joshua Boniface 02a775c99b Bump version to 0.9.99 2024-08-28 11:15:55 -04:00
Joshua Boniface 8177d5f8b7 Use absolute path for ZK schema 2024-08-27 09:40:24 -04:00
Joshua Boniface 26d0d08873 Add is-primary command
Used by the cron to check if the node firing an autobackup is the
primary node or not, so it will not multi-fire from all coordinators.
2024-08-25 22:09:03 -04:00
Joshua Boniface f57b8d4a15 Simplify Celery event handling
It was far too cumbersome to report every possible stage here in a
consistent way. Realistically, this command will be run silently from
cron 99.95% of the time, so all this overcomplexity to handle individual
Celery state updates just isn't worth it.
2024-08-25 21:59:12 -04:00
Joshua Boniface 10de85cce3 Allow API-only builds and deploy 2024-08-25 20:45:52 -04:00
Joshua Boniface e938140414 Refactor autobackups to make more sense 2024-08-25 19:21:00 -04:00
Joshua Boniface fd87a28eb3 Fix bug in API parameters 2024-08-25 19:13:31 -04:00
Joshua Boniface 4ef5fbdbe8 Restore previous autobackup continue behaviour
With the original system, the failure of one VM's backups would not
trigger a total fault, thus allowing other backups to complete.
Restore that behaviour.
2024-08-25 17:04:43 -04:00
Joshua Boniface 8fa6bed736 Ensure cron flag triggers truly silent output 2024-08-25 16:35:24 -04:00
Joshua Boniface f7926726f2 Adjust snapshot name again 2024-08-25 16:20:59 -04:00
Joshua Boniface de58efdaa9 Ensure email_recipients is always a list 2024-08-25 16:18:19 -04:00
Joshua Boniface 8ca6976892 Re-add cron flag for autobackups 2024-08-25 16:17:41 -04:00
Joshua Boniface a957218976 Fix staging for summary report 2024-08-25 16:11:35 -04:00
Joshua Boniface 61365e6e01 Adjust autobackup snap name and output messages 2024-08-25 16:09:52 -04:00
Joshua Boniface 35fe16ce75 Revert "Adjust stage naming to reflect autobackup stages"
This reverts commit c1f320ede2.
2024-08-25 15:58:25 -04:00
Joshua Boniface c45e488958 Improve output of build-and-deploy 2024-08-25 15:57:07 -04:00
Joshua Boniface c1f320ede2 Adjust stage naming to reflect autobackup stages 2024-08-25 15:55:16 -04:00
Joshua Boniface 03db9604e1 Ensure recipients is a proper list 2024-08-25 15:55:00 -04:00
Joshua Boniface f1668bffcc Refactor autobackups to implement vm.worker defs
Avoid trying to subcall other Celery worker tasks, as this just gets
very screwy with the stages. Instead reimplement what is needed directly
here. While this does cause a fair bit of code duplication, I believe
the resulting clarity is worthwhile.
2024-08-25 15:54:03 -04:00
Joshua Boniface c0686fc5c7 Remove stage overrides
These aren't needed after pending refactor.
2024-08-25 15:17:46 -04:00
Joshua Boniface 7ecc05b413 Restart worker after becoming primary 2024-08-25 14:18:18 -04:00
Joshua Boniface 4b37c4fea3 Fix assignment bug 2024-08-25 14:10:59 -04:00
Joshua Boniface 0d918d66fe Port VM autobackups into pvcworkerd with snaps
Moves VM autobackups from being in-CLI to being handled by the
pvcworkerd system on the primary coordinator. Turns the CLI autobackup
command into an actual API client endpoint rather than having its logic
in the CLI.

In addition, modifies the new autobackup to leverage the new "pvc vm
snapshot" function set, just with special snapshot names. This helps
automate this within the new snapshot scaffolding.
2024-08-23 17:23:06 -04:00
Joshua Boniface fd199f405b Add deprection warning to pvc vm backup commands 2024-08-23 17:04:15 -04:00
Joshua Boniface f6c009beac Allow overriding stages in some commands
This allows them to be called by autobackup commands while still
preserving the current Celery report flow.
2024-08-23 11:21:02 -04:00
Joshua Boniface fc89f4f2f5 Fix error message contents 2024-08-23 10:23:51 -04:00
Joshua Boniface 565011b277 Set snapshot name before start 2024-08-20 23:01:52 -04:00
Joshua Boniface 0bf9cc6b06 Improve stage handling
Run start() at the beginning, and leverage the new tweaks to the CLI to
update the total steps later. Allows errors to be handled gracefully
2024-08-20 17:50:27 -04:00
Joshua Boniface f2dfada73e Improve return handling for snapshot tasks 2024-08-20 17:40:44 -04:00
Joshua Boniface f63c392ba6 Show primary status in node run_on 2024-08-20 17:32:33 -04:00
Joshua Boniface 7663ad72c5 Update length of progress bar each update
Allows us to start with a lower length, and increase it later.
2024-08-20 17:22:15 -04:00
Joshua Boniface 9b3075be18 Add UUID check and fix wording
Don't suggest renaming any more as it's not enough.
2024-08-20 17:05:27 -04:00
Joshua Boniface 9a661d0173 Convert VM snapshots to worker tasks
Improves manageability and offloads these from the API context.
2024-08-20 16:50:41 -04:00
Joshua Boniface 4a0680b27f Fix issues with snapshot imports 2024-08-20 13:59:05 -04:00
Joshua Boniface 6597f7aef6 Fix bad function call 2024-08-20 12:58:17 -04:00
Joshua Boniface f42a1bad0e Allow passing zk_only into VM snapshot creation 2024-08-20 12:57:53 -04:00
Joshua Boniface 3fb52a13c2 Add missing VM states from snapshots 2024-08-20 11:53:57 -04:00
Joshua Boniface 8937ddf331 Simplify VM rename to preserve data
A rename is simply a change to two values, so instead of undefining and
re-defining the VM, just edit those two fields. This ensures things like
snapshots are preserved automatically.
2024-08-20 11:37:28 -04:00
Joshua Boniface 7cc354466f Finish implementing snapshot import 2024-08-20 11:25:09 -04:00
Joshua Boniface 44232fe3c6 Fix export swagger definition 2024-08-20 11:07:56 -04:00
Joshua Boniface 0a8bad3418 Add VM snapshot import 2024-08-20 10:53:56 -04:00
Joshua Boniface f10d32987b Fix up comments 2024-08-20 10:37:58 -04:00
Joshua Boniface faf920ac1d Fix bug where force_flag is a string 2024-08-20 10:10:33 -04:00
Joshua Boniface a6e824a049 Improve output text message 2024-08-19 18:51:41 -04:00
Joshua Boniface 624eb4e752 Fix bug in primary node display 2024-08-19 18:48:32 -04:00
Joshua Boniface d060787503 Add initial implementation of snapshot export 2024-08-19 18:46:07 -04:00
Joshua Boniface 9a435fe2ae Allow specifying become-primary during deploys 2024-08-19 17:44:13 -04:00
Joshua Boniface 9f47da6777 Fix triplicate API calls on GET commands 2024-08-19 17:33:21 -04:00
Joshua Boniface 0cf229273a Add API endpoint for current primary node
This was never exposed before, so expose it for use in other functions
being built.
2024-08-19 17:15:52 -04:00
Joshua Boniface 212ecaab68 Fix Swagger doc issues 2024-08-19 16:56:18 -04:00
Joshua Boniface f1b4593367 Store current stats with snapshots
Allows getting info like size, etc. for the snapshot.
2024-08-19 14:07:27 -04:00
Joshua Boniface fc55046812 Add confirmation of snapshot removals 2024-08-19 13:57:20 -04:00
Joshua Boniface 33f905459a Implement VM rollback
Closes #184
2024-08-16 10:47:18 -04:00
Joshua Boniface 174e6e08e3 Correct issues with VM output formats 2024-08-16 10:46:25 -04:00
Joshua Boniface 9f85c92dff Handle missing or empty snapshot lists 2024-08-16 10:46:25 -04:00
Joshua Boniface 4b30d2f58a Always show snapshots 2024-08-16 10:46:25 -04:00
Joshua Boniface 2fcee28fed Hide topology in long output 2024-08-16 10:46:25 -04:00
Joshua Boniface 1f18e88c06 Add snapshots to VM info details 2024-08-16 10:46:25 -04:00
Joshua Boniface 359191c83f Ensure snapshot name does not already exist 2024-08-16 10:46:25 -04:00
Joshua Boniface 3d0d5e63f6 Make default snap name just the datestring 2024-08-16 10:46:25 -04:00
Joshua Boniface e6bfbb6d45 Actually fix incorrect naming bug 2024-08-16 10:46:25 -04:00
Joshua Boniface b80f9e28dc Add human-readable age to snapshots
This is parsed server-side for consistent timing and to simplify the API
consumers.
2024-08-16 10:46:25 -04:00
Joshua Boniface fbd5b3cca3 Remove is_backup flag for snapshots
This won't be needed for anything.
2024-08-16 10:46:25 -04:00
Joshua Boniface 2b1082590e Fix bug in snapshot removal 2024-08-16 10:46:25 -04:00
Joshua Boniface a4ca112128 Add snapshot count to VM list 2024-08-16 10:46:25 -04:00
Joshua Boniface 6fc7f45027 Add snapshot lists and timestamp
Adds snapshots to the list of data in VM objects
2024-08-16 10:46:25 -04:00
Joshua Boniface 0c240a5129 Add VM snapshot removal 2024-08-16 10:46:25 -04:00
Joshua Boniface 553c1e670e Add VM snapshots functionality
Adds the ability to create snapshots of an entire VM, including all its
RBD disks and the VM XML config, though not any PVC metadata.
2024-08-16 10:46:25 -04:00
Joshua Boniface 942de9f15b Add better exception handling for XML configs 2024-08-16 10:46:04 -04:00
Joshua Boniface 9aca8e215b Run IPMI check 3 times with 2s timeout
Avoids potential timeouts or deadlocks, and retries if a single try
fails.
2024-07-28 12:36:01 -04:00
Joshua Boniface 97329bb90d Sort Ceph pool data by name
There is no guarantee that both commands output the pools in the same
order, so sort them by name first so the iteration over the pools by ID
is successful.
2024-07-22 13:26:27 -04:00
Joshua Boniface c186015d6f Add check for invalid profile 2024-07-13 17:13:40 -04:00
Joshua Boniface 1aa5999109 Bump version to 0.9.98 2024-06-05 12:01:31 -04:00
Joshua Boniface 570460e5ee Add --version flag to pvcnoded.py for info 2024-06-05 11:57:47 -04:00
Joshua Boniface 7a99e0e524 Fix bugs listing snapshots by pool/volume
The logic of this didn't work, so reconfigure to use these like limits.
Also fixes a bug in the upper getCephVolumes for invalid pools.
2024-05-16 16:32:22 -04:00
Joshua Boniface 234d6ae83b Add warnings about snapshot consistency 2024-05-13 15:29:43 -04:00
Joshua Boniface 5d0e7931d1 Add support for rolling back snapshots
We supported creating snapshots, but not doing anything with them. This
removes the manual task of restoring a snapshot and replace it with a
PVC abstraction of rolling back to a snapshot.

While Ceph recommends cloning a snapshot instead of rolling back, due to
the time taken, in our usecase I don't think that is an optimal
strategy, as it will leave dangling clones that we'd then have to
manage.

Closes #183
2024-05-13 15:24:51 -04:00
Joshua Boniface dcb9c0d12c Improve fence handling conditions
Use the intermediate output text when judging the fence status, rather
than the retcode of the stop as this should be more reliable.
2024-05-08 10:55:15 -04:00
Joshua Boniface f6e856bf98 Fix debug output on timeout 2024-05-06 10:49:57 -04:00
Joshua Boniface f1fe0c63f5 Bump version to 0.9.97 2024-04-19 10:32:16 -04:00
Joshua Boniface ab944f9b95 Add RBD snap purge during volume removal
Fixes #180
2024-04-19 10:31:11 -04:00
Joshua Boniface 9714ac20b2 Update formatting for Black 24.4.0 2024-04-19 10:26:06 -04:00
Joshua Boniface 79ad09ae59 Switch virtual memory free to allocated
Avoids incorrect reporting if cache/buffers exceeds normal.
2024-04-19 10:25:33 -04:00
Joshua Boniface 4c6aabec6a Fix bug if d_network changes 2024-04-05 14:05:51 -04:00
Joshua Boniface 559400ed90 Explicitly set --lines to integer type 2024-03-13 13:01:02 -04:00
Joshua Boniface 78c774b607 Bump version to 0.9.96 2024-03-08 14:23:07 -05:00
Joshua Boniface a461791ce8 Fix bug cleaning up successful benchmark results 2024-03-08 14:22:07 -05:00
Joshua Boniface 9fdb6d8708 Fix bug with network stats 2024-03-07 15:44:35 -05:00
Joshua Boniface 2fb7c40497 Work around bad plugin data 2024-03-07 14:37:05 -05:00
Joshua Boniface dee8d186cf Bump version to 0.9.95 2024-02-12 13:12:48 -05:00
Joshua Boniface 1e9871241e Fix bug showing OSDs as split when not 2024-02-12 13:12:08 -05:00
Joshua Boniface 9cd88ebccb Ensure storage template disks are sorted 2024-02-09 12:40:20 -05:00
Joshua Boniface 3bc500bc55 Permit duplicate VNIs in templates with flag
Supports niche usecases whereby a network template should contain the
same VNI(s) more than once.
2024-02-09 12:12:04 -05:00
49 changed files with 15558 additions and 7397 deletions

View File

@ -4,4 +4,4 @@ bbuilder:
published: published:
- git submodule update --init - git submodule update --init
- /bin/bash build-stable-deb.sh - /bin/bash build-stable-deb.sh
- sudo /usr/local/bin/deploy-package -C pvc - sudo /usr/local/bin/deploy-package -C pvc -D bookworm

View File

@ -1 +1 @@
0.9.94 0.9.103

View File

@ -1,5 +1,94 @@
## PVC Changelog ## PVC Changelog
###### [v0.9.103](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.103)
* [Provisioner] Fixes a bug with the change in `storage_hosts` to FQDNs affecting the VM Builder
* [Monitoring] Fixes the Munin plugin to work properly with sudo
###### [v0.9.102](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.102)
* [API Daemon] Ensures that received config snapshots update storage hosts in addition to secret UUIDs
* [CLI Client] Fixes several bugs around local connection handling and connection listings
###### [v0.9.101](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.101)
**New Feature**: Adds VM snapshot sending (`vm snapshot send`), VM mirroring (`vm mirror create`), and (offline) mirror promotion (`vm mirror promote`). Permits transferring VM snapshots to remote clusters, individually or repeatedly, and promoting them to active status, for disaster recovery and migration between clusters.
**Breaking Change**: Migrates the API daemon into Gunicorn when in production mode. Permits more scalable and performant operation of the API. **Requires additional dependency packages on all coordinator nodes** (`gunicorn`, `python3-gunicorn`, `python3-setuptools`); upgrade via `pvc-ansible` is strongly recommended.
**Enhancement**: Provides whole cluster utilization stats in the cluster status data. Permits better observability into the overall resource utilization of the cluster.
**Enhancement**: Adds a new storage benchmark format (v2) which includes additional resource utilization statistics. This allows for better evaluation of storage performance impact on the cluster as a whole. The updated format also permits arbitrary benchmark job names for easier parsing and tracking.
* [API Daemon] Allows scanning of new volumes added manually via other commands
* [API Daemon/CLI Client] Adds whole cluster utilization statistics to cluster status
* [API Daemon] Moves production API execution into Gunicorn
* [API Daemon] Adds a new storage benchmark format (v2) with additional resource tracking
* [API Daemon] Adds support for named storage benchmark jobs
* [API Daemon] Fixes a bug in OSD creation which would create `split` OSDs if `--osd-count` was set to 1
* [API Daemon] Adds support for the `mirror` VM state used by snapshot mirrors
* [CLI Client] Fixes several output display bugs in various commands and in Worker task outputs
* [CLI Client] Improves and shrinks the status progress bar output to support longer messages
* [API Daemon] Adds support for sending snapshots to remote clusters
* [API Daemon] Adds support for updating and promoting snapshot mirrors to remote clusters
* [Node Daemon] Improves timeouts during primary/secondary coordinator transitions to avoid deadlocks
* [Node Daemon] Improves timeouts during keepalive updates to avoid deadlocks
* [Node Daemon] Refactors fencing thread structure to ensure a single fencing task per cluster and sequential node fences to avoid potential anomalies (e.g. fencing 2 nodes simultaneously)
* [Node Daemon] Fixes a bug in fencing if VM locks were already freed, leaving VMs in an invalid state
* [Node Daemon] Increases the wait time during system startup to ensure Zookeeper has more time to synchronize
###### [v0.9.100](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.100)
* [API Daemon] Improves the handling of "detect:" disk strings on newer systems by leveraging the "nvme" command
* [Client CLI] Update help text about "detect:" disk strings
* [Meta] Updates deprecation warnings and updates builder to only add this version for Debian 12 (Bookworm)
###### [v0.9.99](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.99)
**Deprecation Warning**: `pvc vm backup` commands are now deprecated and will be removed in a future version. Use `pvc vm snapshot` commands instead.
**Breaking Change**: The on-disk format of VM snapshot exports differs from backup exports, and the PVC autobackup system now leverages these. It is recommended to start fresh with a new tree of backups for `pvc autobackup` for maximum compatibility.
**Breaking Change**: VM autobackups now run in `pvcworkerd` instead of the CLI client directly, allowing them to be triggerd from any node (or externally). It is important to apply the timer unit changes from the `pvc-ansible` role after upgrading to 0.9.99 to avoid duplicate runs.
**Usage Note**: VM snapshots are displayed in the `pvc vm list` and `pvc vm info` outputs, not in a unique "list" endpoint.
* [API Daemon] Adds a proper error when an invalid provisioner profile is specified
* [Node Daemon] Sorts Ceph pools properly in node keepalive to avoid incorrect ordering
* [Health Daemon] Improves handling of IPMI checks by adding multiple tries but a shorter timeout
* [API Daemon] Improves handling of XML parsing errors in VM configurations
* [ALL] Adds support for whole VM snapshots, including configuration XML details, and direct rollback to snapshots
* [ALL] Adds support for exporting and importing whole VM snapshots
* [Client CLI] Removes vCPU topology from short VM info output
* [Client CLI] Improves output format of VM info output
* [API Daemon] Adds an endpoint to get the current primary node
* [Client CLI] Fixes a bug where API requests were made 3 times
* [Other] Improves the build-and-deploy.sh script
* [API Daemon] Improves the "vm rename" command to avoid redefining VM, preserving history etc.
* [API Daemon] Adds an indication when a task is run on the primary node
* [API Daemon] Fixes a bug where the ZK schema relative path didn't work sometimes
###### [v0.9.98](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.98)
* [CLI Client] Fixed output when API call times out
* [Node Daemon] Improves the handling of fence states
* [API Daemon/CLI Client] Adds support for storage snapshot rollback
* [CLI Client] Adds additional warning messages about snapshot consistency to help output
* [API Daemon] Fixes a bug listing snapshots by pool/volume
* [Node Daemon] Adds a --version flag for information gathering by update-motd.sh
###### [v0.9.97](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.97)
* [Client CLI] Ensures --lines is always an integer value
* [Node Daemon] Fixes a bug if d_network changes during iteration
* [Node Daemon] Moves to using allocated instead of free memory for node reporting
* [API Daemon] Fixes a bug if lingering RBD snapshots exist when removing a volume (#180)
###### [v0.9.96](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.96)
* [API Daemon] Fixes a bug when reporting node stats
* [API Daemon] Fixes a bug deleteing successful benchmark results
###### [v0.9.95](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.95)
* [API Daemon/CLI Client] Adds a flag to allow duplicate VNIs in network templates
* [API Daemon] Ensures that storage template disks are returned in disk ID order
* [Client CLI] Fixes a display bug showing all OSDs as split
###### [v0.9.94](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.94) ###### [v0.9.94](https://github.com/parallelvirtualcluster/pvc/releases/tag/v0.9.94)
* [CLI Client] Fixes an incorrect ordering issue with autobackup summary emails * [CLI Client] Fixes an incorrect ordering issue with autobackup summary emails

View File

@ -1,10 +1,11 @@
<p align="center"> <p align="center">
<img alt="Logo banner" src="images/pvc_logo_black.png"/> <img alt="Logo banner" src="https://docs.parallelvirtualcluster.org/en/latest/images/pvc_logo_black.png"/>
<br/><br/> <br/><br/>
<a href="https://www.parallelvirtualcluster.org"><img alt="Website" src="https://img.shields.io/badge/visit-website-blue"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Latest Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
<a href="https://docs.parallelvirtualcluster.org/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a> <a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
<a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a> <a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
<a href="https://docs.parallelvirtualcluster.org/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
</p> </p>
## What is PVC? ## What is PVC?
@ -23,62 +24,64 @@ Installation of PVC is accomplished by two main components: a [Node installer IS
Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time. Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.
More information about PVC, its motivations, the hardware requirements, and setting up and managing a cluster [can be found over at our docs page](https://docs.parallelvirtualcluster.org).
## Getting Started ## Getting Started
To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster. To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster.
## Changelog ## Changelog
View the changelog in [CHANGELOG.md](CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!** View the changelog in [CHANGELOG.md](https://github.com/parallelvirtualcluster/pvc/blob/master/CHANGELOG.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**
## Screenshots ## Screenshots
These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85. These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85.
<p><img alt="0. Integrated help" src="images/0-integrated-help.png"/><br/> <p><img alt="0. Integrated help" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/0-integrated-help.png"/><br/>
<i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i> <i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i>
</p> </p>
<p><img alt="1. Connection management" src="images/1-connection-management.png"/><br/> <p><img alt="1. Connection management" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/1-connection-management.png"/><br/>
<i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i> <i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i>
</p> </p>
<p><img alt="2. Cluster details and output formats" src="images/2-cluster-details-and-output-formats.png"/><br/> <p><img alt="2. Cluster details and output formats" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/2-cluster-details-and-output-formats.png"/><br/>
<i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i> <i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i>
</p> </p>
<p><img alt="3. Node information" src="images/3-node-information.png"/><br/> <p><img alt="3. Node information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/3-node-information.png"/><br/>
<i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i> <i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i>
</p> </p>
<p><img alt="4. VM information" src="images/4-vm-information.png"/><br/> <p><img alt="4. VM information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/4-vm-information.png"/><br/>
<i>PVC can show details about the VMs in the cluster, including their state, resource allocations, current hosting node, and metadata.</i> <i>PVC can show details about the VMs in the cluster, including their state, resource allocations, current hosting node, and metadata.</i>
</p> </p>
<p><img alt="5. VM details" src="images/5-vm-details.png"/><br/> <p><img alt="5. VM details" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/5-vm-details.png"/><br/>
<i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i> <i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i>
</p> </p>
<p><img alt="6. Network information" src="images/6-network-information.png"/><br/> <p><img alt="6. Network information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/6-network-information.png"/><br/>
<i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i> <i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i>
</p> </p>
<p><img alt="7. Storage information" src="images/7-storage-information.png"/><br/> <p><img alt="7. Storage information" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/7-storage-information.png"/><br/>
<i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i> <i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i>
</p> </p>
<p><img alt="8. VM and node logs" src="images/8-vm-and-node-logs.png"/><br/> <p><img alt="8. VM and node logs" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/8-vm-and-node-logs.png"/><br/>
<i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i> <i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i>
</p> </p>
<p><img alt="9. VM and worker tasks" src="images/9-vm-and-worker-tasks.png"/><br/> <p><img alt="9. VM and worker tasks" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/9-vm-and-worker-tasks.png"/><br/>
<i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i> <i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i>
</p> </p>
<p><img alt="10. Provisioner" src="images/10-provisioner.png"/><br/> <p><img alt="10. Provisioner" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/10-provisioner.png"/><br/>
<i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i> <i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i>
</p> </p>
<p><img alt="11. Prometheus and Grafana dashboard" src="images/11-prometheus-grafana.png"/><br/> <p><img alt="11. Prometheus and Grafana dashboard" src="https://raw.githubusercontent.com/parallelvirtualcluster/pvc/refs/heads/master/images/11-prometheus-grafana.png"/><br/>
<i>PVC features several monitoring integration examples under "node-daemon/monitoring", including CheckMK, Munin, and, most recently, Prometheus, including an example Grafana dashboard for cluster monitoring and alerting.</i> <i>PVC features several monitoring integration examples under "node-daemon/monitoring", including CheckMK, Munin, and, most recently, Prometheus, including an example Grafana dashboard for cluster monitoring and alerting.</i>
</p> </p>

View File

@ -21,4 +21,5 @@
from daemon_lib.zkhandler import ZKSchema from daemon_lib.zkhandler import ZKSchema
ZKSchema.write() schema = ZKSchema(root_path=".")
schema.write()

View File

@ -19,6 +19,13 @@
# #
############################################################################### ###############################################################################
import pvcapid.Daemon # noqa: F401 import sys
from os import path
# Ensure current directory (/usr/share/pvc) is in the system path for Gunicorn
current_dir = path.dirname(path.abspath(__file__))
sys.path.append(current_dir)
import pvcapid.Daemon # noqa: F401, E402
pvcapid.Daemon.entrypoint() pvcapid.Daemon.entrypoint()

View File

@ -19,15 +19,13 @@
# #
############################################################################### ###############################################################################
import subprocess
from ssl import SSLContext, TLSVersion from ssl import SSLContext, TLSVersion
from distutils.util import strtobool as dustrtobool from distutils.util import strtobool as dustrtobool
import daemon_lib.config as cfg import daemon_lib.config as cfg
# Daemon version # Daemon version
version = "0.9.94" version = "0.9.100~git-73c0834f"
# API version # API version
API_VERSION = 1.0 API_VERSION = 1.0
@ -53,7 +51,6 @@ def strtobool(stringv):
# Configuration Parsing # Configuration Parsing
########################################################## ##########################################################
# Get our configuration # Get our configuration
config = cfg.get_configuration() config = cfg.get_configuration()
config["daemon_name"] = "pvcapid" config["daemon_name"] = "pvcapid"
@ -61,22 +58,16 @@ config["daemon_version"] = version
########################################################## ##########################################################
# Entrypoint # Flask App Creation for Gunicorn
########################################################## ##########################################################
def entrypoint(): def create_app():
import pvcapid.flaskapi as pvc_api # noqa: E402 """
Create and return the Flask app and SSL context if necessary.
if config["api_ssl_enabled"]: """
context = SSLContext() # Import the Flask app from pvcapid.flaskapi after adjusting the path
context.minimum_version = TLSVersion.TLSv1 import pvcapid.flaskapi as pvc_api
context.get_ca_certs()
context.load_cert_chain(
config["api_ssl_cert_file"], keyfile=config["api_ssl_key_file"]
)
else:
context = None
# Print our startup messages # Print our startup messages
print("") print("")
@ -102,9 +93,69 @@ def entrypoint():
print("") print("")
pvc_api.celery_startup() pvc_api.celery_startup()
pvc_api.app.run(
return pvc_api.app
##########################################################
# Entrypoint
##########################################################
def entrypoint():
if config["debug"]:
app = create_app()
if config["api_ssl_enabled"]:
ssl_context = SSLContext()
ssl_context.minimum_version = TLSVersion.TLSv1
ssl_context.get_ca_certs()
ssl_context.load_cert_chain(
config["api_ssl_cert_file"], keyfile=config["api_ssl_key_file"]
)
else:
ssl_context = None
app.run(
config["api_listen_address"], config["api_listen_address"],
config["api_listen_port"], config["api_listen_port"],
threaded=True, threaded=True,
ssl_context=context, ssl_context=ssl_context,
) )
else:
# Build the command to run Gunicorn
gunicorn_cmd = [
"gunicorn",
"--workers",
"1",
"--threads",
"8",
"--timeout",
"86400",
"--bind",
"{}:{}".format(config["api_listen_address"], config["api_listen_port"]),
"pvcapid.Daemon:create_app()",
"--log-level",
"info",
"--access-logfile",
"-",
"--error-logfile",
"-",
]
if config["api_ssl_enabled"]:
gunicorn_cmd += [
"--certfile",
config["api_ssl_cert_file"],
"--keyfile",
config["api_ssl_key_file"],
]
# Run Gunicorn
try:
subprocess.run(gunicorn_cmd)
except KeyboardInterrupt:
exit(0)
except Exception as e:
print(e)
exit(1)

File diff suppressed because it is too large Load Diff

View File

@ -21,7 +21,9 @@
import flask import flask
import json import json
import logging
import lxml.etree as etree import lxml.etree as etree
import sys
from re import match from re import match
from requests import get from requests import get
@ -40,6 +42,15 @@ import daemon_lib.network as pvc_network
import daemon_lib.ceph as pvc_ceph import daemon_lib.ceph as pvc_ceph
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
logger.addHandler(handler)
# #
# Cluster base functions # Cluster base functions
# #
@ -765,6 +776,134 @@ def vm_restore(
return output, retcode return output, retcode
@ZKConnection(config)
def create_vm_snapshot(
zkhandler,
domain,
snapshot_name=None,
):
"""
Take a snapshot of a VM.
"""
retflag, retdata = pvc_vm.create_vm_snapshot(
zkhandler,
domain,
snapshot_name,
)
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
@ZKConnection(config)
def remove_vm_snapshot(
zkhandler,
domain,
snapshot_name,
):
"""
Take a snapshot of a VM.
"""
retflag, retdata = pvc_vm.remove_vm_snapshot(
zkhandler,
domain,
snapshot_name,
)
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
@ZKConnection(config)
def rollback_vm_snapshot(
zkhandler,
domain,
snapshot_name,
):
"""
Roll back to a snapshot of a VM.
"""
retflag, retdata = pvc_vm.rollback_vm_snapshot(
zkhandler,
domain,
snapshot_name,
)
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
@ZKConnection(config)
def export_vm_snapshot(
zkhandler,
domain,
snapshot_name,
export_path,
incremental_parent=None,
):
"""
Export a snapshot of a VM to files.
"""
retflag, retdata = pvc_vm.export_vm_snapshot(
zkhandler,
domain,
snapshot_name,
export_path,
incremental_parent,
)
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
@ZKConnection(config)
def import_vm_snapshot(
zkhandler,
domain,
snapshot_name,
export_path,
retain_snapshot=False,
):
"""
Import a snapshot of a VM from files.
"""
retflag, retdata = pvc_vm.import_vm_snapshot(
zkhandler,
domain,
snapshot_name,
export_path,
retain_snapshot,
)
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
@ZKConnection(config) @ZKConnection(config)
def vm_attach_device(zkhandler, vm, device_spec_xml): def vm_attach_device(zkhandler, vm, device_spec_xml):
""" """
@ -1014,11 +1153,11 @@ def vm_remove(zkhandler, name):
@ZKConnection(config) @ZKConnection(config)
def vm_start(zkhandler, name): def vm_start(zkhandler, name, force=False):
""" """
Start a VM in the PVC cluster. Start a VM in the PVC cluster.
""" """
retflag, retdata = pvc_vm.start_vm(zkhandler, name) retflag, retdata = pvc_vm.start_vm(zkhandler, name, force=force)
if retflag: if retflag:
retcode = 200 retcode = 200
@ -1062,11 +1201,11 @@ def vm_shutdown(zkhandler, name, wait):
@ZKConnection(config) @ZKConnection(config)
def vm_stop(zkhandler, name): def vm_stop(zkhandler, name, force=False):
""" """
Forcibly stop a VM in the PVC cluster. Forcibly stop a VM in the PVC cluster.
""" """
retflag, retdata = pvc_vm.stop_vm(zkhandler, name) retflag, retdata = pvc_vm.stop_vm(zkhandler, name, force=force)
if retflag: if retflag:
retcode = 200 retcode = 200
@ -1152,7 +1291,7 @@ def vm_flush_locks(zkhandler, vm):
zkhandler, None, None, None, vm, is_fuzzy=False, negate=False zkhandler, None, None, None, vm, is_fuzzy=False, negate=False
) )
if retdata[0].get("state") not in ["stop", "disable"]: if retdata[0].get("state") not in ["stop", "disable", "mirror"]:
return {"message": "VM must be stopped to flush locks"}, 400 return {"message": "VM must be stopped to flush locks"}, 400
retflag, retdata = pvc_vm.flush_locks(zkhandler, vm) retflag, retdata = pvc_vm.flush_locks(zkhandler, vm)
@ -1166,6 +1305,342 @@ def vm_flush_locks(zkhandler, vm):
return output, retcode return output, retcode
@ZKConnection(config)
def vm_snapshot_receive_block_full(zkhandler, pool, volume, snapshot, size, request):
"""
Receive an RBD volume from a remote system
"""
import rados
import rbd
_, rbd_detail = pvc_ceph.get_list_volume(
zkhandler, pool, limit=volume, is_fuzzy=False
)
if len(rbd_detail) > 0:
volume_exists = True
else:
volume_exists = False
cluster = rados.Rados(conffile="/etc/ceph/ceph.conf")
cluster.connect()
ioctx = cluster.open_ioctx(pool)
if not volume_exists:
rbd_inst = rbd.RBD()
rbd_inst.create(ioctx, volume, size)
retflag, retdata = pvc_ceph.add_volume(
zkhandler, pool, volume, str(size) + "B", force_flag=True, zk_only=True
)
if not retflag:
ioctx.close()
cluster.shutdown()
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
image = rbd.Image(ioctx, volume)
last_chunk = 0
chunk_size = 1024 * 1024 * 1024
logger.info(f"Importing full snapshot {pool}/{volume}@{snapshot}")
while True:
chunk = request.stream.read(chunk_size)
if not chunk:
break
image.write(chunk, last_chunk)
last_chunk += len(chunk)
image.close()
ioctx.close()
cluster.shutdown()
return {"message": "Successfully received RBD block device"}, 200
@ZKConnection(config)
def vm_snapshot_receive_block_diff(
zkhandler, pool, volume, snapshot, source_snapshot, request
):
"""
Receive an RBD volume from a remote system
"""
import rados
import rbd
cluster = rados.Rados(conffile="/etc/ceph/ceph.conf")
cluster.connect()
ioctx = cluster.open_ioctx(pool)
image = rbd.Image(ioctx, volume)
if len(request.files) > 0:
logger.info(f"Applying {len(request.files)} RBD diff chunks for {snapshot}")
for i in range(len(request.files)):
object_key = f"object_{i}"
if object_key in request.files:
object_data = request.files[object_key].read()
offset = int.from_bytes(object_data[:8], "big")
length = int.from_bytes(object_data[8:16], "big")
data = object_data[16 : 16 + length]
logger.info(f"Applying RBD diff chunk at {offset} ({length} bytes)")
image.write(data, offset)
else:
return {"message": "No data received"}, 400
image.close()
ioctx.close()
cluster.shutdown()
return {
"message": f"Successfully received {len(request.files)} RBD diff chunks"
}, 200
@ZKConnection(config)
def vm_snapshot_receive_block_createsnap(zkhandler, pool, volume, snapshot):
"""
Create the snapshot of a remote volume
"""
import rados
import rbd
cluster = rados.Rados(conffile="/etc/ceph/ceph.conf")
cluster.connect()
ioctx = cluster.open_ioctx(pool)
image = rbd.Image(ioctx, volume)
image.create_snap(snapshot)
image.close()
ioctx.close()
cluster.shutdown()
retflag, retdata = pvc_ceph.add_snapshot(
zkhandler, pool, volume, snapshot, zk_only=True
)
if not retflag:
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
return {"message": "Successfully received RBD snapshot"}, 200
@ZKConnection(config)
def vm_snapshot_receive_config(zkhandler, snapshot, vm_config, source_snapshot=None):
"""
Receive a VM configuration snapshot from a remote system, and modify it to work on our system
"""
def parse_unified_diff(diff_text, original_text):
"""
Take a unified diff and apply it to an original string
"""
# Split the original string into lines
original_lines = original_text.splitlines(keepends=True)
patched_lines = []
original_idx = 0 # Track position in original lines
diff_lines = diff_text.splitlines(keepends=True)
for line in diff_lines:
if line.startswith("---") or line.startswith("+++"):
# Ignore prefix lines
continue
if line.startswith("@@"):
# Extract line numbers from the diff hunk header
hunk_header = line
parts = hunk_header.split(" ")
original_range = parts[1]
# Get the starting line number and range length for the original file
original_start, _ = map(int, original_range[1:].split(","))
# Adjust for zero-based indexing
original_start -= 1
# Add any lines between the current index and the next hunk's start
while original_idx < original_start:
patched_lines.append(original_lines[original_idx])
original_idx += 1
elif line.startswith("-"):
# This line should be removed from the original, skip it
original_idx += 1
elif line.startswith("+"):
# This line should be added to the patched version, removing the '+'
patched_lines.append(line[1:])
else:
# Context line (unchanged), it has no prefix, add from the original
patched_lines.append(original_lines[original_idx])
original_idx += 1
# Add any remaining lines from the original file after the last hunk
patched_lines.extend(original_lines[original_idx:])
return "".join(patched_lines).strip()
# Get our XML configuration for this snapshot
# We take the main XML configuration, then apply the diff for this particular incremental
current_snapshot = [s for s in vm_config["snapshots"] if s["name"] == snapshot][0]
vm_xml = vm_config["xml"]
vm_xml_diff = "\n".join(current_snapshot["xml_diff_lines"])
snapshot_vm_xml = parse_unified_diff(vm_xml_diff, vm_xml)
xml_data = etree.fromstring(snapshot_vm_xml)
# Replace the Ceph storage secret UUID with this cluster's
our_ceph_secret_uuid = config["ceph_secret_uuid"]
ceph_secrets = xml_data.xpath("//secret[@type='ceph']")
for ceph_secret in ceph_secrets:
ceph_secret.set("uuid", our_ceph_secret_uuid)
# Replace the Ceph source hosts with this cluster's
our_ceph_storage_hosts = config["storage_hosts"]
our_ceph_storage_port = str(config["ceph_monitor_port"])
ceph_sources = xml_data.xpath("//source[@protocol='rbd']")
for ceph_source in ceph_sources:
for host in ceph_source.xpath("host"):
ceph_source.remove(host)
for ceph_storage_host in our_ceph_storage_hosts:
new_host = etree.Element("host")
new_host.set("name", ceph_storage_host)
new_host.set("port", our_ceph_storage_port)
ceph_source.append(new_host)
# Regenerate the VM XML
snapshot_vm_xml = etree.tostring(xml_data, pretty_print=True).decode("utf8")
if (
source_snapshot is not None
or pvc_vm.searchClusterByUUID(zkhandler, vm_config["uuid"]) is not None
):
logger.info(
f"Receiving incremental VM configuration for {vm_config['name']}@{snapshot}"
)
# Modify the VM based on our passed detail
retcode, retmsg = pvc_vm.modify_vm(
zkhandler,
vm_config["uuid"],
False,
snapshot_vm_xml,
)
if not retcode:
retcode = 400
retdata = {"message": retmsg}
return retdata, retcode
retcode, retmsg = pvc_vm.modify_vm_metadata(
zkhandler,
vm_config["uuid"],
None, # Node limits are left unchanged
vm_config["node_selector"],
vm_config["node_autostart"],
vm_config["profile"],
vm_config["migration_method"],
vm_config["migration_max_downtime"],
)
if not retcode:
retcode = 400
retdata = {"message": retmsg}
return retdata, retcode
current_vm_tags = zkhandler.children(("domain.meta.tags", vm_config["uuid"]))
new_vm_tags = [t["name"] for t in vm_config["tags"]]
remove_tags = []
add_tags = []
for tag in vm_config["tags"]:
if tag["name"] not in current_vm_tags:
add_tags.append((tag["name"], tag["protected"]))
for tag in current_vm_tags:
if tag not in new_vm_tags:
remove_tags.append(tag)
for tag in add_tags:
name, protected = tag
pvc_vm.modify_vm_tag(
zkhandler, vm_config["uuid"], "add", name, protected=protected
)
for tag in remove_tags:
pvc_vm.modify_vm_tag(zkhandler, vm_config["uuid"], "remove", name)
else:
logger.info(
f"Receiving full VM configuration for {vm_config['name']}@{snapshot}"
)
# Define the VM based on our passed detail
retcode, retmsg = pvc_vm.define_vm(
zkhandler,
snapshot_vm_xml,
None, # Target node is autoselected
None, # Node limits are invalid here so ignore them
vm_config["node_selector"],
vm_config["node_autostart"],
vm_config["migration_method"],
vm_config["migration_max_downtime"],
vm_config["profile"],
vm_config["tags"],
"mirror",
)
if not retcode:
retcode = 400
retdata = {"message": retmsg}
return retdata, retcode
# Add this snapshot to the VM manually in Zookeeper
zkhandler.write(
[
(
(
"domain.snapshots",
vm_config["uuid"],
"domain_snapshot.name",
snapshot,
),
snapshot,
),
(
(
"domain.snapshots",
vm_config["uuid"],
"domain_snapshot.timestamp",
snapshot,
),
current_snapshot["timestamp"],
),
(
(
"domain.snapshots",
vm_config["uuid"],
"domain_snapshot.xml",
snapshot,
),
snapshot_vm_xml,
),
(
(
"domain.snapshots",
vm_config["uuid"],
"domain_snapshot.rbd_snapshots",
snapshot,
),
",".join(current_snapshot["rbd_snapshots"]),
),
]
)
return {"message": "Successfully received VM configuration snapshot"}, 200
# #
# Network functions # Network functions
# #
@ -1869,7 +2344,23 @@ def ceph_volume_list(zkhandler, pool=None, limit=None, is_fuzzy=True):
@ZKConnection(config) @ZKConnection(config)
def ceph_volume_add(zkhandler, pool, name, size, force_flag): def ceph_volume_scan(zkhandler, pool, name):
"""
(Re)scan a Ceph RBD volume for stats in the PVC Ceph storage cluster.
"""
retflag, retdata = pvc_ceph.scan_volume(zkhandler, pool, name)
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
@ZKConnection(config)
def ceph_volume_add(zkhandler, pool, name, size, force_flag=False):
""" """
Add a Ceph RBD volume to the PVC Ceph storage cluster. Add a Ceph RBD volume to the PVC Ceph storage cluster.
""" """
@ -2183,6 +2674,22 @@ def ceph_volume_snapshot_rename(zkhandler, pool, volume, name, new_name):
return output, retcode return output, retcode
@ZKConnection(config)
def ceph_volume_snapshot_rollback(zkhandler, pool, volume, name):
"""
Roll back a Ceph RBD volume to a given snapshot in the PVC Ceph storage cluster.
"""
retflag, retdata = pvc_ceph.rollback_snapshot(zkhandler, pool, volume, name)
if retflag:
retcode = 200
else:
retcode = 400
output = {"message": retdata.replace('"', "'")}
return output, retcode
@ZKConnection(config) @ZKConnection(config)
def ceph_volume_snapshot_remove(zkhandler, pool, volume, name): def ceph_volume_snapshot_remove(zkhandler, pool, volume, name):
""" """

View File

@ -125,7 +125,7 @@ def list_template(limit, table, is_fuzzy=True):
args = (template_data["id"],) args = (template_data["id"],)
cur.execute(query, args) cur.execute(query, args)
disks = cur.fetchall() disks = cur.fetchall()
data[template_id]["disks"] = disks data[template_id]["disks"] = sorted(disks, key=lambda x: x["disk_id"])
close_database(conn, cur) close_database(conn, cur)
@ -284,12 +284,13 @@ def create_template_network(name, mac_template=None):
return retmsg, retcode return retmsg, retcode
def create_template_network_element(name, vni): def create_template_network_element(name, vni, permit_duplicate=False):
if list_template_network(name, is_fuzzy=False)[-1] != 200: if list_template_network(name, is_fuzzy=False)[-1] != 200:
retmsg = {"message": 'The network template "{}" does not exist.'.format(name)} retmsg = {"message": 'The network template "{}" does not exist.'.format(name)}
retcode = 400 retcode = 400
return retmsg, retcode return retmsg, retcode
if not permit_duplicate:
networks, code = list_template_network_vnis(name) networks, code = list_template_network_vnis(name)
if code != 200: if code != 200:
networks = [] networks = []

View File

@ -13,6 +13,8 @@ else
fi fi
KEEP_ARTIFACTS="" KEEP_ARTIFACTS=""
API_ONLY=""
PRIMARY_NODE=""
if [[ -n ${1} ]]; then if [[ -n ${1} ]]; then
for arg in ${@}; do for arg in ${@}; do
case ${arg} in case ${arg} in
@ -20,12 +22,23 @@ if [[ -n ${1} ]]; then
KEEP_ARTIFACTS="y" KEEP_ARTIFACTS="y"
shift shift
;; ;;
-a|--api-only)
API_ONLY="y"
shift
;;
-p=*|--become-primary=*)
PRIMARY_NODE=$( awk -F'=' '{ print $NF }' <<<"${arg}" )
shift
;;
esac esac
done done
fi fi
HOSTS=( ${@} ) HOSTS=( ${@} )
echo "Deploying to host(s): ${HOSTS[@]}" echo "Deploying to host(s): ${HOSTS[@]}"
if [[ -n ${PRIMARY_NODE} ]]; then
echo "Will become primary on ${PRIMARY_NODE} after updating it"
fi
# Move to repo root if we're not # Move to repo root if we're not
pushd $( git rev-parse --show-toplevel ) &>/dev/null pushd $( git rev-parse --show-toplevel ) &>/dev/null
@ -67,6 +80,7 @@ for HOST in ${HOSTS[@]}; do
ssh $HOST $SUDO systemctl restart pvcapid &>/dev/null ssh $HOST $SUDO systemctl restart pvcapid &>/dev/null
sleep 2 sleep 2
ssh $HOST $SUDO systemctl restart pvcworkerd &>/dev/null ssh $HOST $SUDO systemctl restart pvcworkerd &>/dev/null
if [[ -z ${API_ONLY} ]]; then
sleep 2 sleep 2
ssh $HOST $SUDO systemctl restart pvchealthd &>/dev/null ssh $HOST $SUDO systemctl restart pvchealthd &>/dev/null
sleep 2 sleep 2
@ -77,7 +91,14 @@ for HOST in ${HOSTS[@]}; do
sleep 5 sleep 5
echo -n "." echo -n "."
done done
fi
echo " done." echo " done."
if [[ -n ${PRIMARY_NODE} && ${PRIMARY_NODE} == ${HOST} ]]; then
echo -n ">>> Setting node $HOST to primary coordinator state... "
ssh $HOST pvc -q node primary --wait &>/dev/null
ssh $HOST $SUDO systemctl restart pvcworkerd &>/dev/null
echo "done."
fi
done done
popd &>/dev/null popd &>/dev/null

File diff suppressed because it is too large Load Diff

View File

@ -83,6 +83,37 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
total_volumes = data.get("volumes", 0) total_volumes = data.get("volumes", 0)
total_snapshots = data.get("snapshots", 0) total_snapshots = data.get("snapshots", 0)
total_cpu_total = data.get("resources", {}).get("cpu", {}).get("total", 0)
total_cpu_load = data.get("resources", {}).get("cpu", {}).get("load", 0)
total_cpu_utilization = (
data.get("resources", {}).get("cpu", {}).get("utilization", 0)
)
total_cpu_string = (
f"{total_cpu_utilization:.1f}% ({total_cpu_load:.1f} / {total_cpu_total})"
)
total_memory_total = (
data.get("resources", {}).get("memory", {}).get("total", 0) / 1024
)
total_memory_used = (
data.get("resources", {}).get("memory", {}).get("used", 0) / 1024
)
total_memory_utilization = (
data.get("resources", {}).get("memory", {}).get("utilization", 0)
)
total_memory_string = f"{total_memory_utilization:.1f}% ({total_memory_used:.1f} GB / {total_memory_total:.1f} GB)"
total_disk_total = (
data.get("resources", {}).get("disk", {}).get("total", 0) / 1024 / 1024
)
total_disk_used = (
data.get("resources", {}).get("disk", {}).get("used", 0) / 1024 / 1024
)
total_disk_utilization = round(
data.get("resources", {}).get("disk", {}).get("utilization", 0)
)
total_disk_string = f"{total_disk_utilization:.1f}% ({total_disk_used:.1f} GB / {total_disk_total:.1f} GB)"
if maintenance == "true" or health == -1: if maintenance == "true" or health == -1:
health_colour = ansii["blue"] health_colour = ansii["blue"]
elif health > 90: elif health > 90:
@ -94,9 +125,6 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
output = list() output = list()
output.append(f"{ansii['bold']}PVC cluster status:{ansii['end']}")
output.append("")
output.append(f"{ansii['purple']}Primary node:{ansii['end']} {primary_node}") output.append(f"{ansii['purple']}Primary node:{ansii['end']} {primary_node}")
output.append(f"{ansii['purple']}PVC version:{ansii['end']} {pvc_version}") output.append(f"{ansii['purple']}PVC version:{ansii['end']} {pvc_version}")
output.append(f"{ansii['purple']}Upstream IP:{ansii['end']} {upstream_ip}") output.append(f"{ansii['purple']}Upstream IP:{ansii['end']} {upstream_ip}")
@ -136,7 +164,17 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
) )
messages = "\n ".join(message_list) messages = "\n ".join(message_list)
output.append(f"{ansii['purple']}Active Faults:{ansii['end']} {messages}") else:
messages = "None"
output.append(f"{ansii['purple']}Active faults:{ansii['end']} {messages}")
output.append(f"{ansii['purple']}Total CPU:{ansii['end']} {total_cpu_string}")
output.append(
f"{ansii['purple']}Total memory:{ansii['end']} {total_memory_string}"
)
output.append(f"{ansii['purple']}Total disk:{ansii['end']} {total_disk_string}")
output.append("") output.append("")
@ -168,12 +206,12 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
output.append(f"{ansii['purple']}Nodes:{ansii['end']} {nodes_string}") output.append(f"{ansii['purple']}Nodes:{ansii['end']} {nodes_string}")
vm_states = ["start", "disable"] vm_states = ["start", "disable", "mirror"]
vm_states.extend( vm_states.extend(
[ [
state state
for state in data.get("vms", {}).keys() for state in data.get("vms", {}).keys()
if state not in ["total", "start", "disable"] if state not in ["total", "start", "disable", "mirror"]
] ]
) )
@ -183,8 +221,10 @@ def cli_cluster_status_format_pretty(CLI_CONFIG, data):
continue continue
if state in ["start"]: if state in ["start"]:
state_colour = ansii["green"] state_colour = ansii["green"]
elif state in ["migrate", "disable", "provision"]: elif state in ["migrate", "disable", "provision", "mirror"]:
state_colour = ansii["blue"] state_colour = ansii["blue"]
elif state in ["mirror"]:
state_colour = ansii["purple"]
elif state in ["stop", "fail"]: elif state in ["stop", "fail"]:
state_colour = ansii["red"] state_colour = ansii["red"]
else: else:
@ -258,9 +298,6 @@ def cli_cluster_status_format_short(CLI_CONFIG, data):
output = list() output = list()
output.append(f"{ansii['bold']}PVC cluster status:{ansii['end']}")
output.append("")
if health != "-1": if health != "-1":
health = f"{health}%" health = f"{health}%"
else: else:
@ -295,7 +332,48 @@ def cli_cluster_status_format_short(CLI_CONFIG, data):
) )
messages = "\n ".join(message_list) messages = "\n ".join(message_list)
output.append(f"{ansii['purple']}Active Faults:{ansii['end']} {messages}") else:
messages = "None"
output.append(f"{ansii['purple']}Active faults:{ansii['end']} {messages}")
total_cpu_total = data.get("resources", {}).get("cpu", {}).get("total", 0)
total_cpu_load = data.get("resources", {}).get("cpu", {}).get("load", 0)
total_cpu_utilization = (
data.get("resources", {}).get("cpu", {}).get("utilization", 0)
)
total_cpu_string = (
f"{total_cpu_utilization:.1f}% ({total_cpu_load:.1f} / {total_cpu_total})"
)
total_memory_total = (
data.get("resources", {}).get("memory", {}).get("total", 0) / 1024
)
total_memory_used = (
data.get("resources", {}).get("memory", {}).get("used", 0) / 1024
)
total_memory_utilization = (
data.get("resources", {}).get("memory", {}).get("utilization", 0)
)
total_memory_string = f"{total_memory_utilization:.1f}% ({total_memory_used:.1f} GB / {total_memory_total:.1f} GB)"
total_disk_total = (
data.get("resources", {}).get("disk", {}).get("total", 0) / 1024 / 1024
)
total_disk_used = (
data.get("resources", {}).get("disk", {}).get("used", 0) / 1024 / 1024
)
total_disk_utilization = round(
data.get("resources", {}).get("disk", {}).get("utilization", 0)
)
total_disk_string = f"{total_disk_utilization:.1f}% ({total_disk_used:.1f} GB / {total_disk_total:.1f} GB)"
output.append(f"{ansii['purple']}CPU usage:{ansii['end']} {total_cpu_string}")
output.append(
f"{ansii['purple']}Memory usage:{ansii['end']} {total_memory_string}"
)
output.append(f"{ansii['purple']}Disk usage:{ansii['end']} {total_disk_string}")
output.append("") output.append("")
@ -580,9 +658,11 @@ def cli_cluster_fault_list_format_long(CLI_CONFIG, fault_data):
fault_id=fault["id"], fault_id=fault["id"],
fault_status=fault["status"].title(), fault_status=fault["status"].title(),
fault_health_delta=f"-{fault['health_delta']}%", fault_health_delta=f"-{fault['health_delta']}%",
fault_acknowledged_at=fault["acknowledged_at"] fault_acknowledged_at=(
fault["acknowledged_at"]
if fault["acknowledged_at"] != "" if fault["acknowledged_at"] != ""
else "N/A", else "N/A"
),
fault_last_reported=fault["last_reported"], fault_last_reported=fault["last_reported"],
fault_first_reported=fault["first_reported"], fault_first_reported=fault["first_reported"],
) )
@ -825,7 +905,7 @@ def cli_connection_list_format_pretty(CLI_CONFIG, data):
# Parse each connection and adjust field lengths # Parse each connection and adjust field lengths
for connection in data: for connection in data:
for field, length in [(f, fields[f]["length"]) for f in fields]: for field, length in [(f, fields[f]["length"]) for f in fields]:
_length = len(str(connection[field])) _length = len(str(connection[field])) + 1
if _length > length: if _length > length:
length = len(str(connection[field])) + 1 length = len(str(connection[field])) + 1
@ -925,7 +1005,7 @@ def cli_connection_detail_format_pretty(CLI_CONFIG, data):
# Parse each connection and adjust field lengths # Parse each connection and adjust field lengths
for connection in data: for connection in data:
for field, length in [(f, fields[f]["length"]) for f in fields]: for field, length in [(f, fields[f]["length"]) for f in fields]:
_length = len(str(connection[field])) _length = len(str(connection[field])) + 1
if _length > length: if _length > length:
length = len(str(connection[field])) + 1 length = len(str(connection[field])) + 1

View File

@ -20,26 +20,16 @@
############################################################################### ###############################################################################
from click import echo as click_echo from click import echo as click_echo
from click import confirm
from datetime import datetime
from distutils.util import strtobool from distutils.util import strtobool
from getpass import getuser
from json import load as jload from json import load as jload
from json import dump as jdump from json import dump as jdump
from os import chmod, environ, getpid, path, popen, makedirs, get_terminal_size from os import chmod, environ, getpid, path, get_terminal_size
from re import findall
from socket import gethostname from socket import gethostname
from subprocess import run, PIPE
from sys import argv from sys import argv
from syslog import syslog, openlog, closelog, LOG_AUTH from syslog import syslog, openlog, closelog, LOG_AUTH
from yaml import load as yload from yaml import load as yload
from yaml import SafeLoader from yaml import SafeLoader
import pvc.lib.provisioner
import pvc.lib.vm
import pvc.lib.node
import pvc.lib.storage
DEFAULT_STORE_DATA = {"cfgfile": "/etc/pvc/pvc.conf"} DEFAULT_STORE_DATA = {"cfgfile": "/etc/pvc/pvc.conf"}
DEFAULT_STORE_FILENAME = "pvc.json" DEFAULT_STORE_FILENAME = "pvc.json"
@ -177,9 +167,17 @@ def get_store(store_path):
with open(store_file) as fh: with open(store_file) as fh:
try: try:
store_data = jload(fh) store_data = jload(fh)
return store_data
except Exception: except Exception:
return dict() store_data = dict()
if path.exists(DEFAULT_STORE_DATA["cfgfile"]):
if store_data.get("local", None) != DEFAULT_STORE_DATA:
del store_data["local"]
if "local" not in store_data.keys():
store_data["local"] = DEFAULT_STORE_DATA
update_store(store_path, store_data)
return store_data
def update_store(store_path, store_data): def update_store(store_path, store_data):
@ -196,452 +194,3 @@ def update_store(store_path, store_data):
with open(store_file, "w") as fh: with open(store_file, "w") as fh:
jdump(store_data, fh, sort_keys=True, indent=4) jdump(store_data, fh, sort_keys=True, indent=4)
def get_autobackup_config(CLI_CONFIG, cfgfile):
try:
config = dict()
with open(cfgfile) as fh:
full_config = yload(fh, Loader=SafeLoader)
backup_config = full_config["autobackup"]
config["backup_root_path"] = backup_config["backup_root_path"]
config["backup_root_suffix"] = backup_config["backup_root_suffix"]
config["backup_tags"] = backup_config["backup_tags"]
config["backup_schedule"] = backup_config["backup_schedule"]
config["auto_mount_enabled"] = backup_config["auto_mount"]["enabled"]
if config["auto_mount_enabled"]:
config["mount_cmds"] = list()
_mount_cmds = backup_config["auto_mount"]["mount_cmds"]
for _mount_cmd in _mount_cmds:
if "{backup_root_path}" in _mount_cmd:
_mount_cmd = _mount_cmd.format(
backup_root_path=backup_config["backup_root_path"]
)
config["mount_cmds"].append(_mount_cmd)
config["unmount_cmds"] = list()
_unmount_cmds = backup_config["auto_mount"]["unmount_cmds"]
for _unmount_cmd in _unmount_cmds:
if "{backup_root_path}" in _unmount_cmd:
_unmount_cmd = _unmount_cmd.format(
backup_root_path=backup_config["backup_root_path"]
)
config["unmount_cmds"].append(_unmount_cmd)
except FileNotFoundError:
return "Backup configuration does not exist!"
except KeyError as e:
return f"Backup configuration is invalid: {e}"
return config
def vm_autobackup(
CLI_CONFIG,
autobackup_cfgfile=DEFAULT_AUTOBACKUP_FILENAME,
email_report=None,
force_full_flag=False,
cron_flag=False,
):
"""
Perform automatic backups of VMs based on an external config file.
"""
backup_summary = dict()
if email_report is not None:
from email.utils import formatdate
from socket import gethostname
try:
with open(autobackup_cfgfile) as fh:
tmp_config = yload(fh, Loader=SafeLoader)
cluster = tmp_config["cluster"]["name"]
except Exception:
cluster = "unknown"
def send_execution_failure_report(error=None):
echo(CLI_CONFIG, f"Sending email failure report to {email_report}")
current_datetime = datetime.now()
email_datetime = formatdate(float(current_datetime.strftime("%s")))
email = list()
email.append(f"Date: {email_datetime}")
email.append(f"Subject: PVC Autobackup execution failure for cluster {cluster}")
recipients = list()
for recipient in email_report.split(","):
recipients.append(f"<{recipient}>")
email.append(f"To: {', '.join(recipients)}")
email.append(f"From: PVC Autobackup System <pvc@{gethostname()}>")
email.append("")
email.append(
f"A PVC autobackup has FAILED at {current_datetime} due to an execution error."
)
email.append("")
email.append("The reported error message is:")
email.append(f" {error}")
try:
p = popen("/usr/sbin/sendmail -t", "w")
p.write("\n".join(email))
p.close()
except Exception as e:
echo(CLI_CONFIG, f"Failed to send report email: {e}")
# Validate that we are running on the current primary coordinator of the 'local' cluster connection
real_connection = CLI_CONFIG["connection"]
CLI_CONFIG["connection"] = "local"
retcode, retdata = pvc.lib.node.node_info(CLI_CONFIG, DEFAULT_NODE_HOSTNAME)
if not retcode or retdata.get("coordinator_state") != "primary":
if cron_flag:
echo(
CLI_CONFIG,
"Current host is not the primary coordinator of the local cluster and running in cron mode. Exiting cleanly.",
)
exit(0)
else:
echo(
CLI_CONFIG,
f"ERROR: Current host is not the primary coordinator of the local cluster; got connection '{real_connection}', host '{DEFAULT_NODE_HOSTNAME}'.",
)
echo(
CLI_CONFIG,
"Autobackup MUST be run from the cluster active primary coordinator using the 'local' connection. See '-h'/'--help' for details.",
)
if email_report is not None:
send_execution_failure_report(
error=f"Autobackup run attempted from non-local connection or non-primary coordinator; got connection '{real_connection}', host '{DEFAULT_NODE_HOSTNAME}'."
)
exit(1)
# Ensure we're running as root, or show a warning & confirmation
if getuser() != "root":
confirm(
"WARNING: You are not running this command as 'root'. This command should be run under the same user as the API daemon, which is usually 'root'. Are you sure you want to continue?",
prompt_suffix=" ",
abort=True,
)
# Load our YAML config
autobackup_config = get_autobackup_config(CLI_CONFIG, autobackup_cfgfile)
if not isinstance(autobackup_config, dict):
echo(CLI_CONFIG, f"ERROR: {autobackup_config}")
if email_report is not None:
send_execution_failure_report(error=f"{autobackup_config}")
exit(1)
# Get the start time of this run
autobackup_start_time = datetime.now()
# Get a list of all VMs on the cluster
# We don't do tag filtering here, because we could match an arbitrary number of tags; instead, we
# parse the list after
retcode, retdata = pvc.lib.vm.vm_list(CLI_CONFIG, None, None, None, None, None)
if not retcode:
echo(CLI_CONFIG, f"ERROR: Failed to fetch VM list: {retdata}")
if email_report is not None:
send_execution_failure_report(error=f"Failed to fetch VM list: {retdata}")
exit(1)
cluster_vms = retdata
# Parse the list to match tags; too complex for list comprehension alas
backup_vms = list()
for vm in cluster_vms:
vm_tag_names = [t["name"] for t in vm["tags"]]
matching_tags = (
True
if len(
set(vm_tag_names).intersection(set(autobackup_config["backup_tags"]))
)
> 0
else False
)
if matching_tags:
backup_vms.append(vm["name"])
if len(backup_vms) < 1:
echo(CLI_CONFIG, "Found no suitable VMs for autobackup.")
exit(0)
# Pretty print the names of the VMs we'll back up (to stderr)
maxnamelen = max([len(n) for n in backup_vms]) + 2
cols = 1
while (cols * maxnamelen + maxnamelen + 2) <= MAX_CONTENT_WIDTH:
cols += 1
rows = len(backup_vms) // cols
vm_list_rows = list()
for row in range(0, rows + 1):
row_start = row * cols
row_end = (row * cols) + cols
row_str = ""
for x in range(row_start, row_end):
if x < len(backup_vms):
row_str += "{:<{}}".format(backup_vms[x], maxnamelen)
vm_list_rows.append(row_str)
echo(CLI_CONFIG, f"Found {len(backup_vms)} suitable VM(s) for autobackup.")
echo(CLI_CONFIG, "Full VM list:", stderr=True)
echo(CLI_CONFIG, " {}".format("\n ".join(vm_list_rows)), stderr=True)
echo(CLI_CONFIG, "", stderr=True)
if autobackup_config["auto_mount_enabled"]:
# Execute each mount_cmds command in sequence
for cmd in autobackup_config["mount_cmds"]:
echo(
CLI_CONFIG,
f"Executing mount command '{cmd.split()[0]}'... ",
newline=False,
)
tstart = datetime.now()
ret = run(
cmd.split(),
stdout=PIPE,
stderr=PIPE,
)
tend = datetime.now()
ttot = tend - tstart
if ret.returncode != 0:
echo(
CLI_CONFIG,
f"failed. [{ttot.seconds}s]",
)
echo(
CLI_CONFIG,
f"Exiting; command reports: {ret.stderr.decode().strip()}",
)
if email_report is not None:
send_execution_failure_report(error=ret.stderr.decode().strip())
exit(1)
else:
echo(CLI_CONFIG, f"done. [{ttot.seconds}s]")
# For each VM, perform the backup
for vm in backup_vms:
backup_suffixed_path = f"{autobackup_config['backup_root_path']}{autobackup_config['backup_root_suffix']}"
if not path.exists(backup_suffixed_path):
makedirs(backup_suffixed_path)
backup_path = f"{backup_suffixed_path}/{vm}"
autobackup_state_file = f"{backup_path}/.autobackup.json"
if not path.exists(backup_path) or not path.exists(autobackup_state_file):
# There are no new backups so the list is empty
state_data = dict()
tracked_backups = list()
else:
with open(autobackup_state_file) as fh:
state_data = jload(fh)
tracked_backups = state_data["tracked_backups"]
full_interval = autobackup_config["backup_schedule"]["full_interval"]
full_retention = autobackup_config["backup_schedule"]["full_retention"]
full_backups = [b for b in tracked_backups if b["type"] == "full"]
if len(full_backups) > 0:
last_full_backup = full_backups[0]
last_full_backup_idx = tracked_backups.index(last_full_backup)
if force_full_flag:
this_backup_type = "forced-full"
this_backup_incremental_parent = None
this_backup_retain_snapshot = True
elif last_full_backup_idx >= full_interval - 1:
this_backup_type = "full"
this_backup_incremental_parent = None
this_backup_retain_snapshot = True
else:
this_backup_type = "incremental"
this_backup_incremental_parent = last_full_backup["datestring"]
this_backup_retain_snapshot = False
else:
# The very first backup must be full to start the tree
this_backup_type = "full"
this_backup_incremental_parent = None
this_backup_retain_snapshot = True
# Perform the backup
echo(
CLI_CONFIG,
f"Backing up VM '{vm}' ({this_backup_type})... ",
newline=False,
)
tstart = datetime.now()
retcode, retdata = pvc.lib.vm.vm_backup(
CLI_CONFIG,
vm,
backup_suffixed_path,
incremental_parent=this_backup_incremental_parent,
retain_snapshot=this_backup_retain_snapshot,
)
tend = datetime.now()
ttot = tend - tstart
if not retcode:
backup_datestring = findall(r"[0-9]{14}", retdata)[0]
echo(CLI_CONFIG, f"failed. [{ttot.seconds}s]")
echo(
CLI_CONFIG,
retdata.strip().replace(f"ERROR in backup {backup_datestring}: ", ""),
)
skip_cleanup = True
else:
backup_datestring = findall(r"[0-9]{14}", retdata)[0]
echo(
CLI_CONFIG,
f"done. Backup '{backup_datestring}' created. [{ttot.seconds}s]",
)
skip_cleanup = False
# Read backup file to get details
backup_json_file = f"{backup_path}/{backup_datestring}/pvcbackup.json"
with open(backup_json_file) as fh:
backup_json = jload(fh)
tracked_backups.insert(0, backup_json)
# Delete any full backups that are expired
marked_for_deletion = list()
found_full_count = 0
for backup in tracked_backups:
if backup["type"] == "full":
found_full_count += 1
if found_full_count > full_retention:
marked_for_deletion.append(backup)
# Depete any incremental backups that depend on marked parents
for backup in tracked_backups:
if backup["type"] == "incremental" and backup["incremental_parent"] in [
b["datestring"] for b in marked_for_deletion
]:
marked_for_deletion.append(backup)
if len(marked_for_deletion) > 0:
if skip_cleanup:
echo(
CLI_CONFIG,
f"Skipping cleanups for {len(marked_for_deletion)} aged-out backups due to backup failure.",
)
else:
echo(
CLI_CONFIG,
f"Running cleanups for {len(marked_for_deletion)} aged-out backups...",
)
# Execute deletes
for backup_to_delete in marked_for_deletion:
echo(
CLI_CONFIG,
f"Removing old VM '{vm}' backup '{backup_to_delete['datestring']}' ({backup_to_delete['type']})... ",
newline=False,
)
tstart = datetime.now()
retcode, retdata = pvc.lib.vm.vm_remove_backup(
CLI_CONFIG,
vm,
backup_suffixed_path,
backup_to_delete["datestring"],
)
tend = datetime.now()
ttot = tend - tstart
if not retcode:
echo(CLI_CONFIG, f"failed. [{ttot.seconds}s]")
echo(
CLI_CONFIG,
f"Skipping removal from tracked backups; command reports: {retdata}",
)
else:
tracked_backups.remove(backup_to_delete)
echo(CLI_CONFIG, f"done. [{ttot.seconds}s]")
# Update tracked state information
state_data["tracked_backups"] = tracked_backups
with open(autobackup_state_file, "w") as fh:
jdump(state_data, fh)
backup_summary[vm] = tracked_backups
if autobackup_config["auto_mount_enabled"]:
# Execute each unmount_cmds command in sequence
for cmd in autobackup_config["unmount_cmds"]:
echo(
CLI_CONFIG,
f"Executing unmount command '{cmd.split()[0]}'... ",
newline=False,
)
tstart = datetime.now()
ret = run(
cmd.split(),
stdout=PIPE,
stderr=PIPE,
)
tend = datetime.now()
ttot = tend - tstart
if ret.returncode != 0:
echo(
CLI_CONFIG,
f"failed. [{ttot.seconds}s]",
)
echo(
CLI_CONFIG,
f"Continuing; command reports: {ret.stderr.decode().strip()}",
)
else:
echo(CLI_CONFIG, f"done. [{ttot.seconds}s]")
autobackup_end_time = datetime.now()
autobackup_total_time = autobackup_end_time - autobackup_start_time
# Handle report emailing
if email_report is not None:
echo(CLI_CONFIG, "")
echo(CLI_CONFIG, f"Sending email summary report to {email_report}")
current_datetime = datetime.now()
email_datetime = formatdate(float(current_datetime.strftime("%s")))
email = list()
email.append(f"Date: {email_datetime}")
email.append(f"Subject: PVC Autobackup report for cluster {cluster}")
recipients = list()
for recipient in email_report.split(","):
recipients.append(f"<{recipient}>")
email.append(f"To: {', '.join(recipients)}")
email.append(f"From: PVC Autobackup System <pvc@{gethostname()}>")
email.append("")
email.append(
f"A PVC autobackup has been completed at {current_datetime} in {autobackup_total_time}."
)
email.append("")
email.append(
"The following is a summary of all current VM backups after cleanups, most recent first:"
)
email.append("")
for vm in backup_vms:
email.append(f"VM {vm}:")
for backup in backup_summary[vm]:
datestring = backup.get("datestring")
backup_date = datetime.strptime(datestring, "%Y%m%d%H%M%S")
if backup.get("result", False):
email.append(
f" {backup_date}: Success in {backup.get('runtime_secs', 0)} seconds, ID {datestring}, type {backup.get('type', 'unknown')}"
)
email.append(
f" Backup contains {len(backup.get('backup_files'))} files totaling {pvc.lib.storage.format_bytes_tohuman(backup.get('backup_size_bytes', 0))} ({backup.get('backup_size_bytes', 0)} bytes)"
)
else:
email.append(
f" {backup_date}: Failure in {backup.get('runtime_secs', 0)} seconds, ID {datestring}, type {backup.get('type', 'unknown')}"
)
email.append(
f" {backup.get('result_message')}"
)
try:
p = popen("/usr/sbin/sendmail -t", "w")
p.write("\n".join(email))
p.close()
except Exception as e:
echo(CLI_CONFIG, f"Failed to send report email: {e}")
echo(CLI_CONFIG, "")
echo(CLI_CONFIG, f"Autobackup completed in {autobackup_total_time}.")

View File

@ -68,7 +68,8 @@ def cli_connection_list_parser(connections_config, show_keys_flag):
} }
) )
return connections_data # Return, ensuring local is always first
return sorted(connections_data, key=lambda x: (x.get("name") != "local"))
def cli_connection_detail_parser(connections_config): def cli_connection_detail_parser(connections_config):
@ -121,4 +122,5 @@ def cli_connection_detail_parser(connections_config):
} }
) )
return connections_data # Return, ensuring local is always first
return sorted(connections_data, key=lambda x: (x.get("name") != "local"))

View File

@ -19,6 +19,8 @@
# #
############################################################################### ###############################################################################
import sys
from click import progressbar from click import progressbar
from time import sleep, time from time import sleep, time
@ -105,7 +107,7 @@ def wait_for_celery_task(CLI_CONFIG, task_detail, start_late=False):
# Start following the task state, updating progress as we go # Start following the task state, updating progress as we go
total_task = task_status.get("total") total_task = task_status.get("total")
with progressbar(length=total_task, show_eta=False) as bar: with progressbar(length=total_task, width=20, show_eta=False) as bar:
last_task = 0 last_task = 0
maxlen = 21 maxlen = 21
echo( echo(
@ -115,28 +117,39 @@ def wait_for_celery_task(CLI_CONFIG, task_detail, start_late=False):
) )
while True: while True:
sleep(0.5) sleep(0.5)
task_status = pvc.lib.common.task_status(
CLI_CONFIG, task_id=task_id, is_watching=True
)
if isinstance(task_status, tuple): if isinstance(task_status, tuple):
continue continue
if task_status.get("state") != "RUNNING": if task_status.get("state") != "RUNNING":
break break
if task_status.get("current") > last_task: if task_status.get("current") == 0:
continue
current_task = int(task_status.get("current")) current_task = int(task_status.get("current"))
total_task = int(task_status.get("total"))
bar.length = total_task
if current_task > last_task:
bar.update(current_task - last_task) bar.update(current_task - last_task)
last_task = current_task last_task = current_task
# The extensive spaces at the end cause this to overwrite longer previous messages
curlen = len(str(task_status.get("status"))) curlen = len(str(task_status.get("status")))
if curlen > maxlen: if curlen > maxlen:
maxlen = curlen maxlen = curlen
lendiff = maxlen - curlen lendiff = maxlen - curlen
overwrite_whitespace = " " * lendiff overwrite_whitespace = " " * lendiff
echo(
CLI_CONFIG, percent_complete = (current_task / total_task) * 100
" " + task_status.get("status") + overwrite_whitespace, bar_output = f"[{bar.format_bar()}] {percent_complete:3.0f}%"
newline=False, sys.stdout.write(
) f"\r {bar_output} {task_status['status']}{overwrite_whitespace}"
task_status = pvc.lib.common.task_status(
CLI_CONFIG, task_id=task_id, is_watching=True
) )
sys.stdout.flush()
if task_status.get("state") == "SUCCESS": if task_status.get("state") == "SUCCESS":
bar.update(total_task - last_task) bar.update(total_task - last_task)

View File

@ -21,6 +21,8 @@
import json import json
from time import sleep
from pvc.lib.common import call_api from pvc.lib.common import call_api
@ -114,3 +116,22 @@ def get_info(config):
return True, response.json() return True, response.json()
else: else:
return False, response.json().get("message", "") return False, response.json().get("message", "")
def get_primary_node(config):
"""
Get the current primary node of the PVC cluster
API endpoint: GET /api/v1/status/primary_node
API arguments:
API schema: {json_data_object}
"""
while True:
response = call_api(config, "get", "/status/primary_node")
resp_code = response.status_code
if resp_code == 200:
break
else:
sleep(1)
return True, response.json()["primary_node"]

View File

@ -83,7 +83,7 @@ class UploadProgressBar(object):
else: else:
self.end_suffix = "" self.end_suffix = ""
self.bar = click.progressbar(length=self.length, show_eta=True) self.bar = click.progressbar(length=self.length, width=20, show_eta=True)
def update(self, monitor): def update(self, monitor):
bytes_cur = monitor.bytes_read bytes_cur = monitor.bytes_read
@ -108,9 +108,10 @@ class UploadProgressBar(object):
class ErrorResponse(requests.Response): class ErrorResponse(requests.Response):
def __init__(self, json_data, status_code): def __init__(self, json_data, status_code, headers):
self.json_data = json_data self.json_data = json_data
self.status_code = status_code self.status_code = status_code
self.headers = headers
def json(self): def json(self):
return self.json_data return self.json_data
@ -157,9 +158,10 @@ def call_api(
if response.status_code in retry_on_code: if response.status_code in retry_on_code:
failed = True failed = True
continue continue
break
except requests.exceptions.ConnectionError: except requests.exceptions.ConnectionError:
failed = True failed = True
pass continue
if failed: if failed:
error = f"Code {response.status_code}" if response else "Timeout" error = f"Code {response.status_code}" if response else "Timeout"
raise requests.exceptions.ConnectionError( raise requests.exceptions.ConnectionError(
@ -206,7 +208,7 @@ def call_api(
except Exception as e: except Exception as e:
message = "Failed to connect to the API: {}".format(e) message = "Failed to connect to the API: {}".format(e)
code = response.status_code if response else 504 code = response.status_code if response else 504
response = ErrorResponse({"message": message}, code) response = ErrorResponse({"message": message}, code, None)
# Display debug output # Display debug output
if config["debug"]: if config["debug"]:

View File

@ -30,6 +30,7 @@ from requests_toolbelt.multipart.encoder import (
import pvc.lib.ansiprint as ansiprint import pvc.lib.ansiprint as ansiprint
from pvc.lib.common import UploadProgressBar, call_api, get_wait_retdata from pvc.lib.common import UploadProgressBar, call_api, get_wait_retdata
from pvc.cli.helpers import MAX_CONTENT_WIDTH
# #
# Supplemental functions # Supplemental functions
@ -430,7 +431,9 @@ def format_list_osd(config, osd_list):
) )
continue continue
if osd_information.get("is_split") is not None: if osd_information.get("is_split") is not None and osd_information.get(
"is_split"
):
osd_information["device"] = f"{osd_information['device']} [s]" osd_information["device"] = f"{osd_information['device']} [s]"
# Deal with the size to human readable # Deal with the size to human readable
@ -1542,6 +1545,30 @@ def ceph_snapshot_add(config, pool, volume, snapshot):
return retstatus, response.json().get("message", "") return retstatus, response.json().get("message", "")
def ceph_snapshot_rollback(config, pool, volume, snapshot):
"""
Roll back Ceph volume to snapshot
API endpoint: POST /api/v1/storage/ceph/snapshot/{pool}/{volume}/{snapshot}/rollback
API arguments:
API schema: {"message":"{data}"}
"""
response = call_api(
config,
"post",
"/storage/ceph/snapshot/{pool}/{volume}/{snapshot}/rollback".format(
snapshot=snapshot, volume=volume, pool=pool
),
)
if response.status_code == 200:
retstatus = True
else:
retstatus = False
return retstatus, response.json().get("message", "")
def ceph_snapshot_remove(config, pool, volume, snapshot): def ceph_snapshot_remove(config, pool, volume, snapshot):
""" """
Remove Ceph snapshot Remove Ceph snapshot
@ -1698,15 +1725,17 @@ def format_list_snapshot(config, snapshot_list):
# #
# Benchmark functions # Benchmark functions
# #
def ceph_benchmark_run(config, pool, wait_flag): def ceph_benchmark_run(config, pool, name, wait_flag):
""" """
Run a storage benchmark against {pool} Run a storage benchmark against {pool}
API endpoint: POST /api/v1/storage/ceph/benchmark API endpoint: POST /api/v1/storage/ceph/benchmark
API arguments: pool={pool} API arguments: pool={pool}, name={name}
API schema: {message} API schema: {message}
""" """
params = {"pool": pool} params = {"pool": pool}
if name:
params["name"] = name
response = call_api(config, "post", "/storage/ceph/benchmark", params=params) response = call_api(config, "post", "/storage/ceph/benchmark", params=params)
return get_wait_retdata(response, wait_flag) return get_wait_retdata(response, wait_flag)
@ -1778,7 +1807,7 @@ def get_benchmark_list_results(benchmark_format, benchmark_data):
benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_legacy( benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_legacy(
benchmark_data benchmark_data
) )
elif benchmark_format == 1: elif benchmark_format == 1 or benchmark_format == 2:
benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_json( benchmark_bandwidth, benchmark_iops = get_benchmark_list_results_json(
benchmark_data benchmark_data
) )
@ -1980,6 +2009,7 @@ def format_info_benchmark(config, benchmark_information):
benchmark_matrix = { benchmark_matrix = {
0: format_info_benchmark_legacy, 0: format_info_benchmark_legacy,
1: format_info_benchmark_json, 1: format_info_benchmark_json,
2: format_info_benchmark_json,
} }
benchmark_version = benchmark_information[0]["test_format"] benchmark_version = benchmark_information[0]["test_format"]
@ -2314,12 +2344,15 @@ def format_info_benchmark_json(config, benchmark_information):
if benchmark_information["benchmark_result"] == "Running": if benchmark_information["benchmark_result"] == "Running":
return "Benchmark test is still running." return "Benchmark test is still running."
benchmark_format = benchmark_information["test_format"]
benchmark_details = benchmark_information["benchmark_result"] benchmark_details = benchmark_information["benchmark_result"]
# Format a nice output; do this line-by-line then concat the elements at the end # Format a nice output; do this line-by-line then concat the elements at the end
ainformation = [] ainformation = []
ainformation.append( ainformation.append(
"{}Storage Benchmark details:{}".format(ansiprint.bold(), ansiprint.end()) "{}Storage Benchmark details (format {}):{}".format(
ansiprint.bold(), benchmark_format, ansiprint.end()
)
) )
nice_test_name_map = { nice_test_name_map = {
@ -2367,7 +2400,7 @@ def format_info_benchmark_json(config, benchmark_information):
if element[1] != 0: if element[1] != 0:
useful_latency_tree.append(element) useful_latency_tree.append(element)
max_rows = 9 max_rows = 5
if len(useful_latency_tree) > 9: if len(useful_latency_tree) > 9:
max_rows = len(useful_latency_tree) max_rows = len(useful_latency_tree)
elif len(useful_latency_tree) < 9: elif len(useful_latency_tree) < 9:
@ -2376,15 +2409,10 @@ def format_info_benchmark_json(config, benchmark_information):
# Format the static data # Format the static data
overall_label = [ overall_label = [
"Overall BW/s:", "BW/s:",
"Overall IOPS:", "IOPS:",
"Total I/O:", "I/O:",
"Runtime (s):", "Time:",
"User CPU %:",
"System CPU %:",
"Ctx Switches:",
"Major Faults:",
"Minor Faults:",
] ]
while len(overall_label) < max_rows: while len(overall_label) < max_rows:
overall_label.append("") overall_label.append("")
@ -2393,68 +2421,149 @@ def format_info_benchmark_json(config, benchmark_information):
format_bytes_tohuman(int(job_details[io_class]["bw_bytes"])), format_bytes_tohuman(int(job_details[io_class]["bw_bytes"])),
format_ops_tohuman(int(job_details[io_class]["iops"])), format_ops_tohuman(int(job_details[io_class]["iops"])),
format_bytes_tohuman(int(job_details[io_class]["io_bytes"])), format_bytes_tohuman(int(job_details[io_class]["io_bytes"])),
job_details["job_runtime"] / 1000, str(job_details["job_runtime"] / 1000) + "s",
job_details["usr_cpu"],
job_details["sys_cpu"],
job_details["ctx"],
job_details["majf"],
job_details["minf"],
] ]
while len(overall_data) < max_rows: while len(overall_data) < max_rows:
overall_data.append("") overall_data.append("")
cpu_label = [
"Total:",
"User:",
"Sys:",
"OSD:",
"MON:",
]
while len(cpu_label) < max_rows:
cpu_label.append("")
cpu_data = [
(
benchmark_details[test]["avg_cpu_util_percent"]["total"]
if benchmark_format > 1
else "N/A"
),
round(job_details["usr_cpu"], 2),
round(job_details["sys_cpu"], 2),
(
benchmark_details[test]["avg_cpu_util_percent"]["ceph-osd"]
if benchmark_format > 1
else "N/A"
),
(
benchmark_details[test]["avg_cpu_util_percent"]["ceph-mon"]
if benchmark_format > 1
else "N/A"
),
]
while len(cpu_data) < max_rows:
cpu_data.append("")
memory_label = [
"Total:",
"OSD:",
"MON:",
]
while len(memory_label) < max_rows:
memory_label.append("")
memory_data = [
(
benchmark_details[test]["avg_memory_util_percent"]["total"]
if benchmark_format > 1
else "N/A"
),
(
benchmark_details[test]["avg_memory_util_percent"]["ceph-osd"]
if benchmark_format > 1
else "N/A"
),
(
benchmark_details[test]["avg_memory_util_percent"]["ceph-mon"]
if benchmark_format > 1
else "N/A"
),
]
while len(memory_data) < max_rows:
memory_data.append("")
network_label = [
"Total:",
"Sent:",
"Recv:",
]
while len(network_label) < max_rows:
network_label.append("")
network_data = [
(
format_bytes_tohuman(
int(benchmark_details[test]["avg_network_util_bps"]["total"])
)
if benchmark_format > 1
else "N/A"
),
(
format_bytes_tohuman(
int(benchmark_details[test]["avg_network_util_bps"]["sent"])
)
if benchmark_format > 1
else "N/A"
),
(
format_bytes_tohuman(
int(benchmark_details[test]["avg_network_util_bps"]["recv"])
)
if benchmark_format > 1
else "N/A"
),
]
while len(network_data) < max_rows:
network_data.append("")
bandwidth_label = [ bandwidth_label = [
"Min:", "Min:",
"Max:", "Max:",
"Mean:", "Mean:",
"StdDev:", "StdDev:",
"Samples:", "Samples:",
"",
"",
"",
"",
] ]
while len(bandwidth_label) < max_rows: while len(bandwidth_label) < max_rows:
bandwidth_label.append("") bandwidth_label.append("")
bandwidth_data = [ bandwidth_data = [
format_bytes_tohuman(int(job_details[io_class]["bw_min"]) * 1024), format_bytes_tohuman(int(job_details[io_class]["bw_min"]) * 1024)
format_bytes_tohuman(int(job_details[io_class]["bw_max"]) * 1024), + " / "
format_bytes_tohuman(int(job_details[io_class]["bw_mean"]) * 1024), + format_ops_tohuman(int(job_details[io_class]["iops_min"])),
format_bytes_tohuman(int(job_details[io_class]["bw_dev"]) * 1024), format_bytes_tohuman(int(job_details[io_class]["bw_max"]) * 1024)
job_details[io_class]["bw_samples"], + " / "
"", + format_ops_tohuman(int(job_details[io_class]["iops_max"])),
"", format_bytes_tohuman(int(job_details[io_class]["bw_mean"]) * 1024)
"", + " / "
"", + format_ops_tohuman(int(job_details[io_class]["iops_mean"])),
format_bytes_tohuman(int(job_details[io_class]["bw_dev"]) * 1024)
+ " / "
+ format_ops_tohuman(int(job_details[io_class]["iops_stddev"])),
str(job_details[io_class]["bw_samples"])
+ " / "
+ str(job_details[io_class]["iops_samples"]),
] ]
while len(bandwidth_data) < max_rows: while len(bandwidth_data) < max_rows:
bandwidth_data.append("") bandwidth_data.append("")
iops_data = [ lat_label = [
format_ops_tohuman(int(job_details[io_class]["iops_min"])), "Min:",
format_ops_tohuman(int(job_details[io_class]["iops_max"])), "Max:",
format_ops_tohuman(int(job_details[io_class]["iops_mean"])), "Mean:",
format_ops_tohuman(int(job_details[io_class]["iops_stddev"])), "StdDev:",
job_details[io_class]["iops_samples"],
"",
"",
"",
"",
] ]
while len(iops_data) < max_rows: while len(lat_label) < max_rows:
iops_data.append("") lat_label.append("")
lat_data = [ lat_data = [
int(job_details[io_class]["lat_ns"]["min"]) / 1000, int(job_details[io_class]["lat_ns"]["min"]) / 1000,
int(job_details[io_class]["lat_ns"]["max"]) / 1000, int(job_details[io_class]["lat_ns"]["max"]) / 1000,
int(job_details[io_class]["lat_ns"]["mean"]) / 1000, int(job_details[io_class]["lat_ns"]["mean"]) / 1000,
int(job_details[io_class]["lat_ns"]["stddev"]) / 1000, int(job_details[io_class]["lat_ns"]["stddev"]) / 1000,
"",
"",
"",
"",
"",
] ]
while len(lat_data) < max_rows: while len(lat_data) < max_rows:
lat_data.append("") lat_data.append("")
@ -2463,98 +2572,119 @@ def format_info_benchmark_json(config, benchmark_information):
lat_bucket_label = list() lat_bucket_label = list()
lat_bucket_data = list() lat_bucket_data = list()
for element in useful_latency_tree: for element in useful_latency_tree:
lat_bucket_label.append(element[0]) lat_bucket_label.append(element[0] + ":" if element[0] else "")
lat_bucket_data.append(element[1]) lat_bucket_data.append(round(float(element[1]), 2) if element[1] else "")
while len(lat_bucket_label) < max_rows:
lat_bucket_label.append("")
while len(lat_bucket_data) < max_rows:
lat_bucket_label.append("")
# Column default widths # Column default widths
overall_label_length = 0 overall_label_length = 5
overall_column_length = 0 overall_column_length = 0
bandwidth_label_length = 0 cpu_label_length = 6
bandwidth_column_length = 11 cpu_column_length = 0
iops_column_length = 4 memory_label_length = 6
latency_column_length = 12 memory_column_length = 0
network_label_length = 6
network_column_length = 6
bandwidth_label_length = 8
bandwidth_column_length = 0
latency_label_length = 7
latency_column_length = 0
latency_bucket_label_length = 0 latency_bucket_label_length = 0
latency_bucket_column_length = 0
# Column layout: # Column layout:
# General Bandwidth IOPS Latency Percentiles # Overall CPU Memory Network Bandwidth/IOPS Latency Percentiles
# --------- ---------- -------- -------- --------------- # --------- ----- ------- -------- -------------- -------- ---------------
# Size Min Min Min A # BW Total Total Total Min Min A
# BW Max Max Max B # IOPS Usr OSD Send Max Max B
# IOPS Mean Mean Mean ... # Time Sys MON Recv Mean Mean ...
# Runtime StdDev StdDev StdDev Z # Size OSD StdDev StdDev Z
# UsrCPU Samples Samples # MON Samples
# SysCPU
# CtxSw
# MajFault
# MinFault
# Set column widths # Set column widths
for item in overall_label:
_item_length = len(str(item))
if _item_length > overall_label_length:
overall_label_length = _item_length
for item in overall_data: for item in overall_data:
_item_length = len(str(item)) _item_length = len(str(item))
if _item_length > overall_column_length: if _item_length > overall_column_length:
overall_column_length = _item_length overall_column_length = _item_length
test_name_length = len(nice_test_name_map[test]) for item in cpu_data:
if test_name_length > overall_label_length + overall_column_length:
_diff = test_name_length - (overall_label_length + overall_column_length)
overall_column_length += _diff
for item in bandwidth_label:
_item_length = len(str(item)) _item_length = len(str(item))
if _item_length > bandwidth_label_length: if _item_length > cpu_column_length:
bandwidth_label_length = _item_length cpu_column_length = _item_length
for item in memory_data:
_item_length = len(str(item))
if _item_length > memory_column_length:
memory_column_length = _item_length
for item in network_data:
_item_length = len(str(item))
if _item_length > network_column_length:
network_column_length = _item_length
for item in bandwidth_data: for item in bandwidth_data:
_item_length = len(str(item)) _item_length = len(str(item))
if _item_length > bandwidth_column_length: if _item_length > bandwidth_column_length:
bandwidth_column_length = _item_length bandwidth_column_length = _item_length
for item in iops_data:
_item_length = len(str(item))
if _item_length > iops_column_length:
iops_column_length = _item_length
for item in lat_data: for item in lat_data:
_item_length = len(str(item)) _item_length = len(str(item))
if _item_length > latency_column_length: if _item_length > latency_column_length:
latency_column_length = _item_length latency_column_length = _item_length
for item in lat_bucket_label: for item in lat_bucket_data:
_item_length = len(str(item)) _item_length = len(str(item))
if _item_length > latency_bucket_label_length: if _item_length > latency_bucket_column_length:
latency_bucket_label_length = _item_length latency_bucket_column_length = _item_length
# Top row (Headers) # Top row (Headers)
ainformation.append( ainformation.append(
"{bold}\ "{bold}{overall_label: <{overall_label_length}} {header_fill}{end_bold}".format(
{overall_label: <{overall_label_length}} \
{bandwidth_label: <{bandwidth_label_length}} \
{bandwidth: <{bandwidth_length}} \
{iops: <{iops_length}} \
{latency: <{latency_length}} \
{latency_bucket_label: <{latency_bucket_label_length}} \
{latency_bucket} \
{end_bold}".format(
bold=ansiprint.bold(), bold=ansiprint.bold(),
end_bold=ansiprint.end(), end_bold=ansiprint.end(),
overall_label=nice_test_name_map[test], overall_label=nice_test_name_map[test],
overall_label_length=overall_label_length, overall_label_length=overall_label_length,
bandwidth_label="", header_fill="-"
bandwidth_label_length=bandwidth_label_length, * (
bandwidth="Bandwidth/s", (MAX_CONTENT_WIDTH if MAX_CONTENT_WIDTH <= 120 else 120)
bandwidth_length=bandwidth_column_length, - len(nice_test_name_map[test])
iops="IOPS", - 4
iops_length=iops_column_length, ),
latency="Latency (μs)", )
latency_length=latency_column_length, )
latency_bucket_label="Latency Buckets (μs/%)",
latency_bucket_label_length=latency_bucket_label_length, ainformation.append(
latency_bucket="", "{bold}\
{overall_label: <{overall_label_length}} \
{cpu_label: <{cpu_label_length}} \
{memory_label: <{memory_label_length}} \
{network_label: <{network_label_length}} \
{bandwidth_label: <{bandwidth_label_length}} \
{latency_label: <{latency_label_length}} \
{latency_bucket_label: <{latency_bucket_label_length}}\
{end_bold}".format(
bold=ansiprint.bold(),
end_bold=ansiprint.end(),
overall_label="Overall",
overall_label_length=overall_label_length + overall_column_length + 1,
cpu_label="CPU (%)",
cpu_label_length=cpu_label_length + cpu_column_length + 1,
memory_label="Memory (%)",
memory_label_length=memory_label_length + memory_column_length + 1,
network_label="Network (bps)",
network_label_length=network_label_length + network_column_length + 1,
bandwidth_label="Bandwidth / IOPS",
bandwidth_label_length=bandwidth_label_length
+ bandwidth_column_length
+ 1,
latency_label="Latency (μs)",
latency_label_length=latency_label_length + latency_column_length + 1,
latency_bucket_label="Buckets (μs/%)",
latency_bucket_label_length=latency_bucket_label_length
+ latency_bucket_column_length,
) )
) )
@ -2562,13 +2692,19 @@ def format_info_benchmark_json(config, benchmark_information):
# Top row (Headers) # Top row (Headers)
ainformation.append( ainformation.append(
"{bold}\ "{bold}\
{overall_label: >{overall_label_length}} \ {overall_label: <{overall_label_length}} \
{overall: <{overall_length}} \ {overall: <{overall_length}} \
{bandwidth_label: >{bandwidth_label_length}} \ {cpu_label: <{cpu_label_length}} \
{cpu: <{cpu_length}} \
{memory_label: <{memory_label_length}} \
{memory: <{memory_length}} \
{network_label: <{network_label_length}} \
{network: <{network_length}} \
{bandwidth_label: <{bandwidth_label_length}} \
{bandwidth: <{bandwidth_length}} \ {bandwidth: <{bandwidth_length}} \
{iops: <{iops_length}} \ {latency_label: <{latency_label_length}} \
{latency: <{latency_length}} \ {latency: <{latency_length}} \
{latency_bucket_label: >{latency_bucket_label_length}} \ {latency_bucket_label: <{latency_bucket_label_length}} \
{latency_bucket}\ {latency_bucket}\
{end_bold}".format( {end_bold}".format(
bold="", bold="",
@ -2577,12 +2713,24 @@ def format_info_benchmark_json(config, benchmark_information):
overall_label_length=overall_label_length, overall_label_length=overall_label_length,
overall=overall_data[idx], overall=overall_data[idx],
overall_length=overall_column_length, overall_length=overall_column_length,
cpu_label=cpu_label[idx],
cpu_label_length=cpu_label_length,
cpu=cpu_data[idx],
cpu_length=cpu_column_length,
memory_label=memory_label[idx],
memory_label_length=memory_label_length,
memory=memory_data[idx],
memory_length=memory_column_length,
network_label=network_label[idx],
network_label_length=network_label_length,
network=network_data[idx],
network_length=network_column_length,
bandwidth_label=bandwidth_label[idx], bandwidth_label=bandwidth_label[idx],
bandwidth_label_length=bandwidth_label_length, bandwidth_label_length=bandwidth_label_length,
bandwidth=bandwidth_data[idx], bandwidth=bandwidth_data[idx],
bandwidth_length=bandwidth_column_length, bandwidth_length=bandwidth_column_length,
iops=iops_data[idx], latency_label=lat_label[idx],
iops_length=iops_column_length, latency_label_length=latency_label_length,
latency=lat_data[idx], latency=lat_data[idx],
latency_length=latency_column_length, latency_length=latency_column_length,
latency_bucket_label=lat_bucket_label[idx], latency_bucket_label=lat_bucket_label[idx],
@ -2591,4 +2739,4 @@ def format_info_benchmark_json(config, benchmark_information):
) )
) )
return "\n".join(ainformation) return "\n".join(ainformation) + "\n"

View File

@ -383,8 +383,8 @@ def vm_state(config, vm, target_state, force=False, wait=False):
""" """
params = { params = {
"state": target_state, "state": target_state,
"force": str(force).lower(), "force": force,
"wait": str(wait).lower(), "wait": wait,
} }
response = call_api(config, "post", "/vm/{vm}/state".format(vm=vm), params=params) response = call_api(config, "post", "/vm/{vm}/state".format(vm=vm), params=params)
@ -421,7 +421,7 @@ def vm_node(config, vm, target_node, action, force=False, wait=False, force_live
return retstatus, response.json().get("message", "") return retstatus, response.json().get("message", "")
def vm_locks(config, vm, wait_flag): def vm_locks(config, vm, wait_flag=True):
""" """
Flush RBD locks of (stopped) VM Flush RBD locks of (stopped) VM
@ -498,6 +498,222 @@ def vm_restore(config, vm, backup_path, backup_datestring, retain_snapshot=False
return True, response.json().get("message", "") return True, response.json().get("message", "")
def vm_create_snapshot(config, vm, snapshot_name=None, wait_flag=True):
"""
Take a snapshot of a VM's disks and configuration
API endpoint: POST /vm/{vm}/snapshot
API arguments: snapshot_name=snapshot_name
API schema: {"message":"{data}"}
"""
params = dict()
if snapshot_name is not None:
params["snapshot_name"] = snapshot_name
response = call_api(
config, "post", "/vm/{vm}/snapshot".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_remove_snapshot(config, vm, snapshot_name, wait_flag=True):
"""
Remove a snapshot of a VM's disks and configuration
API endpoint: DELETE /vm/{vm}/snapshot
API arguments: snapshot_name=snapshot_name
API schema: {"message":"{data}"}
"""
params = {"snapshot_name": snapshot_name}
response = call_api(
config, "delete", "/vm/{vm}/snapshot".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_rollback_snapshot(config, vm, snapshot_name, wait_flag=True):
"""
Roll back to a snapshot of a VM's disks and configuration
API endpoint: POST /vm/{vm}/snapshot/rollback
API arguments: snapshot_name=snapshot_name
API schema: {"message":"{data}"}
"""
params = {"snapshot_name": snapshot_name}
response = call_api(
config, "post", "/vm/{vm}/snapshot/rollback".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_export_snapshot(
config, vm, snapshot_name, export_path, incremental_parent=None, wait_flag=True
):
"""
Export an (existing) snapshot of a VM's disks and configuration to export_path, optionally
incremental with incremental_parent
API endpoint: POST /vm/{vm}/snapshot/export
API arguments: snapshot_name=snapshot_name, export_path=export_path, incremental_parent=incremental_parent
API schema: {"message":"{data}"}
"""
params = {
"snapshot_name": snapshot_name,
"export_path": export_path,
}
if incremental_parent is not None:
params["incremental_parent"] = incremental_parent
response = call_api(
config, "post", "/vm/{vm}/snapshot/export".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_import_snapshot(
config, vm, snapshot_name, import_path, retain_snapshot=False, wait_flag=True
):
"""
Import a snapshot of {vm} and its volumes from a local primary coordinator filesystem path
API endpoint: POST /vm/{vm}/snapshot/import
API arguments: snapshot_name={snapshot_name}, import_path={import_path}, retain_snapshot={retain_snapshot}
API schema: {"message":"{data}"}
"""
params = {
"snapshot_name": snapshot_name,
"import_path": import_path,
"retain_snapshot": retain_snapshot,
}
response = call_api(
config, "post", "/vm/{vm}/snapshot/import".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_send_snapshot(
config,
vm,
snapshot_name,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=True,
destination_storage_pool=None,
incremental_parent=None,
wait_flag=True,
):
"""
Send an (existing) snapshot of a VM's disks and configuration to a destination PVC cluster, optionally
incremental with incremental_parent
API endpoint: POST /vm/{vm}/snapshot/send
API arguments: snapshot_name=snapshot_name, destination_api_uri=destination_api_uri, destination_api_key=destination_api_key, destination_api_verify_ssl=destination_api_verify_ssl, incremental_parent=incremental_parent, destination_storage_pool=destination_storage_pool
API schema: {"message":"{data}"}
"""
params = {
"snapshot_name": snapshot_name,
"destination_api_uri": destination_api_uri,
"destination_api_key": destination_api_key,
"destination_api_verify_ssl": destination_api_verify_ssl,
}
if destination_storage_pool is not None:
params["destination_storage_pool"] = destination_storage_pool
if incremental_parent is not None:
params["incremental_parent"] = incremental_parent
response = call_api(
config, "post", "/vm/{vm}/snapshot/send".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_create_mirror(
config,
vm,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=True,
destination_storage_pool=None,
wait_flag=True,
):
"""
Create a new snapshot and send the snapshot to a destination PVC cluster, with automatic incremental handling
API endpoint: POST /vm/{vm}/mirror/create
API arguments: destination_api_uri=destination_api_uri, destination_api_key=destination_api_key, destination_api_verify_ssl=destination_api_verify_ssl, destination_storage_pool=destination_storage_pool
API schema: {"message":"{data}"}
"""
params = {
"destination_api_uri": destination_api_uri,
"destination_api_key": destination_api_key,
"destination_api_verify_ssl": destination_api_verify_ssl,
}
if destination_storage_pool is not None:
params["destination_storage_pool"] = destination_storage_pool
response = call_api(
config, "post", "/vm/{vm}/mirror/create".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_promote_mirror(
config,
vm,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=True,
destination_storage_pool=None,
remove_on_source=False,
wait_flag=True,
):
"""
Shut down a VM, create a new snapshot, send the snapshot to a destination PVC cluster, start the VM on the remote cluster, and optionally remove the local VM, with automatic incremental handling
API endpoint: POST /vm/{vm}/mirror/promote
API arguments: destination_api_uri=destination_api_uri, destination_api_key=destination_api_key, destination_api_verify_ssl=destination_api_verify_ssl, destination_storage_pool=destination_storage_pool, remove_on_source=remove_on_source
API schema: {"message":"{data}"}
"""
params = {
"destination_api_uri": destination_api_uri,
"destination_api_key": destination_api_key,
"destination_api_verify_ssl": destination_api_verify_ssl,
"remove_on_source": remove_on_source,
}
if destination_storage_pool is not None:
params["destination_storage_pool"] = destination_storage_pool
response = call_api(
config, "post", "/vm/{vm}/mirror/promote".format(vm=vm), params=params
)
return get_wait_retdata(response, wait_flag)
def vm_autobackup(config, email_recipients=None, force_full_flag=False, wait_flag=True):
"""
Perform a cluster VM autobackup
API endpoint: POST /vm//autobackup
API arguments: email_recipients=email_recipients, force_full_flag=force_full_flag
API schema: {"message":"{data}"}
"""
params = {
"email_recipients": email_recipients,
"force_full": force_full_flag,
}
response = call_api(config, "post", "/vm/autobackup", params=params)
return get_wait_retdata(response, wait_flag)
def vm_vcpus_set(config, vm, vcpus, topology, restart): def vm_vcpus_set(config, vm, vcpus, topology, restart):
""" """
Set the vCPU count of the VM with topology Set the vCPU count of the VM with topology
@ -1522,6 +1738,7 @@ def format_info(config, domain_information, long_output):
ansiprint.purple(), ansiprint.end(), domain_information["vcpu"] ansiprint.purple(), ansiprint.end(), domain_information["vcpu"]
) )
) )
if long_output:
ainformation.append( ainformation.append(
"{}Topology (S/C/T):{} {}".format( "{}Topology (S/C/T):{} {}".format(
ansiprint.purple(), ansiprint.end(), domain_information["vcpu_topology"] ansiprint.purple(), ansiprint.end(), domain_information["vcpu_topology"]
@ -1529,22 +1746,32 @@ def format_info(config, domain_information, long_output):
) )
if ( if (
domain_information["vnc"].get("listen", "None") != "None" domain_information["vnc"].get("listen")
and domain_information["vnc"].get("port", "None") != "None" and domain_information["vnc"].get("port")
): ) or long_output:
listen = (
domain_information["vnc"]["listen"]
if domain_information["vnc"].get("listen")
else "N/A"
)
port = (
domain_information["vnc"]["port"]
if domain_information["vnc"].get("port")
else "N/A"
)
ainformation.append("") ainformation.append("")
ainformation.append( ainformation.append(
"{}VNC listen:{} {}".format( "{}VNC listen:{} {}".format(
ansiprint.purple(), ansiprint.end(), domain_information["vnc"]["listen"] ansiprint.purple(), ansiprint.end(), listen
) )
) )
ainformation.append( ainformation.append(
"{}VNC port:{} {}".format( "{}VNC port:{} {}".format(
ansiprint.purple(), ansiprint.end(), domain_information["vnc"]["port"] ansiprint.purple(), ansiprint.end(), port
) )
) )
if long_output is True: if long_output:
# Virtualization information # Virtualization information
ainformation.append("") ainformation.append("")
ainformation.append( ainformation.append(
@ -1633,6 +1860,8 @@ def format_info(config, domain_information, long_output):
"unmigrate": ansiprint.blue(), "unmigrate": ansiprint.blue(),
"provision": ansiprint.blue(), "provision": ansiprint.blue(),
"restore": ansiprint.blue(), "restore": ansiprint.blue(),
"import": ansiprint.blue(),
"mirror": ansiprint.purple(),
} }
ainformation.append( ainformation.append(
"{}State:{} {}{}{}".format( "{}State:{} {}{}{}".format(
@ -1665,12 +1894,18 @@ def format_info(config, domain_information, long_output):
) )
) )
if not domain_information.get("node_selector"): if (
not domain_information.get("node_selector")
or domain_information.get("node_selector") == "None"
):
formatted_node_selector = "Default" formatted_node_selector = "Default"
else: else:
formatted_node_selector = str(domain_information["node_selector"]).title() formatted_node_selector = str(domain_information["node_selector"]).title()
if not domain_information.get("node_limit"): if (
not domain_information.get("node_limit")
or domain_information.get("node_limit") == "None"
):
formatted_node_limit = "Any" formatted_node_limit = "Any"
else: else:
formatted_node_limit = ", ".join(domain_information["node_limit"]) formatted_node_limit = ", ".join(domain_information["node_limit"])
@ -1682,7 +1917,10 @@ def format_info(config, domain_information, long_output):
autostart_colour = ansiprint.green() autostart_colour = ansiprint.green()
formatted_node_autostart = "True" formatted_node_autostart = "True"
if not domain_information.get("migration_method"): if (
not domain_information.get("migration_method")
or domain_information.get("migration_method") == "None"
):
formatted_migration_method = "Live, Shutdown" formatted_migration_method = "Live, Shutdown"
else: else:
formatted_migration_method = ( formatted_migration_method = (
@ -1765,9 +2003,9 @@ def format_info(config, domain_information, long_output):
tags_name=tag["name"], tags_name=tag["name"],
tags_type=tag["type"], tags_type=tag["type"],
tags_protected=str(tag["protected"]), tags_protected=str(tag["protected"]),
tags_protected_colour=ansiprint.green() tags_protected_colour=(
if tag["protected"] ansiprint.green() if tag["protected"] else ansiprint.blue()
else ansiprint.blue(), ),
end=ansiprint.end(), end=ansiprint.end(),
) )
) )
@ -1780,6 +2018,78 @@ def format_info(config, domain_information, long_output):
) )
) )
# Snapshot list
snapshots_name_length = 5
snapshots_age_length = 4
snapshots_xml_changes_length = 12
for snapshot in domain_information.get("snapshots", list()):
xml_diff_plus = 0
xml_diff_minus = 0
for line in snapshot["xml_diff_lines"]:
if re.match(r"^\+ ", line):
xml_diff_plus += 1
elif re.match(r"^- ", line):
xml_diff_minus += 1
xml_diff_counts = f"+{xml_diff_plus}/-{xml_diff_minus}"
_snapshots_name_length = len(snapshot["name"]) + 1
if _snapshots_name_length > snapshots_name_length:
snapshots_name_length = _snapshots_name_length
_snapshots_age_length = len(snapshot["age"]) + 1
if _snapshots_age_length > snapshots_age_length:
snapshots_age_length = _snapshots_age_length
_snapshots_xml_changes_length = len(xml_diff_counts) + 1
if _snapshots_xml_changes_length > snapshots_xml_changes_length:
snapshots_xml_changes_length = _snapshots_xml_changes_length
if len(domain_information.get("snapshots", list())) > 0:
ainformation.append("")
ainformation.append(
"{purple}Snapshots:{end} {bold}{snapshots_name: <{snapshots_name_length}} {snapshots_age: <{snapshots_age_length}} {snapshots_xml_changes: <{snapshots_xml_changes_length}}{end}".format(
purple=ansiprint.purple(),
bold=ansiprint.bold(),
end=ansiprint.end(),
snapshots_name_length=snapshots_name_length,
snapshots_age_length=snapshots_age_length,
snapshots_xml_changes_length=snapshots_xml_changes_length,
snapshots_name="Name",
snapshots_age="Age",
snapshots_xml_changes="XML Changes",
)
)
for snapshot in domain_information.get("snapshots", list()):
xml_diff_plus = 0
xml_diff_minus = 0
for line in snapshot["xml_diff_lines"]:
if re.match(r"^\+ ", line):
xml_diff_plus += 1
elif re.match(r"^- ", line):
xml_diff_minus += 1
xml_diff_counts = f"{ansiprint.green()}+{xml_diff_plus}{ansiprint.end()}/{ansiprint.red()}-{xml_diff_minus}{ansiprint.end()}"
ainformation.append(
" {snapshots_name: <{snapshots_name_length}} {snapshots_age: <{snapshots_age_length}} {snapshots_xml_changes: <{snapshots_xml_changes_length}}{end}".format(
snapshots_name_length=snapshots_name_length,
snapshots_age_length=snapshots_age_length,
snapshots_xml_changes_length=snapshots_xml_changes_length,
snapshots_name=snapshot["name"],
snapshots_age=snapshot["age"],
snapshots_xml_changes=xml_diff_counts,
end=ansiprint.end(),
)
)
else:
ainformation.append("")
ainformation.append(
"{purple}Snapshots:{end} N/A".format(
purple=ansiprint.purple(),
end=ansiprint.end(),
)
)
# Network list # Network list
net_list = [] net_list = []
cluster_net_list = call_api(config, "get", "/network").json() cluster_net_list = call_api(config, "get", "/network").json()
@ -1806,7 +2116,7 @@ def format_info(config, domain_information, long_output):
) )
) )
if long_output is True: if long_output:
# Disk list # Disk list
ainformation.append("") ainformation.append("")
name_length = 0 name_length = 0
@ -1942,6 +2252,7 @@ def format_list(config, vm_list):
vm_name_length = 5 vm_name_length = 5
vm_state_length = 6 vm_state_length = 6
vm_tags_length = 5 vm_tags_length = 5
vm_snapshots_length = 10
vm_nets_length = 9 vm_nets_length = 9
vm_ram_length = 8 vm_ram_length = 8
vm_vcpu_length = 6 vm_vcpu_length = 6
@ -1962,6 +2273,12 @@ def format_list(config, vm_list):
_vm_tags_length = len(",".join(tag_list)) + 1 _vm_tags_length = len(",".join(tag_list)) + 1
if _vm_tags_length > vm_tags_length: if _vm_tags_length > vm_tags_length:
vm_tags_length = _vm_tags_length vm_tags_length = _vm_tags_length
# vm_snapshots column
_vm_snapshots_length = (
len(str(len(domain_information.get("snapshots", list())))) + 1
)
if _vm_snapshots_length > vm_snapshots_length:
vm_snapshots_length = _vm_snapshots_length
# vm_nets column # vm_nets column
_vm_nets_length = len(",".join(net_list)) + 1 _vm_nets_length = len(",".join(net_list)) + 1
if _vm_nets_length > vm_nets_length: if _vm_nets_length > vm_nets_length:
@ -1978,7 +2295,11 @@ def format_list(config, vm_list):
# Format the string (header) # Format the string (header)
vm_list_output.append( vm_list_output.append(
"{bold}{vm_header: <{vm_header_length}} {resource_header: <{resource_header_length}} {node_header: <{node_header_length}}{end_bold}".format( "{bold}{vm_header: <{vm_header_length}} {resource_header: <{resource_header_length}} {node_header: <{node_header_length}}{end_bold}".format(
vm_header_length=vm_name_length + vm_state_length + vm_tags_length + 2, vm_header_length=vm_name_length
+ vm_state_length
+ vm_tags_length
+ vm_snapshots_length
+ 3,
resource_header_length=vm_nets_length + vm_ram_length + vm_vcpu_length + 2, resource_header_length=vm_nets_length + vm_ram_length + vm_vcpu_length + 2,
node_header_length=vm_node_length + vm_migrated_length + 1, node_header_length=vm_node_length + vm_migrated_length + 1,
bold=ansiprint.bold(), bold=ansiprint.bold(),
@ -1988,7 +2309,12 @@ def format_list(config, vm_list):
[ [
"-" "-"
for _ in range( for _ in range(
4, vm_name_length + vm_state_length + vm_tags_length + 1 4,
vm_name_length
+ vm_state_length
+ vm_tags_length
+ +vm_snapshots_length
+ 2,
) )
] ]
), ),
@ -2010,6 +2336,7 @@ def format_list(config, vm_list):
"{bold}{vm_name: <{vm_name_length}} \ "{bold}{vm_name: <{vm_name_length}} \
{vm_state_colour}{vm_state: <{vm_state_length}}{end_colour} \ {vm_state_colour}{vm_state: <{vm_state_length}}{end_colour} \
{vm_tags: <{vm_tags_length}} \ {vm_tags: <{vm_tags_length}} \
{vm_snapshots: <{vm_snapshots_length}} \
{vm_networks: <{vm_nets_length}} \ {vm_networks: <{vm_nets_length}} \
{vm_memory: <{vm_ram_length}} {vm_vcpu: <{vm_vcpu_length}} \ {vm_memory: <{vm_ram_length}} {vm_vcpu: <{vm_vcpu_length}} \
{vm_node: <{vm_node_length}} \ {vm_node: <{vm_node_length}} \
@ -2017,6 +2344,7 @@ def format_list(config, vm_list):
vm_name_length=vm_name_length, vm_name_length=vm_name_length,
vm_state_length=vm_state_length, vm_state_length=vm_state_length,
vm_tags_length=vm_tags_length, vm_tags_length=vm_tags_length,
vm_snapshots_length=vm_snapshots_length,
vm_nets_length=vm_nets_length, vm_nets_length=vm_nets_length,
vm_ram_length=vm_ram_length, vm_ram_length=vm_ram_length,
vm_vcpu_length=vm_vcpu_length, vm_vcpu_length=vm_vcpu_length,
@ -2029,6 +2357,7 @@ def format_list(config, vm_list):
vm_name="Name", vm_name="Name",
vm_state="State", vm_state="State",
vm_tags="Tags", vm_tags="Tags",
vm_snapshots="Snapshots",
vm_networks="Networks", vm_networks="Networks",
vm_memory="RAM (M)", vm_memory="RAM (M)",
vm_vcpu="vCPUs", vm_vcpu="vCPUs",
@ -2042,16 +2371,14 @@ def format_list(config, vm_list):
# Format the string (elements) # Format the string (elements)
for domain_information in sorted(vm_list, key=lambda v: v["name"]): for domain_information in sorted(vm_list, key=lambda v: v["name"]):
if domain_information["state"] == "start": if domain_information["state"] in ["start"]:
vm_state_colour = ansiprint.green() vm_state_colour = ansiprint.green()
elif domain_information["state"] == "restart": elif domain_information["state"] in ["restart", "shutdown"]:
vm_state_colour = ansiprint.yellow() vm_state_colour = ansiprint.yellow()
elif domain_information["state"] == "shutdown": elif domain_information["state"] in ["stop", "fail"]:
vm_state_colour = ansiprint.yellow()
elif domain_information["state"] == "stop":
vm_state_colour = ansiprint.red()
elif domain_information["state"] == "fail":
vm_state_colour = ansiprint.red() vm_state_colour = ansiprint.red()
elif domain_information["state"] in ["mirror"]:
vm_state_colour = ansiprint.purple()
else: else:
vm_state_colour = ansiprint.blue() vm_state_colour = ansiprint.blue()
@ -2075,8 +2402,10 @@ def format_list(config, vm_list):
else: else:
net_invalid_list.append(False) net_invalid_list.append(False)
display_net_string_list = []
net_string_list = [] net_string_list = []
for net_idx, net_vni in enumerate(net_list): for net_idx, net_vni in enumerate(net_list):
display_net_string_list.append(net_vni)
if net_invalid_list[net_idx]: if net_invalid_list[net_idx]:
net_string_list.append( net_string_list.append(
"{}{}{}".format( "{}{}{}".format(
@ -2085,9 +2414,6 @@ def format_list(config, vm_list):
ansiprint.end(), ansiprint.end(),
) )
) )
# Fix the length due to the extra fake characters
vm_nets_length -= len(net_vni)
vm_nets_length += len(net_string_list[net_idx])
else: else:
net_string_list.append(net_vni) net_string_list.append(net_vni)
@ -2095,6 +2421,7 @@ def format_list(config, vm_list):
"{bold}{vm_name: <{vm_name_length}} \ "{bold}{vm_name: <{vm_name_length}} \
{vm_state_colour}{vm_state: <{vm_state_length}}{end_colour} \ {vm_state_colour}{vm_state: <{vm_state_length}}{end_colour} \
{vm_tags: <{vm_tags_length}} \ {vm_tags: <{vm_tags_length}} \
{vm_snapshots: <{vm_snapshots_length}} \
{vm_networks: <{vm_nets_length}} \ {vm_networks: <{vm_nets_length}} \
{vm_memory: <{vm_ram_length}} {vm_vcpu: <{vm_vcpu_length}} \ {vm_memory: <{vm_ram_length}} {vm_vcpu: <{vm_vcpu_length}} \
{vm_node: <{vm_node_length}} \ {vm_node: <{vm_node_length}} \
@ -2102,7 +2429,10 @@ def format_list(config, vm_list):
vm_name_length=vm_name_length, vm_name_length=vm_name_length,
vm_state_length=vm_state_length, vm_state_length=vm_state_length,
vm_tags_length=vm_tags_length, vm_tags_length=vm_tags_length,
vm_nets_length=vm_nets_length, vm_snapshots_length=vm_snapshots_length,
vm_nets_length=vm_nets_length
+ len(",".join(net_string_list))
- len(",".join(display_net_string_list)),
vm_ram_length=vm_ram_length, vm_ram_length=vm_ram_length,
vm_vcpu_length=vm_vcpu_length, vm_vcpu_length=vm_vcpu_length,
vm_node_length=vm_node_length, vm_node_length=vm_node_length,
@ -2114,7 +2444,9 @@ def format_list(config, vm_list):
vm_name=domain_information["name"], vm_name=domain_information["name"],
vm_state=domain_information["state"], vm_state=domain_information["state"],
vm_tags=",".join(tag_list), vm_tags=",".join(tag_list),
vm_networks=",".join(net_string_list), vm_snapshots=len(domain_information.get("snapshots", list())),
vm_networks=",".join(net_string_list)
+ ("" if all(net_invalid_list) else " "),
vm_memory=domain_information["memory"], vm_memory=domain_information["memory"],
vm_vcpu=domain_information["vcpu"], vm_vcpu=domain_information["vcpu"],
vm_node=domain_information["node"], vm_node=domain_information["node"],

View File

@ -2,7 +2,7 @@ from setuptools import setup
setup( setup(
name="pvc", name="pvc",
version="0.9.94", version="0.9.103",
packages=["pvc.cli", "pvc.lib"], packages=["pvc.cli", "pvc.lib"],
install_requires=[ install_requires=[
"Click", "Click",

695
daemon-common/autobackup.py Normal file
View File

@ -0,0 +1,695 @@
#!/usr/bin/env python3
# autobackup.py - PVC API Autobackup functions
# Part of the Parallel Virtual Cluster (PVC) system
#
# Copyright (C) 2018-2024 Joshua M. Boniface <joshua@boniface.me>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
###############################################################################
from datetime import datetime
from json import load as jload
from json import dump as jdump
from os import popen, makedirs, path, scandir
from shutil import rmtree
from subprocess import run, PIPE
from daemon_lib.common import run_os_command
from daemon_lib.config import get_autobackup_configuration
from daemon_lib.celery import start, fail, log_info, log_err, update, finish
import daemon_lib.ceph as ceph
import daemon_lib.vm as vm
def send_execution_failure_report(
celery_conf, config, recipients=None, total_time=0, error=None
):
if recipients is None:
return
from email.utils import formatdate
from socket import gethostname
log_message = f"Sending email failure report to {', '.join(recipients)}"
log_info(celery_conf[0], log_message)
update(
celery_conf[0],
log_message,
current=celery_conf[1] + 1,
total=celery_conf[2],
)
current_datetime = datetime.now()
email_datetime = formatdate(float(current_datetime.strftime("%s")))
email = list()
email.append(f"Date: {email_datetime}")
email.append(
f"Subject: PVC Autobackup execution failure for cluster '{config['cluster']}'"
)
email_to = list()
for recipient in recipients:
email_to.append(f"<{recipient}>")
email.append(f"To: {', '.join(email_to)}")
email.append(f"From: PVC Autobackup System <pvc@{gethostname()}>")
email.append("")
email.append(
f"A PVC autobackup has FAILED at {current_datetime} in {total_time}s due to an execution error."
)
email.append("")
email.append("The reported error message is:")
email.append(f" {error}")
try:
with popen("/usr/sbin/sendmail -t", "w") as p:
p.write("\n".join(email))
except Exception as e:
log_err(f"Failed to send report email: {e}")
def send_execution_summary_report(
celery_conf, config, recipients=None, total_time=0, summary=dict()
):
if recipients is None:
return
from email.utils import formatdate
from socket import gethostname
log_message = f"Sending email summary report to {', '.join(recipients)}"
log_info(celery_conf[0], log_message)
update(
celery_conf[0],
log_message,
current=celery_conf[1] + 1,
total=celery_conf[2],
)
current_datetime = datetime.now()
email_datetime = formatdate(float(current_datetime.strftime("%s")))
email = list()
email.append(f"Date: {email_datetime}")
email.append(f"Subject: PVC Autobackup report for cluster '{config['cluster']}'")
email_to = list()
for recipient in recipients:
email_to.append(f"<{recipient}>")
email.append(f"To: {', '.join(email_to)}")
email.append(f"From: PVC Autobackup System <pvc@{gethostname()}>")
email.append("")
email.append(
f"A PVC autobackup has been completed at {current_datetime} in {total_time}."
)
email.append("")
email.append(
"The following is a summary of all current VM backups after cleanups, most recent first:"
)
email.append("")
for vm_name in summary.keys():
email.append(f"VM: {vm_name}:")
for backup in summary[vm_name]:
datestring = backup.get("datestring")
backup_date = datetime.strptime(datestring, "%Y%m%d%H%M%S")
if backup.get("result", False):
email.append(
f" {backup_date}: Success in {backup.get('runtime_secs', 0)} seconds, ID {backup.get('snapshot_name')}, type {backup.get('type', 'unknown')}"
)
email.append(
f" Backup contains {len(backup.get('export_files'))} files totaling {ceph.format_bytes_tohuman(backup.get('export_size_bytes', 0))} ({backup.get('export_size_bytes', 0)} bytes)"
)
else:
email.append(
f" {backup_date}: Failure in {backup.get('runtime_secs', 0)} seconds, ID {backup.get('snapshot_name')}, type {backup.get('type', 'unknown')}"
)
email.append(f" {backup.get('result_message')}")
try:
with popen("/usr/sbin/sendmail -t", "w") as p:
p.write("\n".join(email))
except Exception as e:
log_err(f"Failed to send report email: {e}")
def run_vm_backup(zkhandler, celery, config, vm_detail, force_full=False):
vm_name = vm_detail["name"]
dom_uuid = vm_detail["uuid"]
backup_suffixed_path = f"{config['backup_root_path']}{config['backup_root_suffix']}"
vm_backup_path = f"{backup_suffixed_path}/{vm_name}"
autobackup_state_file = f"{vm_backup_path}/.autobackup.json"
full_interval = config["backup_schedule"]["full_interval"]
full_retention = config["backup_schedule"]["full_retention"]
if not path.exists(vm_backup_path) or not path.exists(autobackup_state_file):
# There are no existing backups so the list is empty
state_data = dict()
tracked_backups = list()
else:
with open(autobackup_state_file) as fh:
state_data = jload(fh)
tracked_backups = state_data["tracked_backups"]
full_backups = [b for b in tracked_backups if b["type"] == "full"]
if len(full_backups) > 0:
last_full_backup = full_backups[0]
last_full_backup_idx = tracked_backups.index(last_full_backup)
if force_full:
this_backup_incremental_parent = None
this_backup_retain_snapshot = True
elif last_full_backup_idx >= full_interval - 1:
this_backup_incremental_parent = None
this_backup_retain_snapshot = True
else:
this_backup_incremental_parent = last_full_backup["snapshot_name"]
this_backup_retain_snapshot = False
else:
# The very first ackup must be full to start the tree
this_backup_incremental_parent = None
this_backup_retain_snapshot = True
export_type = (
"incremental" if this_backup_incremental_parent is not None else "full"
)
now = datetime.now()
datestring = now.strftime("%Y%m%d%H%M%S")
snapshot_name = f"ab{datestring}"
# Take the VM snapshot (vm.vm_worker_create_snapshot)
snap_list = list()
failure = False
export_files = None
export_files_size = 0
def update_tracked_backups():
# Read export file to get details
backup_json_file = (
f"{backup_suffixed_path}/{vm_name}/{snapshot_name}/snapshot.json"
)
try:
with open(backup_json_file) as fh:
backup_json = jload(fh)
tracked_backups.insert(0, backup_json)
except Exception as e:
log_err(celery, f"Could not open export JSON: {e}")
return list()
state_data["tracked_backups"] = tracked_backups
with open(autobackup_state_file, "w") as fh:
jdump(state_data, fh)
return tracked_backups
def write_backup_summary(success=False, message=""):
ttotal = (datetime.now() - now).total_seconds()
export_details = {
"type": export_type,
"result": success,
"message": message,
"datestring": datestring,
"runtime_secs": ttotal,
"snapshot_name": snapshot_name,
"incremental_parent": this_backup_incremental_parent,
"vm_detail": vm_detail,
"export_files": export_files,
"export_size_bytes": export_files_size,
}
try:
with open(
f"{backup_suffixed_path}/{vm_name}/{snapshot_name}/snapshot.json",
"w",
) as fh:
jdump(export_details, fh)
except Exception as e:
log_err(celery, f"Error exporting snapshot details: {e}")
return False, e
return True, ""
def cleanup_failure():
for snapshot in snap_list:
rbd, snapshot_name = snapshot.split("@")
pool, volume = rbd.split("/")
# We capture no output here, because if this fails too we're in a deep
# error chain and will just ignore it
ceph.remove_snapshot(zkhandler, pool, volume, snapshot_name)
rbd_list = zkhandler.read(("domain.storage.volumes", dom_uuid)).split(",")
for rbd in rbd_list:
pool, volume = rbd.split("/")
ret, msg = ceph.add_snapshot(
zkhandler, pool, volume, snapshot_name, zk_only=False
)
if not ret:
cleanup_failure()
error_message = msg.replace("ERROR: ", "")
log_err(celery, error_message)
failure = True
break
else:
snap_list.append(f"{pool}/{volume}@{snapshot_name}")
if failure:
error_message = (f"[{vm_name}] Error in snapshot export, skipping",)
write_backup_summary(message=error_message)
tracked_backups = update_tracked_backups()
return tracked_backups
# Get the current domain XML
vm_config = zkhandler.read(("domain.xml", dom_uuid))
# Add the snapshot entry to Zookeeper
ret = zkhandler.write(
[
(
(
"domain.snapshots",
dom_uuid,
"domain_snapshot.name",
snapshot_name,
),
snapshot_name,
),
(
(
"domain.snapshots",
dom_uuid,
"domain_snapshot.timestamp",
snapshot_name,
),
now.strftime("%s"),
),
(
(
"domain.snapshots",
dom_uuid,
"domain_snapshot.xml",
snapshot_name,
),
vm_config,
),
(
(
"domain.snapshots",
dom_uuid,
"domain_snapshot.rbd_snapshots",
snapshot_name,
),
",".join(snap_list),
),
]
)
if not ret:
error_message = (f"[{vm_name}] Error in snapshot export, skipping",)
log_err(celery, error_message)
write_backup_summary(message=error_message)
tracked_backups = update_tracked_backups()
return tracked_backups
# Export the snapshot (vm.vm_worker_export_snapshot)
export_target_path = f"{backup_suffixed_path}/{vm_name}/{snapshot_name}/images"
try:
makedirs(export_target_path)
except Exception as e:
error_message = (
f"[{vm_name}] Failed to create target directory '{export_target_path}': {e}",
)
log_err(celery, error_message)
return tracked_backups
def export_cleanup():
from shutil import rmtree
rmtree(f"{backup_suffixed_path}/{vm_name}/{snapshot_name}")
# Set the export filetype
if this_backup_incremental_parent is not None:
export_fileext = "rbddiff"
else:
export_fileext = "rbdimg"
snapshot_volumes = list()
for rbdsnap in snap_list:
pool, _volume = rbdsnap.split("/")
volume, name = _volume.split("@")
ret, snapshots = ceph.get_list_snapshot(
zkhandler, pool, volume, limit=name, is_fuzzy=False
)
if ret:
snapshot_volumes += snapshots
export_files = list()
for snapshot_volume in snapshot_volumes:
snap_pool = snapshot_volume["pool"]
snap_volume = snapshot_volume["volume"]
snap_snapshot_name = snapshot_volume["snapshot"]
snap_size = snapshot_volume["stats"]["size"]
if this_backup_incremental_parent is not None:
retcode, stdout, stderr = run_os_command(
f"rbd export-diff --from-snap {this_backup_incremental_parent} {snap_pool}/{snap_volume}@{snap_snapshot_name} {export_target_path}/{snap_pool}.{snap_volume}.{export_fileext}"
)
if retcode:
error_message = (
f"[{vm_name}] Failed to export snapshot for volume(s) '{snap_pool}/{snap_volume}'",
)
failure = True
break
else:
export_files.append(
(
f"images/{snap_pool}.{snap_volume}.{export_fileext}",
snap_size,
)
)
else:
retcode, stdout, stderr = run_os_command(
f"rbd export --export-format 2 {snap_pool}/{snap_volume}@{snap_snapshot_name} {export_target_path}/{snap_pool}.{snap_volume}.{export_fileext}"
)
if retcode:
error_message = (
f"[{vm_name}] Failed to export snapshot for volume(s) '{snap_pool}/{snap_volume}'",
)
failure = True
break
else:
export_files.append(
(
f"images/{snap_pool}.{snap_volume}.{export_fileext}",
snap_size,
)
)
if failure:
log_err(celery, error_message)
write_backup_summary(message=error_message)
tracked_backups = update_tracked_backups()
return tracked_backups
def get_dir_size(pathname):
total = 0
with scandir(pathname) as it:
for entry in it:
if entry.is_file():
total += entry.stat().st_size
elif entry.is_dir():
total += get_dir_size(entry.path)
return total
export_files_size = get_dir_size(export_target_path)
ret, e = write_backup_summary(success=True)
if not ret:
error_message = (f"[{vm_name}] Failed to export configuration snapshot: {e}",)
log_err(celery, error_message)
write_backup_summary(message=error_message)
tracked_backups = update_tracked_backups()
return tracked_backups
# Clean up the snapshot (vm.vm_worker_remove_snapshot)
if not this_backup_retain_snapshot:
for snap in snap_list:
rbd, name = snap.split("@")
pool, volume = rbd.split("/")
ret, msg = ceph.remove_snapshot(zkhandler, pool, volume, name)
if not ret:
error_message = msg.replace("ERROR: ", f"[{vm_name}] ")
failure = True
break
if failure:
log_err(celery, error_message)
write_backup_summary(message=error_message)
tracked_backups = update_tracked_backups()
return tracked_backups
ret = zkhandler.delete(
("domain.snapshots", dom_uuid, "domain_snapshot.name", snapshot_name)
)
if not ret:
error_message = (f"[{vm_name}] Failed to remove VM snapshot; continuing",)
log_err(celery, error_message)
marked_for_deletion = list()
# Find any full backups that are expired
found_full_count = 0
for backup in tracked_backups:
if backup["type"] == "full":
found_full_count += 1
if found_full_count > full_retention:
marked_for_deletion.append(backup)
# Find any incremental backups that depend on marked parents
for backup in tracked_backups:
if backup["type"] == "incremental" and backup["incremental_parent"] in [
b["snapshot_name"] for b in marked_for_deletion
]:
marked_for_deletion.append(backup)
if len(marked_for_deletion) > 0:
for backup_to_delete in marked_for_deletion:
ret = vm.vm_worker_remove_snapshot(
zkhandler, None, vm_name, backup_to_delete["snapshot_name"]
)
if ret is False:
error_message = f"Failed to remove obsolete backup snapshot '{backup_to_delete['snapshot_name']}', leaving in tracked backups"
log_err(celery, error_message)
else:
rmtree(f"{vm_backup_path}/{backup_to_delete['snapshot_name']}")
tracked_backups.remove(backup_to_delete)
tracked_backups = update_tracked_backups()
return tracked_backups
def worker_cluster_autobackup(
zkhandler, celery, force_full=False, email_recipients=None
):
config = get_autobackup_configuration()
backup_summary = dict()
current_stage = 0
total_stages = 1
if email_recipients is not None:
total_stages += 1
start(
celery,
f"Starting cluster '{config['cluster']}' VM autobackup",
current=current_stage,
total=total_stages,
)
if not config["autobackup_enabled"]:
message = "Autobackups are not configured on this cluster."
log_info(celery, message)
return finish(
celery,
message,
current=total_stages,
total=total_stages,
)
autobackup_start_time = datetime.now()
retcode, vm_list = vm.get_list(zkhandler)
if not retcode:
error_message = f"Failed to fetch VM list: {vm_list}"
log_err(celery, error_message)
send_execution_failure_report(
(celery, current_stage, total_stages),
config,
recipients=email_recipients,
error=error_message,
)
fail(celery, error_message)
return False
backup_suffixed_path = f"{config['backup_root_path']}{config['backup_root_suffix']}"
if not path.exists(backup_suffixed_path):
makedirs(backup_suffixed_path)
full_interval = config["backup_schedule"]["full_interval"]
backup_vms = list()
for vm_detail in vm_list:
vm_tag_names = [t["name"] for t in vm_detail["tags"]]
matching_tags = (
True
if len(set(vm_tag_names).intersection(set(config["backup_tags"]))) > 0
else False
)
if matching_tags:
backup_vms.append(vm_detail)
if len(backup_vms) < 1:
message = "Found no VMs tagged for autobackup."
log_info(celery, message)
return finish(
celery,
message,
current=total_stages,
total=total_stages,
)
if config["auto_mount_enabled"]:
total_stages += len(config["mount_cmds"])
total_stages += len(config["unmount_cmds"])
total_stages += len(backup_vms)
log_info(
celery,
f"Found {len(backup_vms)} suitable VM(s) for autobackup: {', '.join([b['name'] for b in backup_vms])}",
)
# Handle automount mount commands
if config["auto_mount_enabled"]:
for cmd in config["mount_cmds"]:
current_stage += 1
update(
celery,
f"Executing mount command '{cmd.split()[0]}'",
current=current_stage,
total=total_stages,
)
ret = run(
cmd.split(),
stdout=PIPE,
stderr=PIPE,
)
if ret.returncode != 0:
error_message = f"Failed to execute mount command '{cmd.split()[0]}': {ret.stderr.decode().strip()}"
log_err(celery, error_message)
send_execution_failure_report(
(celery, current_stage, total_stages),
config,
recipients=email_recipients,
total_time=datetime.now() - autobackup_start_time,
error=error_message,
)
fail(celery, error_message)
return False
# Execute the backup: take a snapshot, then export the snapshot
for vm_detail in backup_vms:
vm_backup_path = f"{backup_suffixed_path}/{vm_detail['name']}"
autobackup_state_file = f"{vm_backup_path}/.autobackup.json"
if not path.exists(vm_backup_path) or not path.exists(autobackup_state_file):
# There are no existing backups so the list is empty
state_data = dict()
tracked_backups = list()
else:
with open(autobackup_state_file) as fh:
state_data = jload(fh)
tracked_backups = state_data["tracked_backups"]
full_backups = [b for b in tracked_backups if b["type"] == "full"]
if len(full_backups) > 0:
last_full_backup = full_backups[0]
last_full_backup_idx = tracked_backups.index(last_full_backup)
if force_full:
this_backup_incremental_parent = None
elif last_full_backup_idx >= full_interval - 1:
this_backup_incremental_parent = None
else:
this_backup_incremental_parent = last_full_backup["snapshot_name"]
else:
# The very first ackup must be full to start the tree
this_backup_incremental_parent = None
export_type = (
"incremental" if this_backup_incremental_parent is not None else "full"
)
current_stage += 1
update(
celery,
f"Performing autobackup of VM {vm_detail['name']} ({export_type})",
current=current_stage,
total=total_stages,
)
summary = run_vm_backup(
zkhandler,
celery,
config,
vm_detail,
force_full=force_full,
)
backup_summary[vm_detail["name"]] = summary
# Handle automount unmount commands
if config["auto_mount_enabled"]:
for cmd in config["unmount_cmds"]:
current_stage += 1
update(
celery,
f"Executing unmount command '{cmd.split()[0]}'",
current=current_stage,
total=total_stages,
)
ret = run(
cmd.split(),
stdout=PIPE,
stderr=PIPE,
)
if ret.returncode != 0:
error_message = f"Failed to execute unmount command '{cmd.split()[0]}': {ret.stderr.decode().strip()}"
log_err(celery, error_message)
send_execution_failure_report(
(celery, current_stage, total_stages),
config,
recipients=email_recipients,
total_time=datetime.now() - autobackup_start_time,
error=error_message,
)
fail(celery, error_message)
return False
autobackup_end_time = datetime.now()
autobackup_total_time = autobackup_end_time - autobackup_start_time
if email_recipients is not None:
send_execution_summary_report(
(celery, current_stage, total_stages),
config,
recipients=email_recipients,
total_time=autobackup_total_time,
summary=backup_summary,
)
current_stage += 1
current_stage += 1
return finish(
celery,
f"Successfully completed cluster '{config['cluster']}' VM autobackup",
current=current_stage,
total=total_stages,
)

View File

@ -19,31 +19,34 @@
# #
############################################################################### ###############################################################################
import os
import psutil
import psycopg2 import psycopg2
import psycopg2.extras import psycopg2.extras
import subprocess
from datetime import datetime from datetime import datetime
from json import loads, dumps from json import loads, dumps
from time import sleep
from daemon_lib.celery import start, fail, log_info, update, finish from daemon_lib.celery import start, fail, log_info, update, finish
import daemon_lib.common as pvc_common
import daemon_lib.ceph as pvc_ceph import daemon_lib.ceph as pvc_ceph
# Define the current test format # Define the current test format
TEST_FORMAT = 1 TEST_FORMAT = 2
# We run a total of 8 tests, to give a generalized idea of performance on the cluster: # We run a total of 8 tests, to give a generalized idea of performance on the cluster:
# 1. A sequential read test of 8GB with a 4M block size # 1. A sequential read test of 64GB with a 4M block size
# 2. A sequential write test of 8GB with a 4M block size # 2. A sequential write test of 64GB with a 4M block size
# 3. A random read test of 8GB with a 4M block size # 3. A random read test of 64GB with a 4M block size
# 4. A random write test of 8GB with a 4M block size # 4. A random write test of 64GB with a 4M block size
# 5. A random read test of 8GB with a 256k block size # 5. A random read test of 64GB with a 256k block size
# 6. A random write test of 8GB with a 256k block size # 6. A random write test of 64GB with a 256k block size
# 7. A random read test of 8GB with a 4k block size # 7. A random read test of 64GB with a 4k block size
# 8. A random write test of 8GB with a 4k block size # 8. A random write test of 64GB with a 4k block size
# Taken together, these 8 results should give a very good indication of the overall storage performance # Taken together, these 8 results should give a very good indication of the overall storage performance
# for a variety of workloads. # for a variety of workloads.
test_matrix = { test_matrix = {
@ -100,7 +103,7 @@ test_matrix = {
# Specify the benchmark volume name and size # Specify the benchmark volume name and size
benchmark_volume_name = "pvcbenchmark" benchmark_volume_name = "pvcbenchmark"
benchmark_volume_size = "8G" benchmark_volume_size = "64G"
# #
@ -115,9 +118,10 @@ class BenchmarkError(Exception):
# #
def cleanup(job_name, db_conn=None, db_cur=None, zkhandler=None): def cleanup(job_name, db_conn=None, db_cur=None, zkhandler=None, final=False):
if db_conn is not None and db_cur is not None: if db_conn is not None and db_cur is not None:
# Clean up our dangling result if not final:
# Clean up our dangling result (non-final runs only)
query = "DELETE FROM storage_benchmarks WHERE job = %s;" query = "DELETE FROM storage_benchmarks WHERE job = %s;"
args = (job_name,) args = (job_name,)
db_cur.execute(query, args) db_cur.execute(query, args)
@ -225,7 +229,7 @@ def cleanup_benchmark_volume(
def run_benchmark_job( def run_benchmark_job(
test, pool, job_name=None, db_conn=None, db_cur=None, zkhandler=None config, test, pool, job_name=None, db_conn=None, db_cur=None, zkhandler=None
): ):
test_spec = test_matrix[test] test_spec = test_matrix[test]
log_info(None, f"Running test '{test}'") log_info(None, f"Running test '{test}'")
@ -255,31 +259,165 @@ def run_benchmark_job(
) )
log_info(None, "Running fio job: {}".format(" ".join(fio_cmd.split()))) log_info(None, "Running fio job: {}".format(" ".join(fio_cmd.split())))
retcode, stdout, stderr = pvc_common.run_os_command(fio_cmd)
# Run the fio command manually instead of using our run_os_command wrapper
# This will help us gather statistics about this node while it's running
process = subprocess.Popen(
fio_cmd.split(),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
# Wait 15 seconds for the test to start
log_info(None, "Waiting 15 seconds for test resource stabilization")
sleep(15)
# Set up function to get process CPU utilization by name
def get_cpu_utilization_by_name(process_name):
cpu_usage = 0
for proc in psutil.process_iter(["name", "cpu_percent"]):
if proc.info["name"] == process_name:
cpu_usage += proc.info["cpu_percent"]
return cpu_usage
# Set up function to get process memory utilization by name
def get_memory_utilization_by_name(process_name):
memory_usage = 0
for proc in psutil.process_iter(["name", "memory_percent"]):
if proc.info["name"] == process_name:
memory_usage += proc.info["memory_percent"]
return memory_usage
# Set up function to get network traffic utilization in bps
def get_network_traffic_bps(interface, duration=1):
# Get initial network counters
net_io_start = psutil.net_io_counters(pernic=True)
if interface not in net_io_start:
return None, None
stats_start = net_io_start[interface]
bytes_sent_start = stats_start.bytes_sent
bytes_recv_start = stats_start.bytes_recv
# Wait for the specified duration
sleep(duration)
# Get final network counters
net_io_end = psutil.net_io_counters(pernic=True)
stats_end = net_io_end[interface]
bytes_sent_end = stats_end.bytes_sent
bytes_recv_end = stats_end.bytes_recv
# Calculate bytes per second
bytes_sent_per_sec = (bytes_sent_end - bytes_sent_start) / duration
bytes_recv_per_sec = (bytes_recv_end - bytes_recv_start) / duration
# Convert to bits per second (bps)
bits_sent_per_sec = bytes_sent_per_sec * 8
bits_recv_per_sec = bytes_recv_per_sec * 8
bits_total_per_sec = bits_sent_per_sec + bits_recv_per_sec
return bits_sent_per_sec, bits_recv_per_sec, bits_total_per_sec
log_info(None, f"Starting system resource polling for test '{test}'")
storage_interface = config["storage_dev"]
total_cpus = psutil.cpu_count(logical=True)
ticks = 1
osd_cpu_utilization = 0
osd_memory_utilization = 0
mon_cpu_utilization = 0
mon_memory_utilization = 0
total_cpu_utilization = 0
total_memory_utilization = 0
storage_sent_bps = 0
storage_recv_bps = 0
storage_total_bps = 0
while process.poll() is None:
# Do collection of statistics like network bandwidth and cpu utilization
current_osd_cpu_utilization = get_cpu_utilization_by_name("ceph-osd")
current_osd_memory_utilization = get_memory_utilization_by_name("ceph-osd")
current_mon_cpu_utilization = get_cpu_utilization_by_name("ceph-mon")
current_mon_memory_utilization = get_memory_utilization_by_name("ceph-mon")
current_total_cpu_utilization = psutil.cpu_percent(interval=1)
current_total_memory_utilization = psutil.virtual_memory().percent
(
current_storage_sent_bps,
current_storage_recv_bps,
current_storage_total_bps,
) = get_network_traffic_bps(storage_interface)
# Recheck if the process is done yet; if it's not, we add the values and increase the ticks
# This helps ensure that if the process finishes earlier than the longer polls above,
# this particular tick isn't counted which can skew the average
if process.poll() is None:
osd_cpu_utilization += current_osd_cpu_utilization
osd_memory_utilization += current_osd_memory_utilization
mon_cpu_utilization += current_mon_cpu_utilization
mon_memory_utilization += current_mon_memory_utilization
total_cpu_utilization += current_total_cpu_utilization
total_memory_utilization += current_total_memory_utilization
storage_sent_bps += current_storage_sent_bps
storage_recv_bps += current_storage_recv_bps
storage_total_bps += current_storage_total_bps
ticks += 1
# Get the 1-minute load average and CPU utilization, which covers the test duration
load1, _, _ = os.getloadavg()
load1 = round(load1, 2)
# Calculate the average CPU utilization values over the runtime
# Divide the OSD and MON CPU utilization by the total number of CPU cores, because
# the total is divided this way
avg_osd_cpu_utilization = round(osd_cpu_utilization / ticks / total_cpus, 2)
avg_osd_memory_utilization = round(osd_memory_utilization / ticks, 2)
avg_mon_cpu_utilization = round(mon_cpu_utilization / ticks / total_cpus, 2)
avg_mon_memory_utilization = round(mon_memory_utilization / ticks, 2)
avg_total_cpu_utilization = round(total_cpu_utilization / ticks, 2)
avg_total_memory_utilization = round(total_memory_utilization / ticks, 2)
avg_storage_sent_bps = round(storage_sent_bps / ticks, 2)
avg_storage_recv_bps = round(storage_recv_bps / ticks, 2)
avg_storage_total_bps = round(storage_total_bps / ticks, 2)
stdout, stderr = process.communicate()
retcode = process.returncode
resource_data = {
"avg_cpu_util_percent": {
"total": avg_total_cpu_utilization,
"ceph-mon": avg_mon_cpu_utilization,
"ceph-osd": avg_osd_cpu_utilization,
},
"avg_memory_util_percent": {
"total": avg_total_memory_utilization,
"ceph-mon": avg_mon_memory_utilization,
"ceph-osd": avg_osd_memory_utilization,
},
"avg_network_util_bps": {
"sent": avg_storage_sent_bps,
"recv": avg_storage_recv_bps,
"total": avg_storage_total_bps,
},
}
try: try:
jstdout = loads(stdout) jstdout = loads(stdout)
if retcode: if retcode:
raise raise
except Exception: except Exception:
cleanup( return None, None
job_name,
db_conn=db_conn,
db_cur=db_cur,
zkhandler=zkhandler,
)
fail(
None,
f"Failed to run fio test '{test}': {stderr}",
)
return jstdout return resource_data, jstdout
def worker_run_benchmark(zkhandler, celery, config, pool): def worker_run_benchmark(zkhandler, celery, config, pool, name):
# Phase 0 - connect to databases # Phase 0 - connect to databases
if not name:
cur_time = datetime.now().isoformat(timespec="seconds") cur_time = datetime.now().isoformat(timespec="seconds")
cur_primary = zkhandler.read("base.config.primary_node") cur_primary = zkhandler.read("base.config.primary_node")
job_name = f"{cur_time}_{cur_primary}" job_name = f"{cur_time}_{cur_primary}"
else:
job_name = name
current_stage = 0 current_stage = 0
total_stages = 13 total_stages = 13
@ -357,7 +495,8 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
total=total_stages, total=total_stages,
) )
results[test] = run_benchmark_job( resource_data, fio_data = run_benchmark_job(
config,
test, test,
pool, pool,
job_name=job_name, job_name=job_name,
@ -365,6 +504,25 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
db_cur=db_cur, db_cur=db_cur,
zkhandler=zkhandler, zkhandler=zkhandler,
) )
if resource_data is None or fio_data is None:
cleanup_benchmark_volume(
pool,
job_name=job_name,
db_conn=db_conn,
db_cur=db_cur,
zkhandler=zkhandler,
)
cleanup(
job_name,
db_conn=db_conn,
db_cur=db_cur,
zkhandler=zkhandler,
)
fail(
None,
f"Failed to run fio test '{test}'",
)
results[test] = {**resource_data, **fio_data}
# Phase 3 - cleanup # Phase 3 - cleanup
current_stage += 1 current_stage += 1
@ -410,6 +568,7 @@ def worker_run_benchmark(zkhandler, celery, config, pool):
db_conn=db_conn, db_conn=db_conn,
db_cur=db_cur, db_cur=db_cur,
zkhandler=zkhandler, zkhandler=zkhandler,
final=True,
) )
current_stage += 1 current_stage += 1

View File

@ -320,7 +320,11 @@ def get_list_osd(zkhandler, limit=None, is_fuzzy=True):
# #
def getPoolInformation(zkhandler, pool): def getPoolInformation(zkhandler, pool):
# Parse the stats data # Parse the stats data
(pool_stats_raw, tier, pgs,) = zkhandler.read_many( (
pool_stats_raw,
tier,
pgs,
) = zkhandler.read_many(
[ [
("pool.stats", pool), ("pool.stats", pool),
("pool.tier", pool), ("pool.tier", pool),
@ -536,7 +540,10 @@ def getCephVolumes(zkhandler, pool):
pool_list = [pool] pool_list = [pool]
for pool_name in pool_list: for pool_name in pool_list:
for volume_name in zkhandler.children(("volume", pool_name)): children = zkhandler.children(("volume", pool_name))
if children is None:
continue
for volume_name in children:
volume_list.append("{}/{}".format(pool_name, volume_name)) volume_list.append("{}/{}".format(pool_name, volume_name))
return volume_list return volume_list
@ -553,7 +560,21 @@ def getVolumeInformation(zkhandler, pool, volume):
return volume_information return volume_information
def add_volume(zkhandler, pool, name, size, force_flag=False): def scan_volume(zkhandler, pool, name):
retcode, stdout, stderr = common.run_os_command(
"rbd info --format json {}/{}".format(pool, name)
)
volstats = stdout
# 3. Add the new volume to Zookeeper
zkhandler.write(
[
(("volume.stats", f"{pool}/{name}"), volstats),
]
)
def add_volume(zkhandler, pool, name, size, force_flag=False, zk_only=False):
# 1. Verify the size of the volume # 1. Verify the size of the volume
pool_information = getPoolInformation(zkhandler, pool) pool_information = getPoolInformation(zkhandler, pool)
size_bytes = format_bytes_fromhuman(size) size_bytes = format_bytes_fromhuman(size)
@ -585,27 +606,28 @@ def add_volume(zkhandler, pool, name, size, force_flag=False):
) )
# 2. Create the volume # 2. Create the volume
# zk_only flag skips actually creating the volume - this would be done by some other mechanism
if not zk_only:
retcode, stdout, stderr = common.run_os_command( retcode, stdout, stderr = common.run_os_command(
"rbd create --size {}B {}/{}".format(size_bytes, pool, name) "rbd create --size {}B {}/{}".format(size_bytes, pool, name)
) )
if retcode: if retcode:
return False, 'ERROR: Failed to create RBD volume "{}": {}'.format(name, stderr) return False, 'ERROR: Failed to create RBD volume "{}": {}'.format(
name, stderr
# 2. Get volume stats
retcode, stdout, stderr = common.run_os_command(
"rbd info --format json {}/{}".format(pool, name)
) )
volstats = stdout
# 3. Add the new volume to Zookeeper # 3. Add the new volume to Zookeeper
zkhandler.write( zkhandler.write(
[ [
(("volume", f"{pool}/{name}"), ""), (("volume", f"{pool}/{name}"), ""),
(("volume.stats", f"{pool}/{name}"), volstats), (("volume.stats", f"{pool}/{name}"), ""),
(("snapshot", f"{pool}/{name}"), ""), (("snapshot", f"{pool}/{name}"), ""),
] ]
) )
# 4. Scan the volume stats
scan_volume(zkhandler, pool, name)
return True, 'Created RBD volume "{}" of size "{}" in pool "{}".'.format( return True, 'Created RBD volume "{}" of size "{}" in pool "{}".'.format(
name, format_bytes_tohuman(size_bytes), pool name, format_bytes_tohuman(size_bytes), pool
) )
@ -655,21 +677,18 @@ def clone_volume(zkhandler, pool, name_src, name_new, force_flag=False):
), ),
) )
# 3. Get volume stats # 3. Add the new volume to Zookeeper
retcode, stdout, stderr = common.run_os_command(
"rbd info --format json {}/{}".format(pool, name_new)
)
volstats = stdout
# 4. Add the new volume to Zookeeper
zkhandler.write( zkhandler.write(
[ [
(("volume", f"{pool}/{name_new}"), ""), (("volume", f"{pool}/{name_new}"), ""),
(("volume.stats", f"{pool}/{name_new}"), volstats), (("volume.stats", f"{pool}/{name_new}"), ""),
(("snapshot", f"{pool}/{name_new}"), ""), (("snapshot", f"{pool}/{name_new}"), ""),
] ]
) )
# 4. Scan the volume stats
scan_volume(zkhandler, pool, name_new)
return True, 'Cloned RBD volume "{}" to "{}" in pool "{}"'.format( return True, 'Cloned RBD volume "{}" to "{}" in pool "{}"'.format(
name_src, name_new, pool name_src, name_new, pool
) )
@ -754,20 +773,8 @@ def resize_volume(zkhandler, pool, name, size, force_flag=False):
except Exception: except Exception:
pass pass
# 4. Get volume stats # 4. Scan the volume stats
retcode, stdout, stderr = common.run_os_command( scan_volume(zkhandler, pool, name)
"rbd info --format json {}/{}".format(pool, name)
)
volstats = stdout
# 5. Update the volume in Zookeeper
zkhandler.write(
[
(("volume", f"{pool}/{name}"), ""),
(("volume.stats", f"{pool}/{name}"), volstats),
(("snapshot", f"{pool}/{name}"), ""),
]
)
return True, 'Resized RBD volume "{}" to size "{}" in pool "{}".'.format( return True, 'Resized RBD volume "{}" to size "{}" in pool "{}".'.format(
name, format_bytes_tohuman(size_bytes), pool name, format_bytes_tohuman(size_bytes), pool
@ -800,18 +807,8 @@ def rename_volume(zkhandler, pool, name, new_name):
] ]
) )
# 3. Get volume stats # 3. Scan the volume stats
retcode, stdout, stderr = common.run_os_command( scan_volume(zkhandler, pool, new_name)
"rbd info --format json {}/{}".format(pool, new_name)
)
volstats = stdout
# 4. Update the volume stats in Zookeeper
zkhandler.write(
[
(("volume.stats", f"{pool}/{new_name}"), volstats),
]
)
return True, 'Renamed RBD volume "{}" to "{}" in pool "{}".'.format( return True, 'Renamed RBD volume "{}" to "{}" in pool "{}".'.format(
name, new_name, pool name, new_name, pool
@ -824,10 +821,22 @@ def remove_volume(zkhandler, pool, name):
name, pool name, pool
) )
# 1. Remove volume snapshots # 1a. Remove PVC-managed volume snapshots
for snapshot in zkhandler.children(("snapshot", f"{pool}/{name}")): for snapshot in zkhandler.children(("snapshot", f"{pool}/{name}")):
remove_snapshot(zkhandler, pool, name, snapshot) remove_snapshot(zkhandler, pool, name, snapshot)
# 1b. Purge any remaining volume snapshots
retcode, stdout, stderr = common.run_os_command(
"rbd snap purge {}/{}".format(pool, name)
)
if retcode:
return (
False,
'ERROR: Failed to purge snapshots from RBD volume "{}" in pool "{}": {}'.format(
name, pool, stderr
),
)
# 2. Remove the volume # 2. Remove the volume
retcode, stdout, stderr = common.run_os_command("rbd rm {}/{}".format(pool, name)) retcode, stdout, stderr = common.run_os_command("rbd rm {}/{}".format(pool, name))
if retcode: if retcode:
@ -996,23 +1005,27 @@ def add_snapshot(zkhandler, pool, volume, name, zk_only=False):
), ),
) )
# 2. Add the snapshot to Zookeeper # 2. Get snapshot stats
retcode, stdout, stderr = common.run_os_command(
"rbd info --format json {}/{}@{}".format(pool, volume, name)
)
snapstats = stdout
# 3. Add the snapshot to Zookeeper
zkhandler.write( zkhandler.write(
[ [
(("snapshot", f"{pool}/{volume}/{name}"), ""), (("snapshot", f"{pool}/{volume}/{name}"), ""),
(("snapshot.stats", f"{pool}/{volume}/{name}"), "{}"), (("snapshot.stats", f"{pool}/{volume}/{name}"), snapstats),
] ]
) )
# 3. Update the count of snapshots on this volume # 4. Update the count of snapshots on this volume
volume_stats_raw = zkhandler.read(("volume.stats", f"{pool}/{volume}")) volume_stats_raw = zkhandler.read(("volume.stats", f"{pool}/{volume}"))
volume_stats = dict(json.loads(volume_stats_raw)) volume_stats = dict(json.loads(volume_stats_raw))
# Format the size to something nicer
volume_stats["snapshot_count"] = volume_stats["snapshot_count"] + 1 volume_stats["snapshot_count"] = volume_stats["snapshot_count"] + 1
volume_stats_raw = json.dumps(volume_stats)
zkhandler.write( zkhandler.write(
[ [
(("volume.stats", f"{pool}/{volume}"), volume_stats_raw), (("volume.stats", f"{pool}/{volume}"), json.dumps(volume_stats)),
] ]
) )
@ -1066,6 +1079,36 @@ def rename_snapshot(zkhandler, pool, volume, name, new_name):
) )
def rollback_snapshot(zkhandler, pool, volume, name):
if not verifyVolume(zkhandler, pool, volume):
return False, 'ERROR: No volume with name "{}" is present in pool "{}".'.format(
volume, pool
)
if not verifySnapshot(zkhandler, pool, volume, name):
return (
False,
'ERROR: No snapshot with name "{}" is present for volume "{}" in pool "{}".'.format(
name, volume, pool
),
)
# 1. Roll back the snapshot
retcode, stdout, stderr = common.run_os_command(
"rbd snap rollback {}/{}@{}".format(pool, volume, name)
)
if retcode:
return (
False,
'ERROR: Failed to roll back RBD volume "{}" in pool "{}" to snapshot "{}": {}'.format(
volume, pool, name, stderr
),
)
return True, 'Rolled back RBD volume "{}" in pool "{}" to snapshot "{}".'.format(
volume, pool, name
)
def remove_snapshot(zkhandler, pool, volume, name): def remove_snapshot(zkhandler, pool, volume, name):
if not verifyVolume(zkhandler, pool, volume): if not verifyVolume(zkhandler, pool, volume):
return False, 'ERROR: No volume with name "{}" is present in pool "{}".'.format( return False, 'ERROR: No volume with name "{}" is present in pool "{}".'.format(
@ -1107,20 +1150,9 @@ def remove_snapshot(zkhandler, pool, volume, name):
) )
def get_list_snapshot(zkhandler, pool, volume, limit=None, is_fuzzy=True): def get_list_snapshot(zkhandler, target_pool, target_volume, limit=None, is_fuzzy=True):
snapshot_list = [] snapshot_list = []
if pool and not verifyPool(zkhandler, pool): full_snapshot_list = getCephSnapshots(zkhandler, target_pool, target_volume)
return False, 'ERROR: No pool with name "{}" is present in the cluster.'.format(
pool
)
if volume and not verifyPool(zkhandler, volume):
return (
False,
'ERROR: No volume with name "{}" is present in the cluster.'.format(volume),
)
full_snapshot_list = getCephSnapshots(zkhandler, pool, volume)
if is_fuzzy and limit: if is_fuzzy and limit:
# Implicitly assume fuzzy limits # Implicitly assume fuzzy limits
@ -1132,6 +1164,18 @@ def get_list_snapshot(zkhandler, pool, volume, limit=None, is_fuzzy=True):
for snapshot in full_snapshot_list: for snapshot in full_snapshot_list:
volume, snapshot_name = snapshot.split("@") volume, snapshot_name = snapshot.split("@")
pool_name, volume_name = volume.split("/") pool_name, volume_name = volume.split("/")
if target_pool and pool_name != target_pool:
continue
if target_volume and volume_name != target_volume:
continue
try:
snapshot_stats = json.loads(
zkhandler.read(
("snapshot.stats", f"{pool_name}/{volume_name}/{snapshot_name}")
)
)
except Exception:
snapshot_stats = []
if limit: if limit:
try: try:
if re.fullmatch(limit, snapshot_name): if re.fullmatch(limit, snapshot_name):
@ -1140,13 +1184,19 @@ def get_list_snapshot(zkhandler, pool, volume, limit=None, is_fuzzy=True):
"pool": pool_name, "pool": pool_name,
"volume": volume_name, "volume": volume_name,
"snapshot": snapshot_name, "snapshot": snapshot_name,
"stats": snapshot_stats,
} }
) )
except Exception as e: except Exception as e:
return False, "Regex Error: {}".format(e) return False, "Regex Error: {}".format(e)
else: else:
snapshot_list.append( snapshot_list.append(
{"pool": pool_name, "volume": volume_name, "snapshot": snapshot_name} {
"pool": pool_name,
"volume": volume_name,
"snapshot": snapshot_name,
"stats": snapshot_stats,
}
) )
return True, sorted(snapshot_list, key=lambda x: str(x["snapshot"])) return True, sorted(snapshot_list, key=lambda x: str(x["snapshot"]))
@ -1181,16 +1231,16 @@ def osd_worker_add_osd(
current_stage = 0 current_stage = 0
total_stages = 5 total_stages = 5
if split_count is None: if split_count is None:
_split_count = 1 split_count = 1
else: else:
_split_count = split_count split_count = int(split_count)
total_stages = total_stages + 3 * int(_split_count) total_stages = total_stages + 3 * int(split_count)
if ext_db_ratio is not None or ext_db_size is not None: if ext_db_ratio is not None or ext_db_size is not None:
total_stages = total_stages + 3 * int(_split_count) + 1 total_stages = total_stages + 3 * int(split_count) + 1
start( start(
celery, celery,
f"Adding {_split_count} new OSD(s) on device {device} with weight {weight}", f"Adding {split_count} new OSD(s) on device {device} with weight {weight}",
current=current_stage, current=current_stage,
total=total_stages, total=total_stages,
) )
@ -1231,7 +1281,7 @@ def osd_worker_add_osd(
else: else:
ext_db_flag = False ext_db_flag = False
if split_count is not None: if split_count > 1:
split_flag = f"--osds-per-device {split_count}" split_flag = f"--osds-per-device {split_count}"
is_split = True is_split = True
log_info( log_info(

View File

@ -262,6 +262,22 @@ def getClusterInformation(zkhandler):
# Get cluster maintenance state # Get cluster maintenance state
maintenance_state = zkhandler.read("base.config.maintenance") maintenance_state = zkhandler.read("base.config.maintenance")
# Prepare cluster total values
cluster_total_node_memory = 0
cluster_total_used_memory = 0
cluster_total_free_memory = 0
cluster_total_allocated_memory = 0
cluster_total_provisioned_memory = 0
cluster_total_average_memory_utilization = 0
cluster_total_cpu_cores = 0
cluster_total_cpu_load = 0
cluster_total_average_cpu_utilization = 0
cluster_total_allocated_cores = 0
cluster_total_osd_space = 0
cluster_total_used_space = 0
cluster_total_free_space = 0
cluster_total_average_osd_utilization = 0
# Get primary node # Get primary node
maintenance_state, primary_node = zkhandler.read_many( maintenance_state, primary_node = zkhandler.read_many(
[ [
@ -276,19 +292,36 @@ def getClusterInformation(zkhandler):
# Get the list of Nodes # Get the list of Nodes
node_list = zkhandler.children("base.node") node_list = zkhandler.children("base.node")
node_count = len(node_list) node_count = len(node_list)
# Get the daemon and domain states of all Nodes # Get the information of all Nodes
node_state_reads = list() node_state_reads = list()
node_memory_reads = list()
node_cpu_reads = list()
for node in node_list: for node in node_list:
node_state_reads += [ node_state_reads += [
("node.state.daemon", node), ("node.state.daemon", node),
("node.state.domain", node), ("node.state.domain", node),
] ]
node_memory_reads += [
("node.memory.total", node),
("node.memory.used", node),
("node.memory.free", node),
("node.memory.allocated", node),
("node.memory.provisioned", node),
]
node_cpu_reads += [
("node.data.static", node),
("node.vcpu.allocated", node),
("node.cpu.load", node),
]
all_node_states = zkhandler.read_many(node_state_reads) all_node_states = zkhandler.read_many(node_state_reads)
all_node_memory = zkhandler.read_many(node_memory_reads)
all_node_cpu = zkhandler.read_many(node_cpu_reads)
# Parse out the Node states # Parse out the Node states
node_data = list() node_data = list()
formatted_node_states = {"total": node_count} formatted_node_states = {"total": node_count}
for nidx, node in enumerate(node_list): for nidx, node in enumerate(node_list):
# Split the large list of return values by the IDX of this node # Split the large list of return values by the IDX of this node (states)
# Each node result is 2 fields long # Each node result is 2 fields long
pos_start = nidx * 2 pos_start = nidx * 2
pos_end = nidx * 2 + 2 pos_end = nidx * 2 + 2
@ -308,6 +341,46 @@ def getClusterInformation(zkhandler):
else: else:
formatted_node_states[node_state] = 1 formatted_node_states[node_state] = 1
# Split the large list of return values by the IDX of this node (memory)
# Each node result is 5 fields long
pos_start = nidx * 5
pos_end = nidx * 5 + 5
(
node_memory_total,
node_memory_used,
node_memory_free,
node_memory_allocated,
node_memory_provisioned,
) = tuple(all_node_memory[pos_start:pos_end])
cluster_total_node_memory += int(node_memory_total)
cluster_total_used_memory += int(node_memory_used)
cluster_total_free_memory += int(node_memory_free)
cluster_total_allocated_memory += int(node_memory_allocated)
cluster_total_provisioned_memory += int(node_memory_provisioned)
# Split the large list of return values by the IDX of this node (cpu)
# Each nod result is 3 fields long
pos_start = nidx * 3
pos_end = nidx * 3 + 3
node_static_data, node_vcpu_allocated, node_cpu_load = tuple(
all_node_cpu[pos_start:pos_end]
)
cluster_total_cpu_cores += int(node_static_data.split()[0])
cluster_total_cpu_load += round(float(node_cpu_load), 2)
cluster_total_allocated_cores += int(node_vcpu_allocated)
cluster_total_average_memory_utilization = (
(round((cluster_total_used_memory / cluster_total_node_memory) * 100, 2))
if cluster_total_node_memory > 0
else 0.00
)
cluster_total_average_cpu_utilization = (
(round((cluster_total_cpu_load / cluster_total_cpu_cores) * 100, 2))
if cluster_total_cpu_cores > 0
else 0.00
)
# Get the list of VMs # Get the list of VMs
vm_list = zkhandler.children("base.domain") vm_list = zkhandler.children("base.domain")
vm_count = len(vm_list) vm_count = len(vm_list)
@ -380,6 +453,18 @@ def getClusterInformation(zkhandler):
else: else:
formatted_osd_states[osd_state] = 1 formatted_osd_states[osd_state] = 1
# Add the OSD utilization
cluster_total_osd_space += int(osd_stats["kb"])
cluster_total_used_space += int(osd_stats["kb_used"])
cluster_total_free_space += int(osd_stats["kb_avail"])
cluster_total_average_osd_utilization += float(osd_stats["utilization"])
cluster_total_average_osd_utilization = (
(round(cluster_total_average_osd_utilization / len(ceph_osd_list), 2))
if ceph_osd_list
else 0.00
)
# Get the list of Networks # Get the list of Networks
network_list = zkhandler.children("base.network") network_list = zkhandler.children("base.network")
network_count = len(network_list) network_count = len(network_list)
@ -424,6 +509,28 @@ def getClusterInformation(zkhandler):
"pools": ceph_pool_count, "pools": ceph_pool_count,
"volumes": ceph_volume_count, "volumes": ceph_volume_count,
"snapshots": ceph_snapshot_count, "snapshots": ceph_snapshot_count,
"resources": {
"memory": {
"total": cluster_total_node_memory,
"free": cluster_total_free_memory,
"used": cluster_total_used_memory,
"allocated": cluster_total_allocated_memory,
"provisioned": cluster_total_provisioned_memory,
"utilization": cluster_total_average_memory_utilization,
},
"cpu": {
"total": cluster_total_cpu_cores,
"load": cluster_total_cpu_load,
"allocated": cluster_total_allocated_cores,
"utilization": cluster_total_average_cpu_utilization,
},
"disk": {
"total": cluster_total_osd_space,
"used": cluster_total_used_space,
"free": cluster_total_free_space,
"utilization": cluster_total_average_osd_utilization,
},
},
"detail": { "detail": {
"node": node_data, "node": node_data,
"vm": vm_data, "vm": vm_data,
@ -1051,6 +1158,9 @@ def get_resource_metrics(zkhandler):
"restart": 6, "restart": 6,
"stop": 7, "stop": 7,
"fail": 8, "fail": 8,
"import": 9,
"restore": 10,
"mirror": 99,
} }
state = vm["state"] state = vm["state"]
output_lines.append( output_lines.append(

View File

@ -26,8 +26,10 @@ import subprocess
import signal import signal
from json import loads from json import loads
from re import match as re_match from re import match as re_match
from re import search as re_search
from re import split as re_split from re import split as re_split
from re import sub as re_sub from re import sub as re_sub
from difflib import unified_diff
from distutils.util import strtobool from distutils.util import strtobool
from threading import Thread from threading import Thread
from shlex import split as shlex_split from shlex import split as shlex_split
@ -81,6 +83,9 @@ vm_state_combinations = [
"migrate", "migrate",
"unmigrate", "unmigrate",
"provision", "provision",
"import",
"restore",
"mirror",
] ]
ceph_osd_state_combinations = [ ceph_osd_state_combinations = [
"up,in", "up,in",
@ -427,6 +432,96 @@ def getDomainTags(zkhandler, dom_uuid):
return tags return tags
#
# Get a list of domain snapshots
#
def getDomainSnapshots(zkhandler, dom_uuid):
"""
Get a list of snapshots for domain dom_uuid
The UUID must be validated before calling this function!
"""
snapshots = list()
all_snapshots = zkhandler.children(("domain.snapshots", dom_uuid))
current_timestamp = time.time()
current_dom_xml = zkhandler.read(("domain.xml", dom_uuid))
snapshots = list()
for snapshot in all_snapshots:
(
snap_name,
snap_timestamp,
_snap_rbd_snapshots,
snap_dom_xml,
) = zkhandler.read_many(
[
("domain.snapshots", dom_uuid, "domain_snapshot.name", snapshot),
("domain.snapshots", dom_uuid, "domain_snapshot.timestamp", snapshot),
(
"domain.snapshots",
dom_uuid,
"domain_snapshot.rbd_snapshots",
snapshot,
),
("domain.snapshots", dom_uuid, "domain_snapshot.xml", snapshot),
]
)
snap_rbd_snapshots = _snap_rbd_snapshots.split(",")
snap_dom_xml_diff = list(
unified_diff(
current_dom_xml.split("\n"),
snap_dom_xml.split("\n"),
fromfile="current",
tofile="snapshot",
fromfiledate="",
tofiledate="",
n=1,
lineterm="",
)
)
_snap_timestamp = float(snap_timestamp)
snap_age_secs = int(current_timestamp) - int(_snap_timestamp)
snap_age = f"{snap_age_secs} seconds"
snap_age_minutes = int(snap_age_secs / 60)
if snap_age_minutes > 0:
if snap_age_minutes > 1:
s = "s"
else:
s = ""
snap_age = f"{snap_age_minutes} minute{s}"
snap_age_hours = int(snap_age_secs / 3600)
if snap_age_hours > 0:
if snap_age_hours > 1:
s = "s"
else:
s = ""
snap_age = f"{snap_age_hours} hour{s}"
snap_age_days = int(snap_age_secs / 86400)
if snap_age_days > 0:
if snap_age_days > 1:
s = "s"
else:
s = ""
snap_age = f"{snap_age_days} day{s}"
snapshots.append(
{
"name": snap_name,
"timestamp": snap_timestamp,
"age": snap_age,
"xml_diff_lines": snap_dom_xml_diff,
"rbd_snapshots": snap_rbd_snapshots,
}
)
return sorted(snapshots, key=lambda s: s["timestamp"], reverse=True)
# #
# Get a set of domain metadata # Get a set of domain metadata
# #
@ -515,6 +610,7 @@ def getInformationFromXML(zkhandler, uuid):
) = getDomainMetadata(zkhandler, uuid) ) = getDomainMetadata(zkhandler, uuid)
domain_tags = getDomainTags(zkhandler, uuid) domain_tags = getDomainTags(zkhandler, uuid)
domain_snapshots = getDomainSnapshots(zkhandler, uuid)
if domain_vnc: if domain_vnc:
domain_vnc_listen, domain_vnc_port = domain_vnc.split(":") domain_vnc_listen, domain_vnc_port = domain_vnc.split(":")
@ -574,6 +670,7 @@ def getInformationFromXML(zkhandler, uuid):
"migration_method": domain_migration_method, "migration_method": domain_migration_method,
"migration_max_downtime": int(domain_migration_max_downtime), "migration_max_downtime": int(domain_migration_max_downtime),
"tags": domain_tags, "tags": domain_tags,
"snapshots": domain_snapshots,
"description": domain_description, "description": domain_description,
"profile": domain_profile, "profile": domain_profile,
"memory": int(domain_memory), "memory": int(domain_memory),
@ -978,7 +1075,7 @@ def sortInterfaceNames(interface_names):
# #
# Parse a "detect" device into a real block device name # Parse a "detect" device into a real block device name
# #
def get_detect_device(detect_string): def get_detect_device_lsscsi(detect_string):
""" """
Parses a "detect:" string into a normalized block device path using lsscsi. Parses a "detect:" string into a normalized block device path using lsscsi.
@ -1045,3 +1142,96 @@ def get_detect_device(detect_string):
break break
return blockdev return blockdev
def get_detect_device_nvme(detect_string):
"""
Parses a "detect:" string into a normalized block device path using nvme.
A detect string is formatted "detect:<NAME>:<SIZE>:<ID>", where
NAME is some unique identifier in lsscsi, SIZE is a human-readable
size value to within +/- 3% of the real size of the device, and
ID is the Nth (0-indexed) matching entry of that NAME and SIZE.
"""
unit_map = {
"kB": 1000,
"MB": 1000 * 1000,
"GB": 1000 * 1000 * 1000,
"TB": 1000 * 1000 * 1000 * 1000,
"PB": 1000 * 1000 * 1000 * 1000 * 1000,
"EB": 1000 * 1000 * 1000 * 1000 * 1000 * 1000,
}
_, name, _size, idd = detect_string.split(":")
if _ != "detect":
return None
size_re = re_search(r"([\d.]+)([kKMGTP]B)", _size)
size_val = float(size_re.group(1))
size_unit = size_re.group(2)
size_bytes = int(size_val * unit_map[size_unit])
retcode, stdout, stderr = run_os_command("nvme list --output-format json")
if retcode:
print(f"Failed to run nvme: {stderr}")
return None
# Parse the output with json
nvme_data = loads(stdout).get("Devices", list())
# Handle size determination (+/- 3%)
size = None
nvme_sizes = set()
for entry in nvme_data:
nvme_sizes.add(entry["PhysicalSize"])
for l_size in nvme_sizes:
plusthreepct = size_bytes * 1.03
minusthreepct = size_bytes * 0.97
if l_size > minusthreepct and l_size < plusthreepct:
size = l_size
break
if size is None:
return None
blockdev = None
matches = list()
for entry in nvme_data:
# Skip if name is not contained in the line (case-insensitive)
if name.lower() not in entry["ModelNumber"].lower():
continue
# Skip if the size does not match
if size != entry["PhysicalSize"]:
continue
# Get our blockdev and append to the list
matches.append(entry["DevicePath"])
blockdev = None
# Find the blockdev at index {idd}
for idx, _blockdev in enumerate(matches):
if int(idx) == int(idd):
blockdev = _blockdev
break
return blockdev
def get_detect_device(detect_string):
"""
Parses a "detect:" string into a normalized block device path.
First tries to parse using "lsscsi" (get_detect_device_lsscsi). If this returns an invalid
block device name, then try to parse using "nvme" (get_detect_device_nvme). This works around
issues with more recent devices (e.g. the Dell R6615 series) not properly reporting block
device paths for NVMe devices with "lsscsi".
"""
device = get_detect_device_lsscsi(detect_string)
if device is None or not re_match(r"^/dev", device):
device = get_detect_device_nvme(detect_string)
if device is not None and re_match(r"^/dev", device):
return device
else:
return None

View File

@ -244,9 +244,9 @@ def get_parsed_configuration(config_file):
] ]
][0] ][0]
config_cluster_networks_specific[ config_cluster_networks_specific[f"{network_type}_dev_ip"] = (
f"{network_type}_dev_ip" f"{list(network.hosts())[address_id]}/{network.prefixlen}"
] = f"{list(network.hosts())[address_id]}/{network.prefixlen}" )
config = {**config, **config_cluster_networks_specific} config = {**config, **config_cluster_networks_specific}
@ -375,8 +375,11 @@ def get_parsed_configuration(config_file):
config = {**config, **config_api_ssl} config = {**config, **config_api_ssl}
# Use coordinators as storage hosts if not explicitly specified # Use coordinators as storage hosts if not explicitly specified
# These are added as FQDNs in the storage domain
if not config["storage_hosts"] or len(config["storage_hosts"]) < 1: if not config["storage_hosts"] or len(config["storage_hosts"]) < 1:
config["storage_hosts"] = config["coordinators"] config["storage_hosts"] = []
for host in config["coordinators"]:
config["storage_hosts"].append(f"{host}.{config['storage_domain']}")
# Set up our token list if specified # Set up our token list if specified
if config["api_auth_source"] == "token": if config["api_auth_source"] == "token":
@ -406,6 +409,78 @@ def get_configuration():
return config return config
def get_parsed_autobackup_configuration(config_file):
"""
Load the configuration; this is the same main pvc.conf that the daemons read
"""
print('Loading configuration from file "{}"'.format(config_file))
with open(config_file, "r") as cfgfh:
try:
o_config = yaml.load(cfgfh, Loader=yaml.SafeLoader)
except Exception as e:
print(f"ERROR: Failed to parse configuration file: {e}")
os._exit(1)
config = dict()
try:
o_cluster = o_config["cluster"]
config_cluster = {
"cluster": o_cluster["name"],
"autobackup_enabled": True,
}
config = {**config, **config_cluster}
o_autobackup = o_config["autobackup"]
if o_autobackup is None:
config["autobackup_enabled"] = False
return config
config_autobackup = {
"backup_root_path": o_autobackup["backup_root_path"],
"backup_root_suffix": o_autobackup["backup_root_suffix"],
"backup_tags": o_autobackup["backup_tags"],
"backup_schedule": o_autobackup["backup_schedule"],
}
config = {**config, **config_autobackup}
o_automount = o_autobackup["auto_mount"]
config_automount = {
"auto_mount_enabled": o_automount["enabled"],
}
config = {**config, **config_automount}
if config["auto_mount_enabled"]:
config["mount_cmds"] = list()
for _mount_cmd in o_automount["mount_cmds"]:
if "{backup_root_path}" in _mount_cmd:
_mount_cmd = _mount_cmd.format(
backup_root_path=config["backup_root_path"]
)
config["mount_cmds"].append(_mount_cmd)
config["unmount_cmds"] = list()
for _unmount_cmd in o_automount["unmount_cmds"]:
if "{backup_root_path}" in _unmount_cmd:
_unmount_cmd = _unmount_cmd.format(
backup_root_path=config["backup_root_path"]
)
config["unmount_cmds"].append(_unmount_cmd)
except Exception as e:
raise MalformedConfigurationError(e)
return config
def get_autobackup_configuration():
"""
Get the configuration.
"""
pvc_config_file = get_configuration_path()
config = get_parsed_autobackup_configuration(pvc_config_file)
return config
def validate_directories(config): def validate_directories(config):
if not os.path.exists(config["dynamic_directory"]): if not os.path.exists(config["dynamic_directory"]):
os.makedirs(config["dynamic_directory"]) os.makedirs(config["dynamic_directory"])

View File

@ -0,0 +1 @@
{"version": "14", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock", "snapshots": "/snapshots"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "domain_snapshot": {"name": "", "timestamp": "/timestamp", "xml": "/xml", "rbd_snapshots": "/rbdsnaplist"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}

View File

@ -0,0 +1 @@
{"version": "15", "root": "", "base": {"root": "", "schema": "/schema", "schema.version": "/schema/version", "config": "/config", "config.maintenance": "/config/maintenance", "config.fence_lock": "/config/fence_lock", "config.primary_node": "/config/primary_node", "config.primary_node.sync_lock": "/config/primary_node/sync_lock", "config.upstream_ip": "/config/upstream_ip", "config.migration_target_selector": "/config/migration_target_selector", "logs": "/logs", "faults": "/faults", "node": "/nodes", "domain": "/domains", "network": "/networks", "storage": "/ceph", "storage.health": "/ceph/health", "storage.util": "/ceph/util", "osd": "/ceph/osds", "pool": "/ceph/pools", "volume": "/ceph/volumes", "snapshot": "/ceph/snapshots"}, "logs": {"node": "", "messages": "/messages"}, "faults": {"id": "", "last_time": "/last_time", "first_time": "/first_time", "ack_time": "/ack_time", "status": "/status", "delta": "/delta", "message": "/message"}, "node": {"name": "", "keepalive": "/keepalive", "mode": "/daemonmode", "data.active_schema": "/activeschema", "data.latest_schema": "/latestschema", "data.static": "/staticdata", "data.pvc_version": "/pvcversion", "running_domains": "/runningdomains", "count.provisioned_domains": "/domainscount", "count.networks": "/networkscount", "state.daemon": "/daemonstate", "state.router": "/routerstate", "state.domain": "/domainstate", "cpu.load": "/cpuload", "vcpu.allocated": "/vcpualloc", "memory.total": "/memtotal", "memory.used": "/memused", "memory.free": "/memfree", "memory.allocated": "/memalloc", "memory.provisioned": "/memprov", "ipmi.hostname": "/ipmihostname", "ipmi.username": "/ipmiusername", "ipmi.password": "/ipmipassword", "sriov": "/sriov", "sriov.pf": "/sriov/pf", "sriov.vf": "/sriov/vf", "monitoring.plugins": "/monitoring_plugins", "monitoring.data": "/monitoring_data", "monitoring.health": "/monitoring_health", "network.stats": "/network_stats"}, "monitoring_plugin": {"name": "", "last_run": "/last_run", "health_delta": "/health_delta", "message": "/message", "data": "/data", "runtime": "/runtime"}, "sriov_pf": {"phy": "", "mtu": "/mtu", "vfcount": "/vfcount"}, "sriov_vf": {"phy": "", "pf": "/pf", "mtu": "/mtu", "mac": "/mac", "phy_mac": "/phy_mac", "config": "/config", "config.vlan_id": "/config/vlan_id", "config.vlan_qos": "/config/vlan_qos", "config.tx_rate_min": "/config/tx_rate_min", "config.tx_rate_max": "/config/tx_rate_max", "config.spoof_check": "/config/spoof_check", "config.link_state": "/config/link_state", "config.trust": "/config/trust", "config.query_rss": "/config/query_rss", "pci": "/pci", "pci.domain": "/pci/domain", "pci.bus": "/pci/bus", "pci.slot": "/pci/slot", "pci.function": "/pci/function", "used": "/used", "used_by": "/used_by"}, "domain": {"name": "", "xml": "/xml", "state": "/state", "profile": "/profile", "stats": "/stats", "node": "/node", "last_node": "/lastnode", "failed_reason": "/failedreason", "storage.volumes": "/rbdlist", "console.log": "/consolelog", "console.vnc": "/vnc", "meta.autostart": "/node_autostart", "meta.migrate_method": "/migration_method", "meta.migrate_max_downtime": "/migration_max_downtime", "meta.node_selector": "/node_selector", "meta.node_limit": "/node_limit", "meta.tags": "/tags", "migrate.sync_lock": "/migrate_sync_lock", "snapshots": "/snapshots"}, "tag": {"name": "", "type": "/type", "protected": "/protected"}, "domain_snapshot": {"name": "", "timestamp": "/timestamp", "xml": "/xml", "rbd_snapshots": "/rbdsnaplist"}, "network": {"vni": "", "type": "/nettype", "mtu": "/mtu", "rule": "/firewall_rules", "rule.in": "/firewall_rules/in", "rule.out": "/firewall_rules/out", "nameservers": "/name_servers", "domain": "/domain", "reservation": "/dhcp4_reservations", "lease": "/dhcp4_leases", "ip4.gateway": "/ip4_gateway", "ip4.network": "/ip4_network", "ip4.dhcp": "/dhcp4_flag", "ip4.dhcp_start": "/dhcp4_start", "ip4.dhcp_end": "/dhcp4_end", "ip6.gateway": "/ip6_gateway", "ip6.network": "/ip6_network", "ip6.dhcp": "/dhcp6_flag"}, "reservation": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname"}, "lease": {"mac": "", "ip": "/ipaddr", "hostname": "/hostname", "expiry": "/expiry", "client_id": "/clientid"}, "rule": {"description": "", "rule": "/rule", "order": "/order"}, "osd": {"id": "", "node": "/node", "device": "/device", "db_device": "/db_device", "fsid": "/fsid", "ofsid": "/fsid/osd", "cfsid": "/fsid/cluster", "lvm": "/lvm", "vg": "/lvm/vg", "lv": "/lvm/lv", "is_split": "/is_split", "stats": "/stats"}, "pool": {"name": "", "pgs": "/pgs", "tier": "/tier", "stats": "/stats"}, "volume": {"name": "", "stats": "/stats"}, "snapshot": {"name": "", "stats": "/stats"}}

View File

@ -69,6 +69,8 @@ def getNodeHealthDetails(zkhandler, node_name, node_health_plugins):
plugin_message, plugin_message,
plugin_data, plugin_data,
) = tuple(all_plugin_data[pos_start:pos_end]) ) = tuple(all_plugin_data[pos_start:pos_end])
if plugin_data is None:
continue
plugin_output = { plugin_output = {
"name": plugin, "name": plugin,
"last_run": int(plugin_last_run) if plugin_last_run is not None else None, "last_run": int(plugin_last_run) if plugin_last_run is not None else None,
@ -156,9 +158,9 @@ def getNodeInformation(zkhandler, node_name):
zkhandler, node_name, node_health_plugins zkhandler, node_name, node_health_plugins
) )
if _node_network_stats is not None: try:
node_network_stats = json.loads(_node_network_stats) node_network_stats = json.loads(_node_network_stats)
else: except Exception:
node_network_stats = dict() node_network_stats = dict()
# Construct a data structure to represent the data # Construct a data structure to represent the data

File diff suppressed because it is too large Load Diff

View File

@ -258,6 +258,13 @@ def worker_create_vm(
args = (vm_profile,) args = (vm_profile,)
db_cur.execute(query, args) db_cur.execute(query, args)
profile_data = db_cur.fetchone() profile_data = db_cur.fetchone()
if profile_data is None:
fail(
celery,
f'Provisioner profile "{vm_profile}" is not present on the cluster',
exception=ClusterError,
)
if profile_data.get("arguments"): if profile_data.get("arguments"):
vm_data["script_arguments"] = profile_data.get("arguments").split("|") vm_data["script_arguments"] = profile_data.get("arguments").split("|")
else: else:
@ -329,11 +336,7 @@ def worker_create_vm(
retcode, stdout, stderr = pvc_common.run_os_command("uname -m") retcode, stdout, stderr = pvc_common.run_os_command("uname -m")
vm_data["system_architecture"] = stdout.strip() vm_data["system_architecture"] = stdout.strip()
monitor_list = list() vm_data["ceph_monitor_list"] = config["storage_hosts"]
monitor_names = config["storage_hosts"]
for monitor in monitor_names:
monitor_list.append("{}.{}".format(monitor, config["storage_domain"]))
vm_data["ceph_monitor_list"] = monitor_list
vm_data["ceph_monitor_port"] = config["ceph_monitor_port"] vm_data["ceph_monitor_port"] = config["ceph_monitor_port"]
vm_data["ceph_monitor_secret"] = config["ceph_secret_uuid"] vm_data["ceph_monitor_secret"] = config["ceph_secret_uuid"]

View File

@ -30,6 +30,10 @@ from kazoo.client import KazooClient, KazooState
from kazoo.exceptions import NoNodeError from kazoo.exceptions import NoNodeError
DEFAULT_ROOT_PATH = "/usr/share/pvc"
SCHEMA_PATH = "daemon_lib/migrations/versions"
# #
# Function decorators # Function decorators
# #
@ -573,7 +577,7 @@ class ZKHandler(object):
# #
class ZKSchema(object): class ZKSchema(object):
# Current version # Current version
_version = 13 _version = 15
# Root for doing nested keys # Root for doing nested keys
_schema_root = "" _schema_root = ""
@ -589,6 +593,7 @@ class ZKSchema(object):
"schema.version": f"{_schema_root}/schema/version", "schema.version": f"{_schema_root}/schema/version",
"config": f"{_schema_root}/config", "config": f"{_schema_root}/config",
"config.maintenance": f"{_schema_root}/config/maintenance", "config.maintenance": f"{_schema_root}/config/maintenance",
"config.fence_lock": f"{_schema_root}/config/fence_lock",
"config.primary_node": f"{_schema_root}/config/primary_node", "config.primary_node": f"{_schema_root}/config/primary_node",
"config.primary_node.sync_lock": f"{_schema_root}/config/primary_node/sync_lock", "config.primary_node.sync_lock": f"{_schema_root}/config/primary_node/sync_lock",
"config.upstream_ip": f"{_schema_root}/config/upstream_ip", "config.upstream_ip": f"{_schema_root}/config/upstream_ip",
@ -713,13 +718,21 @@ class ZKSchema(object):
"meta.node_limit": "/node_limit", "meta.node_limit": "/node_limit",
"meta.tags": "/tags", "meta.tags": "/tags",
"migrate.sync_lock": "/migrate_sync_lock", "migrate.sync_lock": "/migrate_sync_lock",
"snapshots": "/snapshots",
}, },
# The schema of an individual domain tag entry (/domains/{domain}/tags/{tag}) # The schema of an individual domain tag entry (/domains/{domain}/tags/{tag})
"tag": { "tag": {
"name": "", "name": "", # The root key
"type": "/type", "type": "/type",
"protected": "/protected", "protected": "/protected",
}, # The root key },
# The schema of an individual domain snapshot entry (/domains/{domain}/snapshots/{snapshot})
"domain_snapshot": {
"name": "", # The root key
"timestamp": "/timestamp",
"xml": "/xml",
"rbd_snapshots": "/rbdsnaplist",
},
# The schema of an individual network entry (/networks/{vni}) # The schema of an individual network entry (/networks/{vni})
"network": { "network": {
"vni": "", # The root key "vni": "", # The root key
@ -820,8 +833,8 @@ class ZKSchema(object):
def schema(self, schema): def schema(self, schema):
self._schema = schema self._schema = schema
def __init__(self): def __init__(self, root_path=DEFAULT_ROOT_PATH):
pass self.schema_path = f"{root_path}/{SCHEMA_PATH}"
def __repr__(self): def __repr__(self):
return f"ZKSchema({self.version})" return f"ZKSchema({self.version})"
@ -861,7 +874,7 @@ class ZKSchema(object):
if not quiet: if not quiet:
print(f"Loading schema version {version}") print(f"Loading schema version {version}")
with open(f"daemon_lib/migrations/versions/{version}.json", "r") as sfh: with open(f"{self.schema_path}/{version}.json", "r") as sfh:
self.schema = json.load(sfh) self.schema = json.load(sfh)
self.version = self.schema.get("version") self.version = self.schema.get("version")
@ -1123,7 +1136,7 @@ class ZKSchema(object):
# Migrate from older to newer schema # Migrate from older to newer schema
def migrate(self, zkhandler, new_version): def migrate(self, zkhandler, new_version):
# Determine the versions in between # Determine the versions in between
versions = ZKSchema.find_all(start=self.version, end=new_version) versions = self.find_all(start=self.version, end=new_version)
if versions is None: if versions is None:
return return
@ -1139,7 +1152,7 @@ class ZKSchema(object):
# Rollback from newer to older schema # Rollback from newer to older schema
def rollback(self, zkhandler, old_version): def rollback(self, zkhandler, old_version):
# Determine the versions in between # Determine the versions in between
versions = ZKSchema.find_all(start=old_version - 1, end=self.version - 1) versions = self.find_all(start=old_version - 1, end=self.version - 1)
if versions is None: if versions is None:
return return
@ -1154,6 +1167,12 @@ class ZKSchema(object):
# Apply those changes # Apply those changes
self.run_migrate(zkhandler, changes) self.run_migrate(zkhandler, changes)
# Write the latest schema to a file
def write(self):
schema_file = f"{self.schema_path}/{self._version}.json"
with open(schema_file, "w") as sfh:
json.dump(self._schema, sfh)
@classmethod @classmethod
def key_diff(cls, schema_a, schema_b): def key_diff(cls, schema_a, schema_b):
# schema_a = current # schema_a = current
@ -1199,26 +1218,10 @@ class ZKSchema(object):
return {"add": diff_add, "remove": diff_remove, "rename": diff_rename} return {"add": diff_add, "remove": diff_remove, "rename": diff_rename}
# Load in the schemal of the current cluster
@classmethod
def load_current(cls, zkhandler):
new_instance = cls()
version = new_instance.get_version(zkhandler)
new_instance.load(version)
return new_instance
# Write the latest schema to a file
@classmethod
def write(cls):
schema_file = "daemon_lib/migrations/versions/{}.json".format(cls._version)
with open(schema_file, "w") as sfh:
json.dump(cls._schema, sfh)
# Static methods for reading information from the files # Static methods for reading information from the files
@staticmethod def find_all(self, start=0, end=None):
def find_all(start=0, end=None):
versions = list() versions = list()
for version in os.listdir("daemon_lib/migrations/versions"): for version in os.listdir(self.schema_path):
sequence_id = int(version.split(".")[0]) sequence_id = int(version.split(".")[0])
if end is None: if end is None:
if sequence_id > start: if sequence_id > start:
@ -1231,11 +1234,18 @@ class ZKSchema(object):
else: else:
return None return None
@staticmethod def find_latest(self):
def find_latest():
latest_version = 0 latest_version = 0
for version in os.listdir("daemon_lib/migrations/versions"): for version in os.listdir(self.schema_path):
sequence_id = int(version.split(".")[0]) sequence_id = int(version.split(".")[0])
if sequence_id > latest_version: if sequence_id > latest_version:
latest_version = sequence_id latest_version = sequence_id
return latest_version return latest_version
# Load in the schema of the current cluster
@classmethod
def load_current(cls, zkhandler):
new_instance = cls()
version = new_instance.get_version(zkhandler)
new_instance.load(version)
return new_instance

107
debian/changelog vendored
View File

@ -1,3 +1,110 @@
pvc (0.9.103-0) unstable; urgency=high
* [Provisioner] Fixes a bug with the change in `storage_hosts` to FQDNs affecting the VM Builder
* [Monitoring] Fixes the Munin plugin to work properly with sudo
-- Joshua M. Boniface <joshua@boniface.me> Fri, 01 Nov 2024 17:19:44 -0400
pvc (0.9.102-0) unstable; urgency=high
* [API Daemon] Ensures that received config snapshots update storage hosts in addition to secret UUIDs
* [CLI Client] Fixes several bugs around local connection handling and connection listings
-- Joshua M. Boniface <joshua@boniface.me> Thu, 17 Oct 2024 10:48:31 -0400
pvc (0.9.101-0) unstable; urgency=high
**New Feature**: Adds VM snapshot sending (`vm snapshot send`), VM mirroring (`vm mirror create`), and (offline) mirror promotion (`vm mirror promote`). Permits transferring VM snapshots to remote clusters, individually or repeatedly, and promoting them to active status, for disaster recovery and migration between clusters.
**Breaking Change**: Migrates the API daemon into Gunicorn when in production mode. Permits more scalable and performant operation of the API. **Requires additional dependency packages on all coordinator nodes** (`gunicorn`, `python3-gunicorn`, `python3-setuptools`); upgrade via `pvc-ansible` is strongly recommended.
**Enhancement**: Provides whole cluster utilization stats in the cluster status data. Permits better observability into the overall resource utilization of the cluster.
**Enhancement**: Adds a new storage benchmark format (v2) which includes additional resource utilization statistics. This allows for better evaluation of storage performance impact on the cluster as a whole. The updated format also permits arbitrary benchmark job names for easier parsing and tracking.
* [API Daemon] Allows scanning of new volumes added manually via other commands
* [API Daemon/CLI Client] Adds whole cluster utilization statistics to cluster status
* [API Daemon] Moves production API execution into Gunicorn
* [API Daemon] Adds a new storage benchmark format (v2) with additional resource tracking
* [API Daemon] Adds support for named storage benchmark jobs
* [API Daemon] Fixes a bug in OSD creation which would create `split` OSDs if `--osd-count` was set to 1
* [API Daemon] Adds support for the `mirror` VM state used by snapshot mirrors
* [CLI Client] Fixes several output display bugs in various commands and in Worker task outputs
* [CLI Client] Improves and shrinks the status progress bar output to support longer messages
* [API Daemon] Adds support for sending snapshots to remote clusters
* [API Daemon] Adds support for updating and promoting snapshot mirrors to remote clusters
* [Node Daemon] Improves timeouts during primary/secondary coordinator transitions to avoid deadlocks
* [Node Daemon] Improves timeouts during keepalive updates to avoid deadlocks
* [Node Daemon] Refactors fencing thread structure to ensure a single fencing task per cluster and sequential node fences to avoid potential anomalies (e.g. fencing 2 nodes simultaneously)
* [Node Daemon] Fixes a bug in fencing if VM locks were already freed, leaving VMs in an invalid state
* [Node Daemon] Increases the wait time during system startup to ensure Zookeeper has more time to synchronize
-- Joshua M. Boniface <joshua@boniface.me> Tue, 15 Oct 2024 11:39:11 -0400
pvc (0.9.100-0) unstable; urgency=high
* [API Daemon] Improves the handling of "detect:" disk strings on newer systems by leveraging the "nvme" command
* [Client CLI] Update help text about "detect:" disk strings
* [Meta] Updates deprecation warnings and updates builder to only add this version for Debian 12 (Bookworm)
-- Joshua M. Boniface <joshua@boniface.me> Fri, 30 Aug 2024 11:03:33 -0400
pvc (0.9.99-0) unstable; urgency=high
**Deprecation Warning**: `pvc vm backup` commands are now deprecated and will be removed in **0.9.100**. Use `pvc vm snapshot` commands instead.
**Breaking Change**: The on-disk format of VM snapshot exports differs from backup exports, and the PVC autobackup system now leverages these. It is recommended to start fresh with a new tree of backups for `pvc autobackup` for maximum compatibility.
**Breaking Change**: VM autobackups now run in `pvcworkerd` instead of the CLI client directly, allowing them to be triggerd from any node (or externally). It is important to apply the timer unit changes from the `pvc-ansible` role after upgrading to 0.9.99 to avoid duplicate runs.
**Usage Note**: VM snapshots are displayed in the `pvc vm list` and `pvc vm info` outputs, not in a unique "list" endpoint.
* [API Daemon] Adds a proper error when an invalid provisioner profile is specified
* [Node Daemon] Sorts Ceph pools properly in node keepalive to avoid incorrect ordering
* [Health Daemon] Improves handling of IPMI checks by adding multiple tries but a shorter timeout
* [API Daemon] Improves handling of XML parsing errors in VM configurations
* [ALL] Adds support for whole VM snapshots, including configuration XML details, and direct rollback to snapshots
* [ALL] Adds support for exporting and importing whole VM snapshots
* [Client CLI] Removes vCPU topology from short VM info output
* [Client CLI] Improves output format of VM info output
* [API Daemon] Adds an endpoint to get the current primary node
* [Client CLI] Fixes a bug where API requests were made 3 times
* [Other] Improves the build-and-deploy.sh script
* [API Daemon] Improves the "vm rename" command to avoid redefining VM, preserving history etc.
* [API Daemon] Adds an indication when a task is run on the primary node
* [API Daemon] Fixes a bug where the ZK schema relative path didn't work sometimes
-- Joshua M. Boniface <joshua@boniface.me> Wed, 28 Aug 2024 11:15:55 -0400
pvc (0.9.98-0) unstable; urgency=high
* [CLI Client] Fixed output when API call times out
* [Node Daemon] Improves the handling of fence states
* [API Daemon/CLI Client] Adds support for storage snapshot rollback
* [CLI Client] Adds additional warning messages about snapshot consistency to help output
* [API Daemon] Fixes a bug listing snapshots by pool/volume
* [Node Daemon] Adds a --version flag for information gathering by update-motd.sh
-- Joshua M. Boniface <joshua@boniface.me> Wed, 05 Jun 2024 12:01:31 -0400
pvc (0.9.97-0) unstable; urgency=high
* [Client CLI] Ensures --lines is always an integer value
* [Node Daemon] Fixes a bug if d_network changes during iteration
* [Node Daemon] Moves to using allocated instead of free memory for node reporting
* [API Daemon] Fixes a bug if lingering RBD snapshots exist when removing a volume (#180)
-- Joshua M. Boniface <joshua@boniface.me> Fri, 19 Apr 2024 10:32:16 -0400
pvc (0.9.96-0) unstable; urgency=high
* [API Daemon] Fixes a bug when reporting node stats
* [API Daemon] Fixes a bug deleteing successful benchmark results
-- Joshua M. Boniface <joshua@boniface.me> Fri, 08 Mar 2024 14:23:06 -0500
pvc (0.9.95-0) unstable; urgency=high
* [API Daemon/CLI Client] Adds a flag to allow duplicate VNIs in network templates
* [API Daemon] Ensures that storage template disks are returned in disk ID order
* [Client CLI] Fixes a display bug showing all OSDs as split
-- Joshua M. Boniface <joshua@boniface.me> Fri, 09 Feb 2024 12:42:00 -0500
pvc (0.9.94-0) unstable; urgency=high pvc (0.9.94-0) unstable; urgency=high
* [CLI Client] Fixes an incorrect ordering issue with autobackup summary emails * [CLI Client] Fixes an incorrect ordering issue with autobackup summary emails

2
debian/control vendored
View File

@ -32,7 +32,7 @@ Description: Parallel Virtual Cluster worker daemon
Package: pvc-daemon-api Package: pvc-daemon-api
Architecture: all Architecture: all
Depends: systemd, pvc-daemon-common, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate Depends: systemd, pvc-daemon-common, gunicorn, python3-gunicorn, python3-yaml, python3-flask, python3-flask-restful, python3-celery, python3-distutils, python3-redis, python3-lxml, python3-flask-migrate
Description: Parallel Virtual Cluster API daemon Description: Parallel Virtual Cluster API daemon
A KVM/Zookeeper/Ceph-based VM and private cloud manager A KVM/Zookeeper/Ceph-based VM and private cloud manager
. .

View File

@ -69,26 +69,33 @@ class MonitoringPluginScript(MonitoringPlugin):
# Run any imports first # Run any imports first
from daemon_lib.common import run_os_command from daemon_lib.common import run_os_command
from time import sleep
# Check the node's IPMI interface # Check the node's IPMI interface
ipmi_hostname = self.config["ipmi_hostname"] ipmi_hostname = self.config["ipmi_hostname"]
ipmi_username = self.config["ipmi_username"] ipmi_username = self.config["ipmi_username"]
ipmi_password = self.config["ipmi_password"] ipmi_password = self.config["ipmi_password"]
retcode = 1
trycount = 0
while retcode > 0 and trycount < 3:
retcode, _, _ = run_os_command( retcode, _, _ = run_os_command(
f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_username} -P {ipmi_password} chassis power status", f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_username} -P {ipmi_password} chassis power status",
timeout=5 timeout=2
) )
trycount += 1
if retcode > 0 and trycount < 3:
sleep(trycount)
if retcode > 0: if retcode > 0:
# Set the health delta to 10 (subtract 10 from the total of 100) # Set the health delta to 10 (subtract 10 from the total of 100)
health_delta = 10 health_delta = 10
# Craft a message that can be used by the clients # Craft a message that can be used by the clients
message = f"IPMI via {ipmi_username}@{ipmi_hostname} is NOT responding" message = f"IPMI via {ipmi_username}@{ipmi_hostname} is NOT responding after 3 attempts"
else: else:
# Set the health delta to 0 (no change) # Set the health delta to 0 (no change)
health_delta = 0 health_delta = 0
# Craft a message that can be used by the clients # Craft a message that can be used by the clients
message = f"IPMI via {ipmi_username}@{ipmi_hostname} is responding" message = f"IPMI via {ipmi_username}@{ipmi_hostname} is responding after {trycount} attempts"
# Set the health delta in our local PluginResult object # Set the health delta in our local PluginResult object
self.plugin_result.set_health_delta(health_delta) self.plugin_result.set_health_delta(health_delta)

View File

@ -33,7 +33,7 @@ import os
import signal import signal
# Daemon version # Daemon version
version = "0.9.94" version = "0.9.103"
########################################################## ##########################################################

View File

@ -34,7 +34,7 @@ warning=0.99
critical=1.99 critical=1.99
export PVC_CLIENT_DIR="/run/shm/munin-pvc" export PVC_CLIENT_DIR="/run/shm/munin-pvc"
PVC_CMD="/usr/bin/pvc --quiet --cluster local status --format json-pretty" PVC_CMD="/usr/bin/sudo -E /usr/bin/pvc --quiet cluster status --format json-pretty"
JQ_CMD="/usr/bin/jq" JQ_CMD="/usr/bin/jq"
output_usage() { output_usage() {
@ -126,7 +126,7 @@ output_values() {
is_maintenance="$( $JQ_CMD ".maintenance" <<<"${PVC_OUTPUT}" | tr -d '"' )" is_maintenance="$( $JQ_CMD ".maintenance" <<<"${PVC_OUTPUT}" | tr -d '"' )"
cluster_health="$( $JQ_CMD ".cluster_health.health" <<<"${PVC_OUTPUT}" | tr -d '"' )" cluster_health="$( $JQ_CMD ".cluster_health.health" <<<"${PVC_OUTPUT}" | tr -d '"' )"
cluster_health_messages="$( $JQ_CMD -r ".cluster_health.messages | @csv" <<<"${PVC_OUTPUT}" | tr -d '"' | sed 's/,/, /g' )" cluster_health_messages="$( $JQ_CMD -r ".cluster_health.messages | map(.text) | join(\", \")" <<<"${PVC_OUTPUT}" )"
echo 'multigraph pvc_cluster_health' echo 'multigraph pvc_cluster_health'
echo "pvc_cluster_health.value ${cluster_health}" echo "pvc_cluster_health.value ${cluster_health}"
echo "pvc_cluster_health.extinfo ${cluster_health_messages}" echo "pvc_cluster_health.extinfo ${cluster_health_messages}"
@ -142,7 +142,7 @@ output_values() {
echo "pvc_cluster_alert.value ${cluster_health_alert}" echo "pvc_cluster_alert.value ${cluster_health_alert}"
node_health="$( $JQ_CMD ".node_health.${HOST}.health" <<<"${PVC_OUTPUT}" | tr -d '"' )" node_health="$( $JQ_CMD ".node_health.${HOST}.health" <<<"${PVC_OUTPUT}" | tr -d '"' )"
node_health_messages="$( $JQ_CMD -r ".node_health.${HOST}.messages | @csv" <<<"${PVC_OUTPUT}" | tr -d '"' | sed 's/,/, /g' )" node_health_messages="$( $JQ_CMD -r ".node_health.${HOST}.messages | join(\", \")" <<<"${PVC_OUTPUT}" )"
echo 'multigraph pvc_node_health' echo 'multigraph pvc_node_health'
echo "pvc_node_health.value ${node_health}" echo "pvc_node_health.value ${node_health}"
echo "pvc_node_health.extinfo ${node_health_messages}" echo "pvc_node_health.extinfo ${node_health_messages}"

File diff suppressed because it is too large Load Diff

View File

@ -15,7 +15,7 @@
"type": "grafana", "type": "grafana",
"id": "grafana", "id": "grafana",
"name": "Grafana", "name": "Grafana",
"version": "10.2.2" "version": "11.1.4"
}, },
{ {
"type": "datasource", "type": "datasource",
@ -112,6 +112,7 @@
"graphMode": "area", "graphMode": "area",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -119,10 +120,11 @@
"fields": "/^pvc_cluster_id$/", "fields": "/^pvc_cluster_id$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -144,7 +146,6 @@
} }
], ],
"title": "Cluster", "title": "Cluster",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -187,6 +188,7 @@
"graphMode": "area", "graphMode": "area",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -194,10 +196,11 @@
"fields": "/^vm$/", "fields": "/^vm$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -219,7 +222,6 @@
} }
], ],
"title": "VM Name", "title": "VM Name",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -301,6 +303,21 @@
"color": "dark-red", "color": "dark-red",
"index": 8, "index": 8,
"text": "fail" "text": "fail"
},
"9": {
"color": "dark-blue",
"index": 9,
"text": "import"
},
"10": {
"color": "dark-blue",
"index": 10,
"text": "restore"
},
"99": {
"color": "dark-purple",
"index": 11,
"text": "mirror"
} }
}, },
"type": "value" "type": "value"
@ -323,6 +340,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -330,10 +348,11 @@
"fields": "/^Value$/", "fields": "/^Value$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -355,7 +374,6 @@
} }
], ],
"title": "State", "title": "State",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -398,6 +416,7 @@
"graphMode": "area", "graphMode": "area",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -405,10 +424,11 @@
"fields": "/^uuid$/", "fields": "/^uuid$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -430,7 +450,6 @@
} }
], ],
"title": "UUID", "title": "UUID",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -473,6 +492,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -480,10 +500,11 @@
"fields": "/^node$/", "fields": "/^node$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -505,7 +526,6 @@
} }
], ],
"title": "Active Node", "title": "Active Node",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -545,6 +565,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -552,10 +573,11 @@
"fields": "/^last_node$/", "fields": "/^last_node$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -577,7 +599,6 @@
} }
], ],
"title": "Migrated", "title": "Migrated",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -646,6 +667,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -653,10 +675,11 @@
"fields": "/^Value$/", "fields": "/^Value$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -678,7 +701,6 @@
} }
], ],
"title": "Autostart", "title": "Autostart",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -721,6 +743,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -728,10 +751,11 @@
"fields": "/^description$/", "fields": "/^description$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -753,7 +777,6 @@
} }
], ],
"title": "Description", "title": "Description",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -796,6 +819,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -803,10 +827,11 @@
"fields": "/^Value$/", "fields": "/^Value$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -828,7 +853,6 @@
} }
], ],
"title": "vCPUs", "title": "vCPUs",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -871,6 +895,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -878,10 +903,11 @@
"fields": "/^topology$/", "fields": "/^topology$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -903,7 +929,6 @@
} }
], ],
"title": "vCPU Topology", "title": "vCPU Topology",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -947,6 +972,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -954,10 +980,11 @@
"fields": "/^Value$/", "fields": "/^Value$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -979,7 +1006,6 @@
} }
], ],
"title": "vRAM", "title": "vRAM",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -1022,6 +1048,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -1029,10 +1056,11 @@
"fields": "/^node_limit$/", "fields": "/^node_limit$/",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -1054,7 +1082,6 @@
} }
], ],
"title": "Node Limits", "title": "Node Limits",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
@ -1097,6 +1124,7 @@
"graphMode": "none", "graphMode": "none",
"justifyMode": "auto", "justifyMode": "auto",
"orientation": "auto", "orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": { "reduceOptions": {
"calcs": [ "calcs": [
"lastNotNull" "lastNotNull"
@ -1104,10 +1132,11 @@
"fields": "failed_reason", "fields": "failed_reason",
"values": false "values": false
}, },
"showPercentChange": false,
"textMode": "auto", "textMode": "auto",
"wideLayout": true "wideLayout": true
}, },
"pluginVersion": "10.2.2", "pluginVersion": "11.1.4",
"targets": [ "targets": [
{ {
"datasource": { "datasource": {
@ -1129,11 +1158,10 @@
} }
], ],
"title": "Failure Reason", "title": "Failure Reason",
"transformations": [],
"type": "stat" "type": "stat"
}, },
{ {
"collapsed": true, "collapsed": false,
"gridPos": { "gridPos": {
"h": 1, "h": 1,
"w": 24, "w": 24,
@ -1141,7 +1169,10 @@
"y": 10 "y": 10
}, },
"id": 14, "id": 14,
"panels": [ "panels": [],
"title": "CPU & Memory Stats",
"type": "row"
},
{ {
"datasource": { "datasource": {
"type": "prometheus", "type": "prometheus",
@ -1664,21 +1695,20 @@
], ],
"title": "Swap Utilization (+ in/- out)", "title": "Swap Utilization (+ in/- out)",
"type": "timeseries" "type": "timeseries"
}
],
"title": "CPU & Memory Stats",
"type": "row"
}, },
{ {
"collapsed": true, "collapsed": false,
"gridPos": { "gridPos": {
"h": 1, "h": 1,
"w": 24, "w": 24,
"x": 0, "x": 0,
"y": 11 "y": 27
}, },
"id": 19, "id": 19,
"panels": [ "panels": [],
"title": "NIC Stats",
"type": "row"
},
{ {
"datasource": { "datasource": {
"type": "prometheus", "type": "prometheus",
@ -1727,8 +1757,7 @@
"mode": "absolute", "mode": "absolute",
"steps": [ "steps": [
{ {
"color": "green", "color": "green"
"value": null
}, },
{ {
"color": "red", "color": "red",
@ -1757,7 +1786,7 @@
"h": 10, "h": 10,
"w": 24, "w": 24,
"x": 0, "x": 0,
"y": 12 "y": 28
}, },
"id": 20, "id": 20,
"options": { "options": {
@ -1864,8 +1893,7 @@
"mode": "absolute", "mode": "absolute",
"steps": [ "steps": [
{ {
"color": "green", "color": "green"
"value": null
}, },
{ {
"color": "red", "color": "red",
@ -1894,7 +1922,7 @@
"h": 10, "h": 10,
"w": 24, "w": 24,
"x": 0, "x": 0,
"y": 22 "y": 38
}, },
"id": 21, "id": 21,
"options": { "options": {
@ -2001,8 +2029,7 @@
"mode": "absolute", "mode": "absolute",
"steps": [ "steps": [
{ {
"color": "green", "color": "green"
"value": null
} }
] ]
}, },
@ -2027,7 +2054,7 @@
"h": 8, "h": 8,
"w": 12, "w": 12,
"x": 0, "x": 0,
"y": 32 "y": 48
}, },
"id": 22, "id": 22,
"options": { "options": {
@ -2134,8 +2161,7 @@
"mode": "absolute", "mode": "absolute",
"steps": [ "steps": [
{ {
"color": "green", "color": "green"
"value": null
} }
] ]
}, },
@ -2160,7 +2186,7 @@
"h": 8, "h": 8,
"w": 12, "w": 12,
"x": 12, "x": 12,
"y": 32 "y": 48
}, },
"id": 23, "id": 23,
"options": { "options": {
@ -2218,21 +2244,20 @@
], ],
"title": "Errors (+ RX/- TX)", "title": "Errors (+ RX/- TX)",
"type": "timeseries" "type": "timeseries"
}
],
"title": "NIC Stats",
"type": "row"
}, },
{ {
"collapsed": true, "collapsed": false,
"gridPos": { "gridPos": {
"h": 1, "h": 1,
"w": 24, "w": 24,
"x": 0, "x": 0,
"y": 12 "y": 56
}, },
"id": 24, "id": 24,
"panels": [ "panels": [],
"title": "Disk Stats",
"type": "row"
},
{ {
"datasource": { "datasource": {
"type": "prometheus", "type": "prometheus",
@ -2281,8 +2306,7 @@
"mode": "absolute", "mode": "absolute",
"steps": [ "steps": [
{ {
"color": "green", "color": "green"
"value": null
}, },
{ {
"color": "red", "color": "red",
@ -2311,7 +2335,7 @@
"h": 9, "h": 9,
"w": 24, "w": 24,
"x": 0, "x": 0,
"y": 13 "y": 57
}, },
"id": 25, "id": 25,
"options": { "options": {
@ -2368,7 +2392,6 @@
} }
], ],
"title": "IOPS (+ Read/- Write)", "title": "IOPS (+ Read/- Write)",
"transformations": [],
"type": "timeseries" "type": "timeseries"
}, },
{ {
@ -2419,8 +2442,7 @@
"mode": "absolute", "mode": "absolute",
"steps": [ "steps": [
{ {
"color": "green", "color": "green"
"value": null
}, },
{ {
"color": "red", "color": "red",
@ -2449,7 +2471,7 @@
"h": 9, "h": 9,
"w": 24, "w": 24,
"x": 0, "x": 0,
"y": 22 "y": 66
}, },
"id": 26, "id": 26,
"options": { "options": {
@ -2509,12 +2531,8 @@
"type": "timeseries" "type": "timeseries"
} }
], ],
"title": "Disk Stats",
"type": "row"
}
],
"refresh": "5s", "refresh": "5s",
"schemaVersion": 38, "schemaVersion": 39,
"tags": [ "tags": [
"pvc" "pvc"
], ],

View File

@ -19,6 +19,11 @@
# #
############################################################################### ###############################################################################
from sys import argv
import pvcnoded.Daemon # noqa: F401 import pvcnoded.Daemon # noqa: F401
if "--version" in argv:
print(pvcnoded.Daemon.version)
exit(0)
pvcnoded.Daemon.entrypoint() pvcnoded.Daemon.entrypoint()

View File

@ -49,7 +49,7 @@ import re
import json import json
# Daemon version # Daemon version
version = "0.9.94" version = "0.9.103"
########################################################## ##########################################################

View File

@ -231,7 +231,7 @@ class NetstatsInstance(object):
# Get a list of all active interfaces # Get a list of all active interfaces
net_root_path = "/sys/class/net" net_root_path = "/sys/class/net"
all_ifaces = list() all_ifaces = list()
for (_, dirnames, _) in walk(net_root_path): for _, dirnames, _ in walk(net_root_path):
all_ifaces.extend(dirnames) all_ifaces.extend(dirnames)
all_ifaces.sort() all_ifaces.sort()

View File

@ -438,8 +438,11 @@ class NodeInstance(object):
# Synchronize nodes B (I am reader) # Synchronize nodes B (I am reader)
lock = self.zkhandler.readlock("base.config.primary_node.sync_lock") lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
self.logger.out("Acquiring read lock for synchronization phase B", state="i") self.logger.out("Acquiring read lock for synchronization phase B", state="i")
lock.acquire() try:
self.logger.out("Acquired read lock for synchronization phase B", state="o") lock.acquire(timeout=5) # Don't wait forever and completely block us
self.logger.out("Acquired read lock for synchronization phase G", state="o")
except Exception:
pass
self.logger.out("Releasing read lock for synchronization phase B", state="i") self.logger.out("Releasing read lock for synchronization phase B", state="i")
lock.release() lock.release()
self.logger.out("Released read lock for synchronization phase B", state="o") self.logger.out("Released read lock for synchronization phase B", state="o")
@ -521,7 +524,7 @@ class NodeInstance(object):
self.logger.out("Acquired write lock for synchronization phase F", state="o") self.logger.out("Acquired write lock for synchronization phase F", state="o")
time.sleep(0.2) # Time fir reader to acquire the lock time.sleep(0.2) # Time fir reader to acquire the lock
# 4. Add gateway IPs # 4. Add gateway IPs
for network in self.d_network: for network in self.d_network.copy():
self.d_network[network].createGateways() self.d_network[network].createGateways()
self.logger.out("Releasing write lock for synchronization phase F", state="i") self.logger.out("Releasing write lock for synchronization phase F", state="i")
self.zkhandler.write([("base.config.primary_node.sync_lock", "")]) self.zkhandler.write([("base.config.primary_node.sync_lock", "")])
@ -648,8 +651,11 @@ class NodeInstance(object):
# Synchronize nodes A (I am reader) # Synchronize nodes A (I am reader)
lock = self.zkhandler.readlock("base.config.primary_node.sync_lock") lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
self.logger.out("Acquiring read lock for synchronization phase A", state="i") self.logger.out("Acquiring read lock for synchronization phase A", state="i")
lock.acquire() try:
self.logger.out("Acquired read lock for synchronization phase A", state="o") lock.acquire(timeout=5) # Don't wait forever and completely block us
self.logger.out("Acquired read lock for synchronization phase G", state="o")
except Exception:
pass
self.logger.out("Releasing read lock for synchronization phase A", state="i") self.logger.out("Releasing read lock for synchronization phase A", state="i")
lock.release() lock.release()
self.logger.out("Released read lock for synchronization phase A", state="o") self.logger.out("Released read lock for synchronization phase A", state="o")
@ -682,8 +688,11 @@ class NodeInstance(object):
# Synchronize nodes C (I am reader) # Synchronize nodes C (I am reader)
lock = self.zkhandler.readlock("base.config.primary_node.sync_lock") lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
self.logger.out("Acquiring read lock for synchronization phase C", state="i") self.logger.out("Acquiring read lock for synchronization phase C", state="i")
lock.acquire() try:
self.logger.out("Acquired read lock for synchronization phase C", state="o") lock.acquire(timeout=5) # Don't wait forever and completely block us
self.logger.out("Acquired read lock for synchronization phase G", state="o")
except Exception:
pass
# 5. Remove Upstream floating IP # 5. Remove Upstream floating IP
self.logger.out( self.logger.out(
"Removing floating upstream IP {}/{} from interface {}".format( "Removing floating upstream IP {}/{} from interface {}".format(
@ -701,8 +710,11 @@ class NodeInstance(object):
# Synchronize nodes D (I am reader) # Synchronize nodes D (I am reader)
lock = self.zkhandler.readlock("base.config.primary_node.sync_lock") lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
self.logger.out("Acquiring read lock for synchronization phase D", state="i") self.logger.out("Acquiring read lock for synchronization phase D", state="i")
lock.acquire() try:
self.logger.out("Acquired read lock for synchronization phase D", state="o") lock.acquire(timeout=5) # Don't wait forever and completely block us
self.logger.out("Acquired read lock for synchronization phase G", state="o")
except Exception:
pass
# 6. Remove Cluster & Storage floating IP # 6. Remove Cluster & Storage floating IP
self.logger.out( self.logger.out(
"Removing floating management IP {}/{} from interface {}".format( "Removing floating management IP {}/{} from interface {}".format(
@ -729,8 +741,11 @@ class NodeInstance(object):
# Synchronize nodes E (I am reader) # Synchronize nodes E (I am reader)
lock = self.zkhandler.readlock("base.config.primary_node.sync_lock") lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
self.logger.out("Acquiring read lock for synchronization phase E", state="i") self.logger.out("Acquiring read lock for synchronization phase E", state="i")
lock.acquire() try:
self.logger.out("Acquired read lock for synchronization phase E", state="o") lock.acquire(timeout=5) # Don't wait forever and completely block us
self.logger.out("Acquired read lock for synchronization phase G", state="o")
except Exception:
pass
# 7. Remove Metadata link-local IP # 7. Remove Metadata link-local IP
self.logger.out( self.logger.out(
"Removing Metadata link-local IP {}/{} from interface {}".format( "Removing Metadata link-local IP {}/{} from interface {}".format(
@ -746,8 +761,11 @@ class NodeInstance(object):
# Synchronize nodes F (I am reader) # Synchronize nodes F (I am reader)
lock = self.zkhandler.readlock("base.config.primary_node.sync_lock") lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
self.logger.out("Acquiring read lock for synchronization phase F", state="i") self.logger.out("Acquiring read lock for synchronization phase F", state="i")
lock.acquire() try:
self.logger.out("Acquired read lock for synchronization phase F", state="o") lock.acquire(timeout=5) # Don't wait forever and completely block us
self.logger.out("Acquired read lock for synchronization phase G", state="o")
except Exception:
pass
# 8. Remove gateway IPs # 8. Remove gateway IPs
for network in self.d_network: for network in self.d_network:
self.d_network[network].removeGateways() self.d_network[network].removeGateways()
@ -759,7 +777,7 @@ class NodeInstance(object):
lock = self.zkhandler.readlock("base.config.primary_node.sync_lock") lock = self.zkhandler.readlock("base.config.primary_node.sync_lock")
self.logger.out("Acquiring read lock for synchronization phase G", state="i") self.logger.out("Acquiring read lock for synchronization phase G", state="i")
try: try:
lock.acquire(timeout=60) # Don't wait forever and completely block us lock.acquire(timeout=5) # Don't wait forever and completely block us
self.logger.out("Acquired read lock for synchronization phase G", state="o") self.logger.out("Acquired read lock for synchronization phase G", state="o")
except Exception: except Exception:
pass pass

View File

@ -21,15 +21,72 @@
import time import time
from kazoo.exceptions import LockTimeout
import daemon_lib.common as common import daemon_lib.common as common
from daemon_lib.vm import vm_worker_flush_locks from daemon_lib.vm import vm_worker_flush_locks
# #
# Fence thread entry function # Fence monitor thread entrypoint
# #
def fence_node(node_name, zkhandler, config, logger): def fence_monitor(zkhandler, config, logger):
# Attempt to acquire an exclusive lock on the fence_lock key
# If it is already held, we'll abort since another node is processing fences
lock = zkhandler.exclusivelock("base.config.fence_lock")
try:
lock.acquire(timeout=config["keepalive_interval"] - 1)
for node_name in zkhandler.children("base.node"):
try:
node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
node_keepalive = int(zkhandler.read(("node.keepalive", node_name)))
except Exception:
node_daemon_state = "unknown"
node_keepalive = 0
node_deadtime = int(time.time()) - (
int(config["keepalive_interval"]) * int(config["fence_intervals"])
)
if node_keepalive < node_deadtime and node_daemon_state == "run":
logger.out(
f"Node {node_name} seems dead; starting monitor for fencing",
state="w",
)
zk_lock = zkhandler.writelock(("node.state.daemon", node_name))
with zk_lock:
# Ensures that, if we lost the lock race and come out of waiting,
# we won't try to trigger our own fence thread.
if zkhandler.read(("node.state.daemon", node_name)) != "dead":
# Write the updated data after we start the fence thread
zkhandler.write([(("node.state.daemon", node_name), "dead")])
# Start the fence monitoring task for this node
# NOTE: This is not a subthread and is designed to block this for loop
# This ensures that only one node is ever being fenced at a time
fence_node(zkhandler, config, logger, node_name)
else:
logger.out(
f"Node {node_name} is OK; last checkin is {node_deadtime - node_keepalive}s from threshold, node state is '{node_daemon_state}'",
state="d",
prefix="fence-thread",
)
except LockTimeout:
logger.out(
"Fence monitor thread failed to acquire exclusive lock; skipping", state="i"
)
except Exception as e:
logger.out(f"Fence monitor thread failed: {e}", state="w")
finally:
# We're finished, so release the global lock
lock.release()
#
# Fence action function
#
def fence_node(zkhandler, config, logger, node_name):
# We allow exactly 6 saving throws (30 seconds) for the host to come back online or we kill it # We allow exactly 6 saving throws (30 seconds) for the host to come back online or we kill it
failcount_limit = 6 failcount_limit = 6
failcount = 0 failcount = 0
@ -190,7 +247,7 @@ def migrateFromFencedNode(zkhandler, node_name, config, logger):
) )
zkhandler.write( zkhandler.write(
{ {
(("domain.state", dom_uuid), "stopped"), (("domain.state", dom_uuid), "stop"),
(("domain.meta.autostart", dom_uuid), "True"), (("domain.meta.autostart", dom_uuid), "True"),
} }
) )
@ -202,6 +259,9 @@ def migrateFromFencedNode(zkhandler, node_name, config, logger):
# Loop through the VMs # Loop through the VMs
for dom_uuid in dead_node_running_domains: for dom_uuid in dead_node_running_domains:
if dom_uuid in ["0", 0]:
# Skip the invalid "0" UUID we sometimes get
continue
try: try:
fence_migrate_vm(dom_uuid) fence_migrate_vm(dom_uuid)
except Exception as e: except Exception as e:
@ -253,12 +313,16 @@ def reboot_via_ipmi(node_name, ipmi_hostname, ipmi_user, ipmi_password, logger):
state="i", state="i",
prefix=f"fencing {node_name}", prefix=f"fencing {node_name}",
) )
ipmi_status_retcode, ipmi_status_stdout, ipmi_status_stderr = common.run_os_command( (
ipmi_intermediate_status_retcode,
ipmi_intermediate_status_stdout,
ipmi_intermediate_status_stderr,
) = common.run_os_command(
f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_user} -P {ipmi_password} chassis power status" f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_user} -P {ipmi_password} chassis power status"
) )
if ipmi_status_retcode == 0: if ipmi_intermediate_status_retcode == 0:
logger.out( logger.out(
f"Current chassis power state is: {ipmi_status_stdout.strip()}", f"Current chassis power state is: {ipmi_intermediate_status_stdout.strip()}",
state="i", state="i",
prefix=f"fencing {node_name}", prefix=f"fencing {node_name}",
) )
@ -299,12 +363,14 @@ def reboot_via_ipmi(node_name, ipmi_hostname, ipmi_user, ipmi_password, logger):
state="i", state="i",
prefix=f"fencing {node_name}", prefix=f"fencing {node_name}",
) )
ipmi_status_retcode, ipmi_status_stdout, ipmi_status_stderr = common.run_os_command( ipmi_final_status_retcode, ipmi_final_status_stdout, ipmi_final_status_stderr = (
common.run_os_command(
f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_user} -P {ipmi_password} chassis power status" f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_user} -P {ipmi_password} chassis power status"
) )
)
if ipmi_stop_retcode == 0: if ipmi_intermediate_status_stdout.strip() == "Chassis power is off":
if ipmi_status_stdout.strip() == "Chassis Power is on": if ipmi_final_status_stdout.strip() == "Chassis Power is on":
# We successfully rebooted the node and it is powered on; this is a succeessful fence # We successfully rebooted the node and it is powered on; this is a succeessful fence
logger.out( logger.out(
"Successfully rebooted dead node; proceeding with fence recovery action", "Successfully rebooted dead node; proceeding with fence recovery action",
@ -312,7 +378,7 @@ def reboot_via_ipmi(node_name, ipmi_hostname, ipmi_user, ipmi_password, logger):
prefix=f"fencing {node_name}", prefix=f"fencing {node_name}",
) )
return True return True
elif ipmi_status_stdout.strip() == "Chassis Power is off": elif ipmi_final_status_stdout.strip() == "Chassis Power is off":
# We successfully rebooted the node but it is powered off; this might be expected or not, but the node is confirmed off so we can call it a successful fence # We successfully rebooted the node but it is powered off; this might be expected or not, but the node is confirmed off so we can call it a successful fence
logger.out( logger.out(
"Chassis power is in confirmed off state after successfuly IPMI reboot; proceeding with fence recovery action", "Chassis power is in confirmed off state after successfuly IPMI reboot; proceeding with fence recovery action",
@ -323,13 +389,13 @@ def reboot_via_ipmi(node_name, ipmi_hostname, ipmi_user, ipmi_password, logger):
else: else:
# We successfully rebooted the node but it is in some unknown power state; since this might indicate a silent failure, we must call it a failed fence # We successfully rebooted the node but it is in some unknown power state; since this might indicate a silent failure, we must call it a failed fence
logger.out( logger.out(
f"Chassis power is in an unknown state ({ipmi_status_stdout.strip()}) after successful IPMI reboot; NOT proceeding fence recovery action", f"Chassis power is in an unknown state ({ipmi_final_status_stdout.strip()}) after successful IPMI reboot; NOT proceeding fence recovery action",
state="e", state="e",
prefix=f"fencing {node_name}", prefix=f"fencing {node_name}",
) )
return False return False
else: else:
if ipmi_status_stdout.strip() == "Chassis Power is off": if ipmi_final_status_stdout.strip() == "Chassis Power is off":
# We failed to reboot the node but it is powered off; it has probably suffered a serious hardware failure, but the node is confirmed off so we can call it a successful fence # We failed to reboot the node but it is powered off; it has probably suffered a serious hardware failure, but the node is confirmed off so we can call it a successful fence
logger.out( logger.out(
"Chassis power is in confirmed off state after failed IPMI reboot; proceeding with fence recovery action", "Chassis power is in confirmed off state after failed IPMI reboot; proceeding with fence recovery action",

View File

@ -157,7 +157,9 @@ def collect_ceph_stats(logger, config, zkhandler, this_node, queue):
1 1
].decode("ascii") ].decode("ascii")
try: try:
ceph_pool_df_raw = json.loads(ceph_df_output)["pools"] ceph_pool_df_raw = sorted(
json.loads(ceph_df_output)["pools"], key=lambda x: x["name"]
)
except Exception as e: except Exception as e:
logger.out("Failed to obtain Pool data (ceph df): {}".format(e), state="w") logger.out("Failed to obtain Pool data (ceph df): {}".format(e), state="w")
ceph_pool_df_raw = [] ceph_pool_df_raw = []
@ -166,7 +168,9 @@ def collect_ceph_stats(logger, config, zkhandler, this_node, queue):
"rados df --format json", timeout=1 "rados df --format json", timeout=1
) )
try: try:
rados_pool_df_raw = json.loads(stdout)["pools"] rados_pool_df_raw = sorted(
json.loads(stdout)["pools"], key=lambda x: x["name"]
)
except Exception as e: except Exception as e:
logger.out("Failed to obtain Pool data (rados df): {}".format(e), state="w") logger.out("Failed to obtain Pool data (rados df): {}".format(e), state="w")
rados_pool_df_raw = [] rados_pool_df_raw = []
@ -743,7 +747,7 @@ def node_keepalive(logger, config, zkhandler, this_node, netstats):
# Get node performance statistics # Get node performance statistics
this_node.memtotal = int(psutil.virtual_memory().total / 1024 / 1024) this_node.memtotal = int(psutil.virtual_memory().total / 1024 / 1024)
this_node.memused = int(psutil.virtual_memory().used / 1024 / 1024) this_node.memused = int(psutil.virtual_memory().used / 1024 / 1024)
this_node.memfree = int(psutil.virtual_memory().free / 1024 / 1024) this_node.memfree = int(psutil.virtual_memory().available / 1024 / 1024)
this_node.cpuload = round(os.getloadavg()[0], 2) this_node.cpuload = round(os.getloadavg()[0], 2)
# Get node network statistics via netstats instance # Get node network statistics via netstats instance
@ -752,29 +756,21 @@ def node_keepalive(logger, config, zkhandler, this_node, netstats):
# Join against running threads # Join against running threads
if config["enable_hypervisor"]: if config["enable_hypervisor"]:
vm_stats_thread.join(timeout=config["keepalive_interval"]) vm_stats_thread.join(timeout=config["keepalive_interval"] - 1)
if vm_stats_thread.is_alive(): if vm_stats_thread.is_alive():
logger.out("VM stats gathering exceeded timeout, continuing", state="w") logger.out("VM stats gathering exceeded timeout, continuing", state="w")
if config["enable_storage"]: if config["enable_storage"]:
ceph_stats_thread.join(timeout=config["keepalive_interval"]) ceph_stats_thread.join(timeout=config["keepalive_interval"] - 1)
if ceph_stats_thread.is_alive(): if ceph_stats_thread.is_alive():
logger.out("Ceph stats gathering exceeded timeout, continuing", state="w") logger.out("Ceph stats gathering exceeded timeout, continuing", state="w")
# Get information from thread queues # Get information from thread queues
if config["enable_hypervisor"]: if config["enable_hypervisor"]:
try: try:
this_node.domains_count = vm_thread_queue.get( this_node.domains_count = vm_thread_queue.get(timeout=0.1)
timeout=config["keepalive_interval"] this_node.memalloc = vm_thread_queue.get(timeout=0.1)
) this_node.memprov = vm_thread_queue.get(timeout=0.1)
this_node.memalloc = vm_thread_queue.get( this_node.vcpualloc = vm_thread_queue.get(timeout=0.1)
timeout=config["keepalive_interval"]
)
this_node.memprov = vm_thread_queue.get(
timeout=config["keepalive_interval"]
)
this_node.vcpualloc = vm_thread_queue.get(
timeout=config["keepalive_interval"]
)
except Exception: except Exception:
logger.out("VM stats queue get exceeded timeout, continuing", state="w") logger.out("VM stats queue get exceeded timeout, continuing", state="w")
else: else:
@ -785,9 +781,7 @@ def node_keepalive(logger, config, zkhandler, this_node, netstats):
if config["enable_storage"]: if config["enable_storage"]:
try: try:
osds_this_node = ceph_thread_queue.get( osds_this_node = ceph_thread_queue.get(timeout=0.1)
timeout=(config["keepalive_interval"] - 1)
)
except Exception: except Exception:
logger.out("Ceph stats queue get exceeded timeout, continuing", state="w") logger.out("Ceph stats queue get exceeded timeout, continuing", state="w")
osds_this_node = "?" osds_this_node = "?"
@ -883,44 +877,12 @@ def node_keepalive(logger, config, zkhandler, this_node, netstats):
) )
# Look for dead nodes and fence them # Look for dead nodes and fence them
if not this_node.maintenance: if not this_node.maintenance and config["daemon_mode"] == "coordinator":
logger.out( logger.out(
"Look for dead nodes and fence them", state="d", prefix="main-thread" "Look for dead nodes and fence them", state="d", prefix="main-thread"
) )
if config["daemon_mode"] == "coordinator": fence_monitor_thread = Thread(
for node_name in zkhandler.children("base.node"): target=pvcnoded.util.fencing.fence_monitor,
try: args=(zkhandler, config, logger),
node_daemon_state = zkhandler.read(("node.state.daemon", node_name))
node_keepalive = int(zkhandler.read(("node.keepalive", node_name)))
except Exception:
node_daemon_state = "unknown"
node_keepalive = 0
# Handle deadtime and fencng if needed
# (A node is considered dead when its keepalive timer is >6*keepalive_interval seconds
# out-of-date while in 'start' state)
node_deadtime = int(time.time()) - (
int(config["keepalive_interval"]) * int(config["fence_intervals"])
)
if node_keepalive < node_deadtime and node_daemon_state == "run":
logger.out(
"Node {} seems dead - starting monitor for fencing".format(
node_name
),
state="w",
)
zk_lock = zkhandler.writelock(("node.state.daemon", node_name))
with zk_lock:
# Ensures that, if we lost the lock race and come out of waiting,
# we won't try to trigger our own fence thread.
if zkhandler.read(("node.state.daemon", node_name)) != "dead":
fence_thread = Thread(
target=pvcnoded.util.fencing.fence_node,
args=(node_name, zkhandler, config, logger),
kwargs={},
)
fence_thread.start()
# Write the updated data after we start the fence thread
zkhandler.write(
[(("node.state.daemon", node_name), "dead")]
) )
fence_monitor_thread.start()

View File

@ -102,5 +102,5 @@ def start_system_services(logger, config):
start_workerd(logger, config) start_workerd(logger, config)
start_healthd(logger, config) start_healthd(logger, config)
logger.out("Waiting 5 seconds for daemons to start", state="s") logger.out("Waiting 10 seconds for daemons to start", state="s")
sleep(5) sleep(10)

View File

@ -188,3 +188,6 @@ def setup_node(logger, config, zkhandler):
(("node.count.networks", config["node_hostname"]), "0"), (("node.count.networks", config["node_hostname"]), "0"),
] ]
) )
logger.out("Waiting 5 seconds for Zookeeper to synchronize", state="s")
time.sleep(5)

View File

@ -28,6 +28,14 @@ from daemon_lib.vm import (
vm_worker_flush_locks, vm_worker_flush_locks,
vm_worker_attach_device, vm_worker_attach_device,
vm_worker_detach_device, vm_worker_detach_device,
vm_worker_create_snapshot,
vm_worker_remove_snapshot,
vm_worker_rollback_snapshot,
vm_worker_export_snapshot,
vm_worker_import_snapshot,
vm_worker_send_snapshot,
vm_worker_create_mirror,
vm_worker_promote_mirror,
) )
from daemon_lib.ceph import ( from daemon_lib.ceph import (
osd_worker_add_osd, osd_worker_add_osd,
@ -42,9 +50,12 @@ from daemon_lib.benchmark import (
from daemon_lib.vmbuilder import ( from daemon_lib.vmbuilder import (
worker_create_vm, worker_create_vm,
) )
from daemon_lib.autobackup import (
worker_cluster_autobackup,
)
# Daemon version # Daemon version
version = "0.9.94" version = "0.9.103"
config = cfg.get_configuration() config = cfg.get_configuration()
@ -88,12 +99,27 @@ def create_vm(
@celery.task(name="storage.benchmark", bind=True, routing_key="run_on") @celery.task(name="storage.benchmark", bind=True, routing_key="run_on")
def storage_benchmark(self, pool=None, run_on="primary"): def storage_benchmark(self, pool=None, name=None, run_on="primary"):
@ZKConnection(config) @ZKConnection(config)
def run_storage_benchmark(zkhandler, self, pool): def run_storage_benchmark(zkhandler, self, pool, name):
return worker_run_benchmark(zkhandler, self, config, pool) return worker_run_benchmark(zkhandler, self, config, pool, name)
return run_storage_benchmark(self, pool) return run_storage_benchmark(self, pool, name)
@celery.task(name="cluster.autobackup", bind=True, routing_key="run_on")
def cluster_autobackup(self, force_full=False, email_recipients=None, run_on="primary"):
@ZKConnection(config)
def run_cluster_autobackup(
zkhandler, self, force_full=False, email_recipients=None
):
return worker_cluster_autobackup(
zkhandler, self, force_full=force_full, email_recipients=email_recipients
)
return run_cluster_autobackup(
self, force_full=force_full, email_recipients=email_recipients
)
@celery.task(name="vm.flush_locks", bind=True, routing_key="run_on") @celery.task(name="vm.flush_locks", bind=True, routing_key="run_on")
@ -123,6 +149,219 @@ def vm_device_detach(self, domain=None, xml=None, run_on=None):
return run_vm_device_detach(self, domain, xml) return run_vm_device_detach(self, domain, xml)
@celery.task(name="vm.create_snapshot", bind=True, routing_key="run_on")
def vm_create_snapshot(self, domain=None, snapshot_name=None, run_on="primary"):
@ZKConnection(config)
def run_vm_create_snapshot(zkhandler, self, domain, snapshot_name):
return vm_worker_create_snapshot(zkhandler, self, domain, snapshot_name)
return run_vm_create_snapshot(self, domain, snapshot_name)
@celery.task(name="vm.remove_snapshot", bind=True, routing_key="run_on")
def vm_remove_snapshot(self, domain=None, snapshot_name=None, run_on="primary"):
@ZKConnection(config)
def run_vm_remove_snapshot(zkhandler, self, domain, snapshot_name):
return vm_worker_remove_snapshot(zkhandler, self, domain, snapshot_name)
return run_vm_remove_snapshot(self, domain, snapshot_name)
@celery.task(name="vm.rollback_snapshot", bind=True, routing_key="run_on")
def vm_rollback_snapshot(self, domain=None, snapshot_name=None, run_on="primary"):
@ZKConnection(config)
def run_vm_rollback_snapshot(zkhandler, self, domain, snapshot_name):
return vm_worker_rollback_snapshot(zkhandler, self, domain, snapshot_name)
return run_vm_rollback_snapshot(self, domain, snapshot_name)
@celery.task(name="vm.export_snapshot", bind=True, routing_key="run_on")
def vm_export_snapshot(
self,
domain=None,
snapshot_name=None,
export_path=None,
incremental_parent=None,
run_on="primary",
):
@ZKConnection(config)
def run_vm_export_snapshot(
zkhandler, self, domain, snapshot_name, export_path, incremental_parent=None
):
return vm_worker_export_snapshot(
zkhandler,
self,
domain,
snapshot_name,
export_path,
incremental_parent=incremental_parent,
)
return run_vm_export_snapshot(
self, domain, snapshot_name, export_path, incremental_parent=incremental_parent
)
@celery.task(name="vm.import_snapshot", bind=True, routing_key="run_on")
def vm_import_snapshot(
self,
domain=None,
snapshot_name=None,
import_path=None,
retain_snapshot=True,
run_on="primary",
):
@ZKConnection(config)
def run_vm_import_snapshot(
zkhandler, self, domain, snapshot_name, import_path, retain_snapshot=True
):
return vm_worker_import_snapshot(
zkhandler,
self,
domain,
snapshot_name,
import_path,
retain_snapshot=retain_snapshot,
)
return run_vm_import_snapshot(
self, domain, snapshot_name, import_path, retain_snapshot=retain_snapshot
)
@celery.task(name="vm.send_snapshot", bind=True, routing_key="run_on")
def vm_send_snapshot(
self,
domain=None,
snapshot_name=None,
destination_api_uri="",
destination_api_key="",
destination_api_verify_ssl=True,
incremental_parent=None,
destination_storage_pool=None,
run_on="primary",
):
@ZKConnection(config)
def run_vm_send_snapshot(
zkhandler,
self,
domain,
snapshot_name,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=True,
incremental_parent=None,
destination_storage_pool=None,
):
return vm_worker_send_snapshot(
zkhandler,
self,
domain,
snapshot_name,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=destination_api_verify_ssl,
incremental_parent=incremental_parent,
destination_storage_pool=destination_storage_pool,
)
return run_vm_send_snapshot(
self,
domain,
snapshot_name,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=destination_api_verify_ssl,
incremental_parent=incremental_parent,
destination_storage_pool=destination_storage_pool,
)
@celery.task(name="vm.create_mirror", bind=True, routing_key="run_on")
def vm_create_mirror(
self,
domain=None,
destination_api_uri="",
destination_api_key="",
destination_api_verify_ssl=True,
destination_storage_pool=None,
run_on="primary",
):
@ZKConnection(config)
def run_vm_create_mirror(
zkhandler,
self,
domain,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=True,
destination_storage_pool=None,
):
return vm_worker_create_mirror(
zkhandler,
self,
domain,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=destination_api_verify_ssl,
destination_storage_pool=destination_storage_pool,
)
return run_vm_create_mirror(
self,
domain,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=destination_api_verify_ssl,
destination_storage_pool=destination_storage_pool,
)
@celery.task(name="vm.promote_mirror", bind=True, routing_key="run_on")
def vm_promote_mirror(
self,
domain=None,
destination_api_uri="",
destination_api_key="",
destination_api_verify_ssl=True,
destination_storage_pool=None,
remove_on_source=False,
run_on="primary",
):
@ZKConnection(config)
def run_vm_promote_mirror(
zkhandler,
self,
domain,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=True,
destination_storage_pool=None,
remove_on_source=False,
):
return vm_worker_promote_mirror(
zkhandler,
self,
domain,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=destination_api_verify_ssl,
destination_storage_pool=destination_storage_pool,
remove_on_source=remove_on_source,
)
return run_vm_promote_mirror(
self,
domain,
destination_api_uri,
destination_api_key,
destination_api_verify_ssl=destination_api_verify_ssl,
destination_storage_pool=destination_storage_pool,
remove_on_source=remove_on_source,
)
@celery.task(name="osd.add", bind=True, routing_key="run_on") @celery.task(name="osd.add", bind=True, routing_key="run_on")
def osd_add( def osd_add(
self, self,