Commit Graph

487 Commits

Author SHA1 Message Date
Joshua Boniface d1e34e7333 Store fault times only to the second
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00
Joshua Boniface 79eb54d5da Move fault generation to common library 2023-12-06 13:17:10 -05:00
Joshua Boniface 2267a9c85d Improve output formatting for simplicity 2023-12-05 10:37:35 -05:00
Joshua Boniface 672e58133f Implement interfaces to faults 2023-12-04 01:37:54 -05:00
Joshua Boniface 3dc48c1783 Lower default monitoring interval to 15s
Faults are also reported on the monitoring interval, so 60s seems like
too long. Lower this to 15 seconds by default instead.
2023-12-01 17:38:28 -05:00
Joshua Boniface 9c2b1b29ee Add node health to fault states
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.

Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.
2023-12-01 17:38:28 -05:00
Joshua Boniface 8594eb697f Add initial fault generation in pvchealthd
References: #164
2023-12-01 17:38:27 -05:00
Joshua Boniface 7cb9ebae6b Remove legacy configuration handler
This is not going to be needed.
2023-12-01 01:25:40 -05:00
Joshua Boniface 102c3c3106 Port all Celery worker functions to discrete pkg
Moves all tasks run by the Celery worker into a discrete package/module
for easier installation. Also adjusts several parameters throughout to
accomplish this.
2023-11-30 02:24:54 -05:00
Joshua Boniface 03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
Joshua Boniface 11db3c5b20 Fix ordering during termination 2023-11-29 21:21:51 -05:00
Joshua Boniface fa12a3c9b1 Permit buffered log appending 2023-11-29 21:21:51 -05:00
Joshua Boniface 787f4216b3 Expand Zookeeper log daemon prefix to match 2023-11-29 21:21:51 -05:00
Joshua Boniface 83ceb41138 Add daemon name to Logger entries 2023-11-29 15:18:37 -05:00
Joshua Boniface 2e5958640a Remove erroneous time from message 2023-11-29 15:12:41 -05:00
Joshua Boniface 7abc697c8a Improve Zookeeper log handling
Ensures that messages are fully read before each append. Adds more
Zookeeper hits, but ensures logs won't be overwritten by multiple
daemons.

Also don't use a set on the client side, to avoid "removing duplicate"
entries erroneously.
2023-11-29 15:12:41 -05:00
Joshua Boniface dd6a38d5ea Properly pass the name of the exception 2023-11-16 18:05:52 -05:00
Joshua Boniface f50f170d4e Convert vmbuilder to use new Celery step structure 2023-11-16 16:08:49 -05:00
Joshua Boniface 83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface b522306f87 Increase Celery wait times
It's a bit inefficient, but provides nicer output and a bit of settling
time between each stage.
2023-11-09 23:54:05 -05:00
Joshua Boniface 07026efb63 Ensure OSD checks in before completing
Avoids issues where the new OSD doesn't check in; at least the
administrator will know.

Also fixes some issues with osd_db in removal.
2023-11-09 23:51:05 -05:00
Joshua Boniface 08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
Joshua Boniface ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
Joshua Boniface 89681d54b9 Port VM on-node tasks to Celery worker system
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
Joshua Boniface a016337f57 Remove block verify in APi
This doesn't work right and is handled by the node anyways.
2023-11-04 02:45:10 -04:00
Joshua Boniface 7f5dd385b5 Use right key for FSID elsewhere 2023-11-03 23:51:01 -04:00
Joshua Boniface ec42b19d0e Send FSID to clients too 2023-11-03 16:37:55 -04:00
Joshua Boniface 64e37ae963 Update OSD replacement functionality
1. Simplify this by leveraging the existing remove_osd/add_osd
functions, since its task was functionally identical to those two in
sequential order.
2. Add support for split OSDs within the command (replacing all OSDs on
the block device(s) as required).
3. Add additional configurability and flexibility around the old device,
weight, and external DB LVs.
2023-11-03 01:45:49 -04:00
Joshua Boniface 980ea6a9e9 Adjust handling of ext_db and _count options
Avoid the use of superfluous flag options, default them to none, and add
support for fixed-size DB LVs.
2023-11-02 13:29:47 -04:00
Joshua Boniface 526a5f4a74 Add support for split OSD adds
Allows creating multiple OSDs on a single (NVMe) block device,
leveraging the "ceph-volume lvm batch" command. Replaces the previous
method of creating OSDs.

Also adds a new ZK item for each OSD indicating if it is split or not.
2023-11-01 21:31:35 -04:00
Joshua Boniface 35f80e544c Use more hierarchical backup path structure 2023-10-24 02:04:16 -04:00
Joshua Boniface 83b937654c Avoid removing nonexistent snapshots
Store retain_snapshot in JSON and use that to check during delete.
2023-10-24 01:35:00 -04:00
Joshua Boniface 714bde89e6 Fix incorrect variable ref 2023-10-24 01:25:01 -04:00
Joshua Boniface c87736eb0a Use consistent path name and format 2023-10-24 01:20:44 -04:00
Joshua Boniface 63d0a85e29 Add backup deletion command 2023-10-24 01:18:27 -04:00
Joshua Boniface 55ca131c2c Handle snapshots on restore and provide options
Also rename the retain option to remove superfluous plural.
2023-10-24 00:25:06 -04:00
Joshua Boniface 8d256a1737 Complete VM restore functionality 2023-10-23 22:23:17 -04:00
Joshua Boniface d3b3fdfc80 Revert "Export backup images to a tar archive"
This reverts commit 38abd078af.
2023-10-23 11:01:16 -04:00
Joshua Boniface f1b29ea94e Initial VM restore work 2023-10-23 11:00:54 -04:00
Joshua Boniface 38abd078af Export backup images to a tar archive
This helps ensure an easier restore as the tar archive(s) can be sent
directly to the API via the normal process of image uploading, instead
of individual disks.
2023-10-23 09:56:50 -04:00
Joshua Boniface fabb97cf48 Only split a command_string if its not a list 2023-10-23 09:50:58 -04:00
Joshua Boniface 68124db323 Remove extra spaces 2023-10-17 13:01:38 -04:00
Joshua Boniface 8921efd269 Fix incorrect tuple construct 2023-10-17 12:55:44 -04:00
Joshua Boniface 3d12915989 Further improve return messages 2023-10-17 12:53:08 -04:00
Joshua Boniface 67b0b19bca Use better time functionality 2023-10-17 12:39:37 -04:00
Joshua Boniface 5d0c674d1d Add runtime and adjust ordering 2023-10-17 12:32:40 -04:00
Joshua Boniface f441b0d823 Improve missing parent message 2023-10-17 12:17:29 -04:00
Joshua Boniface a5d0f219e4 Improve return messages 2023-10-17 12:10:55 -04:00
Joshua Boniface 0169510df0 Fix up datestring generation 2023-10-17 12:05:45 -04:00
Joshua Boniface a58c1d5a8c Fix bad snapshot removals 2023-10-17 12:02:24 -04:00
Joshua Boniface a8e4b01b67 Handle return data even better 2023-10-17 11:51:03 -04:00
Joshua Boniface 45c4c86911 Handle extra return variable 2023-10-17 11:47:01 -04:00
Joshua Boniface 6448b31d2c Improve VM list arguments
Use kwargs here instead of fixed args to allow default None values.
2023-10-17 11:01:38 -04:00
Joshua Boniface b997c6f31e Add support for full VM backups
Adds support for exporting full VM backups, including configuration,
metainfo, and RBD disk images, with incremental support.
2023-10-17 10:15:06 -04:00
Joshua Boniface a0b45a2bcd Always create RBDs with bytes value
Converting into human results in imprecise values when specifying bytes
directly, which in turn breaks VMDK image uploads. Instead, just use the
raw bytes value when creating the volume instead of converting it back.
2023-09-30 12:37:43 -04:00
Joshua Boniface c4397219da Ensure fencing states are properly reflected 2023-09-18 09:59:18 -04:00
Joshua Boniface 311bb69785 Format based on updated Black 2023-09-12 16:41:02 -04:00
Joshua Boniface 653b95ee25 Normalize return messages for node commands 2023-05-04 17:02:46 -04:00
Joshua Boniface 78322f4de4 Improve size handling during volume add/resize 2023-04-28 12:16:16 -04:00
Joshua Boniface c1782c5004 Add full/nearfull OSD health detection 2023-04-28 11:33:39 -04:00
Joshua Boniface e773211293 Add PVC version to cluster status output 2023-02-22 16:09:24 -05:00
Joshua Boniface 70ba364f1d Flip VM state condition to remove shutdown
Don't cause health degredation for shutdown state, and flip the list
around to make it clearer.
2023-02-16 20:32:33 -05:00
Joshua Boniface 1f8561d59a Format cluster health like node healths
Make a cleaner construct here.
2023-02-16 12:33:36 -05:00
Joshua Boniface 1093ca6264 Disallow health less than 0 2023-02-15 16:50:24 -05:00
Joshua Boniface 29584e5636 Add per-node health entries for 3rd party checks 2023-02-15 16:44:49 -05:00
Joshua Boniface f4e8449356 Fix bugs and formatting of health messages 2023-02-15 16:28:56 -05:00
Joshua Boniface ec79acf061 Fix linting of cluster.py file 2023-02-15 15:48:31 -05:00
Joshua Boniface 00586074cf Modify cluster health to use new values 2023-02-15 15:45:43 -05:00
Joshua Boniface f4eef30770 Add JSON health to cluster data 2023-02-15 15:26:57 -05:00
Joshua Boniface b07396c39a Fix bugs if plugins fail to load 2023-02-13 21:51:48 -05:00
Joshua Boniface e6f9e6e0e8 Fix several bugs and optimize output 2023-02-13 16:36:15 -05:00
Joshua Boniface 9c14d84bfc Add node health value and send out API 2023-02-13 15:53:39 -05:00
Joshua Boniface 3c742a827b Initial implementation of monitoring plugin system 2023-02-13 12:06:26 -05:00
Joshua Boniface 671a907236 Allow rename in disable state 2023-01-30 11:48:43 -05:00
Joshua Boniface 38d63d9837 Flip behaviour of memory selectors
It didn't make any sense to me for mem(prov) to be the default selector,
since this has too many caveats versus mem(free). Switch to using
mem(free) as the default (i.e. "mem") and make memprov the alternative.
2022-11-15 15:45:59 -05:00
Joshua Boniface 79eb994a5e Ensure equality of none and None for selector 2022-11-07 11:59:53 -05:00
Joshua Boniface 8af7189dd0 Add module tag for daemon lib 2022-11-04 03:47:18 -04:00
Joshua Boniface 726d0a562b Update copyright header year 2022-10-06 11:55:27 -04:00
Joshua Boniface 881550b610 Actually fix VM sorting
Due to the executor the previous attempt did not work.
2022-08-12 17:46:29 -04:00
Joshua Boniface bcabd7d079 Always sort VM list
Same justification as previous commit.
2022-08-09 12:05:40 -04:00
Joshua Boniface 05a316cdd6 Ensure the node list is sorted
Otherwise the node entries could come back in an arbitrary order; since
this is an ordered list of dictionaries that might not be expected by
the API consumers, so ensure it's always sorted.
2022-08-09 12:03:49 -04:00
Joshua Boniface d8d3feee22 Add selector help and adjust flag name
1. Add documentation on the node selector flags. In the API, reference
the daemon configuration manual which now includes details in this
section; in the CLI, provide the help in "pvc vm define" in detail and
then reference that command's help in the other commands that use this
field.

2. Ensure the naming is consistent in the CLI, using the flag name
"--node-selector" everywhere (was "--selector" for "pvc vm" commands and
"--node-selector" for "pvc provisioner" commands).
2022-06-10 02:42:06 -04:00
Joshua Boniface f8cdcb30ba Add migration selector via free memory
Closes #152
2022-05-18 03:47:16 -04:00
Joshua Boniface c401a1f655 Use consistent language for primary mode
I didn't call it "router" anywhere else, but the state in the list is
called "coordinator" so, call it "coordinator mode".
2022-05-06 15:40:52 -04:00
Joshua Boniface 7a40c7a55b Add support for replacing/refreshing OSDs
Adds commands to both replace an OSD disk, and refresh (reimport) an
existing OSD disk on a new node. This handles the cases where an OSD
disk should be replaced (either due to upgrades or failures) or where a
node is rebuilt in-place and an existing OSD must be re-imported to it.

This should avoid the need to do a full remove/add sequence for either
case.

Also cleans up some aspects of OSD removal that are identical between
methods (e.g. using safe-to-destroy and sleeping after stopping) and
fixes a bug if an OSD does not truly exist when the daemon starts up.
2022-05-06 15:32:06 -04:00
Joshua Boniface 464f0e0356 Store additional OSD information in ZK
Ensures that information like the FSIDs and the OSD LVM volume are
stored in Zookeeper at creation time and updated at daemon start time
(to ensure the data is populated at least once, or if the /dev/sdX
path changes).

This will allow safer operation of OSD removals and the potential
implementation of re-activation after node replacements.
2022-05-02 12:11:39 -04:00
Joshua Boniface d6ca74376a Fix bugs with forced removal 2022-04-29 14:03:07 -04:00
Joshua Boniface 4d698be34b Add OSD removal force option
Ensures a removal can continue even in situations where some step(s)
might fail, for instance removing an obsolete OSD from a replaced node.
2022-04-29 11:16:33 -04:00
Joshua Boniface 1142454934 Add pool PGs count modification
Allows an administrator to adjust the PG count of a given pool. This can
be used to increase the PGs (for example after adding more OSDs) or
decrease it (to remove OSDs, reduce CPU load, etc.).
2021-12-28 21:53:29 -05:00
Joshua Boniface bbfad340a1 Add PGs count to pool list 2021-12-28 21:12:02 -05:00
Joshua Boniface c73939e1c5 Fix issue if pool stats have not updated yet 2021-12-28 21:03:10 -05:00
Joshua Boniface 25fe45dd28 Add device class tiers to Ceph pools
Allows specifying a particular device class ("tier") for a given pool,
for instance SSD-only or NVMe-only. This is implemented with Crush
rules on the Ceph side, and via an additional new key in the pool
Zookeeper schema which is defaulted to "default".
2021-12-28 20:58:15 -05:00
Joshua Boniface 6ccd19e636 Standardize fuzzy matching and use fullmatch
Solves two problems:

1. How match fuzziness was used was very inconsistent; make them all the
same, i.e. "if is_fuzzy and limit, apply .* to both sides".

2. Use re.fullmatch instead of re.match to ensure exact matching of the
regex to the value. Without fuzziness, this would sometimes cause
inconsistent behavior, for instance if a limit was non-fuzzy "vm",
expecting to match the actual "vm", but also matching "vm1" too.
2021-12-06 16:35:29 -05:00
Joshua Boniface 0d857d5ab8 Use positive check rather than negative
Ensure the VM is start before doing shutdown/stop, rather than being
stopped. Prevents overwrite of existing disable state and other
weirdness.
2021-11-06 04:08:33 -04:00
Joshua Boniface 5f193a6134 Perform automatic shutdown/stop on VM disable
Instead of requiring the VM to already be stopped, instead allow disable
state changes to perform a shutdown first. Also add a force option which
will do a hard stop instead of a shutdown.

References #148
2021-11-06 03:57:24 -04:00
Joshua Boniface c41664d2da Reformat code with Black code formatter
Unify the code style along PEP and Black principles using the tool.
2021-11-06 03:02:43 -04:00
Joshua Boniface 87cda72ca9 Fix invalid schema key
Addresses #144
2021-10-09 18:42:33 -04:00
Joshua Boniface 24de0f4189 Add MTU to network creation/modification
Addresses #144
2021-10-09 17:51:32 -04:00
Joshua Boniface 50d8aa0586 Add handlers for client network MTUs
Refactors some of the code in VXNetworkInterface to handle MTUs in a
more streamlined fashion. Also fixes a bug whereby bridge client
networks were being explicitly given the cluster dev MTU which might not
be correct. Now adds support for this option explicitly in the configs,
and defaults to 1500 for safety (the standard Ethernet MTU).

Addresses #144
2021-10-09 17:02:27 -04:00
Joshua Boniface c0f7ba0125 Add limit negation to VM list
When using the "state", "node", or "tag" arguments to a VM list, add
support for a "negate" flag to look for all VMs *not in* the state,
node, or tag state.
2021-10-07 11:50:52 -04:00
Joshua Boniface 65df807b09 Add support for configurable OSD DB ratios
The default of 0.05 (5%) is likely ideal in the initial implementation,
but allow this to be set explicitly for maximum flexibility in
space-constrained or performance-critical use-cases.
2021-09-24 01:06:39 -04:00
Joshua Boniface adc8a5a3bc Add separate OSD DB device support
Adds in three parts:

1. Create an API endpoint to create OSD DB volume groups on a device.
Passed through to the node via the same command pipeline as
creating/removing OSDs, and creates a volume group with a fixed name
(osd-db).

2. Adds API support for specifying whether or not to use this DB volume
group when creating a new OSD via the "ext_db" flag. Naming and sizing
is fixed for simplicity and based on Ceph recommendations (5% of OSD
size). The Zookeeper schema tracks the block device to use during
removal.

3. Adds CLI support for the new and modified API endpoints, as well as
displaying the block device and DB block device in the OSD list.

While I debated supporting adding a DB device to an existing OSD, in
practice this ended up being a very complex operation involving stopping
the OSD and setting some options, so this is not supported; this can be
specified during OSD creation only.

Closes #142
2021-09-23 13:59:49 -04:00
Joshua Boniface 58db537093 Add memory and vCPU checks to VM define/modify
Ensures that a VM won't:

(a) Have provisioned more RAM than there is available on a given node.
Due to memory overprovisioning, this is simply a "is the VM memory count
more than the node count", and doesn't factor in free or used memory on
a node, total cluster usage, etc. So if a node has 64GB total RAM, the
VM limit is 64GB. It is up to an administrator to ensure sanity *below*
that value.

(b) Have provisioned more vCPUs than there are CPU cores on the node,
minus 2 to account for hypervisor/storage processes. Will ensure there
is no severe CPU contention caused by a single VM having more vCPUs than
there are actual execution threads available.

Closes #139
2021-09-13 01:51:21 -04:00
Joshua Boniface e71a6c90bf Add pool size check when resizing volumes
Closes #140
2021-09-12 19:54:51 -04:00
Joshua Boniface e962743e51 Add VM device hot attach/detach support
Adds a new API endpoint to support hot attach/detach of devices, and the
corresponding client-side logic to use this endpoint when doing VM
network/storage add/remove actions.

The live attach is now the default behaviour for these types of
additions and removals, and can be disabled if needed.

Closes #141
2021-09-12 19:33:00 -04:00
Joshua Boniface 73e8149cb0 Remove explicit image-features from rbd cmd
This should be managed in ceph.conf with the `rbd default
features` configuration option instead, and thus can be tailored to the
underlying OS version.
2021-07-30 11:33:59 -04:00
Joshua Boniface 4a7246b8c0 Ensure RBD resize has bytes appended
If this isn't, the resize will be interpreted as a MB value and result
in an absurdly big volume instead. This is the same consistency
validation that occurs on add.
2021-07-30 11:25:13 -04:00
Joshua Boniface c49351469b Revert "Ensure consistent sizing of volumes"
This reverts commit dc03e95bbf.
2021-07-29 15:30:00 -04:00
Joshua Boniface dc03e95bbf Ensure consistent sizing of volumes
Convert from human to bytes, then to megabytes and always pass this to
the RBD command. This ensures consistency regardless of what is actually
passed by the user.
2021-07-29 15:14:25 -04:00
Joshua Boniface 45f23c12ea Remove logs from schema validation
These are managed entirely by the logging subsystem not by the schema
handler due to catch-22's.
2021-07-20 00:00:37 -04:00
Joshua Boniface b14bc7e3a3 Add retry to log writes 2021-07-19 13:11:28 -04:00
Joshua Boniface 4d6842f942 Don't bail out if write fails, keep retrying 2021-07-19 13:09:36 -04:00
Joshua Boniface e9df043c0a Ensure ZK logging does not block startup 2021-07-19 12:19:59 -04:00
Joshua Boniface 5be968123f Readd 1 second queue get timeout
Otherwise daemon stops will sometimes inexplicably block.
2021-07-18 22:17:57 -04:00
Joshua Boniface 99fd7ebe63 Fix excessive CPU due to looping 2021-07-18 22:06:50 -04:00
Joshua Boniface cffc96d156 Fix failure in creating base keys 2021-07-18 21:00:23 -04:00
Joshua Boniface b770e15a91 Fix final termination of logger
We need to do a bit more finagling with the logger on termination to
ensure that all messages are written and the queue drained before
actually terminating.
2021-07-18 19:53:00 -04:00
Joshua Boniface 982dfd52c6 Adjust date output format 2021-07-18 19:00:54 -04:00
Joshua Boniface a088aa4484 Add node log functions to API and CLI 2021-07-18 18:54:28 -04:00
Joshua Boniface 323c7c41ae Implement node logging into Zookeeper
Adds the ability to send node daemon logs to Zookeeper to facilitate a
command like "pvc node log", similar to "pvc vm log". Each node stores
its logs in a separate tree under "/logs" which can then be combined or
queried. By default, set by config, only 2000 lines are kept.
2021-07-18 17:11:43 -04:00
Joshua Boniface 75fb60b1b4 Add VM list filtering by tag
Uses same method as state or node filtering, rather than altering how
the main LIMIT field works.
2021-07-14 00:59:20 -04:00
Joshua Boniface 9ea9ac3b8a Revamp tag handling and display
Add an additional protected class, limit manipulation to one at a time,
and ensure future flexibility. Also makes display consistent with other
VM elements.
2021-07-13 22:39:52 -04:00
Joshua Boniface c0a3467b70 Simplify VM metadata reads
Directly call the new common getDomainMetadata function to avoid
excessive Zookeeper calls for this information.
2021-07-13 19:05:33 -04:00
Joshua Boniface 9a199992a1 Add functions for manipulating VM tags
Adds tags to schema (v3), to VM definition, adds function to modify
tags, adds function to get tags, and adds tags to VM data output.

Tags will enable more granular classification of VMs based either on
administrator configuration or from automated system events.
2021-07-13 19:05:33 -04:00
Joshua Boniface c76149141f Only log ZK connections when persistent
Prevents spam in the API logs.
2021-07-10 23:35:49 -04:00
Joshua Boniface 0699c48d10 Fix bad schema path name 2021-07-09 16:47:09 -04:00
Joshua Boniface 4832245d9c Handle non-RBD disks and non-RBD errors better 2021-07-09 15:48:57 -04:00
Joshua Boniface 2138f2f59f Fail VM removal on disk removal failures
Prevents bad states where the VM is "removed" but some of its disks
remain due to e.g. stuck watchers.

Rearrange the sequence so it goes stop, delete disks, then delete VM,
and then return a failure if any of the disk(s) fail to remove, allowing
the task to be rerun after fixing the problem.
2021-07-09 15:39:06 -04:00
Joshua Boniface d1d355a96b Avoid errors if stats data is None 2021-07-09 13:13:54 -04:00
Joshua Boniface c0c9327a7d Return an empty log if the value is None 2021-07-09 13:08:00 -04:00
Joshua Boniface 5ffabcfef5 Avoid failing if we can't get the future data 2021-07-09 13:05:37 -04:00
Joshua Boniface 80fe96b24d Add some additional docstrings 2021-07-07 12:28:08 -04:00
Joshua Boniface 80f04ce8ee Remove connection renewal in state handler
Regenerating the ZK connection was fraught with issues, including
duplicate connections, strange failures to reconnect, and various other
wonkiness.

Instead let Kazoo handle states sensibly. Kazoo moves to SUSPENDED state
when it loses connectivity, and stays there indefinitely (based on
cursory tests). And Kazoo seems to always resume from this just fine on
its own. Thus all that hackery did nothing but complicate reconnection.

This therefore turns the listener into a purely informational function,
providing logs of when/why it failed, and we also add some additional
output messages during initial connection and final disconnection.
2021-07-07 11:55:12 -04:00
Joshua Boniface a8c28786dd Better handle empty ipaths in schema
When trying to write to sub-item paths that don't yet exist, the
previous method would just blindly write to whatever the root key is,
which is never what we actually want.

Instead, check explicitly for a "base path" situation, and handle that.
Then, if we try to get a subpath that isn't valid, return None. Finally
in the various functions, if the path is None, just continue (or return
false/None) and (try to) chug along.
2021-07-05 23:35:03 -04:00
Joshua Boniface c45804e8c1 Revert "Return none if a schema path is not found"
This reverts commit b1fcf6a4a5.
2021-07-05 23:16:39 -04:00
Joshua Boniface b1fcf6a4a5 Return none if a schema path is not found
This can cause overwriting of unintended keys, so should not be
happening. Will have to find the bugs this causes.
2021-07-05 17:15:55 -04:00
Joshua Boniface a69105569f Add node PVC version data to Node information
Allows API client to see the currently-active version of the node
daemon.
2021-07-05 09:57:38 -04:00
Joshua Boniface e44f3d623e Remove unnecessary try/except blocks from VM reads
The zkhandler read() function takes care of ensuring there is a None
value returned if these fail, so these aren't required. Makes the code a
fair bit more readable here.
2021-07-02 12:01:58 -04:00
Joshua Boniface 43009486ae Move Ceph pool/volume list assembly to thread pool
Same reasons as the VM list, though less impactful.
2021-07-01 17:33:13 -04:00
Joshua Boniface 58789f1db4 Move VM list assembly to thread pool
This helps parallelize the numerous Zookeeper calls a little bit, at
least within the bounds of the GIL, to improve performance when getting
a large list of VMs. The max_workers value is capped at 32 to avoid
causing too many threads during concurrent executions, but still
provides a noticeable speedup (on the order of 0.2-0.4 seconds with 75
VMs, scaling up further as counts grow).
2021-07-01 17:32:47 -04:00
Joshua Boniface baf4c3fbc7 Add performance profiler function
Usable anywhere that the global daemon "config" parameter can be passed
in (e.g. pvcapid/helper.py, pvcnoded/Daemon.py, etc.). Stores results in
a subdirectory of the PVC logdir called "profiler" if this directory can
be created, or prints results.

The debug config parameter ensures that the profiler can be added to
functions and not run unless the server is explicitly in debug mode.
Might not be useful as I don't initially plan to add this to every
function (only when investigating performance problems), but this
flexibility allows that to change later.
2021-07-01 14:01:33 -04:00
Joshua Boniface e093efceb1 Add NoNodeError handlers in ZK locks
Instead of looping 5+ times acquiring an impossible lock on a
nonexistent key, just fail on a different error and return failure
immediately.

This is likely a major corner case that shouldn't happen, but better to
be safe than 500.
2021-07-01 01:17:38 -04:00
Joshua Boniface a080598781 Avoid superfluous ZK exists calls
These cause a major (2x) slowdown in read calls since Zookeeper
connections are expensive/slow. Instead, just try the thing and return
None if there's no key there.

Also wrap the children command in similar error handling since that did
not exist and could likely cause some bugs at some point.
2021-07-01 01:15:51 -04:00
Joshua Boniface 6adaf1f669 Fix incorrect handling of deletions in init 2021-06-29 18:41:02 -04:00
Joshua Boniface f91c07fdcf Re-add UUID limit matching for full UUIDs
This *was* valuable when passing a full UUID in, so go back to that.
Verify first that the limit string is an actual UUID, and then compare
against it if applicable.
2021-06-28 12:27:43 -04:00
Joshua Boniface c54f66efa8 Limit match only on VM name
I can see no possible reason to want to do limits against UUIDs, but
supporting that means match is not what one would expect since a random
UUID could match the limit. So only limit based on the name.
2021-06-23 19:17:35 -04:00
Joshua Boniface cd860bae6b Optimize VM list in API
With many VMs this slows down linearly. Rework it a bit so there are
fewer calls to getInformationFromXML and so the processing could happen
in parallel at some point.
2021-06-23 19:14:26 -04:00
Joshua Boniface 07dbd55f03 Use list comprehension to compare against source 2021-06-22 02:31:14 -04:00
Joshua Boniface 6cd0ccf0ad Fix network check on VM config modification 2021-06-22 02:21:55 -04:00
Joshua Boniface e623909a43 Store PHY MAC for VFs and restore after free 2021-06-22 00:56:47 -04:00
Joshua Boniface 60e1da09dd Don't try any shenannegans when updating NICs
Trying to do this on the VMInstance side had problems because we can't
differentiate the 3 types of migration there. So, just update this in
the API side and hope everything goes well.

This introduces an edge bug: if a VM is using a macvtap SR-IOV device,
and then tries to migrate, and the migrate is aborted, the NIC lists
will be inconsistent.

When I revamp the VMInstance in the future, I should be able to correct
this, but for now we'll have to live with that edgecase.
2021-06-22 00:00:50 -04:00
Joshua Boniface dc560c1dcb Better handle retcodes in migrate update 2021-06-21 23:46:47 -04:00
Joshua Boniface 68c7481aa2 Ensure offline migrations update SR-IOV NIC states 2021-06-21 23:35:52 -04:00
Joshua Boniface 24ce361a04 Ensure SR-IOV NIC states are updated on migration 2021-06-21 23:18:34 -04:00
Joshua Boniface eeb83da97d Add support for SR-IOV NICs to VMs 2021-06-21 23:18:22 -04:00
Joshua Boniface 64d1a37b3c Add PCIe device paths to SR-IOV VF information
This will be used when adding VM network interfaces of type hostdev.
2021-06-21 21:08:46 -04:00
Joshua Boniface 13cc0f986f Implement SR-IOV VF config set
Also fixes some random bugs, adds proper interface sorting, and assorted
tweaks.
2021-06-21 18:40:11 -04:00
Joshua Boniface 33195c3c29 Ensure VF list is sorted 2021-06-21 17:11:48 -04:00
Joshua Boniface a697c2db2e Add SRIOV PF and VF listing to API 2021-06-21 01:42:55 -04:00
Joshua Boniface 509afd4d05 Add hostdev net_type to handler as well 2021-06-17 01:52:58 -04:00
Joshua Boniface e7b6a3eac1 Implement SR-IOV PF and VF instances
Adds support for the node daemon managing SR-IOV PF and VF instances.

PFs are added to Zookeeper automatically based on the config at startup
during network configuration, and are otherwise completely static. PFs
are automatically removed from Zookeeper, along with all coresponding
VFs, should the PF phy device be removed from the configuration.

VFs are configured based on the (autocreated) VFs of each PF device,
added to Zookeeper, and then a new class instance, SRIOVVFInstance, is
used to watch them for configuration changes. This will enable the
runtime management of VF settings by the API. The set of keys ensures
that both configuration and details of the NIC can be tracked.

Most keys are self-explanatory, especially for PFs and the basic keys
for VFs. The configuration tree is also self-explanatory, being based
entirely on the options available in the `ip link set {dev} vf` command.

Two additional keys are also present: `used` and `used_by`, which will
be able to track the (boolean) state of usage, as well as the VM that
uses a given VIF. Since the VM side implementation will support both
macvtap and direct "hostdev" assignments, this will ensure that this
state can be tracked on both the VF and the VM side.
2021-06-17 01:33:03 -04:00
Joshua Boniface f540dd320b Allow VNI for "direct" type vNICs 2021-06-15 00:27:01 -04:00
Joshua Boniface 23318524b9 Ensure validate writes a valid schema version 2021-06-14 21:27:37 -04:00
Joshua Boniface 5f11b3198b Fix base schema None issue in handler too 2021-06-14 21:13:40 -04:00
Joshua Boniface 20c773413c Fix bug in snapshot rename 2021-06-14 00:55:26 -04:00
Joshua Boniface 49f4feb482 Fix typo bug in key rename 2021-06-14 00:51:45 -04:00
Joshua Boniface 30a160d5ff Fix invalid type_key 2021-06-13 21:20:10 -04:00
Joshua Boniface 1cbc66dccf Fix bugs in lease listing 2021-06-13 21:10:42 -04:00
Joshua Boniface bbd903e568 Fix bad schema name 2021-06-13 21:02:44 -04:00
Joshua Boniface 9511dc9864 Correct issue with invalid ACL ordering 2021-06-13 20:55:28 -04:00
Joshua Boniface 3013973975 Fix bad schema names 2021-06-13 20:32:41 -04:00
Joshua Boniface 8269930d40 Fix bad entry in network add 2021-06-13 18:22:13 -04:00
Joshua Boniface ae79113f7c Correct key typo and add error handler 2021-06-13 15:49:30 -04:00
Joshua Boniface 3bad3de720 Verify if key exists before reading 2021-06-13 15:39:43 -04:00
Joshua Boniface 680c62a6e4 Fix schema path call and version check 2021-06-13 14:46:30 -04:00
Joshua Boniface 88a1d89501 Fix bad key name 2021-06-13 14:29:54 -04:00
Joshua Boniface 7110a42e5f Add final schema elements after refactoring 2021-06-13 14:26:17 -04:00
Joshua Boniface 01c82f5d19 Move backup and restore into common 2021-06-13 14:25:51 -04:00
Joshua Boniface 059230d369 Convert vm.py to new ZK schema handler 2021-06-13 13:41:21 -04:00
Joshua Boniface f6e37906a9 Convert node.py to new ZK schema handler 2021-06-13 13:18:34 -04:00
Joshua Boniface 0a162b304a Convert network.py to new ZK schema handler 2021-06-12 18:40:25 -04:00
Joshua Boniface f071343333 Add DHCP lease schema and temp workaround 2021-06-12 18:22:43 -04:00
Joshua Boniface 01c762a362 Convert common.py to new ZK schema handler 2021-06-12 17:59:09 -04:00
Joshua Boniface 9b1bd8476f Convert cluster.py to new ZK schema handler 2021-06-12 17:11:32 -04:00
Joshua Boniface 6d00ec07b5 Convert ceph.py to new ZK schema handler 2021-06-12 17:09:29 -04:00
Joshua Boniface 247ae4fe2d Fix pre-refactor path bug 2021-06-10 01:18:33 -04:00
Joshua Boniface b1c13c9fc1 Fix another bug with read call 2021-06-10 01:08:18 -04:00
Joshua Boniface 75fc40a1e8 Fix bug with nkipath 2021-06-10 01:00:40 -04:00
Joshua Boniface 2aa7f87ca9 Fix bug in creating child path keys 2021-06-10 00:55:54 -04:00
Joshua Boniface 5273c4ebfa Fix bug with encoding raw creates 2021-06-10 00:52:07 -04:00
Joshua Boniface 8dc9fd6dcb Fix bug with sub self command path/key 2021-06-10 00:49:01 -04:00
Joshua Boniface f030ed974c Correct schema and handling of network subkeys
Required a bit of refactoring in the validation code to ensure we have
direct access, without relying on the translations done in the normal
zkhandler functions.
2021-06-10 00:35:42 -04:00
Joshua Boniface 9985e1dadd Add support for 2-level dynamic keys 2021-06-09 23:52:21 -04:00
Joshua Boniface 7e42118e6f Adjust lock schema in NodeInstance and VMInstance
Removes a superfluous lock and puts the sync_lock keys in more usable
places.
2021-06-09 22:51:00 -04:00
Joshua Boniface 24663a3333 Add missing VM schema entry 2021-06-09 22:12:24 -04:00
Joshua Boniface a9a57533a7 Integrate schema handling within ZKHandler
Abstracts away the schema management, especially when doing actions, to
prevent duplication in other areas.
2021-06-09 13:23:57 -04:00
Joshua Boniface 76c37e6628 Tweak some field names slightly and add missing 2021-06-09 09:58:18 -04:00
Joshua Boniface 0a04adf8f9 Allow empty sub_paths 2021-06-09 01:54:29 -04:00
Joshua Boniface f2b55ba937 Fix some bugs with migrations 2021-06-09 00:04:16 -04:00
Joshua Boniface 5540bdc86b Add automatic schema upgrade to nodes
Performs an automatic schema upgrade when all nodes are updated to the
latest version.

Addresses #129
2021-06-08 23:35:39 -04:00