Commit Graph

3201 Commits

Author SHA1 Message Date
Joshua Boniface 0cb81f96e6 Use custom task IDs for Celery tasks
Full UUIDs were obnoxiously long, so switch to using just the first
8-character section of a UUID instead. Keeps the list nice and short,
makes them easier to copy, and is just generally nicer.

Could this cause uniqueness problems? Perhaps, but I don't see that
happening nearly frequently enough to matter.
2023-11-16 13:22:14 -05:00
Joshua Boniface 3651885954 Add --events to workers 2023-11-16 12:35:54 -05:00
Joshua Boniface d226e9f4e5 Enable extended Celery results 2023-11-16 12:02:57 -05:00
Joshua Boniface fa361a55d9 Explicitly use kwargs in Celery task calls 2023-11-16 11:55:30 -05:00
Joshua Boniface 8915864fa9 Lower truncation size and add elipses 2023-11-16 11:47:36 -05:00
Joshua Boniface 79f7e8f82e Skip "run_on" argument in output
This isn't required to know, it's internal.
2023-11-16 11:46:15 -05:00
Joshua Boniface 0d818017e8 Name the celery workers pvcworkerd@<hostname> 2023-11-16 11:43:17 -05:00
Joshua Boniface eb1d61a8b9 Generalize task status output 2023-11-16 11:39:08 -05:00
Joshua Boniface 262babc63d Use kwargs for all task arguments
This will help ensure that the CLI frontend can properly parse the args
in a consistent way.
2023-11-16 10:10:48 -05:00
Joshua Boniface 63773a3061 Allow watching existing task via cluster task 2023-11-16 03:06:13 -05:00
Joshua Boniface 289049d223 Properly handle a "primary" run_on value 2023-11-16 02:49:29 -05:00
Joshua Boniface e818df5dae Use enable/disable --now instead of two commands
Avoids needing two calls here especially for the stop.
2023-11-16 02:40:35 -05:00
Joshua Boniface c76a5afd04 Avoid waits during node secondary
Waiting for the daemons to stop took too much time on some nodes and
could throw off the lockstep. Instead, leverage background=True to run
the systemctl os_commands in the background (when they complete is
irrelevant), stop the Metadata API first, and don't delay during its
stop at all.
2023-11-16 02:34:12 -05:00
Joshua Boniface 0bec6abe71 Return proper run_on for ported tasks 2023-11-16 02:28:57 -05:00
Joshua Boniface 18e43a9377 Adjust name in worker log output 2023-11-16 02:25:14 -05:00
Joshua Boniface 4555f5a20a Remove warnings when switch coordinator state
Tasks are no longer bound to the primary coordinator for state updates
due to using KeyDB and a proper shared queue and result backend, so this
warning is now obsolete and no longer required.

This would interrupt "--wait" commands on provisioner tasks, but we no
longer believe that this warrants a warning, as the affected user could
simply run "pvc cluster task" to validate or resume the watcher.
2023-11-16 02:15:01 -05:00
Joshua Boniface d727764ebc Remove obsolete status and add cluster task
Removes the obsoleted "pvc provisioner status" command and replaces it
with a generalized "pvc cluster task" command to show all
currently-active or pending tasks on the cluster workers.
2023-11-16 02:13:26 -05:00
Joshua Boniface 484e6542c2 Port remaining tasks to new task handler
Move the create_vm and run_benchmark tasks to use the new Celery
subsystem, handlers, and wait command. Remove the obsolete, dedicated
API endpoints.

Standardize the CLI client and move the repeated handler code into a
separate common function.
2023-11-16 02:00:23 -05:00
Joshua Boniface aef38639cf Rename pvcapid-worker to pvcworkerd 2023-11-15 20:31:39 -05:00
Joshua Boniface 5f1432ccdd Fix memory allocation updates and add more debug
Previously, we were assigning memalloc/memprov/vcpualloc during an
earlier phase using the main d_domain list. I'm not sure exactly why,
but this was throwing off stats after a fence. Instead, set these values
later on while parsing the actually-active VMs.
2023-11-10 10:29:32 -05:00
Joshua Boniface d6b8808448 Clean up fencing handler
1. Remove all format strings in favour of f-strings
2. Ensure all logger messages have a prefix
3. Add a few more logger messages for clarity
2023-11-10 10:09:54 -05:00
Joshua Boniface 83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface 2a9bc632fa Add node monitoring plugin for KeyDB/Redis 2023-11-10 00:56:46 -05:00
Joshua Boniface b5e4c52387 Increase worker concurrency to 3 2023-11-10 00:39:42 -05:00
Joshua Boniface b522306f87 Increase Celery wait times
It's a bit inefficient, but provides nicer output and a bit of settling
time between each stage.
2023-11-09 23:54:05 -05:00
Joshua Boniface 07026efb63 Ensure OSD checks in before completing
Avoids issues where the new OSD doesn't check in; at least the
administrator will know.

Also fixes some issues with osd_db in removal.
2023-11-09 23:51:05 -05:00
Joshua Boniface d7ea705e31 Improve waiter output
Add an extra newline, show the name of the task (from start()), and show
the first step as a "Gathering information" message on the progressbar.
2023-11-09 23:28:18 -05:00
Joshua Boniface 08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
Joshua Boniface ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
Joshua Boniface 89681d54b9 Port VM on-node tasks to Celery worker system
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
Joshua Boniface f0c2e9d295 Don't start pvcapid-worker on primary
It will be running anyways
2023-11-05 19:44:00 -05:00
Joshua Boniface 2c15036f86 Add KeyDB to node startup services
Also ensure API worker starts on all nodes, not just coordinators.
2023-11-05 19:26:38 -05:00
Joshua Boniface 42ed6f6420 Remove redis as a dependency 2023-11-05 18:23:34 -05:00
Joshua Boniface 3dc1f57de2 Revert "Switch to ZK+PG over Redis for Celery queue"
This reverts commit 54215bab6c.
2023-11-05 17:10:46 -05:00
Joshua Boniface b99b4e64b2 Ensure store path is passed properly 2023-11-05 16:48:47 -05:00
Joshua Boniface 91af1175ef Fix missing CLI_CONFIG in echo() 2023-11-04 15:17:50 -04:00
Joshua Boniface af8a8d969e Ensure queues are set up for non-coordinator nodes
Allows a runner to operate on every possible node, not just
coordinators, as OSDs or other things could be on any node.

Also add more comments.
2023-11-04 15:05:07 -04:00
Joshua Boniface a6caac1b78 Add Celery queue routing for tasks
By default, tasks will continue to run as they did, on the primary
coordinator's task runner. However this opens the possibility for
defining more tasks that will run on other nodes or coordinators.
2023-11-04 14:29:59 -04:00
Joshua Boniface 30d7e49401 Start API worker with node daemon on coordinators 2023-11-04 13:08:16 -04:00
Joshua Boniface ab629f6b51 Use per-host hostname and queues in worker
Opens up the ability to direct tasks to specific workers.
2023-11-04 13:02:30 -04:00
Joshua Boniface 54215bab6c Switch to ZK+PG over Redis for Celery queue
Redis did not provide a distributed solution for the worker, which
precluded several important planned functions. So instead, move to using
Zookeeper + PostgreSQL as the broker and result backend respectively.

Should be a seamless drop-in change but for future uses requires the
database host to be the primary coordinator IP rather than localhost, so
that writes can occur to the database from non-primary hosts.
2023-11-04 12:46:34 -04:00
Joshua Boniface 7490f13b7c Check for partition tables on new devices 2023-11-04 03:13:58 -04:00
Joshua Boniface d1602f35de Adjust split indicator 2023-11-04 02:56:21 -04:00
Joshua Boniface 7cdedde2fb Adjust wording about extdb 2023-11-04 02:54:25 -04:00
Joshua Boniface ab156b14b7 Update help messages for OSD refresh 2023-11-04 02:47:04 -04:00
Joshua Boniface a016337f57 Remove block verify in APi
This doesn't work right and is handled by the node anyways.
2023-11-04 02:45:10 -04:00
Joshua Boniface e32054be81 Refactor refresh as well 2023-11-04 02:44:52 -04:00
Joshua Boniface 18d32fede3 Fix wording of detect strings 2023-11-04 01:37:07 -04:00
Joshua Boniface b3d13fe9be Add log message for zap 2023-11-04 01:02:51 -04:00
Joshua Boniface 48b2ccbd95 Add timeout for safe-to-destroy
Continuously take the OSD down and out while doing so.
2023-11-04 00:55:05 -04:00