Commit Graph

400 Commits

Author SHA1 Message Date
Joshua Boniface b9fbfe2ed5 Improve fault ID format
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
2023-12-09 16:48:14 -05:00
Joshua Boniface 7e6d922877 Improve fault detail handling further
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
2023-12-09 16:13:36 -05:00
Joshua Boniface 4003204f14 Remove bracketed text from fault_str
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.

Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
Joshua Boniface 2bea78d25e Make all remaining limits optional 2023-12-09 13:43:58 -05:00
Joshua Boniface fd717b702d Use external list of fault states 2023-12-09 12:51:41 -05:00
Joshua Boniface 317ca4b98c Move defined state combinations into common 2023-12-09 12:36:32 -05:00
Joshua Boniface 0bda095571 Move libvirt_schema and fix other imports 2023-12-09 12:20:29 -05:00
Joshua Boniface 813aef1463 Fix incorrect UUID key name 2023-12-09 12:14:57 -05:00
Joshua Boniface 5a7ea25266 Fix incorrect database name entries 2023-12-09 12:12:00 -05:00
Joshua Boniface 61b39d0739 Fix incorrect cluster health calculation 2023-12-07 11:13:36 -05:00
Joshua Boniface 4bf80a5913 Fix missing datetime shrink 2023-12-06 17:15:36 -05:00
Joshua Boniface e0bf7f7d1a Fix bad ID values in acknowledge 2023-12-06 14:18:31 -05:00
Joshua Boniface 20acf3295f Add mass ack/delete of faults 2023-12-06 13:59:39 -05:00
Joshua Boniface d1e34e7333 Store fault times only to the second
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00
Joshua Boniface 79eb54d5da Move fault generation to common library 2023-12-06 13:17:10 -05:00
Joshua Boniface 2267a9c85d Improve output formatting for simplicity 2023-12-05 10:37:35 -05:00
Joshua Boniface 672e58133f Implement interfaces to faults 2023-12-04 01:37:54 -05:00
Joshua Boniface 3dc48c1783 Lower default monitoring interval to 15s
Faults are also reported on the monitoring interval, so 60s seems like
too long. Lower this to 15 seconds by default instead.
2023-12-01 17:38:28 -05:00
Joshua Boniface 9c2b1b29ee Add node health to fault states
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.

Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.
2023-12-01 17:38:28 -05:00
Joshua Boniface 8594eb697f Add initial fault generation in pvchealthd
References: #164
2023-12-01 17:38:27 -05:00
Joshua Boniface 7cb9ebae6b Remove legacy configuration handler
This is not going to be needed.
2023-12-01 01:25:40 -05:00
Joshua Boniface 102c3c3106 Port all Celery worker functions to discrete pkg
Moves all tasks run by the Celery worker into a discrete package/module
for easier installation. Also adjusts several parameters throughout to
accomplish this.
2023-11-30 02:24:54 -05:00
Joshua Boniface 03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
Joshua Boniface 11db3c5b20 Fix ordering during termination 2023-11-29 21:21:51 -05:00
Joshua Boniface fa12a3c9b1 Permit buffered log appending 2023-11-29 21:21:51 -05:00
Joshua Boniface 787f4216b3 Expand Zookeeper log daemon prefix to match 2023-11-29 21:21:51 -05:00
Joshua Boniface 83ceb41138 Add daemon name to Logger entries 2023-11-29 15:18:37 -05:00
Joshua Boniface 2e5958640a Remove erroneous time from message 2023-11-29 15:12:41 -05:00
Joshua Boniface 7abc697c8a Improve Zookeeper log handling
Ensures that messages are fully read before each append. Adds more
Zookeeper hits, but ensures logs won't be overwritten by multiple
daemons.

Also don't use a set on the client side, to avoid "removing duplicate"
entries erroneously.
2023-11-29 15:12:41 -05:00
Joshua Boniface dd6a38d5ea Properly pass the name of the exception 2023-11-16 18:05:52 -05:00
Joshua Boniface f50f170d4e Convert vmbuilder to use new Celery step structure 2023-11-16 16:08:49 -05:00
Joshua Boniface 83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface b522306f87 Increase Celery wait times
It's a bit inefficient, but provides nicer output and a bit of settling
time between each stage.
2023-11-09 23:54:05 -05:00
Joshua Boniface 07026efb63 Ensure OSD checks in before completing
Avoids issues where the new OSD doesn't check in; at least the
administrator will know.

Also fixes some issues with osd_db in removal.
2023-11-09 23:51:05 -05:00
Joshua Boniface 08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
Joshua Boniface ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
Joshua Boniface 89681d54b9 Port VM on-node tasks to Celery worker system
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
Joshua Boniface a016337f57 Remove block verify in APi
This doesn't work right and is handled by the node anyways.
2023-11-04 02:45:10 -04:00
Joshua Boniface 7f5dd385b5 Use right key for FSID elsewhere 2023-11-03 23:51:01 -04:00
Joshua Boniface ec42b19d0e Send FSID to clients too 2023-11-03 16:37:55 -04:00
Joshua Boniface 64e37ae963 Update OSD replacement functionality
1. Simplify this by leveraging the existing remove_osd/add_osd
functions, since its task was functionally identical to those two in
sequential order.
2. Add support for split OSDs within the command (replacing all OSDs on
the block device(s) as required).
3. Add additional configurability and flexibility around the old device,
weight, and external DB LVs.
2023-11-03 01:45:49 -04:00
Joshua Boniface 980ea6a9e9 Adjust handling of ext_db and _count options
Avoid the use of superfluous flag options, default them to none, and add
support for fixed-size DB LVs.
2023-11-02 13:29:47 -04:00
Joshua Boniface 526a5f4a74 Add support for split OSD adds
Allows creating multiple OSDs on a single (NVMe) block device,
leveraging the "ceph-volume lvm batch" command. Replaces the previous
method of creating OSDs.

Also adds a new ZK item for each OSD indicating if it is split or not.
2023-11-01 21:31:35 -04:00
Joshua Boniface 35f80e544c Use more hierarchical backup path structure 2023-10-24 02:04:16 -04:00
Joshua Boniface 83b937654c Avoid removing nonexistent snapshots
Store retain_snapshot in JSON and use that to check during delete.
2023-10-24 01:35:00 -04:00
Joshua Boniface 714bde89e6 Fix incorrect variable ref 2023-10-24 01:25:01 -04:00
Joshua Boniface c87736eb0a Use consistent path name and format 2023-10-24 01:20:44 -04:00
Joshua Boniface 63d0a85e29 Add backup deletion command 2023-10-24 01:18:27 -04:00
Joshua Boniface 55ca131c2c Handle snapshots on restore and provide options
Also rename the retain option to remove superfluous plural.
2023-10-24 00:25:06 -04:00
Joshua Boniface 8d256a1737 Complete VM restore functionality 2023-10-23 22:23:17 -04:00
Joshua Boniface d3b3fdfc80 Revert "Export backup images to a tar archive"
This reverts commit 38abd078af.
2023-10-23 11:01:16 -04:00
Joshua Boniface f1b29ea94e Initial VM restore work 2023-10-23 11:00:54 -04:00
Joshua Boniface 38abd078af Export backup images to a tar archive
This helps ensure an easier restore as the tar archive(s) can be sent
directly to the API via the normal process of image uploading, instead
of individual disks.
2023-10-23 09:56:50 -04:00
Joshua Boniface fabb97cf48 Only split a command_string if its not a list 2023-10-23 09:50:58 -04:00
Joshua Boniface 68124db323 Remove extra spaces 2023-10-17 13:01:38 -04:00
Joshua Boniface 8921efd269 Fix incorrect tuple construct 2023-10-17 12:55:44 -04:00
Joshua Boniface 3d12915989 Further improve return messages 2023-10-17 12:53:08 -04:00
Joshua Boniface 67b0b19bca Use better time functionality 2023-10-17 12:39:37 -04:00
Joshua Boniface 5d0c674d1d Add runtime and adjust ordering 2023-10-17 12:32:40 -04:00
Joshua Boniface f441b0d823 Improve missing parent message 2023-10-17 12:17:29 -04:00
Joshua Boniface a5d0f219e4 Improve return messages 2023-10-17 12:10:55 -04:00
Joshua Boniface 0169510df0 Fix up datestring generation 2023-10-17 12:05:45 -04:00
Joshua Boniface a58c1d5a8c Fix bad snapshot removals 2023-10-17 12:02:24 -04:00
Joshua Boniface a8e4b01b67 Handle return data even better 2023-10-17 11:51:03 -04:00
Joshua Boniface 45c4c86911 Handle extra return variable 2023-10-17 11:47:01 -04:00
Joshua Boniface 6448b31d2c Improve VM list arguments
Use kwargs here instead of fixed args to allow default None values.
2023-10-17 11:01:38 -04:00
Joshua Boniface b997c6f31e Add support for full VM backups
Adds support for exporting full VM backups, including configuration,
metainfo, and RBD disk images, with incremental support.
2023-10-17 10:15:06 -04:00
Joshua Boniface a0b45a2bcd Always create RBDs with bytes value
Converting into human results in imprecise values when specifying bytes
directly, which in turn breaks VMDK image uploads. Instead, just use the
raw bytes value when creating the volume instead of converting it back.
2023-09-30 12:37:43 -04:00
Joshua Boniface c4397219da Ensure fencing states are properly reflected 2023-09-18 09:59:18 -04:00
Joshua Boniface 311bb69785 Format based on updated Black 2023-09-12 16:41:02 -04:00
Joshua Boniface 653b95ee25 Normalize return messages for node commands 2023-05-04 17:02:46 -04:00
Joshua Boniface 78322f4de4 Improve size handling during volume add/resize 2023-04-28 12:16:16 -04:00
Joshua Boniface c1782c5004 Add full/nearfull OSD health detection 2023-04-28 11:33:39 -04:00
Joshua Boniface e773211293 Add PVC version to cluster status output 2023-02-22 16:09:24 -05:00
Joshua Boniface 70ba364f1d Flip VM state condition to remove shutdown
Don't cause health degredation for shutdown state, and flip the list
around to make it clearer.
2023-02-16 20:32:33 -05:00
Joshua Boniface 1f8561d59a Format cluster health like node healths
Make a cleaner construct here.
2023-02-16 12:33:36 -05:00
Joshua Boniface 1093ca6264 Disallow health less than 0 2023-02-15 16:50:24 -05:00
Joshua Boniface 29584e5636 Add per-node health entries for 3rd party checks 2023-02-15 16:44:49 -05:00
Joshua Boniface f4e8449356 Fix bugs and formatting of health messages 2023-02-15 16:28:56 -05:00
Joshua Boniface ec79acf061 Fix linting of cluster.py file 2023-02-15 15:48:31 -05:00
Joshua Boniface 00586074cf Modify cluster health to use new values 2023-02-15 15:45:43 -05:00
Joshua Boniface f4eef30770 Add JSON health to cluster data 2023-02-15 15:26:57 -05:00
Joshua Boniface b07396c39a Fix bugs if plugins fail to load 2023-02-13 21:51:48 -05:00
Joshua Boniface e6f9e6e0e8 Fix several bugs and optimize output 2023-02-13 16:36:15 -05:00
Joshua Boniface 9c14d84bfc Add node health value and send out API 2023-02-13 15:53:39 -05:00
Joshua Boniface 3c742a827b Initial implementation of monitoring plugin system 2023-02-13 12:06:26 -05:00
Joshua Boniface 671a907236 Allow rename in disable state 2023-01-30 11:48:43 -05:00
Joshua Boniface 38d63d9837 Flip behaviour of memory selectors
It didn't make any sense to me for mem(prov) to be the default selector,
since this has too many caveats versus mem(free). Switch to using
mem(free) as the default (i.e. "mem") and make memprov the alternative.
2022-11-15 15:45:59 -05:00
Joshua Boniface 79eb994a5e Ensure equality of none and None for selector 2022-11-07 11:59:53 -05:00
Joshua Boniface 8af7189dd0 Add module tag for daemon lib 2022-11-04 03:47:18 -04:00
Joshua Boniface 726d0a562b Update copyright header year 2022-10-06 11:55:27 -04:00
Joshua Boniface 881550b610 Actually fix VM sorting
Due to the executor the previous attempt did not work.
2022-08-12 17:46:29 -04:00
Joshua Boniface bcabd7d079 Always sort VM list
Same justification as previous commit.
2022-08-09 12:05:40 -04:00
Joshua Boniface 05a316cdd6 Ensure the node list is sorted
Otherwise the node entries could come back in an arbitrary order; since
this is an ordered list of dictionaries that might not be expected by
the API consumers, so ensure it's always sorted.
2022-08-09 12:03:49 -04:00
Joshua Boniface d8d3feee22 Add selector help and adjust flag name
1. Add documentation on the node selector flags. In the API, reference
the daemon configuration manual which now includes details in this
section; in the CLI, provide the help in "pvc vm define" in detail and
then reference that command's help in the other commands that use this
field.

2. Ensure the naming is consistent in the CLI, using the flag name
"--node-selector" everywhere (was "--selector" for "pvc vm" commands and
"--node-selector" for "pvc provisioner" commands).
2022-06-10 02:42:06 -04:00
Joshua Boniface f8cdcb30ba Add migration selector via free memory
Closes #152
2022-05-18 03:47:16 -04:00
Joshua Boniface c401a1f655 Use consistent language for primary mode
I didn't call it "router" anywhere else, but the state in the list is
called "coordinator" so, call it "coordinator mode".
2022-05-06 15:40:52 -04:00
Joshua Boniface 7a40c7a55b Add support for replacing/refreshing OSDs
Adds commands to both replace an OSD disk, and refresh (reimport) an
existing OSD disk on a new node. This handles the cases where an OSD
disk should be replaced (either due to upgrades or failures) or where a
node is rebuilt in-place and an existing OSD must be re-imported to it.

This should avoid the need to do a full remove/add sequence for either
case.

Also cleans up some aspects of OSD removal that are identical between
methods (e.g. using safe-to-destroy and sleeping after stopping) and
fixes a bug if an OSD does not truly exist when the daemon starts up.
2022-05-06 15:32:06 -04:00
Joshua Boniface 464f0e0356 Store additional OSD information in ZK
Ensures that information like the FSIDs and the OSD LVM volume are
stored in Zookeeper at creation time and updated at daemon start time
(to ensure the data is populated at least once, or if the /dev/sdX
path changes).

This will allow safer operation of OSD removals and the potential
implementation of re-activation after node replacements.
2022-05-02 12:11:39 -04:00
Joshua Boniface d6ca74376a Fix bugs with forced removal 2022-04-29 14:03:07 -04:00