Commit Graph

831 Commits

Author SHA1 Message Date
Joshua Boniface 5f1432ccdd Fix memory allocation updates and add more debug
Previously, we were assigning memalloc/memprov/vcpualloc during an
earlier phase using the main d_domain list. I'm not sure exactly why,
but this was throwing off stats after a fence. Instead, set these values
later on while parsing the actually-active VMs.
2023-11-10 10:29:32 -05:00
Joshua Boniface d6b8808448 Clean up fencing handler
1. Remove all format strings in favour of f-strings
2. Ensure all logger messages have a prefix
3. Add a few more logger messages for clarity
2023-11-10 10:09:54 -05:00
Joshua Boniface 83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface 2a9bc632fa Add node monitoring plugin for KeyDB/Redis 2023-11-10 00:56:46 -05:00
Joshua Boniface 08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
Joshua Boniface ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
Joshua Boniface 89681d54b9 Port VM on-node tasks to Celery worker system
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
Joshua Boniface f0c2e9d295 Don't start pvcapid-worker on primary
It will be running anyways
2023-11-05 19:44:00 -05:00
Joshua Boniface 2c15036f86 Add KeyDB to node startup services
Also ensure API worker starts on all nodes, not just coordinators.
2023-11-05 19:26:38 -05:00
Joshua Boniface 30d7e49401 Start API worker with node daemon on coordinators 2023-11-04 13:08:16 -04:00
Joshua Boniface 7490f13b7c Check for partition tables on new devices 2023-11-04 03:13:58 -04:00
Joshua Boniface e32054be81 Refactor refresh as well 2023-11-04 02:44:52 -04:00
Joshua Boniface b3d13fe9be Add log message for zap 2023-11-04 01:02:51 -04:00
Joshua Boniface 48b2ccbd95 Add timeout for safe-to-destroy
Continuously take the OSD down and out while doing so.
2023-11-04 00:55:05 -04:00
Joshua Boniface 1535078842 Fix lvremove, lvcreate, and update ZK details 2023-11-04 00:30:14 -04:00
Joshua Boniface 0e45613634 Use right key with correct data 2023-11-04 00:02:00 -04:00
Joshua Boniface 7f5dd385b5 Use right key for FSID elsewhere 2023-11-03 23:51:01 -04:00
Joshua Boniface befce62925 Add OSD destroy before purge 2023-11-03 23:44:27 -04:00
Joshua Boniface b0909aed61 Get proper FSID value 2023-11-03 23:38:24 -04:00
Joshua Boniface f418b40527 Use proper FSID instead of hack 2023-11-03 16:38:19 -04:00
Joshua Boniface dd0177ce10 Rework replacement procedure again
Avoid calling other functions; replicate the actual process from Ceph
docs (https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds/)
to ensure things work out well (e.g. preserving OSD IDs).
2023-11-03 16:31:56 -04:00
Joshua Boniface ed5bc9fb43 Fix numerous formatting and function bugs 2023-11-03 14:00:05 -04:00
Joshua Boniface 94d8d2cf75 Fix skip_zap_flag anomaly and add crush rm 2023-11-03 02:35:12 -04:00
Joshua Boniface 20497cf89d Fix bugs and skip safe_to_destroy on force 2023-11-03 02:29:50 -04:00
Joshua Boniface 64e37ae963 Update OSD replacement functionality
1. Simplify this by leveraging the existing remove_osd/add_osd
functions, since its task was functionally identical to those two in
sequential order.
2. Add support for split OSDs within the command (replacing all OSDs on
the block device(s) as required).
3. Add additional configurability and flexibility around the old device,
weight, and external DB LVs.
2023-11-03 01:45:49 -04:00
Joshua Boniface 3cb8a70f04 Add forcing to OSD purge 2023-11-02 23:20:48 -04:00
Joshua Boniface f53af510c1 Avoid startup failures if OSD removed 2023-11-02 22:24:39 -04:00
Joshua Boniface d5d783fad3 Set proper split flag 2023-11-02 22:20:22 -04:00
Joshua Boniface 980ea6a9e9 Adjust handling of ext_db and _count options
Avoid the use of superfluous flag options, default them to none, and add
support for fixed-size DB LVs.
2023-11-02 13:29:47 -04:00
Joshua Boniface 8780044be6 Ensure db_device is an empty string 2023-11-02 00:52:18 -04:00
Joshua Boniface f08c654f22 Fix missing fstring 2023-11-01 21:41:06 -04:00
Joshua Boniface 8b93f9a80e Handle OSD index errors during stats collection 2023-11-01 21:33:40 -04:00
Joshua Boniface 526a5f4a74 Add support for split OSD adds
Allows creating multiple OSDs on a single (NVMe) block device,
leveraging the "ceph-volume lvm batch" command. Replaces the previous
method of creating OSDs.

Also adds a new ZK item for each OSD indicating if it is split or not.
2023-11-01 21:31:35 -04:00
Joshua Boniface aa0b1f504f Fix output bug 2023-11-01 15:46:38 -04:00
Joshua Boniface 5b4dd61754 Bump version to 0.9.80 2023-10-27 09:56:31 -04:00
Joshua Boniface 221af3f241 Bump version to 0.9.79 2023-10-24 02:10:24 -04:00
Joshua Boniface 0769f1ea52 Increase service start time to 10s 2023-10-23 22:24:03 -04:00
Joshua Boniface 50aabde320 Ensure bond count is compared with actual qty 2023-10-22 02:28:04 -04:00
Joshua Boniface 6e83300d78 Increase ipmi plugin timeout 2023-10-04 19:21:59 -04:00
Joshua Boniface c6c44bf775 Bump version to 0.9.78 2023-09-30 12:57:55 -04:00
Joshua Boniface 7c0f12750e Bump version to 0.9.77 2023-09-19 11:05:55 -04:00
Joshua Boniface 51e78480fa Bump version to 0.9.76 2023-09-18 10:15:52 -04:00
Joshua Boniface f46bfc962f Bump version to 0.9.75 2023-09-16 23:06:38 -04:00
Joshua Boniface 714d4b6005 Revert float conversion of cpu_cores
Results in much uglier output, there are no decimal core counts.
2023-09-16 23:06:07 -04:00
Joshua Boniface fa8329ac3d Explicitly round load avg in load plugin 2023-09-16 22:58:49 -04:00
Joshua Boniface 457b7bed3d Handle exceptions in fence migrations 2023-09-16 22:56:09 -04:00
Joshua Boniface 86115b2928 Add startup message for IPMI reachability
It's good to know that this succeeded in addition to knowing if it
failed.
2023-09-16 22:41:58 -04:00
Joshua Boniface 1a906b589e Bump version to 0.9.74 2023-09-16 00:18:13 -04:00
Joshua Boniface 7b230d8bd5 Add monitoring plugin for hardware RAID arrays 2023-09-16 00:02:53 -04:00
Joshua Boniface 48662e90c1 Remove obsolete monitoring_instance passing 2023-09-15 22:47:45 -04:00
Joshua Boniface 079381c03e Move printing to end and add runtime 2023-09-15 22:40:09 -04:00
Joshua Boniface 794cea4a02 Reverse ordering, run checks before starting timer 2023-09-15 22:25:37 -04:00
Joshua Boniface fa24f3ba75 Fix bad fstring in psur check 2023-09-15 22:19:49 -04:00
Joshua Boniface caadafa80d Add PSU redundancy sensor check 2023-09-15 19:07:29 -04:00
Joshua Boniface 479e156234 Run monitoring plugins once on startup 2023-09-15 17:53:16 -04:00
Joshua Boniface 86830286f3 Adjust message printing to be on one line 2023-09-15 17:00:34 -04:00
Joshua Boniface 4d51318a40 Make monitoring interval configurable 2023-09-15 16:54:51 -04:00
Joshua Boniface cba6f5be48 Fix wording of non-coordinator state 2023-09-15 16:51:04 -04:00
Joshua Boniface 254303b9d4 Use coordinator_state instead of router_state
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
Joshua Boniface 40b7d68853 Separate monitoring and move to 60s interval
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.

Adds its own logs and related items as required.

Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
Joshua Boniface a8115cafd1 Bump version to 0.9.73 2023-09-02 02:16:19 -04:00
Joshua Boniface 570da99605 Avoid failures if no children found 2023-09-02 01:36:17 -04:00
Joshua Boniface fdda47e8a2 Bump version to 0.9.72 2023-09-01 16:34:45 -04:00
Joshua Boniface bb2aac145d Bump version to 0.9.71 2023-09-01 00:36:38 -04:00
Joshua Boniface 6c407d54c3 Bump version to 0.9.70 2023-08-31 14:15:54 -04:00
Joshua Boniface cb413e5ce6 [Bookworm] Fix Ceph 16 OSD stat parsing 2023-08-31 00:45:03 -04:00
Joshua Boniface 123499f75f [Bookworm] Specify YAML loader explicitly 2023-08-31 00:16:19 -04:00
Joshua Boniface 83b8ce7b62 Bump version to 0.9.69 (nice) 2023-08-29 22:02:13 -04:00
Joshua Boniface 5e43f9bd7c Ensure Patroni failures do not block takeover 2023-08-29 22:00:11 -04:00
Joshua Boniface ed087d83c2 Found cpuload to 2 decimal places 2023-08-29 21:41:44 -04:00
Joshua Boniface 83d475bd15 Bump version to 0.9.68 2023-08-27 20:59:23 -04:00
Joshua Boniface 705ec802a3 Bump version to 0.9.67 2023-08-27 14:47:20 -04:00
Joshua Boniface 0b90f37518 Bump version to 0.9.66 2023-08-27 11:41:22 -04:00
Joshua Boniface 1e083d7652 Bump version to 0.9.65 2023-08-23 01:56:57 -04:00
Joshua Boniface 075dbe7cc9 Bump version to 0.9.64 2023-08-18 12:34:27 -04:00
Joshua Boniface b5f996febd Fix bugs for node flush for stop/shutdown/restart
Previously VMs in stop/shutdown/restart states wouldn't be properly
handled during a node flush. This fixes the bugs and ensures that the
transient VM states (shutdown/restart) are completed before proceeding,
and then avoids setting a stopped/shutdown VM to shutdown/auotstart.
2023-08-18 11:25:59 -04:00
Joshua Boniface 3a90fda109 Bump version to 0.9.63 2023-04-28 14:47:04 -04:00
Joshua Boniface 9114255af5 Add *.update-* obsolete configs to dpkg plugin 2023-04-10 15:39:40 -04:00
Joshua Boniface 2c3a3cdf52 Use try when watching health value in NodeInstance 2023-03-07 09:53:01 -05:00
Joshua Boniface 0b583bfdaf Bump IPMI timeout to 2 seconds 2023-03-07 09:25:27 -05:00
Joshua Boniface 7c07fbefff Adjust keepalive health printing and ordering 2023-02-24 11:08:30 -05:00
Joshua Boniface 202dc3ed59 Correct error handling if monitoring plugins fail 2023-02-24 10:19:41 -05:00
Joshua Boniface 4c2d99f8a6 Fix bug with SMART info 2023-02-23 13:21:23 -05:00
Joshua Boniface bcff6650d0 Set timeout on IPMI command 2023-02-23 11:10:09 -05:00
Joshua Boniface a11206253d Fix ZK check location 2023-02-23 11:04:02 -05:00
Joshua Boniface 45ad3b9a17 Bump version to 0.9.62 2023-02-22 18:13:45 -05:00
Joshua Boniface dc4e56db4b Add IPMI monitoring check 2023-02-22 15:02:08 -05:00
Joshua Boniface e45b3108a2 Add health delta change to message output 2023-02-22 15:02:08 -05:00
Joshua Boniface 118237a53b Fix bad string value for message 2023-02-22 15:02:08 -05:00
Joshua Boniface 9805681f94 Use consistent connection with other checks 2023-02-22 15:02:08 -05:00
Joshua Boniface 6c9abb2abe Add Libvirtd monitoring check 2023-02-22 15:02:08 -05:00
Joshua Boniface a1122c6e71 Add Zookeeper monitoring check 2023-02-22 15:02:08 -05:00
Joshua Boniface 3696f81597 Add PostgreSQL monitoring check 2023-02-22 15:02:08 -05:00
Joshua Boniface 5ca0d903b6 Adjust comment message 2023-02-22 15:02:08 -05:00
Joshua Boniface 626424b74a Adjust Munin threshold values 2023-02-22 10:42:43 -05:00
Joshua Boniface c9ceb3159b Remove obsolete LINKSPEED variable 2023-02-22 01:04:25 -05:00
Joshua Boniface 6525a2568b Adjust health delta of load to 50
This is a very bad situation and should be critical.
2023-02-22 01:03:12 -05:00
Joshua Boniface 09a005d3d7 Adjust health delta of EDAC Uncorrected to 50
This is a very bad situation and should be critical.
2023-02-22 01:01:54 -05:00
Joshua Boniface fb0fcc0597 Update readme for Munin plugin 2023-02-18 00:00:04 -05:00
Joshua Boniface 3009f24910 Fix typo in var and flip conditional 2023-02-17 16:18:42 -05:00