Joshua Boniface
d6b8808448
Clean up fencing handler
...
1. Remove all format strings in favour of f-strings
2. Ensure all logger messages have a prefix
3. Add a few more logger messages for clarity
2023-11-10 10:09:54 -05:00
Joshua Boniface
83c4c6633d
Readd RBD lock detection and clearing on startup
...
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.
Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface
2a9bc632fa
Add node monitoring plugin for KeyDB/Redis
2023-11-10 00:56:46 -05:00
Joshua Boniface
08411708f6
Clean up dangling references to cmd pipes
...
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
Joshua Boniface
ce17c60a20
Port OSD on-node tasks to Celery worker system
...
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
Joshua Boniface
89681d54b9
Port VM on-node tasks to Celery worker system
...
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
Joshua Boniface
f0c2e9d295
Don't start pvcapid-worker on primary
...
It will be running anyways
2023-11-05 19:44:00 -05:00
Joshua Boniface
2c15036f86
Add KeyDB to node startup services
...
Also ensure API worker starts on all nodes, not just coordinators.
2023-11-05 19:26:38 -05:00
Joshua Boniface
30d7e49401
Start API worker with node daemon on coordinators
2023-11-04 13:08:16 -04:00
Joshua Boniface
7490f13b7c
Check for partition tables on new devices
2023-11-04 03:13:58 -04:00
Joshua Boniface
e32054be81
Refactor refresh as well
2023-11-04 02:44:52 -04:00
Joshua Boniface
b3d13fe9be
Add log message for zap
2023-11-04 01:02:51 -04:00
Joshua Boniface
48b2ccbd95
Add timeout for safe-to-destroy
...
Continuously take the OSD down and out while doing so.
2023-11-04 00:55:05 -04:00
Joshua Boniface
1535078842
Fix lvremove, lvcreate, and update ZK details
2023-11-04 00:30:14 -04:00
Joshua Boniface
0e45613634
Use right key with correct data
2023-11-04 00:02:00 -04:00
Joshua Boniface
7f5dd385b5
Use right key for FSID elsewhere
2023-11-03 23:51:01 -04:00
Joshua Boniface
befce62925
Add OSD destroy before purge
2023-11-03 23:44:27 -04:00
Joshua Boniface
b0909aed61
Get proper FSID value
2023-11-03 23:38:24 -04:00
Joshua Boniface
f418b40527
Use proper FSID instead of hack
2023-11-03 16:38:19 -04:00
Joshua Boniface
dd0177ce10
Rework replacement procedure again
...
Avoid calling other functions; replicate the actual process from Ceph
docs (https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds/ )
to ensure things work out well (e.g. preserving OSD IDs).
2023-11-03 16:31:56 -04:00
Joshua Boniface
ed5bc9fb43
Fix numerous formatting and function bugs
2023-11-03 14:00:05 -04:00
Joshua Boniface
94d8d2cf75
Fix skip_zap_flag anomaly and add crush rm
2023-11-03 02:35:12 -04:00
Joshua Boniface
20497cf89d
Fix bugs and skip safe_to_destroy on force
2023-11-03 02:29:50 -04:00
Joshua Boniface
64e37ae963
Update OSD replacement functionality
...
1. Simplify this by leveraging the existing remove_osd/add_osd
functions, since its task was functionally identical to those two in
sequential order.
2. Add support for split OSDs within the command (replacing all OSDs on
the block device(s) as required).
3. Add additional configurability and flexibility around the old device,
weight, and external DB LVs.
2023-11-03 01:45:49 -04:00
Joshua Boniface
3cb8a70f04
Add forcing to OSD purge
2023-11-02 23:20:48 -04:00
Joshua Boniface
f53af510c1
Avoid startup failures if OSD removed
2023-11-02 22:24:39 -04:00
Joshua Boniface
d5d783fad3
Set proper split flag
2023-11-02 22:20:22 -04:00
Joshua Boniface
980ea6a9e9
Adjust handling of ext_db and _count options
...
Avoid the use of superfluous flag options, default them to none, and add
support for fixed-size DB LVs.
2023-11-02 13:29:47 -04:00
Joshua Boniface
8780044be6
Ensure db_device is an empty string
2023-11-02 00:52:18 -04:00
Joshua Boniface
f08c654f22
Fix missing fstring
2023-11-01 21:41:06 -04:00
Joshua Boniface
8b93f9a80e
Handle OSD index errors during stats collection
2023-11-01 21:33:40 -04:00
Joshua Boniface
526a5f4a74
Add support for split OSD adds
...
Allows creating multiple OSDs on a single (NVMe) block device,
leveraging the "ceph-volume lvm batch" command. Replaces the previous
method of creating OSDs.
Also adds a new ZK item for each OSD indicating if it is split or not.
2023-11-01 21:31:35 -04:00
Joshua Boniface
aa0b1f504f
Fix output bug
2023-11-01 15:46:38 -04:00
Joshua Boniface
5b4dd61754
Bump version to 0.9.80
2023-10-27 09:56:31 -04:00
Joshua Boniface
221af3f241
Bump version to 0.9.79
2023-10-24 02:10:24 -04:00
Joshua Boniface
0769f1ea52
Increase service start time to 10s
2023-10-23 22:24:03 -04:00
Joshua Boniface
50aabde320
Ensure bond count is compared with actual qty
2023-10-22 02:28:04 -04:00
Joshua Boniface
6e83300d78
Increase ipmi plugin timeout
2023-10-04 19:21:59 -04:00
Joshua Boniface
c6c44bf775
Bump version to 0.9.78
2023-09-30 12:57:55 -04:00
Joshua Boniface
7c0f12750e
Bump version to 0.9.77
2023-09-19 11:05:55 -04:00
Joshua Boniface
51e78480fa
Bump version to 0.9.76
2023-09-18 10:15:52 -04:00
Joshua Boniface
f46bfc962f
Bump version to 0.9.75
2023-09-16 23:06:38 -04:00
Joshua Boniface
714d4b6005
Revert float conversion of cpu_cores
...
Results in much uglier output, there are no decimal core counts.
2023-09-16 23:06:07 -04:00
Joshua Boniface
fa8329ac3d
Explicitly round load avg in load plugin
2023-09-16 22:58:49 -04:00
Joshua Boniface
457b7bed3d
Handle exceptions in fence migrations
2023-09-16 22:56:09 -04:00
Joshua Boniface
86115b2928
Add startup message for IPMI reachability
...
It's good to know that this succeeded in addition to knowing if it
failed.
2023-09-16 22:41:58 -04:00
Joshua Boniface
1a906b589e
Bump version to 0.9.74
2023-09-16 00:18:13 -04:00
Joshua Boniface
7b230d8bd5
Add monitoring plugin for hardware RAID arrays
2023-09-16 00:02:53 -04:00
Joshua Boniface
48662e90c1
Remove obsolete monitoring_instance passing
2023-09-15 22:47:45 -04:00
Joshua Boniface
079381c03e
Move printing to end and add runtime
2023-09-15 22:40:09 -04:00
Joshua Boniface
794cea4a02
Reverse ordering, run checks before starting timer
2023-09-15 22:25:37 -04:00
Joshua Boniface
fa24f3ba75
Fix bad fstring in psur check
2023-09-15 22:19:49 -04:00
Joshua Boniface
caadafa80d
Add PSU redundancy sensor check
2023-09-15 19:07:29 -04:00
Joshua Boniface
479e156234
Run monitoring plugins once on startup
2023-09-15 17:53:16 -04:00
Joshua Boniface
86830286f3
Adjust message printing to be on one line
2023-09-15 17:00:34 -04:00
Joshua Boniface
4d51318a40
Make monitoring interval configurable
2023-09-15 16:54:51 -04:00
Joshua Boniface
cba6f5be48
Fix wording of non-coordinator state
2023-09-15 16:51:04 -04:00
Joshua Boniface
254303b9d4
Use coordinator_state instead of router_state
...
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
Joshua Boniface
40b7d68853
Separate monitoring and move to 60s interval
...
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.
Adds its own logs and related items as required.
Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
Joshua Boniface
a8115cafd1
Bump version to 0.9.73
2023-09-02 02:16:19 -04:00
Joshua Boniface
570da99605
Avoid failures if no children found
2023-09-02 01:36:17 -04:00
Joshua Boniface
fdda47e8a2
Bump version to 0.9.72
2023-09-01 16:34:45 -04:00
Joshua Boniface
bb2aac145d
Bump version to 0.9.71
2023-09-01 00:36:38 -04:00
Joshua Boniface
6c407d54c3
Bump version to 0.9.70
2023-08-31 14:15:54 -04:00
Joshua Boniface
cb413e5ce6
[Bookworm] Fix Ceph 16 OSD stat parsing
2023-08-31 00:45:03 -04:00
Joshua Boniface
123499f75f
[Bookworm] Specify YAML loader explicitly
2023-08-31 00:16:19 -04:00
Joshua Boniface
83b8ce7b62
Bump version to 0.9.69 (nice)
2023-08-29 22:02:13 -04:00
Joshua Boniface
5e43f9bd7c
Ensure Patroni failures do not block takeover
2023-08-29 22:00:11 -04:00
Joshua Boniface
ed087d83c2
Found cpuload to 2 decimal places
2023-08-29 21:41:44 -04:00
Joshua Boniface
83d475bd15
Bump version to 0.9.68
2023-08-27 20:59:23 -04:00
Joshua Boniface
705ec802a3
Bump version to 0.9.67
2023-08-27 14:47:20 -04:00
Joshua Boniface
0b90f37518
Bump version to 0.9.66
2023-08-27 11:41:22 -04:00
Joshua Boniface
1e083d7652
Bump version to 0.9.65
2023-08-23 01:56:57 -04:00
Joshua Boniface
075dbe7cc9
Bump version to 0.9.64
2023-08-18 12:34:27 -04:00
Joshua Boniface
b5f996febd
Fix bugs for node flush for stop/shutdown/restart
...
Previously VMs in stop/shutdown/restart states wouldn't be properly
handled during a node flush. This fixes the bugs and ensures that the
transient VM states (shutdown/restart) are completed before proceeding,
and then avoids setting a stopped/shutdown VM to shutdown/auotstart.
2023-08-18 11:25:59 -04:00
Joshua Boniface
3a90fda109
Bump version to 0.9.63
2023-04-28 14:47:04 -04:00
Joshua Boniface
9114255af5
Add *.update-* obsolete configs to dpkg plugin
2023-04-10 15:39:40 -04:00
Joshua Boniface
2c3a3cdf52
Use try when watching health value in NodeInstance
2023-03-07 09:53:01 -05:00
Joshua Boniface
0b583bfdaf
Bump IPMI timeout to 2 seconds
2023-03-07 09:25:27 -05:00
Joshua Boniface
7c07fbefff
Adjust keepalive health printing and ordering
2023-02-24 11:08:30 -05:00
Joshua Boniface
202dc3ed59
Correct error handling if monitoring plugins fail
2023-02-24 10:19:41 -05:00
Joshua Boniface
4c2d99f8a6
Fix bug with SMART info
2023-02-23 13:21:23 -05:00
Joshua Boniface
bcff6650d0
Set timeout on IPMI command
2023-02-23 11:10:09 -05:00
Joshua Boniface
a11206253d
Fix ZK check location
2023-02-23 11:04:02 -05:00
Joshua Boniface
45ad3b9a17
Bump version to 0.9.62
2023-02-22 18:13:45 -05:00
Joshua Boniface
dc4e56db4b
Add IPMI monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
e45b3108a2
Add health delta change to message output
2023-02-22 15:02:08 -05:00
Joshua Boniface
118237a53b
Fix bad string value for message
2023-02-22 15:02:08 -05:00
Joshua Boniface
9805681f94
Use consistent connection with other checks
2023-02-22 15:02:08 -05:00
Joshua Boniface
6c9abb2abe
Add Libvirtd monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
a1122c6e71
Add Zookeeper monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
3696f81597
Add PostgreSQL monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
5ca0d903b6
Adjust comment message
2023-02-22 15:02:08 -05:00
Joshua Boniface
626424b74a
Adjust Munin threshold values
2023-02-22 10:42:43 -05:00
Joshua Boniface
c9ceb3159b
Remove obsolete LINKSPEED variable
2023-02-22 01:04:25 -05:00
Joshua Boniface
6525a2568b
Adjust health delta of load to 50
...
This is a very bad situation and should be critical.
2023-02-22 01:03:12 -05:00
Joshua Boniface
09a005d3d7
Adjust health delta of EDAC Uncorrected to 50
...
This is a very bad situation and should be critical.
2023-02-22 01:01:54 -05:00
Joshua Boniface
fb0fcc0597
Update readme for Munin plugin
2023-02-18 00:00:04 -05:00
Joshua Boniface
3009f24910
Fix typo in var and flip conditional
2023-02-17 16:18:42 -05:00
Joshua Boniface
5ae836f1c5
Fix various issues with PVC Munin plugin
2023-02-17 15:41:16 -05:00