Commit Graph

35 Commits

Author SHA1 Message Date
Joshua Boniface 83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface 2c15036f86 Add KeyDB to node startup services
Also ensure API worker starts on all nodes, not just coordinators.
2023-11-05 19:26:38 -05:00
Joshua Boniface 30d7e49401 Start API worker with node daemon on coordinators 2023-11-04 13:08:16 -04:00
Joshua Boniface 8b93f9a80e Handle OSD index errors during stats collection 2023-11-01 21:33:40 -04:00
Joshua Boniface 0769f1ea52 Increase service start time to 10s 2023-10-23 22:24:03 -04:00
Joshua Boniface 457b7bed3d Handle exceptions in fence migrations 2023-09-16 22:56:09 -04:00
Joshua Boniface 48662e90c1 Remove obsolete monitoring_instance passing 2023-09-15 22:47:45 -04:00
Joshua Boniface 079381c03e Move printing to end and add runtime 2023-09-15 22:40:09 -04:00
Joshua Boniface 4d51318a40 Make monitoring interval configurable 2023-09-15 16:54:51 -04:00
Joshua Boniface 254303b9d4 Use coordinator_state instead of router_state
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
Joshua Boniface 40b7d68853 Separate monitoring and move to 60s interval
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.

Adds its own logs and related items as required.

Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
Joshua Boniface cb413e5ce6 [Bookworm] Fix Ceph 16 OSD stat parsing 2023-08-31 00:45:03 -04:00
Joshua Boniface ed087d83c2 Found cpuload to 2 decimal places 2023-08-29 21:41:44 -04:00
Joshua Boniface 7c07fbefff Adjust keepalive health printing and ordering 2023-02-24 11:08:30 -05:00
Joshua Boniface f4eef30770 Add JSON health to cluster data 2023-02-15 15:26:57 -05:00
Joshua Boniface bc88d764b0 Add logging flag for montioring plugin output 2023-02-13 22:04:39 -05:00
Joshua Boniface 2ee52e44d3 Move Ceph cluster health reporting to plugin
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 12:13:56 -05:00
Joshua Boniface 3c742a827b Initial implementation of monitoring plugin system 2023-02-13 12:06:26 -05:00
Joshua Boniface 726d0a562b Update copyright header year 2022-10-06 11:55:27 -04:00
Joshua Boniface 5942aa50fc Avoid raise/handle deadlocks
Can cause log flooding in some edge cases and isn't really needed any
longer. Use a proper conditional followed by an actual error handler.
2022-10-03 14:04:12 -04:00
Joshua Boniface 8d0f26ff7a Add additional kb_ values to OSD stats
Allows for easier parsing later to get e.g. % values and more details on
the used amounts.
2022-08-11 11:06:36 -04:00
Joshua Boniface 23b1501f40 Fix linting error F541 f-string placeholders 2021-11-06 03:26:03 -04:00
Joshua Boniface c41664d2da Reformat code with Black code formatter
Unify the code style along PEP and Black principles using the tool.
2021-11-06 03:02:43 -04:00
Joshua Boniface 2e7b9b28b3 Add some delay and additional tries to fencing 2021-10-27 16:24:17 -04:00
Joshua Boniface 55f397a347 Fix bad location of config sets 2021-10-12 17:23:04 -04:00
Joshua Boniface fe73dfbdc9 Use current live value for bridge_mtu
This will ensure that upgrading without the bridge_mtu config key set
will keep things as they are.
2021-10-12 12:24:03 -04:00
Joshua Boniface 8f906c1f81 Use power off in fence instead of reset
Use a power off (and then make the power on a requirement) during a node
fence. Removes some potential ambiguity in the power state, since we
will know for certain if it is off.
2021-10-12 11:04:27 -04:00
Joshua Boniface 50d8aa0586 Add handlers for client network MTUs
Refactors some of the code in VXNetworkInterface to handle MTUs in a
more streamlined fashion. Also fixes a bug whereby bridge client
networks were being explicitly given the cluster dev MTU which might not
be correct. Now adds support for this option explicitly in the configs,
and defaults to 1500 for safety (the standard Ethernet MTU).

Addresses #144
2021-10-09 17:02:27 -04:00
Joshua Boniface 6ee4c55071 Correct flawed conditional in verify_ipmi 2021-10-07 15:11:19 -04:00
Joshua Boniface f6f6f07488 Add timeouts to queue gets and adjust
Ensure that all keepalive timeouts are set (prevent the queue.get()
actions from blocking forever) and set the thread timeouts to line up as
well. Everything here is thus limited to keepalive_interval seconds
(default 5s) to keep it uniform.
2021-09-27 16:10:27 -04:00
Joshua Boniface 3b805cdc34 Fix failure to connect to libvirt in keepalive
This should be caught and abort the thread rather than failing and
holding up keepalives.
2021-09-26 20:42:01 -04:00
Joshua Boniface 06f0f7ed91 Fix several bugs in fence handling
1. Output from ipmitool was not being stripped, and stray newlines were
throwing off the comparisons. Fixes this.

2. Several stages were lacking meaningful messages. Adds these in so the
output is more clear about what is going on.

3. Reduce the sleep time after a fence to just 1x the
keepalive_interval, rather than 2x, because this seemed like excessively
long even for slow IPMI interfaces, especially since we're checking the
power state now anyways.

4. Set the node daemon state to an explicit 'fenced' state after a
successful fence to indicate to users that the node was indeed fenced
successfully and not still 'dead'.
2021-09-26 20:07:30 -04:00
Joshua Boniface f3fb492633 Handle VM disk/network stats gathering exceptions 2021-09-12 19:41:07 -04:00
Joshua Boniface a4c0e0befd Fix typo in output message 2021-08-23 00:39:19 -04:00
Joshua Boniface afb0359c20 Refactor pvcnoded to reduce Daemon.py size
This branch commit refactors the pvcnoded component to better adhere to
good programming practices. The previous Daemon.py was a massive file
which contained almost 2000 lines of direct, root-level code which was
directly imported. Not only was this poor practice, but this resulted
in a nigh-unmaintainable file which was hard even for me to understand.

This refactoring splits a large section of the code from Daemon.py into
separate small modules and functions in the `util/` directory. This will
hopefully make most of the functionality easy to find and modify without
having to dig through a single large file.

Further the existing subcomponents have been moved to the `objects/`
directory which clearly separates them.

Finally, the Daemon.py code has mostly been moved into a function,
`entrypoint()`, which is then called from the `pvcnoded.py` stub.

An additional item is that most format strings have been replaced by
f-strings to make use of the Python 3.6 features in Daemon.py and the
utility files.
2021-08-21 03:14:22 -04:00