Commit Graph

469 Commits

Author SHA1 Message Date
Joshua Boniface f08c654f22 Fix missing fstring 2023-11-01 21:41:06 -04:00
Joshua Boniface 8b93f9a80e Handle OSD index errors during stats collection 2023-11-01 21:33:40 -04:00
Joshua Boniface 526a5f4a74 Add support for split OSD adds
Allows creating multiple OSDs on a single (NVMe) block device,
leveraging the "ceph-volume lvm batch" command. Replaces the previous
method of creating OSDs.

Also adds a new ZK item for each OSD indicating if it is split or not.
2023-11-01 21:31:35 -04:00
Joshua Boniface aa0b1f504f Fix output bug 2023-11-01 15:46:38 -04:00
Joshua Boniface 5b4dd61754 Bump version to 0.9.80 2023-10-27 09:56:31 -04:00
Joshua Boniface 221af3f241 Bump version to 0.9.79 2023-10-24 02:10:24 -04:00
Joshua Boniface 0769f1ea52 Increase service start time to 10s 2023-10-23 22:24:03 -04:00
Joshua Boniface c6c44bf775 Bump version to 0.9.78 2023-09-30 12:57:55 -04:00
Joshua Boniface 7c0f12750e Bump version to 0.9.77 2023-09-19 11:05:55 -04:00
Joshua Boniface 51e78480fa Bump version to 0.9.76 2023-09-18 10:15:52 -04:00
Joshua Boniface f46bfc962f Bump version to 0.9.75 2023-09-16 23:06:38 -04:00
Joshua Boniface 457b7bed3d Handle exceptions in fence migrations 2023-09-16 22:56:09 -04:00
Joshua Boniface 86115b2928 Add startup message for IPMI reachability
It's good to know that this succeeded in addition to knowing if it
failed.
2023-09-16 22:41:58 -04:00
Joshua Boniface 1a906b589e Bump version to 0.9.74 2023-09-16 00:18:13 -04:00
Joshua Boniface 48662e90c1 Remove obsolete monitoring_instance passing 2023-09-15 22:47:45 -04:00
Joshua Boniface 079381c03e Move printing to end and add runtime 2023-09-15 22:40:09 -04:00
Joshua Boniface 794cea4a02 Reverse ordering, run checks before starting timer 2023-09-15 22:25:37 -04:00
Joshua Boniface 479e156234 Run monitoring plugins once on startup 2023-09-15 17:53:16 -04:00
Joshua Boniface 86830286f3 Adjust message printing to be on one line 2023-09-15 17:00:34 -04:00
Joshua Boniface 4d51318a40 Make monitoring interval configurable 2023-09-15 16:54:51 -04:00
Joshua Boniface cba6f5be48 Fix wording of non-coordinator state 2023-09-15 16:51:04 -04:00
Joshua Boniface 254303b9d4 Use coordinator_state instead of router_state
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
Joshua Boniface 40b7d68853 Separate monitoring and move to 60s interval
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.

Adds its own logs and related items as required.

Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
Joshua Boniface a8115cafd1 Bump version to 0.9.73 2023-09-02 02:16:19 -04:00
Joshua Boniface 570da99605 Avoid failures if no children found 2023-09-02 01:36:17 -04:00
Joshua Boniface fdda47e8a2 Bump version to 0.9.72 2023-09-01 16:34:45 -04:00
Joshua Boniface bb2aac145d Bump version to 0.9.71 2023-09-01 00:36:38 -04:00
Joshua Boniface 6c407d54c3 Bump version to 0.9.70 2023-08-31 14:15:54 -04:00
Joshua Boniface cb413e5ce6 [Bookworm] Fix Ceph 16 OSD stat parsing 2023-08-31 00:45:03 -04:00
Joshua Boniface 123499f75f [Bookworm] Specify YAML loader explicitly 2023-08-31 00:16:19 -04:00
Joshua Boniface 83b8ce7b62 Bump version to 0.9.69 (nice) 2023-08-29 22:02:13 -04:00
Joshua Boniface 5e43f9bd7c Ensure Patroni failures do not block takeover 2023-08-29 22:00:11 -04:00
Joshua Boniface ed087d83c2 Found cpuload to 2 decimal places 2023-08-29 21:41:44 -04:00
Joshua Boniface 83d475bd15 Bump version to 0.9.68 2023-08-27 20:59:23 -04:00
Joshua Boniface 705ec802a3 Bump version to 0.9.67 2023-08-27 14:47:20 -04:00
Joshua Boniface 0b90f37518 Bump version to 0.9.66 2023-08-27 11:41:22 -04:00
Joshua Boniface 1e083d7652 Bump version to 0.9.65 2023-08-23 01:56:57 -04:00
Joshua Boniface 075dbe7cc9 Bump version to 0.9.64 2023-08-18 12:34:27 -04:00
Joshua Boniface b5f996febd Fix bugs for node flush for stop/shutdown/restart
Previously VMs in stop/shutdown/restart states wouldn't be properly
handled during a node flush. This fixes the bugs and ensures that the
transient VM states (shutdown/restart) are completed before proceeding,
and then avoids setting a stopped/shutdown VM to shutdown/auotstart.
2023-08-18 11:25:59 -04:00
Joshua Boniface 3a90fda109 Bump version to 0.9.63 2023-04-28 14:47:04 -04:00
Joshua Boniface 2c3a3cdf52 Use try when watching health value in NodeInstance 2023-03-07 09:53:01 -05:00
Joshua Boniface 7c07fbefff Adjust keepalive health printing and ordering 2023-02-24 11:08:30 -05:00
Joshua Boniface 202dc3ed59 Correct error handling if monitoring plugins fail 2023-02-24 10:19:41 -05:00
Joshua Boniface 45ad3b9a17 Bump version to 0.9.62 2023-02-22 18:13:45 -05:00
Joshua Boniface e45b3108a2 Add health delta change to message output 2023-02-22 15:02:08 -05:00
Joshua Boniface 118237a53b Fix bad string value for message 2023-02-22 15:02:08 -05:00
Joshua Boniface 1093ca6264 Disallow health less than 0 2023-02-15 16:50:24 -05:00
Joshua Boniface f4eef30770 Add JSON health to cluster data 2023-02-15 15:26:57 -05:00
Joshua Boniface 0ecf219910 Run setup during plugin loads 2023-02-15 10:11:38 -05:00
Joshua Boniface 0f4edc54d1 Use percentage in keepalie output 2023-02-15 01:56:02 -05:00
Joshua Boniface 14d29f2986 Adjust text on log message 2023-02-13 22:21:23 -05:00
Joshua Boniface bc88d764b0 Add logging flag for montioring plugin output 2023-02-13 22:04:39 -05:00
Joshua Boniface b07396c39a Fix bugs if plugins fail to load 2023-02-13 21:51:48 -05:00
Joshua Boniface 1ea4800212 Set node health to None when restarting 2023-02-13 15:54:46 -05:00
Joshua Boniface 9c14d84bfc Add node health value and send out API 2023-02-13 15:53:39 -05:00
Joshua Boniface d8f346abdd Move Ceph cluster health reporting to plugin
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 13:29:40 -05:00
Joshua Boniface 2ee52e44d3 Move Ceph cluster health reporting to plugin
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 12:13:56 -05:00
Joshua Boniface 3c742a827b Initial implementation of monitoring plugin system 2023-02-13 12:06:26 -05:00
Joshua Boniface aeb238f43c Bump version to 0.9.61 2023-02-08 10:08:05 -05:00
Joshua Boniface a49510ecc8 Bump version to 0.9.60 2022-12-06 15:42:55 -05:00
Joshua Boniface 92feeefd26 Bump version to 0.9.59 2022-11-15 15:50:15 -05:00
Joshua Boniface 095bcb2373 Bump version to 0.9.58 2022-11-07 12:27:48 -05:00
Joshua Boniface d65f512897 Bump version to 0.9.57 2022-11-06 01:39:50 -04:00
Joshua Boniface c3bc55eff8 Bump version to 0.9.56 2022-10-27 14:21:04 -04:00
Joshua Boniface 726d0a562b Update copyright header year 2022-10-06 11:55:27 -04:00
Joshua Boniface f1df1cfe93 Bump version to 0.9.55 2022-10-04 13:21:40 -04:00
Joshua Boniface 5942aa50fc Avoid raise/handle deadlocks
Can cause log flooding in some edge cases and isn't really needed any
longer. Use a proper conditional followed by an actual error handler.
2022-10-03 14:04:12 -04:00
Joshua Boniface 239c392892 Bump version to 0.9.54 2022-08-23 11:01:05 -04:00
Joshua Boniface 9b499b9f48 Bump version to 0.9.53 2022-08-12 17:47:11 -04:00
Joshua Boniface 2a21d48128 Bump version to 0.9.52 2022-08-12 11:09:25 -04:00
Joshua Boniface 8d0f26ff7a Add additional kb_ values to OSD stats
Allows for easier parsing later to get e.g. % values and more details on
the used amounts.
2022-08-11 11:06:36 -04:00
Joshua Boniface 645b525ad7 Bump version to 0.9.51 2022-07-25 23:25:41 -04:00
Joshua Boniface 932b3c55a3 Bump version to 0.9.50 2022-07-06 16:01:14 -04:00
Joshua Boniface 92e2ff7449 Fix bug with space-containing detect strings 2022-07-06 15:58:57 -04:00
Joshua Boniface 51ad2058ed Bump version to 0.9.49 2022-05-06 15:49:39 -04:00
Joshua Boniface 7a40c7a55b Add support for replacing/refreshing OSDs
Adds commands to both replace an OSD disk, and refresh (reimport) an
existing OSD disk on a new node. This handles the cases where an OSD
disk should be replaced (either due to upgrades or failures) or where a
node is rebuilt in-place and an existing OSD must be re-imported to it.

This should avoid the need to do a full remove/add sequence for either
case.

Also cleans up some aspects of OSD removal that are identical between
methods (e.g. using safe-to-destroy and sleeping after stopping) and
fixes a bug if an OSD does not truly exist when the daemon starts up.
2022-05-06 15:32:06 -04:00
Joshua Boniface 3801fcc07b Fix bug with initial JSON for stats 2022-05-02 13:28:19 -04:00
Joshua Boniface c741900baf Refactor OSD removal to use new ZK data
With the OSD LVM information stored in Zookeeper, we can use this to
determine the actual block device to zap rather than relying on runtime
determination and guestimation.
2022-05-02 12:52:22 -04:00
Joshua Boniface 464f0e0356 Store additional OSD information in ZK
Ensures that information like the FSIDs and the OSD LVM volume are
stored in Zookeeper at creation time and updated at daemon start time
(to ensure the data is populated at least once, or if the /dev/sdX
path changes).

This will allow safer operation of OSD removals and the potential
implementation of re-activation after node replacements.
2022-05-02 12:11:39 -04:00
Joshua Boniface cea8832f90 Ensure initial OSD stats is populated
Values are all invalid but this ensures the client won't error out when
trying to show an OSD that has never checked in yet.
2022-04-29 16:50:30 -04:00
Joshua Boniface 5807351405 Bump version to 0.9.48 2022-04-29 15:03:52 -04:00
Joshua Boniface d6ca74376a Fix bugs with forced removal 2022-04-29 14:03:07 -04:00
Joshua Boniface 4d698be34b Add OSD removal force option
Ensures a removal can continue even in situations where some step(s)
might fail, for instance removing an obsolete OSD from a replaced node.
2022-04-29 11:16:33 -04:00
Joshua Boniface ea709f573f Bump version to 0.9.47 2021-12-28 22:03:08 -05:00
Joshua Boniface 58d57d7037 Bump version to 0.9.46 2021-12-28 15:02:14 -05:00
Joshua Boniface 00d2c67c41 Allow single-node clusters to restart and timeout
Prevents a daemon from waiting forever to terminate if it is primary,
and avoids this entirely if there is only a single node in the cluster.
2021-12-28 03:06:03 -05:00
Joshua Boniface 67131de4f6 Fix bug when removing OSDs
Ensure the OSD is down as well as out or purge might fail.
2021-12-28 03:05:34 -05:00
Joshua Boniface abc23ebb18 Handle detect strings as arguments for blockdevs
Allows specifying blockdevs in the OSD and OSD-DB addition commands as
detect strings rather than actual block device paths. This provides
greater flexibility for automation with pvcbootstrapd (which originates
the concept of detect strings) and in general usage as well.
2021-12-28 02:53:02 -05:00
Joshua Boniface f164d898c1 Bump version to 0.9.45 2021-11-25 09:34:20 -05:00
Joshua Boniface 817dffcf30 Bump version to 0.9.44 2021-11-11 16:20:38 -05:00
Joshua Boniface 6e9fcd38a3 Bump version to 0.9.43 2021-11-08 02:29:17 -05:00
Joshua Boniface 78faa90139 Reformat recent changes with Black 2021-11-06 03:27:07 -04:00
Joshua Boniface 23b1501f40 Fix linting error F541 f-string placeholders 2021-11-06 03:26:03 -04:00
Joshua Boniface 66bfad3109 Fix linting errors F522/F523 unused args 2021-11-06 03:24:50 -04:00
Joshua Boniface c41664d2da Reformat code with Black code formatter
Unify the code style along PEP and Black principles using the tool.
2021-11-06 03:02:43 -04:00
Joshua Boniface 2e7b9b28b3 Add some delay and additional tries to fencing 2021-10-27 16:24:17 -04:00
Joshua Boniface 55f397a347 Fix bad location of config sets 2021-10-12 17:23:04 -04:00
Joshua Boniface dfebb2d3e5 Also validate on failures 2021-10-12 17:11:03 -04:00
Joshua Boniface e88147db4a Bump version to 0.9.42 2021-10-12 15:25:42 -04:00
Joshua Boniface b8204d89ac Go back to passing if exception
Validation already happened and the set happens again later.
2021-10-12 14:21:52 -04:00