Commit Graph

684 Commits

Author SHA1 Message Date
Joshua Boniface 09a005d3d7 Adjust health delta of EDAC Uncorrected to 50
This is a very bad situation and should be critical.
2023-02-22 01:01:54 -05:00
Joshua Boniface fb0fcc0597 Update readme for Munin plugin 2023-02-18 00:00:04 -05:00
Joshua Boniface 3009f24910 Fix typo in var and flip conditional 2023-02-17 16:18:42 -05:00
Joshua Boniface 5ae836f1c5 Fix various issues with PVC Munin plugin 2023-02-17 15:41:16 -05:00
Joshua Boniface eda1b95d5f Update Munin plugin example 2023-02-16 16:06:00 -05:00
Joshua Boniface 3bd93563e6 Add CheckMK monitoring example plugins 2023-02-16 16:05:47 -05:00
Joshua Boniface 1093ca6264 Disallow health less than 0 2023-02-15 16:50:24 -05:00
Joshua Boniface 388f6556c0 Remove extra text from packages plugin 2023-02-15 16:28:41 -05:00
Joshua Boniface 6c7be492b8 Move Ceph health to global cluster health 2023-02-15 15:46:13 -05:00
Joshua Boniface f4eef30770 Add JSON health to cluster data 2023-02-15 15:26:57 -05:00
Joshua Boniface 8565cf26b3 Add disk monitoring plugin 2023-02-15 11:30:49 -05:00
Joshua Boniface 0ecf219910 Run setup during plugin loads 2023-02-15 10:11:38 -05:00
Joshua Boniface 0f4edc54d1 Use percentage in keepalie output 2023-02-15 01:56:02 -05:00
Joshua Boniface ca91be51e1 Improve ethtool parsing speeds 2023-02-14 15:49:58 -05:00
Joshua Boniface e29d0e89eb Add NIC monitoring plugin 2023-02-14 15:43:52 -05:00
Joshua Boniface 14d29f2986 Adjust text on log message 2023-02-13 22:21:23 -05:00
Joshua Boniface bc88d764b0 Add logging flag for montioring plugin output 2023-02-13 22:04:39 -05:00
Joshua Boniface a3c31564ca Flip condition in EDAC check 2023-02-13 21:58:56 -05:00
Joshua Boniface b07396c39a Fix bugs if plugins fail to load 2023-02-13 21:51:48 -05:00
Joshua Boniface 71139fa66d Add EDAC check plugin 2023-02-13 21:43:13 -05:00
Joshua Boniface 1ea4800212 Set node health to None when restarting 2023-02-13 15:54:46 -05:00
Joshua Boniface 9c14d84bfc Add node health value and send out API 2023-02-13 15:53:39 -05:00
Joshua Boniface d8f346abdd Move Ceph cluster health reporting to plugin
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 13:29:40 -05:00
Joshua Boniface 2ee52e44d3 Move Ceph cluster health reporting to plugin
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 12:13:56 -05:00
Joshua Boniface 3c742a827b Initial implementation of monitoring plugin system 2023-02-13 12:06:26 -05:00
Joshua Boniface aeb238f43c Bump version to 0.9.61 2023-02-08 10:08:05 -05:00
Joshua Boniface a49510ecc8 Bump version to 0.9.60 2022-12-06 15:42:55 -05:00
Joshua Boniface 92feeefd26 Bump version to 0.9.59 2022-11-15 15:50:15 -05:00
Joshua Boniface 38d63d9837 Flip behaviour of memory selectors
It didn't make any sense to me for mem(prov) to be the default selector,
since this has too many caveats versus mem(free). Switch to using
mem(free) as the default (i.e. "mem") and make memprov the alternative.
2022-11-15 15:45:59 -05:00
Joshua Boniface 095bcb2373 Bump version to 0.9.58 2022-11-07 12:27:48 -05:00
Joshua Boniface d65f512897 Bump version to 0.9.57 2022-11-06 01:39:50 -04:00
Joshua Boniface c3bc55eff8 Bump version to 0.9.56 2022-10-27 14:21:04 -04:00
Joshua Boniface 6c58d52fa1 Add node autoready oneshot unit
This replicates some of the more important functionality of the defunct
pvc-flush.service unit. On presence of a trigger file (i.e.
/etc/pvc/autoready), it will trigger a "node ready" on boot. It does
nothing on shutdown as this must be handled by other mechanisms, though
a similar autoflush could be added as well.
2022-10-27 14:09:14 -04:00
Joshua Boniface 726d0a562b Update copyright header year 2022-10-06 11:55:27 -04:00
Joshua Boniface f1df1cfe93 Bump version to 0.9.55 2022-10-04 13:21:40 -04:00
Joshua Boniface 5942aa50fc Avoid raise/handle deadlocks
Can cause log flooding in some edge cases and isn't really needed any
longer. Use a proper conditional followed by an actual error handler.
2022-10-03 14:04:12 -04:00
Joshua Boniface 239c392892 Bump version to 0.9.54 2022-08-23 11:01:05 -04:00
Joshua Boniface 9b499b9f48 Bump version to 0.9.53 2022-08-12 17:47:11 -04:00
Joshua Boniface 2a21d48128 Bump version to 0.9.52 2022-08-12 11:09:25 -04:00
Joshua Boniface 8d0f26ff7a Add additional kb_ values to OSD stats
Allows for easier parsing later to get e.g. % values and more details on
the used amounts.
2022-08-11 11:06:36 -04:00
Joshua Boniface 645b525ad7 Bump version to 0.9.51 2022-07-25 23:25:41 -04:00
Joshua Boniface ec559aec0d Remove pvc-flush service
This service caused more headaches than it was worth, so remove it.

The original goal was to cleanly flush nodes on shutdown and unflush
them on startup, but this is tightly controlled by Ansible playbooks at
this point, and this is something best left to the Administrator and
their particular situation anyways.
2022-07-25 23:21:34 -04:00
Joshua Boniface 932b3c55a3 Bump version to 0.9.50 2022-07-06 16:01:14 -04:00
Joshua Boniface 92e2ff7449 Fix bug with space-containing detect strings 2022-07-06 15:58:57 -04:00
Joshua Boniface f8cdcb30ba Add migration selector via free memory
Closes #152
2022-05-18 03:47:16 -04:00
Joshua Boniface 51ad2058ed Bump version to 0.9.49 2022-05-06 15:49:39 -04:00
Joshua Boniface 7a40c7a55b Add support for replacing/refreshing OSDs
Adds commands to both replace an OSD disk, and refresh (reimport) an
existing OSD disk on a new node. This handles the cases where an OSD
disk should be replaced (either due to upgrades or failures) or where a
node is rebuilt in-place and an existing OSD must be re-imported to it.

This should avoid the need to do a full remove/add sequence for either
case.

Also cleans up some aspects of OSD removal that are identical between
methods (e.g. using safe-to-destroy and sleeping after stopping) and
fixes a bug if an OSD does not truly exist when the daemon starts up.
2022-05-06 15:32:06 -04:00
Joshua Boniface 3801fcc07b Fix bug with initial JSON for stats 2022-05-02 13:28:19 -04:00
Joshua Boniface c741900baf Refactor OSD removal to use new ZK data
With the OSD LVM information stored in Zookeeper, we can use this to
determine the actual block device to zap rather than relying on runtime
determination and guestimation.
2022-05-02 12:52:22 -04:00
Joshua Boniface 464f0e0356 Store additional OSD information in ZK
Ensures that information like the FSIDs and the OSD LVM volume are
stored in Zookeeper at creation time and updated at daemon start time
(to ensure the data is populated at least once, or if the /dev/sdX
path changes).

This will allow safer operation of OSD removals and the potential
implementation of re-activation after node replacements.
2022-05-02 12:11:39 -04:00