Joshua Boniface
48662e90c1
Remove obsolete monitoring_instance passing
2023-09-15 22:47:45 -04:00
Joshua Boniface
079381c03e
Move printing to end and add runtime
2023-09-15 22:40:09 -04:00
Joshua Boniface
794cea4a02
Reverse ordering, run checks before starting timer
2023-09-15 22:25:37 -04:00
Joshua Boniface
fa24f3ba75
Fix bad fstring in psur check
2023-09-15 22:19:49 -04:00
Joshua Boniface
caadafa80d
Add PSU redundancy sensor check
2023-09-15 19:07:29 -04:00
Joshua Boniface
479e156234
Run monitoring plugins once on startup
2023-09-15 17:53:16 -04:00
Joshua Boniface
86830286f3
Adjust message printing to be on one line
2023-09-15 17:00:34 -04:00
Joshua Boniface
4d51318a40
Make monitoring interval configurable
2023-09-15 16:54:51 -04:00
Joshua Boniface
cba6f5be48
Fix wording of non-coordinator state
2023-09-15 16:51:04 -04:00
Joshua Boniface
254303b9d4
Use coordinator_state instead of router_state
...
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
Joshua Boniface
40b7d68853
Separate monitoring and move to 60s interval
...
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.
Adds its own logs and related items as required.
Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
Joshua Boniface
a8115cafd1
Bump version to 0.9.73
2023-09-02 02:16:19 -04:00
Joshua Boniface
570da99605
Avoid failures if no children found
2023-09-02 01:36:17 -04:00
Joshua Boniface
fdda47e8a2
Bump version to 0.9.72
2023-09-01 16:34:45 -04:00
Joshua Boniface
bb2aac145d
Bump version to 0.9.71
2023-09-01 00:36:38 -04:00
Joshua Boniface
6c407d54c3
Bump version to 0.9.70
2023-08-31 14:15:54 -04:00
Joshua Boniface
cb413e5ce6
[Bookworm] Fix Ceph 16 OSD stat parsing
2023-08-31 00:45:03 -04:00
Joshua Boniface
123499f75f
[Bookworm] Specify YAML loader explicitly
2023-08-31 00:16:19 -04:00
Joshua Boniface
83b8ce7b62
Bump version to 0.9.69 (nice)
2023-08-29 22:02:13 -04:00
Joshua Boniface
5e43f9bd7c
Ensure Patroni failures do not block takeover
2023-08-29 22:00:11 -04:00
Joshua Boniface
ed087d83c2
Found cpuload to 2 decimal places
2023-08-29 21:41:44 -04:00
Joshua Boniface
83d475bd15
Bump version to 0.9.68
2023-08-27 20:59:23 -04:00
Joshua Boniface
705ec802a3
Bump version to 0.9.67
2023-08-27 14:47:20 -04:00
Joshua Boniface
0b90f37518
Bump version to 0.9.66
2023-08-27 11:41:22 -04:00
Joshua Boniface
1e083d7652
Bump version to 0.9.65
2023-08-23 01:56:57 -04:00
Joshua Boniface
075dbe7cc9
Bump version to 0.9.64
2023-08-18 12:34:27 -04:00
Joshua Boniface
b5f996febd
Fix bugs for node flush for stop/shutdown/restart
...
Previously VMs in stop/shutdown/restart states wouldn't be properly
handled during a node flush. This fixes the bugs and ensures that the
transient VM states (shutdown/restart) are completed before proceeding,
and then avoids setting a stopped/shutdown VM to shutdown/auotstart.
2023-08-18 11:25:59 -04:00
Joshua Boniface
3a90fda109
Bump version to 0.9.63
2023-04-28 14:47:04 -04:00
Joshua Boniface
9114255af5
Add *.update-* obsolete configs to dpkg plugin
2023-04-10 15:39:40 -04:00
Joshua Boniface
2c3a3cdf52
Use try when watching health value in NodeInstance
2023-03-07 09:53:01 -05:00
Joshua Boniface
0b583bfdaf
Bump IPMI timeout to 2 seconds
2023-03-07 09:25:27 -05:00
Joshua Boniface
7c07fbefff
Adjust keepalive health printing and ordering
2023-02-24 11:08:30 -05:00
Joshua Boniface
202dc3ed59
Correct error handling if monitoring plugins fail
2023-02-24 10:19:41 -05:00
Joshua Boniface
4c2d99f8a6
Fix bug with SMART info
2023-02-23 13:21:23 -05:00
Joshua Boniface
bcff6650d0
Set timeout on IPMI command
2023-02-23 11:10:09 -05:00
Joshua Boniface
a11206253d
Fix ZK check location
2023-02-23 11:04:02 -05:00
Joshua Boniface
45ad3b9a17
Bump version to 0.9.62
2023-02-22 18:13:45 -05:00
Joshua Boniface
dc4e56db4b
Add IPMI monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
e45b3108a2
Add health delta change to message output
2023-02-22 15:02:08 -05:00
Joshua Boniface
118237a53b
Fix bad string value for message
2023-02-22 15:02:08 -05:00
Joshua Boniface
9805681f94
Use consistent connection with other checks
2023-02-22 15:02:08 -05:00
Joshua Boniface
6c9abb2abe
Add Libvirtd monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
a1122c6e71
Add Zookeeper monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
3696f81597
Add PostgreSQL monitoring check
2023-02-22 15:02:08 -05:00
Joshua Boniface
5ca0d903b6
Adjust comment message
2023-02-22 15:02:08 -05:00
Joshua Boniface
626424b74a
Adjust Munin threshold values
2023-02-22 10:42:43 -05:00
Joshua Boniface
c9ceb3159b
Remove obsolete LINKSPEED variable
2023-02-22 01:04:25 -05:00
Joshua Boniface
6525a2568b
Adjust health delta of load to 50
...
This is a very bad situation and should be critical.
2023-02-22 01:03:12 -05:00
Joshua Boniface
09a005d3d7
Adjust health delta of EDAC Uncorrected to 50
...
This is a very bad situation and should be critical.
2023-02-22 01:01:54 -05:00
Joshua Boniface
fb0fcc0597
Update readme for Munin plugin
2023-02-18 00:00:04 -05:00
Joshua Boniface
3009f24910
Fix typo in var and flip conditional
2023-02-17 16:18:42 -05:00
Joshua Boniface
5ae836f1c5
Fix various issues with PVC Munin plugin
2023-02-17 15:41:16 -05:00
Joshua Boniface
eda1b95d5f
Update Munin plugin example
2023-02-16 16:06:00 -05:00
Joshua Boniface
3bd93563e6
Add CheckMK monitoring example plugins
2023-02-16 16:05:47 -05:00
Joshua Boniface
1093ca6264
Disallow health less than 0
2023-02-15 16:50:24 -05:00
Joshua Boniface
388f6556c0
Remove extra text from packages plugin
2023-02-15 16:28:41 -05:00
Joshua Boniface
6c7be492b8
Move Ceph health to global cluster health
2023-02-15 15:46:13 -05:00
Joshua Boniface
f4eef30770
Add JSON health to cluster data
2023-02-15 15:26:57 -05:00
Joshua Boniface
8565cf26b3
Add disk monitoring plugin
2023-02-15 11:30:49 -05:00
Joshua Boniface
0ecf219910
Run setup during plugin loads
2023-02-15 10:11:38 -05:00
Joshua Boniface
0f4edc54d1
Use percentage in keepalie output
2023-02-15 01:56:02 -05:00
Joshua Boniface
ca91be51e1
Improve ethtool parsing speeds
2023-02-14 15:49:58 -05:00
Joshua Boniface
e29d0e89eb
Add NIC monitoring plugin
2023-02-14 15:43:52 -05:00
Joshua Boniface
14d29f2986
Adjust text on log message
2023-02-13 22:21:23 -05:00
Joshua Boniface
bc88d764b0
Add logging flag for montioring plugin output
2023-02-13 22:04:39 -05:00
Joshua Boniface
a3c31564ca
Flip condition in EDAC check
2023-02-13 21:58:56 -05:00
Joshua Boniface
b07396c39a
Fix bugs if plugins fail to load
2023-02-13 21:51:48 -05:00
Joshua Boniface
71139fa66d
Add EDAC check plugin
2023-02-13 21:43:13 -05:00
Joshua Boniface
1ea4800212
Set node health to None when restarting
2023-02-13 15:54:46 -05:00
Joshua Boniface
9c14d84bfc
Add node health value and send out API
2023-02-13 15:53:39 -05:00
Joshua Boniface
d8f346abdd
Move Ceph cluster health reporting to plugin
...
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 13:29:40 -05:00
Joshua Boniface
2ee52e44d3
Move Ceph cluster health reporting to plugin
...
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 12:13:56 -05:00
Joshua Boniface
3c742a827b
Initial implementation of monitoring plugin system
2023-02-13 12:06:26 -05:00
Joshua Boniface
aeb238f43c
Bump version to 0.9.61
2023-02-08 10:08:05 -05:00
Joshua Boniface
a49510ecc8
Bump version to 0.9.60
2022-12-06 15:42:55 -05:00
Joshua Boniface
92feeefd26
Bump version to 0.9.59
2022-11-15 15:50:15 -05:00
Joshua Boniface
38d63d9837
Flip behaviour of memory selectors
...
It didn't make any sense to me for mem(prov) to be the default selector,
since this has too many caveats versus mem(free). Switch to using
mem(free) as the default (i.e. "mem") and make memprov the alternative.
2022-11-15 15:45:59 -05:00
Joshua Boniface
095bcb2373
Bump version to 0.9.58
2022-11-07 12:27:48 -05:00
Joshua Boniface
d65f512897
Bump version to 0.9.57
2022-11-06 01:39:50 -04:00
Joshua Boniface
c3bc55eff8
Bump version to 0.9.56
2022-10-27 14:21:04 -04:00
Joshua Boniface
6c58d52fa1
Add node autoready oneshot unit
...
This replicates some of the more important functionality of the defunct
pvc-flush.service unit. On presence of a trigger file (i.e.
/etc/pvc/autoready), it will trigger a "node ready" on boot. It does
nothing on shutdown as this must be handled by other mechanisms, though
a similar autoflush could be added as well.
2022-10-27 14:09:14 -04:00
Joshua Boniface
726d0a562b
Update copyright header year
2022-10-06 11:55:27 -04:00
Joshua Boniface
f1df1cfe93
Bump version to 0.9.55
2022-10-04 13:21:40 -04:00
Joshua Boniface
5942aa50fc
Avoid raise/handle deadlocks
...
Can cause log flooding in some edge cases and isn't really needed any
longer. Use a proper conditional followed by an actual error handler.
2022-10-03 14:04:12 -04:00
Joshua Boniface
239c392892
Bump version to 0.9.54
2022-08-23 11:01:05 -04:00
Joshua Boniface
9b499b9f48
Bump version to 0.9.53
2022-08-12 17:47:11 -04:00
Joshua Boniface
2a21d48128
Bump version to 0.9.52
2022-08-12 11:09:25 -04:00
Joshua Boniface
8d0f26ff7a
Add additional kb_ values to OSD stats
...
Allows for easier parsing later to get e.g. % values and more details on
the used amounts.
2022-08-11 11:06:36 -04:00
Joshua Boniface
645b525ad7
Bump version to 0.9.51
2022-07-25 23:25:41 -04:00
Joshua Boniface
ec559aec0d
Remove pvc-flush service
...
This service caused more headaches than it was worth, so remove it.
The original goal was to cleanly flush nodes on shutdown and unflush
them on startup, but this is tightly controlled by Ansible playbooks at
this point, and this is something best left to the Administrator and
their particular situation anyways.
2022-07-25 23:21:34 -04:00
Joshua Boniface
932b3c55a3
Bump version to 0.9.50
2022-07-06 16:01:14 -04:00
Joshua Boniface
92e2ff7449
Fix bug with space-containing detect strings
2022-07-06 15:58:57 -04:00
Joshua Boniface
f8cdcb30ba
Add migration selector via free memory
...
Closes #152
2022-05-18 03:47:16 -04:00
Joshua Boniface
51ad2058ed
Bump version to 0.9.49
2022-05-06 15:49:39 -04:00
Joshua Boniface
7a40c7a55b
Add support for replacing/refreshing OSDs
...
Adds commands to both replace an OSD disk, and refresh (reimport) an
existing OSD disk on a new node. This handles the cases where an OSD
disk should be replaced (either due to upgrades or failures) or where a
node is rebuilt in-place and an existing OSD must be re-imported to it.
This should avoid the need to do a full remove/add sequence for either
case.
Also cleans up some aspects of OSD removal that are identical between
methods (e.g. using safe-to-destroy and sleeping after stopping) and
fixes a bug if an OSD does not truly exist when the daemon starts up.
2022-05-06 15:32:06 -04:00
Joshua Boniface
3801fcc07b
Fix bug with initial JSON for stats
2022-05-02 13:28:19 -04:00
Joshua Boniface
c741900baf
Refactor OSD removal to use new ZK data
...
With the OSD LVM information stored in Zookeeper, we can use this to
determine the actual block device to zap rather than relying on runtime
determination and guestimation.
2022-05-02 12:52:22 -04:00
Joshua Boniface
464f0e0356
Store additional OSD information in ZK
...
Ensures that information like the FSIDs and the OSD LVM volume are
stored in Zookeeper at creation time and updated at daemon start time
(to ensure the data is populated at least once, or if the /dev/sdX
path changes).
This will allow safer operation of OSD removals and the potential
implementation of re-activation after node replacements.
2022-05-02 12:11:39 -04:00
Joshua Boniface
cea8832f90
Ensure initial OSD stats is populated
...
Values are all invalid but this ensures the client won't error out when
trying to show an OSD that has never checked in yet.
2022-04-29 16:50:30 -04:00
Joshua Boniface
5807351405
Bump version to 0.9.48
2022-04-29 15:03:52 -04:00
Joshua Boniface
d6ca74376a
Fix bugs with forced removal
2022-04-29 14:03:07 -04:00
Joshua Boniface
4d698be34b
Add OSD removal force option
...
Ensures a removal can continue even in situations where some step(s)
might fail, for instance removing an obsolete OSD from a replaced node.
2022-04-29 11:16:33 -04:00
Joshua Boniface
ea709f573f
Bump version to 0.9.47
2021-12-28 22:03:08 -05:00
Joshua Boniface
58d57d7037
Bump version to 0.9.46
2021-12-28 15:02:14 -05:00
Joshua Boniface
00d2c67c41
Allow single-node clusters to restart and timeout
...
Prevents a daemon from waiting forever to terminate if it is primary,
and avoids this entirely if there is only a single node in the cluster.
2021-12-28 03:06:03 -05:00
Joshua Boniface
67131de4f6
Fix bug when removing OSDs
...
Ensure the OSD is down as well as out or purge might fail.
2021-12-28 03:05:34 -05:00
Joshua Boniface
abc23ebb18
Handle detect strings as arguments for blockdevs
...
Allows specifying blockdevs in the OSD and OSD-DB addition commands as
detect strings rather than actual block device paths. This provides
greater flexibility for automation with pvcbootstrapd (which originates
the concept of detect strings) and in general usage as well.
2021-12-28 02:53:02 -05:00
Joshua Boniface
f164d898c1
Bump version to 0.9.45
2021-11-25 09:34:20 -05:00
Joshua Boniface
a8899a1d66
Fix ordering of pvcnoded unit
...
We want to be after network.target and want network-online.target
2021-11-18 16:56:49 -05:00
Joshua Boniface
817dffcf30
Bump version to 0.9.44
2021-11-11 16:20:38 -05:00
Joshua Boniface
eda2a57a73
Add Munin plugin for Ceph utilization
2021-11-08 15:21:09 -05:00
Joshua Boniface
6e9fcd38a3
Bump version to 0.9.43
2021-11-08 02:29:17 -05:00
Joshua Boniface
78faa90139
Reformat recent changes with Black
2021-11-06 03:27:07 -04:00
Joshua Boniface
23b1501f40
Fix linting error F541 f-string placeholders
2021-11-06 03:26:03 -04:00
Joshua Boniface
66bfad3109
Fix linting errors F522/F523 unused args
2021-11-06 03:24:50 -04:00
Joshua Boniface
c41664d2da
Reformat code with Black code formatter
...
Unify the code style along PEP and Black principles using the tool.
2021-11-06 03:02:43 -04:00
Joshua Boniface
2e7b9b28b3
Add some delay and additional tries to fencing
2021-10-27 16:24:17 -04:00
Joshua Boniface
55f397a347
Fix bad location of config sets
2021-10-12 17:23:04 -04:00
Joshua Boniface
dfebb2d3e5
Also validate on failures
2021-10-12 17:11:03 -04:00
Joshua Boniface
e88147db4a
Bump version to 0.9.42
2021-10-12 15:25:42 -04:00
Joshua Boniface
b8204d89ac
Go back to passing if exception
...
Validation already happened and the set happens again later.
2021-10-12 14:21:52 -04:00
Joshua Boniface
fe73dfbdc9
Use current live value for bridge_mtu
...
This will ensure that upgrading without the bridge_mtu config key set
will keep things as they are.
2021-10-12 12:24:03 -04:00
Joshua Boniface
8f906c1f81
Use power off in fence instead of reset
...
Use a power off (and then make the power on a requirement) during a node
fence. Removes some potential ambiguity in the power state, since we
will know for certain if it is off.
2021-10-12 11:04:27 -04:00
Joshua Boniface
2d9fb9688d
Validate network MTU after initial read
2021-10-12 10:53:17 -04:00
Joshua Boniface
f13cc04b89
Bump version to 0.9.41
2021-10-09 19:39:21 -04:00
Joshua Boniface
95e01f38d5
Adjust log type of object setup message
2021-10-09 19:23:12 -04:00
Joshua Boniface
3122d73bf5
Avoid duplicate runs of MTU set
...
It wasn't the validator duplicating, but the update duplicating, so
avoid that happening properly this time.
2021-10-09 19:21:47 -04:00
Joshua Boniface
7ed8ef179c
Revert "Avoid duplicate runs of MTU validator"
...
This reverts commit 56021c443a
.
2021-10-09 19:11:42 -04:00
Joshua Boniface
caead02b2a
Set all log messages to information state
...
None of these were "success" messages and thus shouldn't have been ok
state.
2021-10-09 19:09:38 -04:00
Joshua Boniface
87bc5f93e6
Avoid duplicate runs of MTU validator
2021-10-09 19:07:41 -04:00
Joshua Boniface
203893559e
Use correct isinstance instead of type
2021-10-09 19:03:31 -04:00
Joshua Boniface
2c51bb0705
Move MTU validation to function
...
Prevents code duplication and ensures validation runs when an MTU is
updated, not just on network creation.
2021-10-09 19:01:45 -04:00
Joshua Boniface
46d3daf686
Add logger message when setting MTU
2021-10-09 18:56:18 -04:00
Joshua Boniface
e9d05aa24e
Ensure vx_mtu is always an int()
2021-10-09 18:52:50 -04:00
Joshua Boniface
6ce28c43af
Add MTU value checking and log messages
...
Ensures that if a specified MTU is more than the maximum it is set to
the maximum instead, and adds warning messages for both situations.
2021-10-09 18:48:56 -04:00
Joshua Boniface
c45f8f5bd5
Have VXNetworkInstance set MTU if unset
...
Makes this explicit in Zookeeper if a network is unset, post-migration
(schema version 6).
Addresses #144
2021-10-09 17:52:57 -04:00
Joshua Boniface
3690a2c1e0
Fix migration bugs and invalid vx_mtu
...
Addresses #144
2021-10-09 17:35:10 -04:00
Joshua Boniface
50d8aa0586
Add handlers for client network MTUs
...
Refactors some of the code in VXNetworkInterface to handle MTUs in a
more streamlined fashion. Also fixes a bug whereby bridge client
networks were being explicitly given the cluster dev MTU which might not
be correct. Now adds support for this option explicitly in the configs,
and defaults to 1500 for safety (the standard Ethernet MTU).
Addresses #144
2021-10-09 17:02:27 -04:00
Joshua Boniface
6ee4c55071
Correct flawed conditional in verify_ipmi
2021-10-07 15:11:19 -04:00
Joshua Boniface
c27359c4bf
Bump version to 0.9.40
2021-10-07 14:42:04 -04:00
Joshua Boniface
46078932c3
Correct bad stop_keepalive_timer call
2021-10-07 14:41:12 -04:00
Joshua Boniface
bdb9db8375
Bump version to 0.9.39
2021-10-07 11:52:38 -04:00
Joshua Boniface
da9248cfa2
Bump version to 0.9.38
2021-10-03 22:32:41 -04:00
Joshua Boniface
23977b04fc
Bump version to 0.9.37
2021-09-30 02:08:14 -04:00
Joshua Boniface
f6f6f07488
Add timeouts to queue gets and adjust
...
Ensure that all keepalive timeouts are set (prevent the queue.get()
actions from blocking forever) and set the thread timeouts to line up as
well. Everything here is thus limited to keepalive_interval seconds
(default 5s) to keep it uniform.
2021-09-27 16:10:27 -04:00
Joshua Boniface
142c999ce8
Re-add success log output during migration
2021-09-27 11:50:55 -04:00
Joshua Boniface
1de069298c
Fix missing character in log message
2021-09-27 00:49:43 -04:00
Joshua Boniface
55221b3d97
Simplify VM migration down to 3 steps
...
Remove two superfluous synchronization steps which are not needed here,
since the exclusive lock handles that situation anyways.
Still does not fix the weird flush->unflush lock timeout bug, but is
better worked-around now due to the cancelling of the other wait freeing
this up and continuing.
2021-09-27 00:03:20 -04:00
Joshua Boniface
0d72798814
Work around synchronization lock issues
...
Make the block on stage C only wait for 900 seconds (15 minutes) to
prevent indefinite blocking.
The issue comes if a VM is being received, and the current unflush is
cancelled for a flush. When this happens, this lock acquisition seems to
block for no obvious reason, and no other changes seem to affect it.
This is certainly some sort of locking bug within Kazoo but I can't
diagnose it as-is. Leave a TODO to look into this again in the future.
2021-09-26 23:26:21 -04:00
Joshua Boniface
3638efc77e
Improve log messages during VM migration
2021-09-26 23:15:38 -04:00