58 Commits

Author SHA1 Message Date
a9e7713abf Add health delta change to message output 2023-02-22 15:02:08 -05:00
0f3cd13da1 Fix bad string value for message 2023-02-22 15:02:08 -05:00
4ab0bdd9e8 Disallow health less than 0 2023-02-15 16:50:24 -05:00
fc16e26f23 Run setup during plugin loads 2023-02-15 10:11:38 -05:00
8aa74aae62 Use percentage in keepalie output 2023-02-15 01:56:02 -05:00
8e6632bf10 Adjust text on log message 2023-02-13 22:21:23 -05:00
96d3aff7ad Add logging flag for montioring plugin output 2023-02-13 22:04:39 -05:00
54373c5bec Fix bugs if plugins fail to load 2023-02-13 21:51:48 -05:00
af436a93cc Set node health to None when restarting 2023-02-13 15:54:46 -05:00
edb3aea990 Add node health value and send out API 2023-02-13 15:53:39 -05:00
4d786c11e3 Move Ceph cluster health reporting to plugin
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 13:29:40 -05:00
3ad6ff2d9c Initial implementation of monitoring plugin system 2023-02-13 12:06:26 -05:00
a81d419a2e Update copyright header year 2022-10-06 11:55:27 -04:00
27214c8190 Fix bug with space-containing detect strings 2022-07-06 15:58:57 -04:00
21bbb0393f Add support for replacing/refreshing OSDs
Adds commands to both replace an OSD disk, and refresh (reimport) an
existing OSD disk on a new node. This handles the cases where an OSD
disk should be replaced (either due to upgrades or failures) or where a
node is rebuilt in-place and an existing OSD must be re-imported to it.

This should avoid the need to do a full remove/add sequence for either
case.

Also cleans up some aspects of OSD removal that are identical between
methods (e.g. using safe-to-destroy and sleeping after stopping) and
fixes a bug if an OSD does not truly exist when the daemon starts up.
2022-05-06 15:32:06 -04:00
1f8f3252a6 Fix bug with initial JSON for stats 2022-05-02 13:28:19 -04:00
b47c9832b7 Refactor OSD removal to use new ZK data
With the OSD LVM information stored in Zookeeper, we can use this to
determine the actual block device to zap rather than relying on runtime
determination and guestimation.
2022-05-02 12:52:22 -04:00
d2757004db Store additional OSD information in ZK
Ensures that information like the FSIDs and the OSD LVM volume are
stored in Zookeeper at creation time and updated at daemon start time
(to ensure the data is populated at least once, or if the /dev/sdX
path changes).

This will allow safer operation of OSD removals and the potential
implementation of re-activation after node replacements.
2022-05-02 12:11:39 -04:00
7323269775 Ensure initial OSD stats is populated
Values are all invalid but this ensures the client won't error out when
trying to show an OSD that has never checked in yet.
2022-04-29 16:50:30 -04:00
19c37c3ed5 Fix bugs with forced removal 2022-04-29 14:03:07 -04:00
cb50eee2a9 Add OSD removal force option
Ensures a removal can continue even in situations where some step(s)
might fail, for instance removing an obsolete OSD from a replaced node.
2022-04-29 11:16:33 -04:00
07f2006f68 Fix bug when removing OSDs
Ensure the OSD is down as well as out or purge might fail.
2021-12-28 03:05:34 -05:00
f4c7fdffb8 Handle detect strings as arguments for blockdevs
Allows specifying blockdevs in the OSD and OSD-DB addition commands as
detect strings rather than actual block device paths. This provides
greater flexibility for automation with pvcbootstrapd (which originates
the concept of detect strings) and in general usage as well.
2021-12-28 02:53:02 -05:00
16544227eb Reformat recent changes with Black 2021-11-06 03:27:07 -04:00
66230ce971 Fix linting errors F522/F523 unused args 2021-11-06 03:24:50 -04:00
2083fd824a Reformat code with Black code formatter
Unify the code style along PEP and Black principles using the tool.
2021-11-06 03:02:43 -04:00
c790c331a7 Also validate on failures 2021-10-12 17:11:03 -04:00
057071a7b7 Go back to passing if exception
Validation already happened and the set happens again later.
2021-10-12 14:21:52 -04:00
cc309fc021 Validate network MTU after initial read 2021-10-12 10:53:17 -04:00
c44732be83 Avoid duplicate runs of MTU set
It wasn't the validator duplicating, but the update duplicating, so
avoid that happening properly this time.
2021-10-09 19:21:47 -04:00
a8b68e0968 Revert "Avoid duplicate runs of MTU validator"
This reverts commit 56021c443a0a992e44350aa960976c6e8bddcb79.
2021-10-09 19:11:42 -04:00
e59152afee Set all log messages to information state
None of these were "success" messages and thus shouldn't have been ok
state.
2021-10-09 19:09:38 -04:00
56021c443a Avoid duplicate runs of MTU validator 2021-10-09 19:07:41 -04:00
ebdea165f1 Use correct isinstance instead of type 2021-10-09 19:03:31 -04:00
fb0651fb05 Move MTU validation to function
Prevents code duplication and ensures validation runs when an MTU is
updated, not just on network creation.
2021-10-09 19:01:45 -04:00
35e7e11403 Add logger message when setting MTU 2021-10-09 18:56:18 -04:00
b7555468eb Ensure vx_mtu is always an int() 2021-10-09 18:52:50 -04:00
4698edc98e Add MTU value checking and log messages
Ensures that if a specified MTU is more than the maximum it is set to
the maximum instead, and adds warning messages for both situations.
2021-10-09 18:48:56 -04:00
b0b0b75605 Have VXNetworkInstance set MTU if unset
Makes this explicit in Zookeeper if a network is unset, post-migration
(schema version 6).

Addresses #144
2021-10-09 17:52:57 -04:00
925141ed65 Fix migration bugs and invalid vx_mtu
Addresses #144
2021-10-09 17:35:10 -04:00
f7a826bf52 Add handlers for client network MTUs
Refactors some of the code in VXNetworkInterface to handle MTUs in a
more streamlined fashion. Also fixes a bug whereby bridge client
networks were being explicitly given the cluster dev MTU which might not
be correct. Now adds support for this option explicitly in the configs,
and defaults to 1500 for safety (the standard Ethernet MTU).

Addresses #144
2021-10-09 17:02:27 -04:00
e514eed414 Re-add success log output during migration 2021-09-27 11:50:55 -04:00
c2a473ed8b Simplify VM migration down to 3 steps
Remove two superfluous synchronization steps which are not needed here,
since the exclusive lock handles that situation anyways.

Still does not fix the weird flush->unflush lock timeout bug, but is
better worked-around now due to the cancelling of the other wait freeing
this up and continuing.
2021-09-27 00:03:20 -04:00
5355f6ff48 Work around synchronization lock issues
Make the block on stage C only wait for 900 seconds (15 minutes) to
prevent indefinite blocking.

The issue comes if a VM is being received, and the current unflush is
cancelled for a flush. When this happens, this lock acquisition seems to
block for no obvious reason, and no other changes seem to affect it.
This is certainly some sort of locking bug within Kazoo but I can't
diagnose it as-is. Leave a TODO to look into this again in the future.
2021-09-26 23:26:21 -04:00
bf7823deb5 Improve log messages during VM migration 2021-09-26 23:15:38 -04:00
8ba371723e Use event to non-block wait and fix inf wait 2021-09-26 22:55:39 -04:00
e10ac52116 Track status of VM state thread 2021-09-26 22:55:21 -04:00
341073521b Simplify locking process for VM migration
Rather than using a cumbersome and overly complex ping-pong of read and
write locks, instead move to a much simpler process using exclusive
locks.

Describing the process in ASCII or narrative is cumbersome, but the
process ping-pongs via a set of exclusive locks and wait timers, so that
the two sides are able to synchronize via blocking the exclusive lock.
The end result is a much more streamlined migration (takes about half
the time all things considered) which should be less error-prone.
2021-09-26 22:08:07 -04:00
8e62d5b30b Fix typo in log message 2021-09-26 03:35:30 -04:00
7df5b8e52e Fix typo in sgdisk command options 2021-09-26 00:59:05 -04:00