108 Commits

Author SHA1 Message Date
123c7ce857 Update copyright header on all files for 2024
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
e654fbba08 Move debug condition handling to Logger
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
41f4e4fb2f Split health monitoring into discrete daemon/pkg 2023-11-29 21:21:51 -05:00
bcc57638a9 Refactor pvcnoded to use new configuration 2023-11-26 15:41:25 -05:00
e818df5dae Use enable/disable --now instead of two commands
Avoids needing two calls here especially for the stop.
2023-11-16 02:40:35 -05:00
c76a5afd04 Avoid waits during node secondary
Waiting for the daemons to stop took too much time on some nodes and
could throw off the lockstep. Instead, leverage background=True to run
the systemctl os_commands in the background (when they complete is
irrelevant), stop the Metadata API first, and don't delay during its
stop at all.
2023-11-16 02:34:12 -05:00
aef38639cf Rename pvcapid-worker to pvcworkerd 2023-11-15 20:31:39 -05:00
83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
89681d54b9 Port VM on-node tasks to Celery worker system
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
f0c2e9d295 Don't start pvcapid-worker on primary
It will be running anyways
2023-11-05 19:44:00 -05:00
7490f13b7c Check for partition tables on new devices 2023-11-04 03:13:58 -04:00
e32054be81 Refactor refresh as well 2023-11-04 02:44:52 -04:00
b3d13fe9be Add log message for zap 2023-11-04 01:02:51 -04:00
48b2ccbd95 Add timeout for safe-to-destroy
Continuously take the OSD down and out while doing so.
2023-11-04 00:55:05 -04:00
1535078842 Fix lvremove, lvcreate, and update ZK details 2023-11-04 00:30:14 -04:00
0e45613634 Use right key with correct data 2023-11-04 00:02:00 -04:00
7f5dd385b5 Use right key for FSID elsewhere 2023-11-03 23:51:01 -04:00
befce62925 Add OSD destroy before purge 2023-11-03 23:44:27 -04:00
b0909aed61 Get proper FSID value 2023-11-03 23:38:24 -04:00
f418b40527 Use proper FSID instead of hack 2023-11-03 16:38:19 -04:00
dd0177ce10 Rework replacement procedure again
Avoid calling other functions; replicate the actual process from Ceph
docs (https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds/)
to ensure things work out well (e.g. preserving OSD IDs).
2023-11-03 16:31:56 -04:00
ed5bc9fb43 Fix numerous formatting and function bugs 2023-11-03 14:00:05 -04:00
94d8d2cf75 Fix skip_zap_flag anomaly and add crush rm 2023-11-03 02:35:12 -04:00
20497cf89d Fix bugs and skip safe_to_destroy on force 2023-11-03 02:29:50 -04:00
64e37ae963 Update OSD replacement functionality
1. Simplify this by leveraging the existing remove_osd/add_osd
functions, since its task was functionally identical to those two in
sequential order.
2. Add support for split OSDs within the command (replacing all OSDs on
the block device(s) as required).
3. Add additional configurability and flexibility around the old device,
weight, and external DB LVs.
2023-11-03 01:45:49 -04:00
3cb8a70f04 Add forcing to OSD purge 2023-11-02 23:20:48 -04:00
f53af510c1 Avoid startup failures if OSD removed 2023-11-02 22:24:39 -04:00
d5d783fad3 Set proper split flag 2023-11-02 22:20:22 -04:00
980ea6a9e9 Adjust handling of ext_db and _count options
Avoid the use of superfluous flag options, default them to none, and add
support for fixed-size DB LVs.
2023-11-02 13:29:47 -04:00
8780044be6 Ensure db_device is an empty string 2023-11-02 00:52:18 -04:00
f08c654f22 Fix missing fstring 2023-11-01 21:41:06 -04:00
526a5f4a74 Add support for split OSD adds
Allows creating multiple OSDs on a single (NVMe) block device,
leveraging the "ceph-volume lvm batch" command. Replaces the previous
method of creating OSDs.

Also adds a new ZK item for each OSD indicating if it is split or not.
2023-11-01 21:31:35 -04:00
aa0b1f504f Fix output bug 2023-11-01 15:46:38 -04:00
794cea4a02 Reverse ordering, run checks before starting timer 2023-09-15 22:25:37 -04:00
479e156234 Run monitoring plugins once on startup 2023-09-15 17:53:16 -04:00
86830286f3 Adjust message printing to be on one line 2023-09-15 17:00:34 -04:00
4d51318a40 Make monitoring interval configurable 2023-09-15 16:54:51 -04:00
cba6f5be48 Fix wording of non-coordinator state 2023-09-15 16:51:04 -04:00
254303b9d4 Use coordinator_state instead of router_state
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
40b7d68853 Separate monitoring and move to 60s interval
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.

Adds its own logs and related items as required.

Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
570da99605 Avoid failures if no children found 2023-09-02 01:36:17 -04:00
5e43f9bd7c Ensure Patroni failures do not block takeover 2023-08-29 22:00:11 -04:00
b5f996febd Fix bugs for node flush for stop/shutdown/restart
Previously VMs in stop/shutdown/restart states wouldn't be properly
handled during a node flush. This fixes the bugs and ensures that the
transient VM states (shutdown/restart) are completed before proceeding,
and then avoids setting a stopped/shutdown VM to shutdown/auotstart.
2023-08-18 11:25:59 -04:00
2c3a3cdf52 Use try when watching health value in NodeInstance 2023-03-07 09:53:01 -05:00
7c07fbefff Adjust keepalive health printing and ordering 2023-02-24 11:08:30 -05:00
202dc3ed59 Correct error handling if monitoring plugins fail 2023-02-24 10:19:41 -05:00