Commit Graph

110 Commits

Author SHA1 Message Date
Joshua Boniface 4c6aabec6a Fix bug if d_network changes 2024-04-05 14:05:51 -04:00
Joshua Boniface 09269f182c Add live migrate max downtime selector meta field
Adds a new flag to VM metadata to allow setting the VM live migration
max downtime. This will enable very busy VMs that hang live migration to
have this value changed.
2024-01-11 00:05:50 -05:00
Joshua Boniface 123c7ce857 Update copyright header on all files for 2024
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
Joshua Boniface e654fbba08 Move debug condition handling to Logger
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
Joshua Boniface 3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
Joshua Boniface 03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
Joshua Boniface 41f4e4fb2f Split health monitoring into discrete daemon/pkg 2023-11-29 21:21:51 -05:00
Joshua Boniface bcc57638a9 Refactor pvcnoded to use new configuration 2023-11-26 15:41:25 -05:00
Joshua Boniface e818df5dae Use enable/disable --now instead of two commands
Avoids needing two calls here especially for the stop.
2023-11-16 02:40:35 -05:00
Joshua Boniface c76a5afd04 Avoid waits during node secondary
Waiting for the daemons to stop took too much time on some nodes and
could throw off the lockstep. Instead, leverage background=True to run
the systemctl os_commands in the background (when they complete is
irrelevant), stop the Metadata API first, and don't delay during its
stop at all.
2023-11-16 02:34:12 -05:00
Joshua Boniface aef38639cf Rename pvcapid-worker to pvcworkerd 2023-11-15 20:31:39 -05:00
Joshua Boniface 83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface 08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
Joshua Boniface ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
Joshua Boniface 89681d54b9 Port VM on-node tasks to Celery worker system
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
Joshua Boniface f0c2e9d295 Don't start pvcapid-worker on primary
It will be running anyways
2023-11-05 19:44:00 -05:00
Joshua Boniface 7490f13b7c Check for partition tables on new devices 2023-11-04 03:13:58 -04:00
Joshua Boniface e32054be81 Refactor refresh as well 2023-11-04 02:44:52 -04:00
Joshua Boniface b3d13fe9be Add log message for zap 2023-11-04 01:02:51 -04:00
Joshua Boniface 48b2ccbd95 Add timeout for safe-to-destroy
Continuously take the OSD down and out while doing so.
2023-11-04 00:55:05 -04:00
Joshua Boniface 1535078842 Fix lvremove, lvcreate, and update ZK details 2023-11-04 00:30:14 -04:00
Joshua Boniface 0e45613634 Use right key with correct data 2023-11-04 00:02:00 -04:00
Joshua Boniface 7f5dd385b5 Use right key for FSID elsewhere 2023-11-03 23:51:01 -04:00
Joshua Boniface befce62925 Add OSD destroy before purge 2023-11-03 23:44:27 -04:00
Joshua Boniface b0909aed61 Get proper FSID value 2023-11-03 23:38:24 -04:00
Joshua Boniface f418b40527 Use proper FSID instead of hack 2023-11-03 16:38:19 -04:00
Joshua Boniface dd0177ce10 Rework replacement procedure again
Avoid calling other functions; replicate the actual process from Ceph
docs (https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds/)
to ensure things work out well (e.g. preserving OSD IDs).
2023-11-03 16:31:56 -04:00
Joshua Boniface ed5bc9fb43 Fix numerous formatting and function bugs 2023-11-03 14:00:05 -04:00
Joshua Boniface 94d8d2cf75 Fix skip_zap_flag anomaly and add crush rm 2023-11-03 02:35:12 -04:00
Joshua Boniface 20497cf89d Fix bugs and skip safe_to_destroy on force 2023-11-03 02:29:50 -04:00
Joshua Boniface 64e37ae963 Update OSD replacement functionality
1. Simplify this by leveraging the existing remove_osd/add_osd
functions, since its task was functionally identical to those two in
sequential order.
2. Add support for split OSDs within the command (replacing all OSDs on
the block device(s) as required).
3. Add additional configurability and flexibility around the old device,
weight, and external DB LVs.
2023-11-03 01:45:49 -04:00
Joshua Boniface 3cb8a70f04 Add forcing to OSD purge 2023-11-02 23:20:48 -04:00
Joshua Boniface f53af510c1 Avoid startup failures if OSD removed 2023-11-02 22:24:39 -04:00
Joshua Boniface d5d783fad3 Set proper split flag 2023-11-02 22:20:22 -04:00
Joshua Boniface 980ea6a9e9 Adjust handling of ext_db and _count options
Avoid the use of superfluous flag options, default them to none, and add
support for fixed-size DB LVs.
2023-11-02 13:29:47 -04:00
Joshua Boniface 8780044be6 Ensure db_device is an empty string 2023-11-02 00:52:18 -04:00
Joshua Boniface f08c654f22 Fix missing fstring 2023-11-01 21:41:06 -04:00
Joshua Boniface 526a5f4a74 Add support for split OSD adds
Allows creating multiple OSDs on a single (NVMe) block device,
leveraging the "ceph-volume lvm batch" command. Replaces the previous
method of creating OSDs.

Also adds a new ZK item for each OSD indicating if it is split or not.
2023-11-01 21:31:35 -04:00
Joshua Boniface aa0b1f504f Fix output bug 2023-11-01 15:46:38 -04:00
Joshua Boniface 794cea4a02 Reverse ordering, run checks before starting timer 2023-09-15 22:25:37 -04:00
Joshua Boniface 479e156234 Run monitoring plugins once on startup 2023-09-15 17:53:16 -04:00
Joshua Boniface 86830286f3 Adjust message printing to be on one line 2023-09-15 17:00:34 -04:00
Joshua Boniface 4d51318a40 Make monitoring interval configurable 2023-09-15 16:54:51 -04:00
Joshua Boniface cba6f5be48 Fix wording of non-coordinator state 2023-09-15 16:51:04 -04:00
Joshua Boniface 254303b9d4 Use coordinator_state instead of router_state
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
Joshua Boniface 40b7d68853 Separate monitoring and move to 60s interval
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.

Adds its own logs and related items as required.

Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
Joshua Boniface 570da99605 Avoid failures if no children found 2023-09-02 01:36:17 -04:00
Joshua Boniface 5e43f9bd7c Ensure Patroni failures do not block takeover 2023-08-29 22:00:11 -04:00
Joshua Boniface b5f996febd Fix bugs for node flush for stop/shutdown/restart
Previously VMs in stop/shutdown/restart states wouldn't be properly
handled during a node flush. This fixes the bugs and ensures that the
transient VM states (shutdown/restart) are completed before proceeding,
and then avoids setting a stopped/shutdown VM to shutdown/auotstart.
2023-08-18 11:25:59 -04:00
Joshua Boniface 2c3a3cdf52 Use try when watching health value in NodeInstance 2023-03-07 09:53:01 -05:00