Commit Graph

825 Commits

Author SHA1 Message Date
Joshua Boniface 09269f182c Add live migrate max downtime selector meta field
Adds a new flag to VM metadata to allow setting the VM live migration
max downtime. This will enable very busy VMs that hang live migration to
have this value changed.
2024-01-11 00:05:50 -05:00
Joshua Boniface e9b6072fa0 Bump version to 0.9.89 2024-01-09 12:15:53 -05:00
Joshua Boniface 1d480f5629 Bump version to 0.9.88 2023-12-29 14:56:33 -05:00
Joshua Boniface 123c7ce857 Update copyright header on all files for 2024
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
Joshua Boniface 8083b7a3e6 Bump version to 0.9.87 2023-12-27 13:40:51 -05:00
Joshua Boniface e654fbba08 Move debug condition handling to Logger
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
Joshua Boniface 494c20263d Move monitoring folder to top level 2023-12-27 11:37:49 -05:00
Joshua Boniface 3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
Joshua Boniface 39f9f3640c Rename health metrics and add resource metrics 2023-12-21 09:40:49 -05:00
Joshua Boniface 0a93f526e0 Bump version to 0.9.86 2023-12-14 14:46:29 -05:00
Joshua Boniface 38e43b46c3 Update health detail messages format 2023-12-13 03:17:47 -05:00
Joshua Boniface 0f24184b78 Explicitly clear resources of fenced node
This actually solves the bug originally "fixed" in
5f1432ccdd without breaking VM resource
allocations for working nodes.
2023-12-11 12:14:56 -05:00
Joshua Boniface 1ba37fe33d Restore VM resource allocation location
Commit 5f1432ccdd changed where these
happen due to a bug after fencing. However this completely broke node
resource reporting as only the final instance will be queried here.

Revert this change and look further into the original bug.
2023-12-11 11:52:59 -05:00
Joshua Boniface 1a05077b10 Fix missing fstring 2023-12-11 11:29:49 -05:00
Joshua Boniface 9617660342 Update Prometheus Grafana dashboard 2023-12-11 00:23:08 -05:00
Joshua Boniface 9dc5097dbc Bump version to 0.9.85 2023-12-10 01:00:33 -05:00
Joshua Boniface 53d632f283 Fix bug in example PVC Grafana dashboard 2023-12-10 00:50:05 -05:00
Joshua Boniface 7bc0760b78 Add time to "starting keepalive" message
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00
Joshua Boniface 9aee2a9075 Bump version to 0.9.84 2023-12-09 23:05:40 -05:00
Joshua Boniface 1f6347d24b Add Prometheus monitoring examples 2023-12-09 17:42:51 -05:00
Joshua Boniface 988de1218f Bump version to 0.9.83 2023-12-01 17:37:42 -05:00
Joshua Boniface 1fb0463dea Adjust daemon service startup
Add healthd, adjust workerd, lower waittime
2023-11-30 03:28:02 -05:00
Joshua Boniface 03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
Joshua Boniface 4a2eba0961 Improve node output messages (from pvchealthd)
1. Output startup "list" entries in cyan with s state
2. Add start of keepalive run message
2023-11-29 21:21:51 -05:00
Joshua Boniface 647cba3cf5 Expand startup width for new daemon name 2023-11-29 21:21:51 -05:00
Joshua Boniface 41f4e4fb2f Split health monitoring into discrete daemon/pkg 2023-11-29 21:21:51 -05:00
Joshua Boniface 83ceb41138 Add daemon name to Logger entries 2023-11-29 15:18:37 -05:00
Joshua Boniface 2545a7b744 Allow similar for IPMI hostnames 2023-11-28 16:09:01 -05:00
Joshua Boniface ce907ff26a Allow specifying static IPs instead of a file 2023-11-28 15:28:31 -05:00
Joshua Boniface 71e589e461 Remove superflous debug output
This is printed in the startup logo block anyways.
2023-11-27 13:46:30 -05:00
Joshua Boniface fc3d292081 Add missing subdirectory configs 2023-11-27 13:40:07 -05:00
Joshua Boniface eab1ae873b Ensure upstream_gateway key will exist 2023-11-27 13:37:57 -05:00
Joshua Boniface eaf93cdf96 Readd missing subsystem configurations 2023-11-27 13:33:41 -05:00
Joshua Boniface c8f4cbb39e Fix node entry keys 2023-11-27 13:24:01 -05:00
Joshua Boniface 786fae7769 Improve logo output 2023-11-27 13:01:43 -05:00
Joshua Boniface bcc57638a9 Refactor pvcnoded to use new configuration 2023-11-26 15:41:25 -05:00
Joshua Boniface 2666e0603e Update dnsmasq script to use new config file 2023-11-26 14:18:13 -05:00
Joshua Boniface dab7396196 Move to unified pvc.conf configuration file 2023-11-26 14:16:21 -05:00
Joshua Boniface 460a2dd09f Bump version to 0.9.82 2023-11-25 15:38:50 -05:00
Joshua Boniface 3e001b08b6 Bump version to 0.9.81 2023-11-17 01:29:41 -05:00
Joshua Boniface e818df5dae Use enable/disable --now instead of two commands
Avoids needing two calls here especially for the stop.
2023-11-16 02:40:35 -05:00
Joshua Boniface c76a5afd04 Avoid waits during node secondary
Waiting for the daemons to stop took too much time on some nodes and
could throw off the lockstep. Instead, leverage background=True to run
the systemctl os_commands in the background (when they complete is
irrelevant), stop the Metadata API first, and don't delay during its
stop at all.
2023-11-16 02:34:12 -05:00
Joshua Boniface 18e43a9377 Adjust name in worker log output 2023-11-16 02:25:14 -05:00
Joshua Boniface aef38639cf Rename pvcapid-worker to pvcworkerd 2023-11-15 20:31:39 -05:00
Joshua Boniface 5f1432ccdd Fix memory allocation updates and add more debug
Previously, we were assigning memalloc/memprov/vcpualloc during an
earlier phase using the main d_domain list. I'm not sure exactly why,
but this was throwing off stats after a fence. Instead, set these values
later on while parsing the actually-active VMs.
2023-11-10 10:29:32 -05:00
Joshua Boniface d6b8808448 Clean up fencing handler
1. Remove all format strings in favour of f-strings
2. Ensure all logger messages have a prefix
3. Add a few more logger messages for clarity
2023-11-10 10:09:54 -05:00
Joshua Boniface 83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
Joshua Boniface 2a9bc632fa Add node monitoring plugin for KeyDB/Redis 2023-11-10 00:56:46 -05:00
Joshua Boniface 08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
Joshua Boniface ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00