819 Commits

Author SHA1 Message Date
494c20263d Move monitoring folder to top level 2023-12-27 11:37:49 -05:00
3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
39f9f3640c Rename health metrics and add resource metrics 2023-12-21 09:40:49 -05:00
0a93f526e0 Bump version to 0.9.86 2023-12-14 14:46:29 -05:00
38e43b46c3 Update health detail messages format 2023-12-13 03:17:47 -05:00
0f24184b78 Explicitly clear resources of fenced node
This actually solves the bug originally "fixed" in
5f1432ccdd38996dac0f528035634cbc82827abd without breaking VM resource
allocations for working nodes.
2023-12-11 12:14:56 -05:00
1ba37fe33d Restore VM resource allocation location
Commit 5f1432ccdd38996dac0f528035634cbc82827abd changed where these
happen due to a bug after fencing. However this completely broke node
resource reporting as only the final instance will be queried here.

Revert this change and look further into the original bug.
2023-12-11 11:52:59 -05:00
1a05077b10 Fix missing fstring 2023-12-11 11:29:49 -05:00
9617660342 Update Prometheus Grafana dashboard 2023-12-11 00:23:08 -05:00
9dc5097dbc Bump version to 0.9.85 2023-12-10 01:00:33 -05:00
53d632f283 Fix bug in example PVC Grafana dashboard 2023-12-10 00:50:05 -05:00
7bc0760b78 Add time to "starting keepalive" message
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00
9aee2a9075 Bump version to 0.9.84 2023-12-09 23:05:40 -05:00
1f6347d24b Add Prometheus monitoring examples 2023-12-09 17:42:51 -05:00
988de1218f Bump version to 0.9.83 2023-12-01 17:37:42 -05:00
1fb0463dea Adjust daemon service startup
Add healthd, adjust workerd, lower waittime
2023-11-30 03:28:02 -05:00
03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
4a2eba0961 Improve node output messages (from pvchealthd)
1. Output startup "list" entries in cyan with s state
2. Add start of keepalive run message
2023-11-29 21:21:51 -05:00
647cba3cf5 Expand startup width for new daemon name 2023-11-29 21:21:51 -05:00
41f4e4fb2f Split health monitoring into discrete daemon/pkg 2023-11-29 21:21:51 -05:00
83ceb41138 Add daemon name to Logger entries 2023-11-29 15:18:37 -05:00
2545a7b744 Allow similar for IPMI hostnames 2023-11-28 16:09:01 -05:00
ce907ff26a Allow specifying static IPs instead of a file 2023-11-28 15:28:31 -05:00
71e589e461 Remove superflous debug output
This is printed in the startup logo block anyways.
2023-11-27 13:46:30 -05:00
fc3d292081 Add missing subdirectory configs 2023-11-27 13:40:07 -05:00
eab1ae873b Ensure upstream_gateway key will exist 2023-11-27 13:37:57 -05:00
eaf93cdf96 Readd missing subsystem configurations 2023-11-27 13:33:41 -05:00
c8f4cbb39e Fix node entry keys 2023-11-27 13:24:01 -05:00
786fae7769 Improve logo output 2023-11-27 13:01:43 -05:00
bcc57638a9 Refactor pvcnoded to use new configuration 2023-11-26 15:41:25 -05:00
2666e0603e Update dnsmasq script to use new config file 2023-11-26 14:18:13 -05:00
dab7396196 Move to unified pvc.conf configuration file 2023-11-26 14:16:21 -05:00
460a2dd09f Bump version to 0.9.82 2023-11-25 15:38:50 -05:00
3e001b08b6 Bump version to 0.9.81 2023-11-17 01:29:41 -05:00
e818df5dae Use enable/disable --now instead of two commands
Avoids needing two calls here especially for the stop.
2023-11-16 02:40:35 -05:00
c76a5afd04 Avoid waits during node secondary
Waiting for the daemons to stop took too much time on some nodes and
could throw off the lockstep. Instead, leverage background=True to run
the systemctl os_commands in the background (when they complete is
irrelevant), stop the Metadata API first, and don't delay during its
stop at all.
2023-11-16 02:34:12 -05:00
18e43a9377 Adjust name in worker log output 2023-11-16 02:25:14 -05:00
aef38639cf Rename pvcapid-worker to pvcworkerd 2023-11-15 20:31:39 -05:00
5f1432ccdd Fix memory allocation updates and add more debug
Previously, we were assigning memalloc/memprov/vcpualloc during an
earlier phase using the main d_domain list. I'm not sure exactly why,
but this was throwing off stats after a fence. Instead, set these values
later on while parsing the actually-active VMs.
2023-11-10 10:29:32 -05:00
d6b8808448 Clean up fencing handler
1. Remove all format strings in favour of f-strings
2. Ensure all logger messages have a prefix
3. Add a few more logger messages for clarity
2023-11-10 10:09:54 -05:00
83c4c6633d Readd RBD lock detection and clearing on startup
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.

Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
2a9bc632fa Add node monitoring plugin for KeyDB/Redis 2023-11-10 00:56:46 -05:00
08411708f6 Clean up dangling references to cmd pipes
Also removes the schema references for these CMD pipes as they are no
longer required.
2023-11-09 23:28:14 -05:00
ce17c60a20 Port OSD on-node tasks to Celery worker system
Adds Celery versions of the osd_add, osd_replace, osd_refresh,
osd_remove, and osd_db_vg_add functions.
2023-11-09 23:28:08 -05:00
89681d54b9 Port VM on-node tasks to Celery worker system
Adds Celery versions of the flush_locks, device_attach, and
device_detach functions.
2023-11-06 20:40:46 -05:00
f0c2e9d295 Don't start pvcapid-worker on primary
It will be running anyways
2023-11-05 19:44:00 -05:00
2c15036f86 Add KeyDB to node startup services
Also ensure API worker starts on all nodes, not just coordinators.
2023-11-05 19:26:38 -05:00
30d7e49401 Start API worker with node daemon on coordinators 2023-11-04 13:08:16 -04:00
7490f13b7c Check for partition tables on new devices 2023-11-04 03:13:58 -04:00
e32054be81 Refactor refresh as well 2023-11-04 02:44:52 -04:00