dcb9c0d12c
Improve fence handling conditions
...
Use the intermediate output text when judging the fence status, rather
than the retcode of the stop as this should be more reliable.
2024-05-08 10:55:15 -04:00
79ad09ae59
Switch virtual memory free to allocated
...
Avoids incorrect reporting if cache/buffers exceeds normal.
2024-04-19 10:25:33 -04:00
a5763c9d25
Fix possible race condition applying schemas
...
Found an instance where two of these fired too close together, and
caused a fatal error. Use a write lock, and then catch the schema.apply
function in case it fails anyways.
2024-01-11 10:21:01 -05:00
123c7ce857
Update copyright header on all files for 2024
...
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
e654fbba08
Move debug condition handling to Logger
...
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
3e4cc53fdd
Add node network statistics and utilization values
...
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
0f24184b78
Explicitly clear resources of fenced node
...
This actually solves the bug originally "fixed" in
5f1432ccdd38996dac0f528035634cbc82827abd without breaking VM resource
allocations for working nodes.
2023-12-11 12:14:56 -05:00
1ba37fe33d
Restore VM resource allocation location
...
Commit 5f1432ccdd38996dac0f528035634cbc82827abd changed where these
happen due to a bug after fencing. However this completely broke node
resource reporting as only the final instance will be queried here.
Revert this change and look further into the original bug.
2023-12-11 11:52:59 -05:00
1a05077b10
Fix missing fstring
2023-12-11 11:29:49 -05:00
7bc0760b78
Add time to "starting keepalive" message
...
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00
1fb0463dea
Adjust daemon service startup
...
Add healthd, adjust workerd, lower waittime
2023-11-30 03:28:02 -05:00
03a738f878
Move config parser into daemon_lib
...
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
4a2eba0961
Improve node output messages (from pvchealthd)
...
1. Output startup "list" entries in cyan with s state
2. Add start of keepalive run message
2023-11-29 21:21:51 -05:00
83ceb41138
Add daemon name to Logger entries
2023-11-29 15:18:37 -05:00
2545a7b744
Allow similar for IPMI hostnames
2023-11-28 16:09:01 -05:00
ce907ff26a
Allow specifying static IPs instead of a file
2023-11-28 15:28:31 -05:00
fc3d292081
Add missing subdirectory configs
2023-11-27 13:40:07 -05:00
eab1ae873b
Ensure upstream_gateway key will exist
2023-11-27 13:37:57 -05:00
eaf93cdf96
Readd missing subsystem configurations
2023-11-27 13:33:41 -05:00
c8f4cbb39e
Fix node entry keys
2023-11-27 13:24:01 -05:00
bcc57638a9
Refactor pvcnoded to use new configuration
2023-11-26 15:41:25 -05:00
18e43a9377
Adjust name in worker log output
2023-11-16 02:25:14 -05:00
aef38639cf
Rename pvcapid-worker to pvcworkerd
2023-11-15 20:31:39 -05:00
5f1432ccdd
Fix memory allocation updates and add more debug
...
Previously, we were assigning memalloc/memprov/vcpualloc during an
earlier phase using the main d_domain list. I'm not sure exactly why,
but this was throwing off stats after a fence. Instead, set these values
later on while parsing the actually-active VMs.
2023-11-10 10:29:32 -05:00
d6b8808448
Clean up fencing handler
...
1. Remove all format strings in favour of f-strings
2. Ensure all logger messages have a prefix
3. Add a few more logger messages for clarity
2023-11-10 10:09:54 -05:00
83c4c6633d
Readd RBD lock detection and clearing on startup
...
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.
Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
2c15036f86
Add KeyDB to node startup services
...
Also ensure API worker starts on all nodes, not just coordinators.
2023-11-05 19:26:38 -05:00
30d7e49401
Start API worker with node daemon on coordinators
2023-11-04 13:08:16 -04:00
8b93f9a80e
Handle OSD index errors during stats collection
2023-11-01 21:33:40 -04:00
0769f1ea52
Increase service start time to 10s
2023-10-23 22:24:03 -04:00
457b7bed3d
Handle exceptions in fence migrations
2023-09-16 22:56:09 -04:00
48662e90c1
Remove obsolete monitoring_instance passing
2023-09-15 22:47:45 -04:00
079381c03e
Move printing to end and add runtime
2023-09-15 22:40:09 -04:00
4d51318a40
Make monitoring interval configurable
2023-09-15 16:54:51 -04:00
254303b9d4
Use coordinator_state instead of router_state
...
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
40b7d68853
Separate monitoring and move to 60s interval
...
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.
Adds its own logs and related items as required.
Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
cb413e5ce6
[Bookworm] Fix Ceph 16 OSD stat parsing
2023-08-31 00:45:03 -04:00
ed087d83c2
Found cpuload to 2 decimal places
2023-08-29 21:41:44 -04:00
7c07fbefff
Adjust keepalive health printing and ordering
2023-02-24 11:08:30 -05:00
f4eef30770
Add JSON health to cluster data
2023-02-15 15:26:57 -05:00
bc88d764b0
Add logging flag for montioring plugin output
2023-02-13 22:04:39 -05:00
2ee52e44d3
Move Ceph cluster health reporting to plugin
...
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 12:13:56 -05:00
3c742a827b
Initial implementation of monitoring plugin system
2023-02-13 12:06:26 -05:00
726d0a562b
Update copyright header year
2022-10-06 11:55:27 -04:00
5942aa50fc
Avoid raise/handle deadlocks
...
Can cause log flooding in some edge cases and isn't really needed any
longer. Use a proper conditional followed by an actual error handler.
2022-10-03 14:04:12 -04:00
8d0f26ff7a
Add additional kb_ values to OSD stats
...
Allows for easier parsing later to get e.g. % values and more details on
the used amounts.
2022-08-11 11:06:36 -04:00
23b1501f40
Fix linting error F541 f-string placeholders
2021-11-06 03:26:03 -04:00
c41664d2da
Reformat code with Black code formatter
...
Unify the code style along PEP and Black principles using the tool.
2021-11-06 03:02:43 -04:00
2e7b9b28b3
Add some delay and additional tries to fencing
2021-10-27 16:24:17 -04:00
55f397a347
Fix bad location of config sets
2021-10-12 17:23:04 -04:00