849 Commits

Author SHA1 Message Date
49bf51da38 Fix indentation of previous fix 2024-10-15 10:57:33 -04:00
1293e8ae7e Fix bugs in lock freeing function
1. The destination state on an error was invalid; should be "stop".

2. If a lock was listed but removing it fails (because it was already
cleared somehow, this would error. In turn this would cause the VM to
not migrate and be left in an undefined state. Fix that when unlocking
is forced.
2024-10-15 10:43:52 -04:00
ae2cf8a070 Add some time for Zookeeper to synchronize 2024-10-15 10:43:44 -04:00
a6f8500309 Improve fence handling to prevent anomalies
1. Move fence monitoring to its own thread rather than doing the listing
and triggering within the main keepalive thread.
2. Add a global lock key at /config/fence_lock and use this lock key to
prevent multiple nodes from trying to run fences simultaneously.
3. Run the fencing monitor for each node sequentially within the context
of the main fence monitoring thread, to ensure that fences of multiple
nodes happen sequentially rather than in parallel.

All of these should help to prevent any anomalies where one node can try
to fence multiple nodes at once without recourse.
2024-10-10 16:42:57 -04:00
c08c3b2d7d Improve thread timeouts in keepalive
Avoids various parts of the keepalive deadlocking waiting on data that
will never come when various internal processes fail. This should ensure
based on testing that the keepalive will always finish in <5 seconds.
2024-10-10 15:33:47 -04:00
4c0d90b517 Add read lock timeouts to prevent deadlocks 2024-10-10 15:19:05 -04:00
8cb44c0c5d Bump version to 0.9.100 2024-08-30 11:03:33 -04:00
02a775c99b Bump version to 0.9.99 2024-08-28 11:15:55 -04:00
97329bb90d Sort Ceph pool data by name
There is no guarantee that both commands output the pools in the same
order, so sort them by name first so the iteration over the pools by ID
is successful.
2024-07-22 13:26:27 -04:00
1aa5999109 Bump version to 0.9.98 2024-06-05 12:01:31 -04:00
570460e5ee Add --version flag to pvcnoded.py for info 2024-06-05 11:57:47 -04:00
dcb9c0d12c Improve fence handling conditions
Use the intermediate output text when judging the fence status, rather
than the retcode of the stop as this should be more reliable.
2024-05-08 10:55:15 -04:00
f1fe0c63f5 Bump version to 0.9.97 2024-04-19 10:32:16 -04:00
9714ac20b2 Update formatting for Black 24.4.0 2024-04-19 10:26:06 -04:00
79ad09ae59 Switch virtual memory free to allocated
Avoids incorrect reporting if cache/buffers exceeds normal.
2024-04-19 10:25:33 -04:00
4c6aabec6a Fix bug if d_network changes 2024-04-05 14:05:51 -04:00
78c774b607 Bump version to 0.9.96 2024-03-08 14:23:07 -05:00
dee8d186cf Bump version to 0.9.95 2024-02-12 13:12:48 -05:00
d63cc2e661 Bump version to 0.9.94 2024-02-06 13:31:50 -05:00
18f09196be Bump version to 0.9.93 2024-01-30 09:51:21 -05:00
df40b779af Bump version to 0.9.92 2024-01-29 09:39:10 -05:00
f29b4c2755 Bump version to 0.9.91 2024-01-23 10:40:59 -05:00
86ca363697 Bump version to 0.9.90 2024-01-11 10:22:48 -05:00
a5763c9d25 Fix possible race condition applying schemas
Found an instance where two of these fired too close together, and
caused a fatal error. Use a write lock, and then catch the schema.apply
function in case it fails anyways.
2024-01-11 10:21:01 -05:00
09269f182c Add live migrate max downtime selector meta field
Adds a new flag to VM metadata to allow setting the VM live migration
max downtime. This will enable very busy VMs that hang live migration to
have this value changed.
2024-01-11 00:05:50 -05:00
e9b6072fa0 Bump version to 0.9.89 2024-01-09 12:15:53 -05:00
1d480f5629 Bump version to 0.9.88 2023-12-29 14:56:33 -05:00
123c7ce857 Update copyright header on all files for 2024
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
8083b7a3e6 Bump version to 0.9.87 2023-12-27 13:40:51 -05:00
e654fbba08 Move debug condition handling to Logger
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
494c20263d Move monitoring folder to top level 2023-12-27 11:37:49 -05:00
3e4cc53fdd Add node network statistics and utilization values
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
39f9f3640c Rename health metrics and add resource metrics 2023-12-21 09:40:49 -05:00
0a93f526e0 Bump version to 0.9.86 2023-12-14 14:46:29 -05:00
38e43b46c3 Update health detail messages format 2023-12-13 03:17:47 -05:00
0f24184b78 Explicitly clear resources of fenced node
This actually solves the bug originally "fixed" in
5f1432ccdd38996dac0f528035634cbc82827abd without breaking VM resource
allocations for working nodes.
2023-12-11 12:14:56 -05:00
1ba37fe33d Restore VM resource allocation location
Commit 5f1432ccdd38996dac0f528035634cbc82827abd changed where these
happen due to a bug after fencing. However this completely broke node
resource reporting as only the final instance will be queried here.

Revert this change and look further into the original bug.
2023-12-11 11:52:59 -05:00
1a05077b10 Fix missing fstring 2023-12-11 11:29:49 -05:00
9617660342 Update Prometheus Grafana dashboard 2023-12-11 00:23:08 -05:00
9dc5097dbc Bump version to 0.9.85 2023-12-10 01:00:33 -05:00
53d632f283 Fix bug in example PVC Grafana dashboard 2023-12-10 00:50:05 -05:00
7bc0760b78 Add time to "starting keepalive" message
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00
9aee2a9075 Bump version to 0.9.84 2023-12-09 23:05:40 -05:00
1f6347d24b Add Prometheus monitoring examples 2023-12-09 17:42:51 -05:00
988de1218f Bump version to 0.9.83 2023-12-01 17:37:42 -05:00
1fb0463dea Adjust daemon service startup
Add healthd, adjust workerd, lower waittime
2023-11-30 03:28:02 -05:00
03a738f878 Move config parser into daemon_lib
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
4a2eba0961 Improve node output messages (from pvchealthd)
1. Output startup "list" entries in cyan with s state
2. Add start of keepalive run message
2023-11-29 21:21:51 -05:00
647cba3cf5 Expand startup width for new daemon name 2023-11-29 21:21:51 -05:00
41f4e4fb2f Split health monitoring into discrete daemon/pkg 2023-11-29 21:21:51 -05:00