parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	962fba7621	Bump up startup waits slightly Ensures there's more time for daemons (specifically Zookeeper) to start up and synchronize between nodes.	2024-10-15 11:10:23 -04:00
Joshua M. Boniface	49bf51da38	Fix indentation of previous fix	2024-10-15 10:57:33 -04:00
Joshua M. Boniface	1293e8ae7e	Fix bugs in lock freeing function 1. The destination state on an error was invalid; should be "stop". 2. If a lock was listed but removing it fails (because it was already cleared somehow, this would error. In turn this would cause the VM to not migrate and be left in an undefined state. Fix that when unlocking is forced.	2024-10-15 10:43:52 -04:00
Joshua M. Boniface	ae2cf8a070	Add some time for Zookeeper to synchronize	2024-10-15 10:43:44 -04:00
Joshua M. Boniface	a6f8500309	Improve fence handling to prevent anomalies 1. Move fence monitoring to its own thread rather than doing the listing and triggering within the main keepalive thread. 2. Add a global lock key at /config/fence_lock and use this lock key to prevent multiple nodes from trying to run fences simultaneously. 3. Run the fencing monitor for each node sequentially within the context of the main fence monitoring thread, to ensure that fences of multiple nodes happen sequentially rather than in parallel. All of these should help to prevent any anomalies where one node can try to fence multiple nodes at once without recourse.	2024-10-10 16:42:57 -04:00
Joshua M. Boniface	c08c3b2d7d	Improve thread timeouts in keepalive Avoids various parts of the keepalive deadlocking waiting on data that will never come when various internal processes fail. This should ensure based on testing that the keepalive will always finish in <5 seconds.	2024-10-10 15:33:47 -04:00
Joshua M. Boniface	97329bb90d	Sort Ceph pool data by name There is no guarantee that both commands output the pools in the same order, so sort them by name first so the iteration over the pools by ID is successful.	2024-07-22 13:26:27 -04:00
Joshua M. Boniface	dcb9c0d12c	Improve fence handling conditions Use the intermediate output text when judging the fence status, rather than the retcode of the stop as this should be more reliable.	2024-05-08 10:55:15 -04:00
Joshua M. Boniface	79ad09ae59	Switch virtual memory free to allocated Avoids incorrect reporting if cache/buffers exceeds normal.	2024-04-19 10:25:33 -04:00
Joshua M. Boniface	a5763c9d25	Fix possible race condition applying schemas Found an instance where two of these fired too close together, and caused a fatal error. Use a write lock, and then catch the schema.apply function in case it fails anyways.	2024-01-11 10:21:01 -05:00
Joshua M. Boniface	123c7ce857	Update copyright header on all files for 2024 Last release of 2023 is probably the best time to do this.	2023-12-29 11:16:59 -05:00
Joshua M. Boniface	e654fbba08	Move debug condition handling to Logger Avoids many dozens of conditionals sprinkled throughout the code by centralizing this check into the main Logger instance.	2023-12-27 13:01:45 -05:00
Joshua M. Boniface	3e4cc53fdd	Add node network statistics and utilization values Adds a new physical network interface stats parser to the node keepalives, and leverages this information to provide a network utilization overview in the Prometheus metrics.	2023-12-21 15:45:01 -05:00
Joshua M. Boniface	0f24184b78	Explicitly clear resources of fenced node This actually solves the bug originally "fixed" in `5f1432ccdd` without breaking VM resource allocations for working nodes.	2023-12-11 12:14:56 -05:00
Joshua M. Boniface	1ba37fe33d	Restore VM resource allocation location Commit `5f1432ccdd` changed where these happen due to a bug after fencing. However this completely broke node resource reporting as only the final instance will be queried here. Revert this change and look further into the original bug.	2023-12-11 11:52:59 -05:00
Joshua M. Boniface	1a05077b10	Fix missing fstring	2023-12-11 11:29:49 -05:00
Joshua M. Boniface	7bc0760b78	Add time to "starting keepalive" message Matches the pvchealthd output and provides a useful message detail to this otherwise contextless message.	2023-12-10 00:40:32 -05:00
Joshua M. Boniface	1fb0463dea	Adjust daemon service startup Add healthd, adjust workerd, lower waittime	2023-11-30 03:28:02 -05:00
Joshua M. Boniface	03a738f878	Move config parser into daemon_lib And reformat/add config values for API.	2023-11-30 00:05:37 -05:00
Joshua M. Boniface	4a2eba0961	Improve node output messages (from pvchealthd) 1. Output startup "list" entries in cyan with s state 2. Add start of keepalive run message	2023-11-29 21:21:51 -05:00
Joshua M. Boniface	83ceb41138	Add daemon name to Logger entries	2023-11-29 15:18:37 -05:00
Joshua M. Boniface	2545a7b744	Allow similar for IPMI hostnames	2023-11-28 16:09:01 -05:00
Joshua M. Boniface	ce907ff26a	Allow specifying static IPs instead of a file	2023-11-28 15:28:31 -05:00
Joshua M. Boniface	fc3d292081	Add missing subdirectory configs	2023-11-27 13:40:07 -05:00
Joshua M. Boniface	eab1ae873b	Ensure upstream_gateway key will exist	2023-11-27 13:37:57 -05:00
Joshua M. Boniface	eaf93cdf96	Readd missing subsystem configurations	2023-11-27 13:33:41 -05:00
Joshua M. Boniface	c8f4cbb39e	Fix node entry keys	2023-11-27 13:24:01 -05:00
Joshua M. Boniface	bcc57638a9	Refactor pvcnoded to use new configuration	2023-11-26 15:41:25 -05:00
Joshua M. Boniface	18e43a9377	Adjust name in worker log output	2023-11-16 02:25:14 -05:00
Joshua M. Boniface	aef38639cf	Rename pvcapid-worker to pvcworkerd	2023-11-15 20:31:39 -05:00
Joshua M. Boniface	5f1432ccdd	Fix memory allocation updates and add more debug Previously, we were assigning memalloc/memprov/vcpualloc during an earlier phase using the main d_domain list. I'm not sure exactly why, but this was throwing off stats after a fence. Instead, set these values later on while parsing the actually-active VMs.	2023-11-10 10:29:32 -05:00
Joshua M. Boniface	d6b8808448	Clean up fencing handler 1. Remove all format strings in favour of f-strings 2. Ensure all logger messages have a prefix 3. Add a few more logger messages for clarity	2023-11-10 10:09:54 -05:00
Joshua M. Boniface	83c4c6633d	Readd RBD lock detection and clearing on startup This is still needed due to the nature of the locks and freeing them on startup, and to preserve lock=fail behaviour on VM startup. Also fixes the fencing lock flush to directly use the client library outside of Celery. I don't like this hack but it seems prudent until we move fencing to the workers as well.	2023-11-10 01:33:48 -05:00
Joshua M. Boniface	2c15036f86	Add KeyDB to node startup services Also ensure API worker starts on all nodes, not just coordinators.	2023-11-05 19:26:38 -05:00
Joshua M. Boniface	30d7e49401	Start API worker with node daemon on coordinators	2023-11-04 13:08:16 -04:00
Joshua M. Boniface	8b93f9a80e	Handle OSD index errors during stats collection	2023-11-01 21:33:40 -04:00
Joshua M. Boniface	0769f1ea52	Increase service start time to 10s	2023-10-23 22:24:03 -04:00
Joshua M. Boniface	457b7bed3d	Handle exceptions in fence migrations	2023-09-16 22:56:09 -04:00
Joshua M. Boniface	48662e90c1	Remove obsolete monitoring_instance passing	2023-09-15 22:47:45 -04:00
Joshua M. Boniface	079381c03e	Move printing to end and add runtime	2023-09-15 22:40:09 -04:00
Joshua M. Boniface	4d51318a40	Make monitoring interval configurable	2023-09-15 16:54:51 -04:00
Joshua M. Boniface	254303b9d4	Use coordinator_state instead of router_state Makes it much clearer what this variable represents.	2023-09-15 16:47:56 -04:00
Joshua M. Boniface	40b7d68853	Separate monitoring and move to 60s interval Removes the dependency of the monitoring subsystem from the node keepalives, and runs them at a 60s interval to avoid excessive backups if a plugin takes too long. Adds its own logs and related items as required. Finally adds a new required argument to the run() of plugins, the coordinator state, which can be used by a plugin to determine actions based on whether the node is a primary, secondary, or non-coordinator.	2023-09-15 16:47:11 -04:00
Joshua M. Boniface	cb413e5ce6	[Bookworm] Fix Ceph 16 OSD stat parsing	2023-08-31 00:45:03 -04:00
Joshua M. Boniface	ed087d83c2	Found cpuload to 2 decimal places	2023-08-29 21:41:44 -04:00
Joshua M. Boniface	7c07fbefff	Adjust keepalive health printing and ordering	2023-02-24 11:08:30 -05:00
Joshua M. Boniface	f4eef30770	Add JSON health to cluster data	2023-02-15 15:26:57 -05:00
Joshua M. Boniface	bc88d764b0	Add logging flag for montioring plugin output	2023-02-13 22:04:39 -05:00
Joshua M. Boniface	2ee52e44d3	Move Ceph cluster health reporting to plugin Also removes several outputs from the normal keepalive that were superfluous/static so that the main output fits on one line.	2023-02-13 12:13:56 -05:00
Joshua M. Boniface	3c742a827b	Initial implementation of monitoring plugin system	2023-02-13 12:06:26 -05:00

1 2

67 Commits