parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	017953c2e6	Move lock release to phase D	2020-10-21 11:07:01 -04:00
Joshua M. Boniface	82b4d3ed1b	Add missing prefix statements to loggers	2020-10-21 10:52:53 -04:00
Joshua M. Boniface	bae366a316	Add waits and only receive check on send	2020-10-21 10:43:42 -04:00
Joshua M. Boniface	351076c15e	Check if node changed during final check Avoids situations where two migrates, to different nodes, happen in rapid succession. Aborts the migration if the current target node no longer matches what was set at the start of the execution.	2020-10-21 02:52:36 -04:00
Joshua M. Boniface	42514b9a50	Improve messages further	2020-10-21 02:41:42 -04:00
Joshua M. Boniface	611e47f338	Add messages to migration aborts Results in some information duplication, but ensures logging of the reason a migration was aborted separate from the error(s) this may generate.	2020-10-21 02:38:42 -04:00
Joshua M. Boniface	1523959074	Move where setting last_ vars happens	2020-10-21 02:24:00 -04:00
Joshua M. Boniface	ef762359f4	Adjust timing to avoid migrating to self quickly Add another separate state lock, release it earlier, and ensure timings are good to avoid double-migrating one VM.	2020-10-21 02:17:55 -04:00
Joshua M. Boniface	398d33778f	Avoid stopping duplicates, just lock our own key	2020-10-20 16:10:39 -04:00
Joshua M. Boniface	a6d492ed9f	Remove spurious writes and adjust sleep	2020-10-20 16:04:26 -04:00
Joshua M. Boniface	11fa3b0df3	Remove additional wait and add last_node entries These allow for aborting a migration to retain the previous settings and override what the client set.	2020-10-20 15:58:55 -04:00
Joshua M. Boniface	442aa4e420	Tweak timers further	2020-10-20 15:43:59 -04:00
Joshua M. Boniface	3910843660	Add missing break	2020-10-20 15:39:29 -04:00
Joshua M. Boniface	70f3fdbfb9	Tweak the delays slightly on receive	2020-10-20 15:38:07 -04:00
Joshua M. Boniface	7cb0241a12	Attempt live migrates 3 times before proceeding	2020-10-20 15:33:41 -04:00
Joshua M. Boniface	9fb33ed7a7	Increase peer lock acquiring timers	2020-10-20 15:26:59 -04:00
Joshua M. Boniface	abfe0108ab	Better handle aborting migrations	2020-10-20 15:22:16 -04:00
Joshua M. Boniface	567fe8f36b	Wait for existing migrations before proceeding	2020-10-20 15:12:32 -04:00
Joshua M. Boniface	ec7b78b9b8	Add additional short sleep in receive	2020-10-20 13:29:17 -04:00
Joshua M. Boniface	224c8082ef	Alter text of synchronization messages	2020-10-20 13:08:18 -04:00
Joshua M. Boniface	f9e7e9884f	Improve handling of VM migrations The VM migration code was very old, very spaghettified, and prone to strange failures. Improve this by taking cues from the node primary migration. Use synchronization between the nodes to ensure lockstep completion of the migration in discrete steps. A proper queue can be built later to integrate with this code more cleanly. References #108	2020-10-20 13:01:55 -04:00
Joshua M. Boniface	726501f4d4	Add additional logging to flush selector Adds additional debug logging to the flush selector to determine how any why any given node is selected. Useful for troubleshooting strange choices.	2020-10-20 12:34:18 -04:00
Joshua M. Boniface	c6e34c7dc6	Bump base version to 0.9	2020-10-18 14:31:19 -04:00
Joshua M. Boniface	f749633f7c	Use provisioned memory for mem migration selector Use the new "provisioned" memory field, instead of the "allocated" memory field, to determine the optimal node when using the "mem" migration selector. This will take into account non-running VMs in the calculation as well as running VMs.	2020-10-18 14:17:15 -04:00
Joshua M. Boniface	a4b80be5ed	Add provisioned memory to node info Adds a separate field to the node memory, "provisioned", which totals the amount of memory provisioned to all VMs on the node, regardless of state, and in contrast to "allocated" which only counts running VMs. Allows for the detection of potential overprovisioned states when factoring in non-running VMs. Includes the supporting code to get this data, since the original implementation of VM memory selection was dependent on the VM being running and getting this from libvirt. Now, if the VM is not active, it gets this from the domain XML instead.	2020-10-18 14:17:15 -04:00
Joshua M. Boniface	aa5f8c93fd	Entirely disable IPv6 on bridged interfaces Prevents any potential leakage due to autoconfigured IPv6 on bridged interfaces. These are exclusively VM-side bridges, and the PVC host should not have any IPv6 configuration on them, ever.	2020-10-15 11:00:59 -04:00
Joshua M. Boniface	9366977fe6	Copy d_domain before iterating Prevents a bug where the thread can crash due to a change in the d_domain object while running the for loop. By copying and iterating over the copy, this becomes safer.	2020-09-16 15:12:37 -04:00
Joshua M. Boniface	65b44f2955	Avoid breaking keepalive during incoming migration The keepalive was getting stuck gathering memoryStats from the non-running VM, since it was in a paused state. Avoid this by just skipping past the rest of the stats gathering if the VM isn't running.	2020-08-28 01:47:36 -04:00
Joshua M. Boniface	78dec77987	Bump version to 0.8	2020-08-26 10:24:44 -04:00
Joshua M. Boniface	921e57ca78	Fix syntax error	2020-08-20 23:05:56 -04:00
Joshua M. Boniface	3cc7df63f2	Add configurable VM shutdown timeout Closes #102	2020-08-20 21:26:12 -04:00
Joshua M. Boniface	e8e65934e3	Use logger prefix for thread debug logs	2020-08-17 14:30:21 -04:00
Joshua M. Boniface	24fda8a73f	Use new debug logger for DNS Aggregator	2020-08-17 14:26:43 -04:00
Joshua M. Boniface	9b3ef6d610	Add connect timeout to Ceph This doesn't seem to actually do anything (like most of these timeouts...) but add it just for posterity.	2020-08-17 13:58:14 -04:00
Joshua M. Boniface	b451c0e8e3	Add additional start/finish debug messages	2020-08-17 13:11:03 -04:00
Joshua M. Boniface	f9b126a106	Make zkhandler accept failures more robustly Most of these would silently fail if there was e.g. an issue with the ZK connection. Instead, encase things in try blocks and handle the exceptions in a more graceful way, returning None or False if applicable. Except for locks, which should retry 5 times before aborting.	2020-08-17 13:03:36 -04:00
Joshua M. Boniface	553f96e7ef	Use logger for debug output Using simple print statements was annoying (lack of timing info and formatting), so move to using the debug logger for these instead with a custom state ('d') with white text to differentiate them. Also indicate which subthread of the keepalive each task is being executed in for easier tracing of issues.	2020-08-17 12:46:52 -04:00
Joshua M. Boniface	65add58c9a	Properly properly handle issue	2020-08-16 11:38:39 -04:00
Joshua M. Boniface	0a01d84290	Tie fence timers to keepalive_interval Also wait 2 full keepalive intervals after fencing before doing anything else, to give the Ceph cluster a chance to recover.	2020-08-15 12:38:03 -04:00
Joshua M. Boniface	4afb288429	Properly handle missing domain_name fail	2020-08-15 12:07:23 -04:00
Joshua M. Boniface	985ad5edc0	Warn if fencing will fail Verify our IPMI state on startup, and then warn if fencing will fail. For now, this is sufficient, but in future (requires refactoring) we might want to adjust how fencing occurs based on this information.	2020-08-13 14:42:18 -04:00
Joshua M. Boniface	0587bcbd67	Go back to manual command for OSD stats Using the Ceph library was a disaster here; it had no timeout or way to force it to continue, so keepalives would become stuck and trigger fence storms. Go back to the manual osd dump command with a 2s timeout which is far more reliable and can be adequately terminated if it runs long.	2020-08-12 22:31:25 -04:00
Joshua M. Boniface	e0cb4a58c3	Ensure zk_listener is readded after reconnect	2020-08-11 12:46:15 -04:00
Joshua M. Boniface	099c58ead8	Fix missing char in log message	2020-08-11 12:40:35 -04:00
Joshua M. Boniface	0e5c681ada	Clean up imports Make several imports more specific to reduce redundant code imports and improve memory utilization.	2020-08-11 12:09:10 -04:00
Joshua M. Boniface	46ffe352e3	Better handle subthread timeouts in keepalive Prevent the main keepalive thread from getting stuck due to a subthread taking an enormous time. If this happens, the rest of the main keepalive will continue onward, thus ensuring that the main keepalive does not fail for a significant number of cycles, which would cause a fence.	2020-08-11 11:37:26 -04:00
Joshua M. Boniface	ccee124c8b	Adjust fence failcount limit to 6 (30s) The previous saving throw limit (3/15s) seems to have been too low. I was observing bizarre failures where a node would be fenced while it was still starting up. Some of this may have been related to Zookeeper connections taking too long, but this was inconsistent. Increase this to 6 saving throws (30s). This provides significantly more time for a node to properly check in on startup before another node fences it. In the real world, 15s vs 30s isn't that big of a downtime change, but prevents false-positive fences.	2020-08-05 22:40:07 -04:00
Joshua M. Boniface	02343079c0	Improve fencing migrate layout Open the option to do this in parallel with some threads	2020-08-05 22:26:01 -04:00
Joshua M. Boniface	37b83aad6a	Add logging and use better conditional	2020-08-05 21:57:36 -04:00
Joshua M. Boniface	876f2424e0	Ensure dead state isn't written erroneously	2020-08-05 21:57:11 -04:00

... 3 4 5 6 7

333 Commits