parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	0a01d84290	Tie fence timers to keepalive_interval Also wait 2 full keepalive intervals after fencing before doing anything else, to give the Ceph cluster a chance to recover.	2020-08-15 12:38:03 -04:00
Joshua M. Boniface	4afb288429	Properly handle missing domain_name fail	2020-08-15 12:07:23 -04:00
Joshua M. Boniface	985ad5edc0	Warn if fencing will fail Verify our IPMI state on startup, and then warn if fencing will fail. For now, this is sufficient, but in future (requires refactoring) we might want to adjust how fencing occurs based on this information.	2020-08-13 14:42:18 -04:00
Joshua M. Boniface	0587bcbd67	Go back to manual command for OSD stats Using the Ceph library was a disaster here; it had no timeout or way to force it to continue, so keepalives would become stuck and trigger fence storms. Go back to the manual osd dump command with a 2s timeout which is far more reliable and can be adequately terminated if it runs long.	2020-08-12 22:31:25 -04:00
Joshua M. Boniface	09c1bb6a46	Increase start delay of flush service	2020-08-11 14:17:35 -04:00
Joshua M. Boniface	e0cb4a58c3	Ensure zk_listener is readded after reconnect	2020-08-11 12:46:15 -04:00
Joshua M. Boniface	099c58ead8	Fix missing char in log message	2020-08-11 12:40:35 -04:00
Joshua M. Boniface	0e5c681ada	Clean up imports Make several imports more specific to reduce redundant code imports and improve memory utilization.	2020-08-11 12:09:10 -04:00
Joshua M. Boniface	46ffe352e3	Better handle subthread timeouts in keepalive Prevent the main keepalive thread from getting stuck due to a subthread taking an enormous time. If this happens, the rest of the main keepalive will continue onward, thus ensuring that the main keepalive does not fail for a significant number of cycles, which would cause a fence.	2020-08-11 11:37:26 -04:00
Joshua M. Boniface	ccee124c8b	Adjust fence failcount limit to 6 (30s) The previous saving throw limit (3/15s) seems to have been too low. I was observing bizarre failures where a node would be fenced while it was still starting up. Some of this may have been related to Zookeeper connections taking too long, but this was inconsistent. Increase this to 6 saving throws (30s). This provides significantly more time for a node to properly check in on startup before another node fences it. In the real world, 15s vs 30s isn't that big of a downtime change, but prevents false-positive fences.	2020-08-05 22:40:07 -04:00
Joshua M. Boniface	02343079c0	Improve fencing migrate layout Open the option to do this in parallel with some threads	2020-08-05 22:26:01 -04:00
Joshua M. Boniface	37b83aad6a	Add logging and use better conditional	2020-08-05 21:57:36 -04:00
Joshua M. Boniface	876f2424e0	Ensure dead state isn't written erroneously	2020-08-05 21:57:11 -04:00
Joshua M. Boniface	5871380e1b	Avoid crashing VM stats thread if domain migrated	2020-06-10 17:10:46 -04:00
Joshua M. Boniface	654a3cb7fa	Improve debug output and use ceph df util data	2020-06-06 22:52:49 -04:00
Joshua M. Boniface	9b65d3271a	Improve handling of Ceph status gathering Use the Rados library instead of random OS commands, which massively improves the performance of these tasks. Closes #97	2020-06-06 22:30:25 -04:00
Joshua M. Boniface	598b2025e8	Use Rados and add Ceph entries to pvcnoded.yaml	2020-06-06 21:12:51 -04:00
Joshua M. Boniface	70b787d1fd	Move all VM functions into thread	2020-06-06 15:44:05 -04:00
Joshua M. Boniface	e1310a05f2	Implement recording of VM stats during keepalive	2020-06-06 15:34:03 -04:00
Joshua M. Boniface	2ad6860dfe	Move Ceph statistics gathering into thread	2020-06-06 13:25:02 -04:00
Joshua M. Boniface	cebb4bbc1a	Comment cleanup	2020-06-06 13:20:40 -04:00
Joshua M. Boniface	a672e06dd2	Move fencing to end of keepalive function	2020-06-06 13:19:11 -04:00
Joshua M. Boniface	1db73bb892	Move libvirt closure into previous section	2020-06-06 13:18:37 -04:00
Joshua M. Boniface	c1956072f0	Rename update_zookeeper function to node_keepalive	2020-06-06 12:49:50 -04:00
Joshua M. Boniface	ce60836c34	Allow enforcement of live migration Provides a CLI and API argument to force live migration, which triggers a new VM state "migrate-live". The node daemon VMInstance during migrate will read this flag from the state and, if enforced, will not trigger a shutdown migration. Closes #95	2020-06-06 12:00:44 -04:00
Joshua M. Boniface	b5434ba744	Fix typo in variable name	2020-06-06 11:29:48 -04:00
Joshua M. Boniface	b9e5b14f94	Update lastnode too if a self-migrate is aborted References #92	2020-06-04 10:28:04 -04:00
Joshua M. Boniface	5d2031d99e	Prevent a VM migrating to the same node Prevents a rare edge case where a node can end up "migrating" to itself. Quick hack to fix this, though like most of the VM management should probably be rethought/rewritten later. Fixes #92	2020-06-04 10:26:47 -04:00
Joshua M. Boniface	5f9836f96d	Add error message to OSD parse fail	2020-05-12 11:04:38 -04:00
Joshua M. Boniface	95c59ba629	Improve flush handling slightly	2020-05-12 11:04:38 -04:00
Joshua M. Boniface	72a38fd437	Correct changed dhcp_reservations key name	2020-05-09 10:00:53 -04:00
Joshua M. Boniface	b580760537	Add missing fmt_cyan variable	2020-05-08 18:15:02 -04:00
Joshua M. Boniface	331027d124	Add further tweaks to takeover state checks Just ensure that everything is proper state before proceeding	2020-04-22 11:16:19 -04:00
Joshua M. Boniface	ae4f36b881	Hook flush into more services Trying to ensure that pvc-flush completes before anything tries to shut down.	2020-04-14 19:58:53 -04:00
Joshua M. Boniface	611e0edd80	Reorder last keepalive during cleanup Make sure the stopping of the keepalive timer and final keepalive update are done as the last step before complete shutdown. The previous setup could conceivably result in a node being fenced should the cleanup operations take longer than ~45 seconds, for instance if primary node switchover took too long or blocked, or log watchers failed to stop quickly enough. Ensures that keepalives will continue to be run during the shutdown process until the last possible moment.	2020-04-12 03:49:29 -04:00
Joshua M. Boniface	b413e042a6	Improve handling of primary contention Previously, contention could occasionally cause a flap/dual primary contention state due to the lack of checking within this function. This could cause a state where a node transitions to primary than is almost immediately shifted away, which could cause undefined behaviour in the cluster. The solution includes several elements: * Implement an exclusive lock operation in zkhandler * Switch the become_primary function to use this exclusive lock * Implement exclusive locking during the contention process * As a failsafe, check stat versions before setting the node as the primary node, in case another node already has * Delay the start of takeover/relinquish operations by slightly longer than the lock timeout * Make the current router_state conditions more explicit (positive conditionals rather than negative conditionals) The new scenario ensures that during contention, only one secondary will ever succeed at acquiring the lock. Ideally, the other would then grab the lock and pass, but in testing this does not seem to be the case - the lock always times out, so the failsafe check is technically not needed but has been left as an added safety mechanism. With this setup, the node that fails the contention will never block the switchover nor will it try to force itself onto the cluster after another node has successfully won contention. Timeouts may need to be adjusted in the future, but the base timeout of 0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably during preliminary tests.	2020-04-12 03:40:17 -04:00
Joshua M. Boniface	e672d799a6	Set flush after pvcapid.service This may or may not help, but should in theory prevent the flush from trying to run after a (locally-running) API daemon is terminated, which could cause an API failure and a failure to flush.	2020-04-12 01:48:50 -04:00
Joshua M. Boniface	a130f19a19	Depend pvcnoded on Zookeeper (harder) and libvirtd	2020-04-09 09:57:53 -04:00
Joshua M. Boniface	a671d9d457	Use consistent tense in messages	2020-04-08 22:00:51 -04:00
Joshua M. Boniface	fee1c7dd6c	Reorder cleanup and gracefully wait for flushes	2020-04-08 22:00:08 -04:00
Joshua M. Boniface	5d58bee34f	Add some time around noded startup/shutdown Otherwise, systemd kills networking before the node daemon fully stops and it goes into "dead" status, which is super annoying.	2020-04-01 23:59:14 -04:00
Joshua M. Boniface	f668412941	Don't use Requires as the dep is too hard Requires seems to flush on every service restart which is NOT what we want. Use Wants instead.	2020-04-01 15:15:37 -04:00
Joshua M. Boniface	a0ebc0d3a7	Add more robust requirements to pvc-flush service	2020-04-01 15:09:44 -04:00
Joshua M. Boniface	98a7005c1b	Add significant TimeoutSec to pvc-flush service This will stop systemd from killing the service in the middle of a flush or unflush operation, which completely defeats the purpose. 30 minutes was chosen as this is a very large but still somewhat manageable value, which should cover even a very large very loaded cluster with room to spare.	2020-04-01 01:24:09 -04:00
Joshua M. Boniface	0a367898a0	Don't trigger aggregator fail if fine	2020-03-12 13:22:12 -04:00
Joshua M. Boniface	c02bc0b46a	Correct issues with VM lock freeing Code was bad and using a depricated feature.	2020-03-02 12:45:12 -05:00
Joshua M. Boniface	1e4350ca6f	Properly handle takeover state in VXNetworks Most of these actions/conditionals were looking for primary state, but were failing during node takeover. Update the conditionals to look for both router states instead. Also add a wait to lock flushing until a takeover is completed.	2020-03-02 10:41:00 -05:00
Joshua M. Boniface	57768f2583	Remove an obsolete script	2020-02-19 21:40:23 -05:00
Joshua M. Boniface	e4e4e336b4	Handle invalid cursor setup cleanly This seems to happen only during termination, so catch it and continue so the loop terminates.	2020-02-19 16:29:59 -05:00
Joshua M. Boniface	d2a5fe59c0	Use transitional takeover states for migration Use a pair of transitional states, "takeover" and "relinquish", when transitioning between primary and secondary coordinator states. This provides a clsuter-wide record that the nodes are still working during their synchronous transition states, and should allow clients to determine when the node(s) have fully switched over. Also add an additional 2 seconds of wait at the end of the transition jobs to ensure everything has had a chance to start before proceeding. References #72	2020-02-19 14:06:54 -05:00

1 2 3 4 5 ...

330 Commits