parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	331027d124	Add further tweaks to takeover state checks Just ensure that everything is proper state before proceeding	2020-04-22 11:16:19 -04:00
Joshua M. Boniface	ae4f36b881	Hook flush into more services Trying to ensure that pvc-flush completes before anything tries to shut down.	2020-04-14 19:58:53 -04:00
Joshua M. Boniface	611e0edd80	Reorder last keepalive during cleanup Make sure the stopping of the keepalive timer and final keepalive update are done as the last step before complete shutdown. The previous setup could conceivably result in a node being fenced should the cleanup operations take longer than ~45 seconds, for instance if primary node switchover took too long or blocked, or log watchers failed to stop quickly enough. Ensures that keepalives will continue to be run during the shutdown process until the last possible moment.	2020-04-12 03:49:29 -04:00
Joshua M. Boniface	b413e042a6	Improve handling of primary contention Previously, contention could occasionally cause a flap/dual primary contention state due to the lack of checking within this function. This could cause a state where a node transitions to primary than is almost immediately shifted away, which could cause undefined behaviour in the cluster. The solution includes several elements: * Implement an exclusive lock operation in zkhandler * Switch the become_primary function to use this exclusive lock * Implement exclusive locking during the contention process * As a failsafe, check stat versions before setting the node as the primary node, in case another node already has * Delay the start of takeover/relinquish operations by slightly longer than the lock timeout * Make the current router_state conditions more explicit (positive conditionals rather than negative conditionals) The new scenario ensures that during contention, only one secondary will ever succeed at acquiring the lock. Ideally, the other would then grab the lock and pass, but in testing this does not seem to be the case - the lock always times out, so the failsafe check is technically not needed but has been left as an added safety mechanism. With this setup, the node that fails the contention will never block the switchover nor will it try to force itself onto the cluster after another node has successfully won contention. Timeouts may need to be adjusted in the future, but the base timeout of 0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably during preliminary tests.	2020-04-12 03:40:17 -04:00
Joshua M. Boniface	e672d799a6	Set flush after pvcapid.service This may or may not help, but should in theory prevent the flush from trying to run after a (locally-running) API daemon is terminated, which could cause an API failure and a failure to flush.	2020-04-12 01:48:50 -04:00
Joshua M. Boniface	a130f19a19	Depend pvcnoded on Zookeeper (harder) and libvirtd	2020-04-09 09:57:53 -04:00
Joshua M. Boniface	a671d9d457	Use consistent tense in messages	2020-04-08 22:00:51 -04:00
Joshua M. Boniface	fee1c7dd6c	Reorder cleanup and gracefully wait for flushes	2020-04-08 22:00:08 -04:00
Joshua M. Boniface	5d58bee34f	Add some time around noded startup/shutdown Otherwise, systemd kills networking before the node daemon fully stops and it goes into "dead" status, which is super annoying.	2020-04-01 23:59:14 -04:00
Joshua M. Boniface	f668412941	Don't use Requires as the dep is too hard Requires seems to flush on every service restart which is NOT what we want. Use Wants instead.	2020-04-01 15:15:37 -04:00
Joshua M. Boniface	a0ebc0d3a7	Add more robust requirements to pvc-flush service	2020-04-01 15:09:44 -04:00
Joshua M. Boniface	98a7005c1b	Add significant TimeoutSec to pvc-flush service This will stop systemd from killing the service in the middle of a flush or unflush operation, which completely defeats the purpose. 30 minutes was chosen as this is a very large but still somewhat manageable value, which should cover even a very large very loaded cluster with room to spare.	2020-04-01 01:24:09 -04:00
Joshua M. Boniface	0a367898a0	Don't trigger aggregator fail if fine	2020-03-12 13:22:12 -04:00
Joshua M. Boniface	c02bc0b46a	Correct issues with VM lock freeing Code was bad and using a depricated feature.	2020-03-02 12:45:12 -05:00
Joshua M. Boniface	1e4350ca6f	Properly handle takeover state in VXNetworks Most of these actions/conditionals were looking for primary state, but were failing during node takeover. Update the conditionals to look for both router states instead. Also add a wait to lock flushing until a takeover is completed.	2020-03-02 10:41:00 -05:00
Joshua M. Boniface	57768f2583	Remove an obsolete script	2020-02-19 21:40:23 -05:00
Joshua M. Boniface	e4e4e336b4	Handle invalid cursor setup cleanly This seems to happen only during termination, so catch it and continue so the loop terminates.	2020-02-19 16:29:59 -05:00
Joshua M. Boniface	d2a5fe59c0	Use transitional takeover states for migration Use a pair of transitional states, "takeover" and "relinquish", when transitioning between primary and secondary coordinator states. This provides a clsuter-wide record that the nodes are still working during their synchronous transition states, and should allow clients to determine when the node(s) have fully switched over. Also add an additional 2 seconds of wait at the end of the transition jobs to ensure everything has had a chance to start before proceeding. References #72	2020-02-19 14:06:54 -05:00
Joshua M. Boniface	9c7041f12c	Update package version to 0.7	2020-02-15 23:25:47 -05:00
Joshua M. Boniface	7ace5b5056	Remove /ceph/cmd pipe for (most) Ceph commands Addresses #80	2020-02-08 23:40:02 -05:00
Joshua M. Boniface	37310e5455	Correct name of systemd target	2020-02-08 20:39:07 -05:00
Joshua M. Boniface	ce985234c3	Use consistent naming of components Rename "pvcd" to "pvcnoded", and "pvc-api" to "pvcapid" so names for the daemons are fully consistent. Update the names of the configuration files as well to match this new formatting. References #79	2020-02-08 19:34:07 -05:00
Joshua M. Boniface	4505b239eb	Rename API and common Debian packages Closes #79	2020-02-08 18:50:38 -05:00
Joshua M. Boniface	74228eb063	Bump version to 0.6	2020-02-08 18:27:39 -05:00
Joshua M. Boniface	90e42683c6	Reduce sleep time during VM migrations	2020-02-04 17:52:37 -05:00
Joshua M. Boniface	20c8466296	Handle invalid search fields better	2020-02-04 17:35:24 -05:00
Joshua M. Boniface	ab28bf40d1	Change ordering of services during primary switch Fixes #77	2020-01-30 09:18:56 -05:00
Joshua M. Boniface	5d73974e95	Fix several bugs around load-based migrations	2020-01-29 17:35:10 -05:00
Joshua M. Boniface	0b31bab797	Add more helpful config parse error message	2020-01-22 12:09:31 -05:00
Joshua M. Boniface	4c1b78d7a4	Use dictionary get() to prevent crashes Use the get() function throughout to prevent crashes in various scenarios if the profile data isn't present or consistent.	2020-01-13 09:21:57 -05:00
Joshua M. Boniface	4ad29f669d	Update default configuration samples	2020-01-12 21:33:15 -05:00
Joshua M. Boniface	0d2e22a111	Normalize all static networks with bridges Modifies the storage and upstream networks to mirror the cluster network, with a bridge on top of the underlying specified dev, and all IPs bound to the bridge. Allows creating VMs in the storage or upstream networks, as well as the cluster network, should the administrator choose to do so (manually).	2020-01-12 19:04:31 -05:00
Joshua M. Boniface	1671a87dd4	Fix the flush service	2020-01-11 17:04:12 -05:00
Joshua M. Boniface	b6474198a4	Implement cluster maintenance mode Implements a "maintenance mode" for PVC clusters. For now, the only thing this mode does is disable node fencing while the state is true. This allows the administrator to tell PVC that network connectivity, etc. might be interrupted and to avoid fencing nodes. Closes #70	2020-01-09 10:53:27 -05:00
Joshua M. Boniface	4e5bce4975	Update copyright header year to 2020	2020-01-08 19:38:02 -05:00
Joshua M. Boniface	c515d63340	Add provision state for VMs	2020-01-08 17:40:02 -05:00
Joshua M. Boniface	21d87f5e51	Add v6 configurations to dnsmasq These options were only applied with v4 networks; now, use the v6 address in a dual-stack or v6-only network.	2020-01-06 23:48:04 -05:00
Joshua M. Boniface	f326fd99e2	Properly fix IPv4 no-DHCP networking	2020-01-06 22:31:37 -05:00
Joshua M. Boniface	38dae8b32f	Change name of cluster in patronictl command	2020-01-06 16:37:17 -05:00
Joshua M. Boniface	2d2bdb879e	Use get() instead of direct dict reference	2020-01-06 16:34:39 -05:00
Joshua M. Boniface	30d4470c8f	Only print AXFR errors in debug mode	2020-01-06 16:04:37 -05:00
Joshua M. Boniface	bbfadac5e1	Fix dnsmasq options for DHCP-disabled networks	2020-01-06 16:04:26 -05:00
Joshua M. Boniface	7b3e267f7a	Implement bridge_device for bridged VNIs Required due to #64. Bridged networks were being created on top of a vLAN if the Cluster network was a vLAN device, rather than being created on the underlying device. This came from a previous revision of the cluster architecture guidelines where Cluster was supposed to be a raw device rather than a vLAN. This fixed the problem by implementing a configuration field for a "bridge_device", a NIC device that can then have the bridged vLANs created on top of it. Fixes #64	2020-01-06 14:44:56 -05:00
Joshua M. Boniface	094ac8c3a8	Ensure stdout is used	2020-01-06 12:34:35 -05:00
Joshua M. Boniface	13548b791d	Add additional debugging and fix pool_idx loop var	2020-01-06 11:31:22 -05:00
Joshua M. Boniface	e7bc4f7328	Handle empty None-type hostname	2020-01-05 22:46:56 -05:00
Joshua M. Boniface	be20ba02a7	Handle VM states in flush more accurately We don't want to block forever on a failure, so limit valid waiting states to just those we know it should be in during a migration.	2020-01-05 15:21:16 -05:00
Joshua M. Boniface	7311fa561b	Fix bad join with new table name	2020-01-04 15:17:27 -05:00
Joshua M. Boniface	bf89050e8b	Update userdata table name	2020-01-04 15:10:37 -05:00
Joshua M. Boniface	20ae2186f9	Run VM state actions in a thread Prevents blocking the main thread(s) while a VM is changing state. In particular, this caused some issues with nodes not responding to cancellation/reversal of a flush/ready state until the previous migration was finished, which could cause issues. This entire subset of actions is now threaded and so can run on its own in the background.	2019-12-26 11:08:16 -05:00

... 6 7 8 9 10 ...

648 Commits