parallelvirtualcluster/pvc - pvc

Commit Graph

Author	SHA1	Message	Date
Joshua Boniface	87bc5f93e6	Avoid duplicate runs of MTU validator	2021-10-09 19:07:41 -04:00
Joshua Boniface	203893559e	Use correct isinstance instead of type	2021-10-09 19:03:31 -04:00
Joshua Boniface	2c51bb0705	Move MTU validation to function Prevents code duplication and ensures validation runs when an MTU is updated, not just on network creation.	2021-10-09 19:01:45 -04:00
Joshua Boniface	46d3daf686	Add logger message when setting MTU	2021-10-09 18:56:18 -04:00
Joshua Boniface	e9d05aa24e	Ensure vx_mtu is always an int()	2021-10-09 18:52:50 -04:00
Joshua Boniface	6ce28c43af	Add MTU value checking and log messages Ensures that if a specified MTU is more than the maximum it is set to the maximum instead, and adds warning messages for both situations.	2021-10-09 18:48:56 -04:00
Joshua Boniface	c45f8f5bd5	Have VXNetworkInstance set MTU if unset Makes this explicit in Zookeeper if a network is unset, post-migration (schema version 6). Addresses #144	2021-10-09 17:52:57 -04:00
Joshua Boniface	3690a2c1e0	Fix migration bugs and invalid vx_mtu Addresses #144	2021-10-09 17:35:10 -04:00
Joshua Boniface	50d8aa0586	Add handlers for client network MTUs Refactors some of the code in VXNetworkInterface to handle MTUs in a more streamlined fashion. Also fixes a bug whereby bridge client networks were being explicitly given the cluster dev MTU which might not be correct. Now adds support for this option explicitly in the configs, and defaults to 1500 for safety (the standard Ethernet MTU). Addresses #144	2021-10-09 17:02:27 -04:00
Joshua Boniface	6ee4c55071	Correct flawed conditional in verify_ipmi	2021-10-07 15:11:19 -04:00
Joshua Boniface	c27359c4bf	Bump version to 0.9.40	2021-10-07 14:42:04 -04:00
Joshua Boniface	46078932c3	Correct bad stop_keepalive_timer call	2021-10-07 14:41:12 -04:00
Joshua Boniface	bdb9db8375	Bump version to 0.9.39	2021-10-07 11:52:38 -04:00
Joshua Boniface	da9248cfa2	Bump version to 0.9.38	2021-10-03 22:32:41 -04:00
Joshua Boniface	23977b04fc	Bump version to 0.9.37	2021-09-30 02:08:14 -04:00
Joshua Boniface	f6f6f07488	Add timeouts to queue gets and adjust Ensure that all keepalive timeouts are set (prevent the queue.get() actions from blocking forever) and set the thread timeouts to line up as well. Everything here is thus limited to keepalive_interval seconds (default 5s) to keep it uniform.	2021-09-27 16:10:27 -04:00
Joshua Boniface	142c999ce8	Re-add success log output during migration	2021-09-27 11:50:55 -04:00
Joshua Boniface	1de069298c	Fix missing character in log message	2021-09-27 00:49:43 -04:00
Joshua Boniface	55221b3d97	Simplify VM migration down to 3 steps Remove two superfluous synchronization steps which are not needed here, since the exclusive lock handles that situation anyways. Still does not fix the weird flush->unflush lock timeout bug, but is better worked-around now due to the cancelling of the other wait freeing this up and continuing.	2021-09-27 00:03:20 -04:00
Joshua Boniface	0d72798814	Work around synchronization lock issues Make the block on stage C only wait for 900 seconds (15 minutes) to prevent indefinite blocking. The issue comes if a VM is being received, and the current unflush is cancelled for a flush. When this happens, this lock acquisition seems to block for no obvious reason, and no other changes seem to affect it. This is certainly some sort of locking bug within Kazoo but I can't diagnose it as-is. Leave a TODO to look into this again in the future.	2021-09-26 23:26:21 -04:00
Joshua Boniface	3638efc77e	Improve log messages during VM migration	2021-09-26 23:15:38 -04:00
Joshua Boniface	c2c888d684	Use event to non-block wait and fix inf wait	2021-09-26 22:55:39 -04:00
Joshua Boniface	febef2e406	Track status of VM state thread	2021-09-26 22:55:21 -04:00
Joshua Boniface	2a4f38e933	Simplify locking process for VM migration Rather than using a cumbersome and overly complex ping-pong of read and write locks, instead move to a much simpler process using exclusive locks. Describing the process in ASCII or narrative is cumbersome, but the process ping-pongs via a set of exclusive locks and wait timers, so that the two sides are able to synchronize via blocking the exclusive lock. The end result is a much more streamlined migration (takes about half the time all things considered) which should be less error-prone.	2021-09-26 22:08:07 -04:00
Joshua Boniface	3b805cdc34	Fix failure to connect to libvirt in keepalive This should be caught and abort the thread rather than failing and holding up keepalives.	2021-09-26 20:42:01 -04:00
Joshua Boniface	06f0f7ed91	Fix several bugs in fence handling 1. Output from ipmitool was not being stripped, and stray newlines were throwing off the comparisons. Fixes this. 2. Several stages were lacking meaningful messages. Adds these in so the output is more clear about what is going on. 3. Reduce the sleep time after a fence to just 1x the keepalive_interval, rather than 2x, because this seemed like excessively long even for slow IPMI interfaces, especially since we're checking the power state now anyways. 4. Set the node daemon state to an explicit 'fenced' state after a successful fence to indicate to users that the node was indeed fenced successfully and not still 'dead'.	2021-09-26 20:07:30 -04:00
Joshua Boniface	fd040ab45a	Ensure pvc-flush is after network-online	2021-09-26 17:40:42 -04:00
Joshua Boniface	e23e2dd9bf	Fix typo in log message	2021-09-26 03:35:30 -04:00
Joshua Boniface	0f02c5eaef	Fix typo in sgdisk command options	2021-09-26 00:59:05 -04:00
Joshua Boniface	075abec5fe	Use re.search instead of re.match Required since we're not matching the start of the string.	2021-09-26 00:55:29 -04:00
Joshua Boniface	3a1cbf8d01	Raise basic exceptions in CephInstance Avoids no exception to reraise errors on failures.	2021-09-26 00:50:10 -04:00
Joshua Boniface	a438a4155a	Fix OSD creation for partition paths and fix gdisk The previous implementation did not work with /dev/nvme devices or any /dev/disk/by-* devices due to some logical failures in the partition naming scheme, so fix these, and be explicit about what is supported in the PVC CLI command output. The 'echo \| gdisk' implementation of partition creation also did not work due to limitations of subprocess.run; instead, use sgdisk which allows these commands to be written out explicitly and is included in the same package as gdisk.	2021-09-26 00:12:28 -04:00
Joshua Boniface	65df807b09	Add support for configurable OSD DB ratios The default of 0.05 (5%) is likely ideal in the initial implementation, but allow this to be set explicitly for maximum flexibility in space-constrained or performance-critical use-cases.	2021-09-24 01:06:39 -04:00
Joshua Boniface	d0f3e9e285	Bump version to 0.9.36	2021-09-23 14:01:38 -04:00
Joshua Boniface	adc8a5a3bc	Add separate OSD DB device support Adds in three parts: 1. Create an API endpoint to create OSD DB volume groups on a device. Passed through to the node via the same command pipeline as creating/removing OSDs, and creates a volume group with a fixed name (osd-db). 2. Adds API support for specifying whether or not to use this DB volume group when creating a new OSD via the "ext_db" flag. Naming and sizing is fixed for simplicity and based on Ceph recommendations (5% of OSD size). The Zookeeper schema tracks the block device to use during removal. 3. Adds CLI support for the new and modified API endpoints, as well as displaying the block device and DB block device in the OSD list. While I debated supporting adding a DB device to an existing OSD, in practice this ended up being a very complex operation involving stopping the OSD and setting some options, so this is not supported; this can be specified during OSD creation only. Closes #142	2021-09-23 13:59:49 -04:00
Joshua Boniface	df277edf1c	Move console watcher stop try up Could cause an exception if d_domain is not defined yet.	2021-09-22 16:02:04 -04:00
Joshua Boniface	772807deb3	Bump version to 0.9.35	2021-09-13 02:20:46 -04:00
Joshua Boniface	f3fb492633	Handle VM disk/network stats gathering exceptions	2021-09-12 19:41:07 -04:00
Joshua Boniface	e962743e51	Add VM device hot attach/detach support Adds a new API endpoint to support hot attach/detach of devices, and the corresponding client-side logic to use this endpoint when doing VM network/storage add/remove actions. The live attach is now the default behaviour for these types of additions and removals, and can be disabled if needed. Closes #141	2021-09-12 19:33:00 -04:00
Joshua Boniface	be954c1625	Don't crash cleanup if no this_node	2021-08-29 03:52:18 -04:00
Joshua Boniface	fb46f5f9e9	Change default node object state to flushed	2021-08-29 03:34:08 -04:00
Joshua Boniface	694b8e85a0	Bump version to 0.9.34	2021-08-24 16:15:25 -04:00
Joshua Boniface	a4c0e0befd	Fix typo in output message	2021-08-23 00:39:19 -04:00
Joshua Boniface	a18cef5f25	Bump version to 0.9.33	2021-08-21 03:28:48 -04:00
Joshua Boniface	afb0359c20	Refactor pvcnoded to reduce Daemon.py size This branch commit refactors the pvcnoded component to better adhere to good programming practices. The previous Daemon.py was a massive file which contained almost 2000 lines of direct, root-level code which was directly imported. Not only was this poor practice, but this resulted in a nigh-unmaintainable file which was hard even for me to understand. This refactoring splits a large section of the code from Daemon.py into separate small modules and functions in the `util/` directory. This will hopefully make most of the functionality easy to find and modify without having to dig through a single large file. Further the existing subcomponents have been moved to the `objects/` directory which clearly separates them. Finally, the Daemon.py code has mostly been moved into a function, `entrypoint()`, which is then called from the `pvcnoded.py` stub. An additional item is that most format strings have been replaced by f-strings to make use of the Python 3.6 features in Daemon.py and the utility files.	2021-08-21 03:14:22 -04:00
Joshua Boniface	afdf254297	Bump version to 0.9.32	2021-08-19 12:37:58 -04:00
Joshua Boniface	42e776fac1	Properly handle exceptions getting VM stats	2021-08-19 12:36:31 -04:00
Joshua Boniface	7ecc6a2635	Bump version to 0.9.31	2021-07-30 12:08:12 -04:00
Joshua Boniface	3ab6365a53	Adjust receive output to show proper source	2021-07-22 15:43:08 -04:00
Joshua Boniface	2a99a27feb	Bump version to 0.9.30	2021-07-20 00:01:45 -04:00
Joshua Boniface	fa1d93e933	Bump version to 0.9.29	2021-07-19 16:55:41 -04:00
Joshua Boniface	6ead21a308	Handle cleanup from a failure properly	2021-07-19 12:39:13 -04:00
Joshua Boniface	b7c8c2ee3d	Fix handling of this_node and d_domain in cleanup	2021-07-19 12:36:35 -04:00
Joshua Boniface	d48f58930b	Use harder exits and add cleanup termination	2021-07-19 12:27:16 -04:00
Joshua Boniface	7c36388c8f	Add post-networking delay and adjust daemon delay	2021-07-19 12:23:45 -04:00
Joshua Boniface	71e4d0b32a	Bump version to 0.9.28	2021-07-19 09:29:34 -04:00
Joshua Boniface	15d92c483f	Bump version to 0.9.27	2021-07-19 00:03:40 -04:00
Joshua Boniface	602093029c	Bump version to 0.9.26	2021-07-18 20:49:52 -04:00
Joshua Boniface	b770e15a91	Fix final termination of logger We need to do a bit more finagling with the logger on termination to ensure that all messages are written and the queue drained before actually terminating.	2021-07-18 19:53:00 -04:00
Joshua Boniface	e23a65128a	Remove del of logger item	2021-07-18 19:03:47 -04:00
Joshua Boniface	3a2478ee0c	Cleanly terminate logger on cleanup	2021-07-18 18:57:44 -04:00
Joshua Boniface	323c7c41ae	Implement node logging into Zookeeper Adds the ability to send node daemon logs to Zookeeper to facilitate a command like "pvc node log", similar to "pvc vm log". Each node stores its logs in a separate tree under "/logs" which can then be combined or queried. By default, set by config, only 2000 lines are kept.	2021-07-18 17:11:43 -04:00
Joshua Boniface	cd1db3d587	Ensure node name is part of confing	2021-07-18 16:38:58 -04:00
Joshua Boniface	75fb60b1b4	Add VM list filtering by tag Uses same method as state or node filtering, rather than altering how the main LIMIT field works.	2021-07-14 00:59:20 -04:00
Joshua Boniface	c6d552ae57	Rework success checks for IPMI fencing Previously, if the node failed to restart, it was declared a "bad fence" and no further action would be taken. However, there are some situations, for instance critical hardware failures, where intelligent systems will not attempt (or succeed at) starting up the node in such a case, which would result in dead, known-offline nodes without recovery. Tweak this behaviour somewhat. The main path of Reboot -> Check On -> Success + fence-flush is retained, but some additional side-paths are now defined: 1. We attempt to power "on" the chassis 1 second after the reboot, just in case it is off and can be recovered. We then wait another 2 seconds and check the power status (as we did before). 2. If the reboot succeeded, follow this series of choices: a. If the chassis is on, the fence succeeded. b. If the chassis is off, the fence "succeeded" as well. c. If the chassis is in some other state, the fence failed. 3. If the reboot failed, follow this series of choices: a. If the chassis is off, the fence itself failed, but we can treat it as "succeeded"" since the chassis is in a known-offline state. This is the most likely situation when there is a critical hardware failure, and the server's IPMI does not allow itself to start back up again. b. If the chassis is in any other state ("on" or unknown), the fence itself failed and we must treat this as a fence failure. Overall, this should alleviate the aforementioned issue of a critical failure rendering the node persistently "off" not triggering a fence-flush and ensure fencing is more robust.	2021-07-13 17:54:41 -04:00
Joshua Boniface	2e9f6ac201	Bump version to 0.9.25	2021-07-11 23:19:09 -04:00
Joshua Boniface	f09849bedf	Don't overwrite shutdown state on termination Just a minor quibble and not really impactful.	2021-07-11 23:18:14 -04:00
Joshua Boniface	c76149141f	Only log ZK connections when persistent Prevents spam in the API logs.	2021-07-10 23:35:49 -04:00
Joshua Boniface	f00c4d07f4	Add date output to keepalive Helps track when there is a log follow in "-o cat" mode.	2021-07-10 23:24:59 -04:00
Joshua Boniface	20b66c10e1	Move two more commands to Rados library	2021-07-10 17:28:42 -04:00
Joshua Boniface	cfeba50b17	Revert "Return to all command-based Ceph gathering" This reverts commit `65d14ccd92`. This was actually a bad idea. For inexplicable reasons, running these Ceph commands manually (not even via Python, but in a normal shell) takes 7 * two orders of magnitude longer than running them with the Rados module, so long in fact that some basic commands like "ceph health" would sometimes take longer than the 1 second timeout to complete. The Rados commands would however take about 1ms instead. Despite the occasional issues when monitors drop out, the Rados module is clearly far superior to the shell commands for any moderately-loaded Ceph cluster. We can look into solving timeouts another way (perhaps with Processes instead of Threads) at a later time. Rados module "ceph health": b'{"checks":{},"status":"HEALTH_OK"}' 0.001204 (s) b'{"checks":{},"status":"HEALTH_OK"}' 0.001258 (s) Command "ceph health": joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.772s user 0m0.707s sys 0m0.046s joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.796s user 0m0.728s sys 0m0.054s	2021-07-10 03:47:45 -04:00
Joshua Boniface	551bae2518	Bump version to 0.9.24	2021-07-09 15:58:36 -04:00
Joshua Boniface	2b5dc286ab	Correct failure to get ceph_health data	2021-07-09 13:10:28 -04:00
Joshua Boniface	330cf14638	Remove return statements in keepalive collectors These seem to bork the keepalive timer process, so just remove them and let it continue to press on.	2021-07-09 13:04:17 -04:00
Joshua Boniface	65d14ccd92	Return to all command-based Ceph gathering Using the Rados module was very problematic, specifically because it had no sensible timeout parameters and thus would hang for many seconds. This has poor implications since it blocks further keepalives. Instead, remove the Rados usage entirely and go back completely to using manual OS commands to gather this information. While this may cause PID exhaustion more quickly it's worthwhile to avoid failure scenarios when Ceph stats time out. Closes #137	2021-07-06 11:30:45 -04:00
Joshua Boniface	7082982a33	Bump version to 0.9.23	2021-07-05 23:40:32 -04:00
Joshua Boniface	5b6ef71909	Ensure daemon mode is updated on startup Fixes the side effect of the previous bug during deploys of 0.9.22.	2021-07-05 23:39:23 -04:00
Joshua Boniface	be7b0be8ed	Fix typo in schema path name	2021-07-05 23:23:23 -04:00
Joshua Boniface	37cd278bc2	Bump version to 0.9.22	2021-07-05 14:18:51 -04:00
Joshua Boniface	a69105569f	Add node PVC version data to Node information Allows API client to see the currently-active version of the node daemon.	2021-07-05 09:57:38 -04:00
Joshua Boniface	21a1a7da9e	Fix bad schema reference Not sure how this didn't cause an issue until now, but the wrong key path was used and this was getting unexpected data with the newly-added version string instead of the proper mode string.	2021-07-05 09:53:51 -04:00
Joshua Boniface	f0fd3d3f0e	Make extra sure VMs terminate when told When doing a stop_vm or terminate_vm, check again after 0.2 seconds and try re-terminating if it's still running. Covers cases where a VM doesn't stop if given the 'stop' state.	2021-07-02 11:40:34 -04:00
Joshua Boniface	f12de6727d	Adjust logo slightly and add debug state	2021-07-02 02:32:08 -04:00
Joshua Boniface	e94f5354e6	Update startup messages with new ASCII logo	2021-07-02 02:21:30 -04:00
Joshua Boniface	c51023ba81	Add profiler to keepalive function	2021-07-02 01:55:15 -04:00
Joshua Boniface	39e82ee426	Cast base schema version to int Or all our comparisons will fail later and nodes can't start.	2021-06-30 09:40:33 -04:00
Joshua Boniface	fe0a1d582a	Bump version to 0.9.21	2021-06-29 19:21:31 -04:00
Joshua Boniface	3490ecbb59	Remove explicit ZK address from Patronictl command	2021-06-22 03:31:06 -04:00
Joshua Boniface	2928d695c9	Ensure migration method is updated on state changes	2021-06-22 03:20:15 -04:00
Joshua Boniface	26dd24e3f5	Ensure MTU is set on VF when starting up	2021-06-22 02:26:14 -04:00
Joshua Boniface	e623909a43	Store PHY MAC for VFs and restore after free	2021-06-22 00:56:47 -04:00
Joshua Boniface	60e1da09dd	Don't try any shenannegans when updating NICs Trying to do this on the VMInstance side had problems because we can't differentiate the 3 types of migration there. So, just update this in the API side and hope everything goes well. This introduces an edge bug: if a VM is using a macvtap SR-IOV device, and then tries to migrate, and the migrate is aborted, the NIC lists will be inconsistent. When I revamp the VMInstance in the future, I should be able to correct this, but for now we'll have to live with that edgecase.	2021-06-22 00:00:50 -04:00
Joshua Boniface	7d42fba373	Ensure being in migrate doesn't abort shutdown	2021-06-21 23:28:53 -04:00
Joshua Boniface	24ce361a04	Ensure SR-IOV NIC states are updated on migration	2021-06-21 23:18:34 -04:00
Joshua Boniface	eeb83da97d	Add support for SR-IOV NICs to VMs	2021-06-21 23:18:22 -04:00
Joshua Boniface	64d1a37b3c	Add PCIe device paths to SR-IOV VF information This will be used when adding VM network interfaces of type hostdev.	2021-06-21 21:08:46 -04:00
Joshua Boniface	13cc0f986f	Implement SR-IOV VF config set Also fixes some random bugs, adds proper interface sorting, and assorted tweaks.	2021-06-21 18:40:11 -04:00
Joshua Boniface	ca11dbf491	Sort the list of VFs for easier parsing	2021-06-21 01:40:05 -04:00
Joshua Boniface	e8bd1bf2c4	Ensure used/used_by are set on creation	2021-06-21 01:25:38 -04:00
Joshua Boniface	bff6d71e18	Add logging to SRIOVVFInstance and fix bug	2021-06-17 02:02:41 -04:00
Joshua Boniface	57b041dc62	Ensure default for vLAN and QOS is 0 not empty	2021-06-17 01:54:37 -04:00
Joshua Boniface	5607a6bb62	Avoid overwriting VF data Ensures that the configuration of a VF is not overwritten in Zookeeper on a node restart. The SRIOVVFInstance handlers were modified to start with None values, so that the DataWatch statements will always trigger updates to the live system interfaces on daemon startup, thus ensuring that the config stored in Zookeeper is applied to the system on startup (mostly relevant after a cold boot or if the API changes them during a daemon restart).	2021-06-17 01:45:22 -04:00
Joshua Boniface	8f1af2a642	Ignore hostdev interfaces in VM net stat gathering Prevents errors if a SR-IOV hostdev interface is configured until this is more defined.	2021-06-17 01:33:11 -04:00
Joshua Boniface	e7b6a3eac1	Implement SR-IOV PF and VF instances Adds support for the node daemon managing SR-IOV PF and VF instances. PFs are added to Zookeeper automatically based on the config at startup during network configuration, and are otherwise completely static. PFs are automatically removed from Zookeeper, along with all coresponding VFs, should the PF phy device be removed from the configuration. VFs are configured based on the (autocreated) VFs of each PF device, added to Zookeeper, and then a new class instance, SRIOVVFInstance, is used to watch them for configuration changes. This will enable the runtime management of VF settings by the API. The set of keys ensures that both configuration and details of the NIC can be tracked. Most keys are self-explanatory, especially for PFs and the basic keys for VFs. The configuration tree is also self-explanatory, being based entirely on the options available in the `ip link set {dev} vf` command. Two additional keys are also present: `used` and `used_by`, which will be able to track the (boolean) state of usage, as well as the VM that uses a given VIF. Since the VM side implementation will support both macvtap and direct "hostdev" assignments, this will ensure that this state can be tracked on both the VF and the VM side.	2021-06-17 01:33:03 -04:00
Joshua Boniface	0ad6d55dff	Add initial SR-IOV support to node daemon Adds configuration values for enabled flag and SR-IOV devices to the configuration and sets up the initial SR-IOV configuration on daemon startup (inserting the module, configuring the VF count, etc.).	2021-06-15 22:56:09 -04:00
Joshua Boniface	e4a65230a1	Just do the shutdown command itself	2021-06-15 02:32:14 -04:00
Joshua Boniface	284c581845	Ensure shutdown migrations actually time out Without this a VM that fails to respond to a shutdown will just spin forever, blocking state changes.	2021-06-15 00:23:15 -04:00
Joshua Boniface	953e46055a	Fix issue with loading None version schema	2021-06-14 21:09:55 -04:00
Joshua Boniface	d2bcfe5cf7	Bump version to 0.9.20	2021-06-14 18:06:27 -04:00
Joshua Boniface	ef1701b4c8	Handle an additional exception case	2021-06-14 17:15:40 -04:00
Joshua Boniface	08dc756549	Actually disable the pvcapid service Prevents it from trying to start itself during updates or reboots on non-primary coordinators.	2021-06-14 17:13:22 -04:00
Joshua Boniface	0a9c0c1ccb	Use a nicer reload method on hot schema update Instead of exiting and trusting systemd to restart us, instead leverage the os.execv() call to reload the process in the current PID context. Also improves the log messages so it's very clear what's going on.	2021-06-14 17:10:21 -04:00
Joshua Boniface	e34a7d4d2a	Handle hot reloads properly A hot reload isn't possible due to DataWatch and ChildrenWatch constructs, so we instead need to terminate the daemon to "apply" the schema update. Thus we use exit code 150 (Application defined in LSB) and reorder some of the elements of the schema validation to ensure things happen in the right order.	2021-06-14 12:52:43 -04:00
Joshua Boniface	1f49bfa1b2	Fix name of schema element	2021-06-13 20:56:17 -04:00
Joshua Boniface	647bce2a22	Ensure we don't grab None data	2021-06-13 16:43:25 -04:00
Joshua Boniface	26b1f531e9	Fix bad variable interpolation	2021-06-13 14:37:23 -04:00
Joshua Boniface	be9f1e8636	Use more compatible is_alive in thread	2021-06-13 14:36:27 -04:00
Joshua Boniface	b694945010	Fix incorrect name bug	2021-06-10 01:11:14 -04:00
Joshua Boniface	058c2ceef3	Convert VXNetworkInstance to new ZK schema handler	2021-06-10 00:36:18 -04:00
Joshua Boniface	e7d60260a0	Fix typo in CephInstance path	2021-06-10 00:36:02 -04:00
Joshua Boniface	85aba7cc18	Convert VMInstance to new ZK schema handler	2021-06-09 23:15:08 -04:00
Joshua Boniface	7e42118e6f	Adjust lock schema in NodeInstance and VMInstance Removes a superfluous lock and puts the sync_lock keys in more usable places.	2021-06-09 22:51:00 -04:00
Joshua Boniface	2704badfbe	Convert VMConsole... to new ZK schema handler	2021-06-09 22:08:32 -04:00
Joshua Boniface	450bf6b153	Convert NodeInstance to new ZK schema handler	2021-06-09 22:07:32 -04:00
Joshua Boniface	b94fe88405	Convert fencing to new ZK schema handler	2021-06-09 21:29:01 -04:00
Joshua Boniface	610f6e8f2c	Convert CephInstance to new ZK schema handler	2021-06-09 21:17:09 -04:00
Joshua Boniface	f913f42a6d	Replace schema paths with updated zkhandler	2021-06-09 20:29:42 -04:00
Joshua Boniface	e475552391	Fix some bugs with hot reload	2021-06-09 00:03:26 -04:00
Joshua Boniface	5540bdc86b	Add automatic schema upgrade to nodes Performs an automatic schema upgrade when all nodes are updated to the latest version. Addresses #129	2021-06-08 23:35:39 -04:00
Joshua Boniface	3c102b3769	Add per-node schema tracking This will allow nodes to start with their own schema versions, and then be updated simultaneously by the API. References #129	2021-06-08 23:35:39 -04:00
Joshua Boniface	a4aaf89681	Add ZKSchema loading and validation to Daemon Also removes some previous hack migrations from pre-0.9.19. Addresses #129	2021-06-08 23:35:39 -04:00
Joshua Boniface	126f0742cd	Add Zookeeper schema manager to zkhandler Adds a new class, ZKSchema, to handle schema management in Zookeeper in an automated and consistent way. This should solve several issues: 1. Pain in managing changes to ZK keys 2. Pain in handling those changes during live upgrades 3. Simplifying the codebase to remove hardcoded ZK paths The current master schema for PVC 0.9.19 is committed as version 0. Addresses #129	2021-06-08 23:35:39 -04:00
Joshua Boniface	5843d8aff4	Fix fence call to findTargetNode	2021-06-08 23:34:49 -04:00
Joshua Boniface	cf96bb009f	Bump version to 0.9.19	2021-06-06 01:47:41 -04:00
Joshua Boniface	719954b70b	Fix missing list comma	2021-06-06 01:39:43 -04:00
Joshua Boniface	7dea5d2fac	Move logger to common, fix buffering	2021-06-01 18:50:26 -04:00
Joshua Boniface	3a5226b893	Add missing flushed output	2021-06-01 18:30:18 -04:00
Joshua Boniface	de2ff2e01b	Fix removed function args	2021-06-01 17:02:36 -04:00
Joshua Boniface	cd75413667	Increase initial lock timer With the new library the reader seems to be a little too quick, so hold the write lock for 1 second instead of 1/2 second to ensure it is caught.	2021-06-01 17:00:11 -04:00
Joshua Boniface	9764090d6d	Merge node common with daemon common	2021-06-01 12:22:11 -04:00
Joshua Boniface	12ac3686de	Convert missed elements to new zkhandler	2021-06-01 11:57:21 -04:00
Joshua Boniface	5740d0f2d5	Remove obsolete zkhandler.py	2021-06-01 11:55:44 -04:00
Joshua Boniface	889f4cdf47	Convert common to new zkhandler	2021-06-01 11:55:32 -04:00
Joshua Boniface	8f66a8d00e	Fix missed zkhandler conversion	2021-06-01 11:53:33 -04:00
Joshua Boniface	6beea0693c	Convert fencing to new zkhandler	2021-06-01 11:53:21 -04:00
Joshua Boniface	1c9a7a6479	Convert VXNetworkInstance to new zkhandler	2021-06-01 11:49:39 -04:00
Joshua Boniface	790098f181	Convert VMInstance to new zkhandler	2021-06-01 11:46:27 -04:00
Joshua Boniface	8a4a41e092	Convert NodeInstance to new zkhandler	2021-06-01 11:27:35 -04:00
Joshua Boniface	a48bf2d71e	More gracefully handle none selectors Allow selection of "none" as the node selector, and handle this by always using the cluster default instead of writing it in.	2021-06-01 11:13:13 -04:00
Joshua Boniface	a0b9087167	Set Daemon migration selector in zookeeper	2021-06-01 10:52:41 -04:00
Joshua Boniface	33a54cf7f2	Move configuration keys to /config tree	2021-06-01 10:48:55 -04:00
Joshua Boniface	d6a8cf9780	Convert MetadataAPIInstance to new zkhandler	2021-05-31 19:55:09 -04:00
Joshua Boniface	abd619a3c1	Convert DNSAggregatorInstance to new zkhandler	2021-05-31 19:55:01 -04:00
Joshua Boniface	ef5fe78125	Convert CepnInstance to new zkhandler	2021-05-31 19:51:27 -04:00
Joshua Boniface	f6d0e89568	Properly add absent node type	2021-05-31 19:26:27 -04:00
Joshua Boniface	ede3e88cd7	Modify node daemon root to use updated zkhandler	2021-05-31 03:14:09 -04:00
Joshua Boniface	c23a53d082	Add daemon_lib symlink to pvcnoded	2021-05-30 00:00:07 -04:00
Joshua Boniface	0c75a127b2	Bump version to 0.9.18	2021-05-23 17:23:10 -04:00
Joshua Boniface	9de14c46fb	Bump version to 0.9.17	2021-05-19 17:06:29 -04:00
Joshua Boniface	fe15bdb854	Bump version to 0.9.16	2021-05-10 01:13:21 -04:00
Joshua Boniface	b851a6209c	Catch all other exceptions in subprocess run Found a rare glitch where the subprocess pipes would not engage, causing a daemon crash. Catch these exceptions with a retcode of 255 instead of bailing out. Closes #124	2021-05-10 01:07:25 -04:00
Joshua Boniface	5ceb57e540	Handle emptying corrupted console log files Libvirt will someones write junk out to console log files, which breaks the log parser deque with a UnicodeDecodeError. If this happens, clear the log and re-open the deque again for newer updates. Closes #123	2021-05-10 01:03:04 -04:00
Joshua Boniface	669338c22b	Bump version to 0.9.15	2021-04-08 13:37:47 -04:00
Joshua Boniface	c4ac75b973	Bump version to 0.9.14	2021-03-30 10:27:37 -04:00
Joshua Boniface	0bf276fd51	Update copyright year in headers	2021-03-25 17:01:55 -04:00
Joshua Boniface	f4ec161aa2	Update file copyright header. Remove the option to select a later version of the GPL.	2021-03-25 16:58:02 -04:00
Joshua Boniface	0ccfc41398	Bump version to 0.9.13	2021-02-17 11:37:59 -05:00
Joshua Boniface	9100c63e99	Add stored_bytes to pool stats information	2021-02-09 01:46:01 -05:00
Joshua Boniface	aba567d6c9	Add nice startup banners to both daemons Add nicer easy-to-find (yay ASCII art) banners for the startup printouts of both the node and API daemons. Also adds the safe loader to pvcnoded to prevent hassle messages and a version string in the API daemon file.	2021-02-08 02:51:43 -05:00
Joshua Boniface	0db8fd9da6	Bump version to 0.9.12	2021-01-28 16:29:58 -05:00
Joshua Boniface	a44f134230	Remove systemd deps on zookeeper and libvirt This caused a serious race condition, since the IPs managed by PVC had not yet come up, but Zookeeper was trying to start and bind to them, which of course failed. Remove these dependencies entirely - the daemon itself starts these services during initialization and they do not need to be started by systemd first.	2021-01-28 16:25:02 -05:00
Joshua Boniface	9fbe35fd24	Bump version to 0.9.11	2021-01-05 15:58:26 -05:00
Joshua Boniface	a24724d9f0	Use external ceph cmd for ceph df	2020-12-26 14:04:21 -05:00
Joshua Boniface	78c017d51d	Remove erroneous extra colon in log output	2020-12-20 16:06:35 -05:00
Joshua Boniface	1b6613c280	Add live VNC information to domain output Sets in the node daemon, returns via the API, and shows in the CLI, information about the live VNC listen address and port for VNC-enabled VMs. Closes #115	2020-12-20 16:00:55 -05:00
Joshua Boniface	d6ef722997	Fix bad log message	2020-12-15 10:51:52 -05:00
Joshua Boniface	518d699c15	Bump version to 0.9.10	2020-12-15 10:45:15 -05:00
Joshua Boniface	ac3ef3d792	Revamp fencing order Prevents unnecessarily excessive timeouts if IPMI connections time out; before, would have to go through 3 timed out commands at ~20s each before failure was registered; reduced to 1 if the first times out.	2020-12-15 02:48:25 -05:00
Joshua Boniface	3705daff43	Better handle failing RBD lock frees If the VM is not in a stop state, failing to free the lock is now considered a fatal error and will put the domain into fail state, aborting the start. This is better than being unsafe or trying to start a VM which will fail to boot due to read-only volumes.	2020-12-14 16:04:38 -05:00
Joshua Boniface	7c99a7bda7	Safely reset RBD locks on failed VMs Should correct issues on cold start as well as if a VM crashes uncleanly, which would prevent the VM from starting due to stale RBD locks. This implementation has four parts: 1. Update how IP addresses are handled, specifically by replacing all previous instances of "vni_ipaddr" with "vni_floatingipaddr", and then adding the "vni_ipaddr" with the real data for this node's IPs. Also include the storage IPs in this where they weren't before, so each this_node actually has the local IPs plus floating IPs. This enables the next two steps. 2. Modify flush_locks to take this_node as an argument, and update the run_command function to only operate against this node, rather than on the primary coordinator. 3. Have the flush_locks check each lock against the current node, to verify that the lock is actually held by the current node. This is the only way to do this safely. During fencing, we override this by not passing a this_node which bypasses this check. 4. Have the VM start do the check for VM failure/startup and execute a flush_locks before actually starting the VM.	2020-12-14 15:53:18 -05:00
Joshua Boniface	89c7e225a0	Move OSD stats uploading to primary only Instead of each node uploading its own OSD stats, which would not work if the PVC daemon wasn't running, instead have the primary upload stats for all OSDs in the cluster.	2020-12-09 02:46:09 -05:00
Joshua Boniface	b36ec43a2d	Bump version to 0.9.9	2020-12-09 02:20:20 -05:00
Joshua Boniface	ce5ee11841	Bump version to 0.9.8	2020-11-24 12:26:57 -05:00
Joshua Boniface	d4a28d7a58	Bump version to 0.9.7	2020-11-19 10:48:28 -05:00
Joshua Boniface	e69eb93cb3	Bump version to 0.9.6	2020-11-17 13:01:54 -05:00
Joshua Boniface	70dfcd434f	Ensure inmigrate is cleared on failure	2020-11-17 12:57:37 -05:00
Joshua Boniface	a4e5323e81	Bump version to 0.9.5	2020-11-17 12:34:04 -05:00
Joshua Boniface	9053edacd8	Bump version to 0.9.4	2020-11-10 15:33:50 -05:00
Joshua Boniface	baac8f24fd	Bump version to 0.9.3	2020-11-09 10:28:15 -05:00
Joshua Boniface	11702f4bc8	Bump version to 0.9.2	2020-11-08 02:03:29 -05:00
Joshua Boniface	6f66b77a00	Lint: E121/E126 continuation line under/over-indented for hanging indent	2020-11-07 15:06:21 -05:00
Joshua Boniface	9135c5e3e4	Lint: E241 multiple spaces after ','	2020-11-07 14:52:39 -05:00
Joshua Boniface	260b39ebf2	Lint: E302 expected 2 blank lines, found X	2020-11-07 14:45:24 -05:00
Joshua Boniface	ab0b932fe3	Lint: E125 continuation line with same indent as next logical line	2020-11-07 13:49:54 -05:00
Joshua Boniface	f5988ad53d	Lint: F821 undefined name 'pool'/'volume' This class is actually entirely unused but is kept for consistency with the others. It may be used someday for something.	2020-11-07 13:34:18 -05:00
Joshua Boniface	c3dfe2e381	Lint: F821 undefined name 'myshorthostname'	2020-11-07 13:31:19 -05:00
Joshua Boniface	961ebb4c01	Lint: E305 expected 2 blank lines after class or function definition, found X	2020-11-07 13:17:49 -05:00
Joshua Boniface	e553c5d42a	Lint: E122 continuation line missing indentation or outdented	2020-11-07 13:12:26 -05:00
Joshua Boniface	7932be3948	Lint: E261 at least two spaces before inline comment	2020-11-07 13:11:03 -05:00
Joshua Boniface	d2490419c5	Lint: E202 whitespace before ']'	2020-11-07 13:02:54 -05:00
Joshua Boniface	d2e5ede399	Lint: E202 whitespace before ')'	2020-11-07 12:58:54 -05:00
Joshua Boniface	3f242cd437	Lint: E202 whitespace before '}'	2020-11-07 12:57:42 -05:00
Joshua Boniface	b7daa8e1f6	E201 whitespace after '['	2020-11-07 12:39:59 -05:00
Joshua Boniface	c88965e898	Lint: E201 whitespace after '('	2020-11-07 12:39:27 -05:00
Joshua Boniface	e333f2b935	Lint: E201 whitespace after '{'	2020-11-07 12:38:31 -05:00
Joshua Boniface	3cb92fed75	Lint: E401 multiple imports on one line	2020-11-07 12:29:32 -05:00
Joshua Boniface	27c6ac2b66	Lint: W605 invalid escape sequence '\d' This is the only one where forcing an `r` type to the string was required; the remainder of W605 were replaced with character class enclosures.	2020-11-07 12:22:20 -05:00
Joshua Boniface	8ba267a59e	Lint: E211 whitespace before '['/'('	2020-11-07 12:20:01 -05:00
Joshua Boniface	39cc992e9b	Lint: E306 expected 1 blank line before a nested definition, found 0	2020-11-07 12:17:38 -05:00
Joshua Boniface	8c623023d5	Lint: F811 redefinition of unused '<function>'	2020-11-07 12:14:29 -05:00
Joshua Boniface	5b3ee363b2	Lint: E222 multiple spaces after operator	2020-11-07 12:10:24 -05:00
Joshua Boniface	fad27a7f4d	Lint: E131 continuation line unaligned for hanging indent	2020-11-06 22:29:49 -05:00
Joshua Boniface	2eef6a1c21	Lint: E265 block comment should start with '# '	2020-11-06 21:32:17 -05:00
Joshua Boniface	4b47a2424c	Lint: E303 too many blank lines (2)	2020-11-06 21:16:52 -05:00
Joshua Boniface	cb2defbde9	Lint: W391 blank line at end of file	2020-11-06 21:14:19 -05:00
Joshua Boniface	5da314902f	Lint: F841 local variable '<variable>' is assigned to but never used	2020-11-06 21:13:13 -05:00
Joshua Boniface	98a573bbc7	Lint: E402 module level import not at top of file	2020-11-06 20:40:32 -05:00
Joshua Boniface	aecb845d6a	Lint: E713 test for membership should be 'not in'	2020-11-06 20:37:52 -05:00
Joshua Boniface	fde8ea2fea	Lint: W291 trailing whitespace	2020-11-06 19:44:14 -05:00
Joshua Boniface	57c51d3234	Lint: E711 comparison to None should be 'if cond is not None:'	2020-11-06 19:37:13 -05:00
Joshua Boniface	ce01b41d81	Lint: E711 comparison to None should be 'if cond is None:'	2020-11-06 19:36:36 -05:00
Joshua Boniface	4d6f36aca0	Lint: E712 comparison to False should be 'if cond is False:' or 'if not cond:'	2020-11-06 19:35:51 -05:00
Joshua Boniface	fb4aafcea9	Lint: E111 indentation is not a multiple of four	2020-11-06 19:26:22 -05:00
Joshua Boniface	d9e7b7ec15	Lint: F401 <library> imported but unused	2020-11-06 19:22:49 -05:00
Joshua Boniface	ebf254f62d	Lint: W293 blank line contains whitespace	2020-11-06 19:11:07 -05:00
Joshua Boniface	63f4f9aed7	Lint: E722 do not use bare 'except'	2020-11-06 18:55:10 -05:00
Joshua Boniface	56ba7b1457	Bump version to 0.9.1	2020-10-29 12:16:38 -04:00
Joshua Boniface	ec0b8acf90	Support per-VM migration type selectors Allow a VM to specify its migration type as a default choice. The valid options are "default" (i.e. behave as now), "live" which forces a live migration only, and "shutdown" which forces a shutdown migration only. The new option is treated as a VM meta option and is set to default if not found.	2020-10-29 12:01:29 -04:00
Joshua Boniface	5d08ad9573	Fix incorrect keepalive interval setting	2020-10-26 11:44:45 -04:00
Joshua Boniface	0f299777f1	Modify version to 3-digit numbering I expect 0.9 will be fairly long-lived, so add another decimal place so I may continue adding tweaks to it. THIS IS NOT SEMVER.	2020-10-26 02:13:11 -04:00
Joshua Boniface	890023cbfc	Make sender wait dynamic based on receiver	2020-10-21 14:43:54 -04:00
Joshua Boniface	28abb018e3	Improve some timeouts and conditionals	2020-10-21 12:00:10 -04:00
Joshua Boniface	017953c2e6	Move lock release to phase D	2020-10-21 11:07:01 -04:00
Joshua Boniface	82b4d3ed1b	Add missing prefix statements to loggers	2020-10-21 10:52:53 -04:00
Joshua Boniface	bae366a316	Add waits and only receive check on send	2020-10-21 10:43:42 -04:00
Joshua Boniface	351076c15e	Check if node changed during final check Avoids situations where two migrates, to different nodes, happen in rapid succession. Aborts the migration if the current target node no longer matches what was set at the start of the execution.	2020-10-21 02:52:36 -04:00
Joshua Boniface	42514b9a50	Improve messages further	2020-10-21 02:41:42 -04:00
Joshua Boniface	611e47f338	Add messages to migration aborts Results in some information duplication, but ensures logging of the reason a migration was aborted separate from the error(s) this may generate.	2020-10-21 02:38:42 -04:00
Joshua Boniface	1523959074	Move where setting last_ vars happens	2020-10-21 02:24:00 -04:00
Joshua Boniface	ef762359f4	Adjust timing to avoid migrating to self quickly Add another separate state lock, release it earlier, and ensure timings are good to avoid double-migrating one VM.	2020-10-21 02:17:55 -04:00
Joshua Boniface	398d33778f	Avoid stopping duplicates, just lock our own key	2020-10-20 16:10:39 -04:00
Joshua Boniface	a6d492ed9f	Remove spurious writes and adjust sleep	2020-10-20 16:04:26 -04:00
Joshua Boniface	11fa3b0df3	Remove additional wait and add last_node entries These allow for aborting a migration to retain the previous settings and override what the client set.	2020-10-20 15:58:55 -04:00
Joshua Boniface	442aa4e420	Tweak timers further	2020-10-20 15:43:59 -04:00
Joshua Boniface	3910843660	Add missing break	2020-10-20 15:39:29 -04:00
Joshua Boniface	70f3fdbfb9	Tweak the delays slightly on receive	2020-10-20 15:38:07 -04:00
Joshua Boniface	7cb0241a12	Attempt live migrates 3 times before proceeding	2020-10-20 15:33:41 -04:00
Joshua Boniface	9fb33ed7a7	Increase peer lock acquiring timers	2020-10-20 15:26:59 -04:00
Joshua Boniface	abfe0108ab	Better handle aborting migrations	2020-10-20 15:22:16 -04:00
Joshua Boniface	567fe8f36b	Wait for existing migrations before proceeding	2020-10-20 15:12:32 -04:00
Joshua Boniface	ec7b78b9b8	Add additional short sleep in receive	2020-10-20 13:29:17 -04:00
Joshua Boniface	224c8082ef	Alter text of synchronization messages	2020-10-20 13:08:18 -04:00
Joshua Boniface	f9e7e9884f	Improve handling of VM migrations The VM migration code was very old, very spaghettified, and prone to strange failures. Improve this by taking cues from the node primary migration. Use synchronization between the nodes to ensure lockstep completion of the migration in discrete steps. A proper queue can be built later to integrate with this code more cleanly. References #108	2020-10-20 13:01:55 -04:00
Joshua Boniface	726501f4d4	Add additional logging to flush selector Adds additional debug logging to the flush selector to determine how any why any given node is selected. Useful for troubleshooting strange choices.	2020-10-20 12:34:18 -04:00
Joshua Boniface	7cc33451b9	Improve Munin check with extinfo	2020-10-19 11:01:00 -04:00
Joshua Boniface	c6e34c7dc6	Bump base version to 0.9	2020-10-18 14:31:19 -04:00
Joshua Boniface	f749633f7c	Use provisioned memory for mem migration selector Use the new "provisioned" memory field, instead of the "allocated" memory field, to determine the optimal node when using the "mem" migration selector. This will take into account non-running VMs in the calculation as well as running VMs.	2020-10-18 14:17:15 -04:00
Joshua Boniface	a4b80be5ed	Add provisioned memory to node info Adds a separate field to the node memory, "provisioned", which totals the amount of memory provisioned to all VMs on the node, regardless of state, and in contrast to "allocated" which only counts running VMs. Allows for the detection of potential overprovisioned states when factoring in non-running VMs. Includes the supporting code to get this data, since the original implementation of VM memory selection was dependent on the VM being running and getting this from libvirt. Now, if the VM is not active, it gets this from the domain XML instead.	2020-10-18 14:17:15 -04:00
Joshua Boniface	aa5f8c93fd	Entirely disable IPv6 on bridged interfaces Prevents any potential leakage due to autoconfigured IPv6 on bridged interfaces. These are exclusively VM-side bridges, and the PVC host should not have any IPv6 configuration on them, ever.	2020-10-15 11:00:59 -04:00
Joshua Boniface	9366977fe6	Copy d_domain before iterating Prevents a bug where the thread can crash due to a change in the d_domain object while running the for loop. By copying and iterating over the copy, this becomes safer.	2020-09-16 15:12:37 -04:00
Joshua Boniface	65b44f2955	Avoid breaking keepalive during incoming migration The keepalive was getting stuck gathering memoryStats from the non-running VM, since it was in a paused state. Avoid this by just skipping past the rest of the stats gathering if the VM isn't running.	2020-08-28 01:47:36 -04:00
Joshua Boniface	78dec77987	Bump version to 0.8	2020-08-26 10:24:44 -04:00
Joshua Boniface	1dcc1f6d55	Rename sample database for API From pvcprov to pvcapi to facilitate the changing nature of this database and its expansion to benchmark results.	2020-08-25 01:59:35 -04:00
Joshua Boniface	921e57ca78	Fix syntax error	2020-08-20 23:05:56 -04:00
Joshua Boniface	3cc7df63f2	Add configurable VM shutdown timeout Closes #102	2020-08-20 21:26:12 -04:00
Joshua Boniface	7e2114b536	Add initial monitoring configurations to daemon Initial work to support multiple monitoring agents including Munin, Check_MK, and NRPE at the least.	2020-08-17 17:05:55 -04:00
Joshua Boniface	e8e65934e3	Use logger prefix for thread debug logs	2020-08-17 14:30:21 -04:00
Joshua Boniface	24fda8a73f	Use new debug logger for DNS Aggregator	2020-08-17 14:26:43 -04:00
Joshua Boniface	9b3ef6d610	Add connect timeout to Ceph This doesn't seem to actually do anything (like most of these timeouts...) but add it just for posterity.	2020-08-17 13:58:14 -04:00
Joshua Boniface	b451c0e8e3	Add additional start/finish debug messages	2020-08-17 13:11:03 -04:00
Joshua Boniface	f9b126a106	Make zkhandler accept failures more robustly Most of these would silently fail if there was e.g. an issue with the ZK connection. Instead, encase things in try blocks and handle the exceptions in a more graceful way, returning None or False if applicable. Except for locks, which should retry 5 times before aborting.	2020-08-17 13:03:36 -04:00
Joshua Boniface	553f96e7ef	Use logger for debug output Using simple print statements was annoying (lack of timing info and formatting), so move to using the debug logger for these instead with a custom state ('d') with white text to differentiate them. Also indicate which subthread of the keepalive each task is being executed in for easier tracing of issues.	2020-08-17 12:46:52 -04:00
Joshua Boniface	65add58c9a	Properly properly handle issue	2020-08-16 11:38:39 -04:00
Joshua Boniface	0a01d84290	Tie fence timers to keepalive_interval Also wait 2 full keepalive intervals after fencing before doing anything else, to give the Ceph cluster a chance to recover.	2020-08-15 12:38:03 -04:00
Joshua Boniface	4afb288429	Properly handle missing domain_name fail	2020-08-15 12:07:23 -04:00
Joshua Boniface	985ad5edc0	Warn if fencing will fail Verify our IPMI state on startup, and then warn if fencing will fail. For now, this is sufficient, but in future (requires refactoring) we might want to adjust how fencing occurs based on this information.	2020-08-13 14:42:18 -04:00
Joshua Boniface	0587bcbd67	Go back to manual command for OSD stats Using the Ceph library was a disaster here; it had no timeout or way to force it to continue, so keepalives would become stuck and trigger fence storms. Go back to the manual osd dump command with a 2s timeout which is far more reliable and can be adequately terminated if it runs long.	2020-08-12 22:31:25 -04:00
Joshua Boniface	09c1bb6a46	Increase start delay of flush service	2020-08-11 14:17:35 -04:00
Joshua Boniface	e0cb4a58c3	Ensure zk_listener is readded after reconnect	2020-08-11 12:46:15 -04:00
Joshua Boniface	099c58ead8	Fix missing char in log message	2020-08-11 12:40:35 -04:00
Joshua Boniface	0e5c681ada	Clean up imports Make several imports more specific to reduce redundant code imports and improve memory utilization.	2020-08-11 12:09:10 -04:00
Joshua Boniface	46ffe352e3	Better handle subthread timeouts in keepalive Prevent the main keepalive thread from getting stuck due to a subthread taking an enormous time. If this happens, the rest of the main keepalive will continue onward, thus ensuring that the main keepalive does not fail for a significant number of cycles, which would cause a fence.	2020-08-11 11:37:26 -04:00
Joshua Boniface	ccee124c8b	Adjust fence failcount limit to 6 (30s) The previous saving throw limit (3/15s) seems to have been too low. I was observing bizarre failures where a node would be fenced while it was still starting up. Some of this may have been related to Zookeeper connections taking too long, but this was inconsistent. Increase this to 6 saving throws (30s). This provides significantly more time for a node to properly check in on startup before another node fences it. In the real world, 15s vs 30s isn't that big of a downtime change, but prevents false-positive fences.	2020-08-05 22:40:07 -04:00
Joshua Boniface	02343079c0	Improve fencing migrate layout Open the option to do this in parallel with some threads	2020-08-05 22:26:01 -04:00
Joshua Boniface	37b83aad6a	Add logging and use better conditional	2020-08-05 21:57:36 -04:00
Joshua Boniface	876f2424e0	Ensure dead state isn't written erroneously	2020-08-05 21:57:11 -04:00
Joshua Boniface	5871380e1b	Avoid crashing VM stats thread if domain migrated	2020-06-10 17:10:46 -04:00
Joshua Boniface	654a3cb7fa	Improve debug output and use ceph df util data	2020-06-06 22:52:49 -04:00
Joshua Boniface	9b65d3271a	Improve handling of Ceph status gathering Use the Rados library instead of random OS commands, which massively improves the performance of these tasks. Closes #97	2020-06-06 22:30:25 -04:00
Joshua Boniface	598b2025e8	Use Rados and add Ceph entries to pvcnoded.yaml	2020-06-06 21:12:51 -04:00
Joshua Boniface	70b787d1fd	Move all VM functions into thread	2020-06-06 15:44:05 -04:00
Joshua Boniface	e1310a05f2	Implement recording of VM stats during keepalive	2020-06-06 15:34:03 -04:00
Joshua Boniface	2ad6860dfe	Move Ceph statistics gathering into thread	2020-06-06 13:25:02 -04:00
Joshua Boniface	cebb4bbc1a	Comment cleanup	2020-06-06 13:20:40 -04:00
Joshua Boniface	a672e06dd2	Move fencing to end of keepalive function	2020-06-06 13:19:11 -04:00
Joshua Boniface	1db73bb892	Move libvirt closure into previous section	2020-06-06 13:18:37 -04:00
Joshua Boniface	c1956072f0	Rename update_zookeeper function to node_keepalive	2020-06-06 12:49:50 -04:00
Joshua Boniface	ce60836c34	Allow enforcement of live migration Provides a CLI and API argument to force live migration, which triggers a new VM state "migrate-live". The node daemon VMInstance during migrate will read this flag from the state and, if enforced, will not trigger a shutdown migration. Closes #95	2020-06-06 12:00:44 -04:00
Joshua Boniface	b5434ba744	Fix typo in variable name	2020-06-06 11:29:48 -04:00
Joshua Boniface	b9e5b14f94	Update lastnode too if a self-migrate is aborted References #92	2020-06-04 10:28:04 -04:00

... 4 5 6 7 8 ...

853 Commits