parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	c6d552ae57	Rework success checks for IPMI fencing Previously, if the node failed to restart, it was declared a "bad fence" and no further action would be taken. However, there are some situations, for instance critical hardware failures, where intelligent systems will not attempt (or succeed at) starting up the node in such a case, which would result in dead, known-offline nodes without recovery. Tweak this behaviour somewhat. The main path of Reboot -> Check On -> Success + fence-flush is retained, but some additional side-paths are now defined: 1. We attempt to power "on" the chassis 1 second after the reboot, just in case it is off and can be recovered. We then wait another 2 seconds and check the power status (as we did before). 2. If the reboot succeeded, follow this series of choices: a. If the chassis is on, the fence succeeded. b. If the chassis is off, the fence "succeeded" as well. c. If the chassis is in some other state, the fence failed. 3. If the reboot failed, follow this series of choices: a. If the chassis is off, the fence itself failed, but we can treat it as "succeeded"" since the chassis is in a known-offline state. This is the most likely situation when there is a critical hardware failure, and the server's IPMI does not allow itself to start back up again. b. If the chassis is in any other state ("on" or unknown), the fence itself failed and we must treat this as a fence failure. Overall, this should alleviate the aforementioned issue of a critical failure rendering the node persistently "off" not triggering a fence-flush and ensure fencing is more robust.	2021-07-13 17:54:41 -04:00
Joshua M. Boniface	2e9f6ac201	Bump version to 0.9.25	2021-07-11 23:19:09 -04:00
Joshua M. Boniface	f09849bedf	Don't overwrite shutdown state on termination Just a minor quibble and not really impactful.	2021-07-11 23:18:14 -04:00
Joshua M. Boniface	c76149141f	Only log ZK connections when persistent Prevents spam in the API logs.	2021-07-10 23:35:49 -04:00
Joshua M. Boniface	f00c4d07f4	Add date output to keepalive Helps track when there is a log follow in "-o cat" mode.	2021-07-10 23:24:59 -04:00
Joshua M. Boniface	20b66c10e1	Move two more commands to Rados library	2021-07-10 17:28:42 -04:00
Joshua M. Boniface	cfeba50b17	Revert "Return to all command-based Ceph gathering" This reverts commit `65d14ccd92`. This was actually a bad idea. For inexplicable reasons, running these Ceph commands manually (not even via Python, but in a normal shell) takes 7 * two orders of magnitude longer than running them with the Rados module, so long in fact that some basic commands like "ceph health" would sometimes take longer than the 1 second timeout to complete. The Rados commands would however take about 1ms instead. Despite the occasional issues when monitors drop out, the Rados module is clearly far superior to the shell commands for any moderately-loaded Ceph cluster. We can look into solving timeouts another way (perhaps with Processes instead of Threads) at a later time. Rados module "ceph health": b'{"checks":{},"status":"HEALTH_OK"}' 0.001204 (s) b'{"checks":{},"status":"HEALTH_OK"}' 0.001258 (s) Command "ceph health": joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.772s user 0m0.707s sys 0m0.046s joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.796s user 0m0.728s sys 0m0.054s	2021-07-10 03:47:45 -04:00
Joshua M. Boniface	551bae2518	Bump version to 0.9.24	2021-07-09 15:58:36 -04:00
Joshua M. Boniface	2b5dc286ab	Correct failure to get ceph_health data	2021-07-09 13:10:28 -04:00
Joshua M. Boniface	330cf14638	Remove return statements in keepalive collectors These seem to bork the keepalive timer process, so just remove them and let it continue to press on.	2021-07-09 13:04:17 -04:00
Joshua M. Boniface	65d14ccd92	Return to all command-based Ceph gathering Using the Rados module was very problematic, specifically because it had no sensible timeout parameters and thus would hang for many seconds. This has poor implications since it blocks further keepalives. Instead, remove the Rados usage entirely and go back completely to using manual OS commands to gather this information. While this may cause PID exhaustion more quickly it's worthwhile to avoid failure scenarios when Ceph stats time out. Closes #137	2021-07-06 11:30:45 -04:00
Joshua M. Boniface	7082982a33	Bump version to 0.9.23	2021-07-05 23:40:32 -04:00
Joshua M. Boniface	5b6ef71909	Ensure daemon mode is updated on startup Fixes the side effect of the previous bug during deploys of 0.9.22.	2021-07-05 23:39:23 -04:00
Joshua M. Boniface	be7b0be8ed	Fix typo in schema path name	2021-07-05 23:23:23 -04:00
Joshua M. Boniface	37cd278bc2	Bump version to 0.9.22	2021-07-05 14:18:51 -04:00
Joshua M. Boniface	a69105569f	Add node PVC version data to Node information Allows API client to see the currently-active version of the node daemon.	2021-07-05 09:57:38 -04:00
Joshua M. Boniface	21a1a7da9e	Fix bad schema reference Not sure how this didn't cause an issue until now, but the wrong key path was used and this was getting unexpected data with the newly-added version string instead of the proper mode string.	2021-07-05 09:53:51 -04:00
Joshua M. Boniface	f0fd3d3f0e	Make extra sure VMs terminate when told When doing a stop_vm or terminate_vm, check again after 0.2 seconds and try re-terminating if it's still running. Covers cases where a VM doesn't stop if given the 'stop' state.	2021-07-02 11:40:34 -04:00
Joshua M. Boniface	f12de6727d	Adjust logo slightly and add debug state	2021-07-02 02:32:08 -04:00
Joshua M. Boniface	e94f5354e6	Update startup messages with new ASCII logo	2021-07-02 02:21:30 -04:00
Joshua M. Boniface	c51023ba81	Add profiler to keepalive function	2021-07-02 01:55:15 -04:00
Joshua M. Boniface	39e82ee426	Cast base schema version to int Or all our comparisons will fail later and nodes can't start.	2021-06-30 09:40:33 -04:00
Joshua M. Boniface	fe0a1d582a	Bump version to 0.9.21	2021-06-29 19:21:31 -04:00
Joshua M. Boniface	3490ecbb59	Remove explicit ZK address from Patronictl command	2021-06-22 03:31:06 -04:00
Joshua M. Boniface	2928d695c9	Ensure migration method is updated on state changes	2021-06-22 03:20:15 -04:00
Joshua M. Boniface	26dd24e3f5	Ensure MTU is set on VF when starting up	2021-06-22 02:26:14 -04:00
Joshua M. Boniface	e623909a43	Store PHY MAC for VFs and restore after free	2021-06-22 00:56:47 -04:00
Joshua M. Boniface	60e1da09dd	Don't try any shenannegans when updating NICs Trying to do this on the VMInstance side had problems because we can't differentiate the 3 types of migration there. So, just update this in the API side and hope everything goes well. This introduces an edge bug: if a VM is using a macvtap SR-IOV device, and then tries to migrate, and the migrate is aborted, the NIC lists will be inconsistent. When I revamp the VMInstance in the future, I should be able to correct this, but for now we'll have to live with that edgecase.	2021-06-22 00:00:50 -04:00
Joshua M. Boniface	7d42fba373	Ensure being in migrate doesn't abort shutdown	2021-06-21 23:28:53 -04:00
Joshua M. Boniface	24ce361a04	Ensure SR-IOV NIC states are updated on migration	2021-06-21 23:18:34 -04:00
Joshua M. Boniface	eeb83da97d	Add support for SR-IOV NICs to VMs	2021-06-21 23:18:22 -04:00
Joshua M. Boniface	64d1a37b3c	Add PCIe device paths to SR-IOV VF information This will be used when adding VM network interfaces of type hostdev.	2021-06-21 21:08:46 -04:00
Joshua M. Boniface	13cc0f986f	Implement SR-IOV VF config set Also fixes some random bugs, adds proper interface sorting, and assorted tweaks.	2021-06-21 18:40:11 -04:00
Joshua M. Boniface	ca11dbf491	Sort the list of VFs for easier parsing	2021-06-21 01:40:05 -04:00
Joshua M. Boniface	e8bd1bf2c4	Ensure used/used_by are set on creation	2021-06-21 01:25:38 -04:00
Joshua M. Boniface	bff6d71e18	Add logging to SRIOVVFInstance and fix bug	2021-06-17 02:02:41 -04:00
Joshua M. Boniface	57b041dc62	Ensure default for vLAN and QOS is 0 not empty	2021-06-17 01:54:37 -04:00
Joshua M. Boniface	5607a6bb62	Avoid overwriting VF data Ensures that the configuration of a VF is not overwritten in Zookeeper on a node restart. The SRIOVVFInstance handlers were modified to start with None values, so that the DataWatch statements will always trigger updates to the live system interfaces on daemon startup, thus ensuring that the config stored in Zookeeper is applied to the system on startup (mostly relevant after a cold boot or if the API changes them during a daemon restart).	2021-06-17 01:45:22 -04:00
Joshua M. Boniface	8f1af2a642	Ignore hostdev interfaces in VM net stat gathering Prevents errors if a SR-IOV hostdev interface is configured until this is more defined.	2021-06-17 01:33:11 -04:00
Joshua M. Boniface	e7b6a3eac1	Implement SR-IOV PF and VF instances Adds support for the node daemon managing SR-IOV PF and VF instances. PFs are added to Zookeeper automatically based on the config at startup during network configuration, and are otherwise completely static. PFs are automatically removed from Zookeeper, along with all coresponding VFs, should the PF phy device be removed from the configuration. VFs are configured based on the (autocreated) VFs of each PF device, added to Zookeeper, and then a new class instance, SRIOVVFInstance, is used to watch them for configuration changes. This will enable the runtime management of VF settings by the API. The set of keys ensures that both configuration and details of the NIC can be tracked. Most keys are self-explanatory, especially for PFs and the basic keys for VFs. The configuration tree is also self-explanatory, being based entirely on the options available in the `ip link set {dev} vf` command. Two additional keys are also present: `used` and `used_by`, which will be able to track the (boolean) state of usage, as well as the VM that uses a given VIF. Since the VM side implementation will support both macvtap and direct "hostdev" assignments, this will ensure that this state can be tracked on both the VF and the VM side.	2021-06-17 01:33:03 -04:00
Joshua M. Boniface	0ad6d55dff	Add initial SR-IOV support to node daemon Adds configuration values for enabled flag and SR-IOV devices to the configuration and sets up the initial SR-IOV configuration on daemon startup (inserting the module, configuring the VF count, etc.).	2021-06-15 22:56:09 -04:00
Joshua M. Boniface	e4a65230a1	Just do the shutdown command itself	2021-06-15 02:32:14 -04:00
Joshua M. Boniface	284c581845	Ensure shutdown migrations actually time out Without this a VM that fails to respond to a shutdown will just spin forever, blocking state changes.	2021-06-15 00:23:15 -04:00
Joshua M. Boniface	953e46055a	Fix issue with loading None version schema	2021-06-14 21:09:55 -04:00
Joshua M. Boniface	d2bcfe5cf7	Bump version to 0.9.20	2021-06-14 18:06:27 -04:00
Joshua M. Boniface	ef1701b4c8	Handle an additional exception case	2021-06-14 17:15:40 -04:00
Joshua M. Boniface	08dc756549	Actually disable the pvcapid service Prevents it from trying to start itself during updates or reboots on non-primary coordinators.	2021-06-14 17:13:22 -04:00
Joshua M. Boniface	0a9c0c1ccb	Use a nicer reload method on hot schema update Instead of exiting and trusting systemd to restart us, instead leverage the os.execv() call to reload the process in the current PID context. Also improves the log messages so it's very clear what's going on.	2021-06-14 17:10:21 -04:00
Joshua M. Boniface	e34a7d4d2a	Handle hot reloads properly A hot reload isn't possible due to DataWatch and ChildrenWatch constructs, so we instead need to terminate the daemon to "apply" the schema update. Thus we use exit code 150 (Application defined in LSB) and reorder some of the elements of the schema validation to ensure things happen in the right order.	2021-06-14 12:52:43 -04:00
Joshua M. Boniface	1f49bfa1b2	Fix name of schema element	2021-06-13 20:56:17 -04:00

1 2 3 4 5 ...

298 Commits