parallelvirtualcluster/pvc - pvc

Commit Graph

Author	SHA1	Message	Date
Joshua Boniface	b5f996febd	Fix bugs for node flush for stop/shutdown/restart Previously VMs in stop/shutdown/restart states wouldn't be properly handled during a node flush. This fixes the bugs and ensures that the transient VM states (shutdown/restart) are completed before proceeding, and then avoids setting a stopped/shutdown VM to shutdown/auotstart.	2023-08-18 11:25:59 -04:00
Joshua Boniface	3a90fda109	Bump version to 0.9.63	2023-04-28 14:47:04 -04:00
Joshua Boniface	9114255af5	Add .update- obsolete configs to dpkg plugin	2023-04-10 15:39:40 -04:00
Joshua Boniface	2c3a3cdf52	Use try when watching health value in NodeInstance	2023-03-07 09:53:01 -05:00
Joshua Boniface	0b583bfdaf	Bump IPMI timeout to 2 seconds	2023-03-07 09:25:27 -05:00
Joshua Boniface	7c07fbefff	Adjust keepalive health printing and ordering	2023-02-24 11:08:30 -05:00
Joshua Boniface	202dc3ed59	Correct error handling if monitoring plugins fail	2023-02-24 10:19:41 -05:00
Joshua Boniface	4c2d99f8a6	Fix bug with SMART info	2023-02-23 13:21:23 -05:00
Joshua Boniface	bcff6650d0	Set timeout on IPMI command	2023-02-23 11:10:09 -05:00
Joshua Boniface	a11206253d	Fix ZK check location	2023-02-23 11:04:02 -05:00
Joshua Boniface	45ad3b9a17	Bump version to 0.9.62	2023-02-22 18:13:45 -05:00
Joshua Boniface	dc4e56db4b	Add IPMI monitoring check	2023-02-22 15:02:08 -05:00
Joshua Boniface	e45b3108a2	Add health delta change to message output	2023-02-22 15:02:08 -05:00
Joshua Boniface	118237a53b	Fix bad string value for message	2023-02-22 15:02:08 -05:00
Joshua Boniface	9805681f94	Use consistent connection with other checks	2023-02-22 15:02:08 -05:00
Joshua Boniface	6c9abb2abe	Add Libvirtd monitoring check	2023-02-22 15:02:08 -05:00
Joshua Boniface	a1122c6e71	Add Zookeeper monitoring check	2023-02-22 15:02:08 -05:00
Joshua Boniface	3696f81597	Add PostgreSQL monitoring check	2023-02-22 15:02:08 -05:00
Joshua Boniface	5ca0d903b6	Adjust comment message	2023-02-22 15:02:08 -05:00
Joshua Boniface	626424b74a	Adjust Munin threshold values	2023-02-22 10:42:43 -05:00
Joshua Boniface	c9ceb3159b	Remove obsolete LINKSPEED variable	2023-02-22 01:04:25 -05:00
Joshua Boniface	6525a2568b	Adjust health delta of load to 50 This is a very bad situation and should be critical.	2023-02-22 01:03:12 -05:00
Joshua Boniface	09a005d3d7	Adjust health delta of EDAC Uncorrected to 50 This is a very bad situation and should be critical.	2023-02-22 01:01:54 -05:00
Joshua Boniface	fb0fcc0597	Update readme for Munin plugin	2023-02-18 00:00:04 -05:00
Joshua Boniface	3009f24910	Fix typo in var and flip conditional	2023-02-17 16:18:42 -05:00
Joshua Boniface	5ae836f1c5	Fix various issues with PVC Munin plugin	2023-02-17 15:41:16 -05:00
Joshua Boniface	eda1b95d5f	Update Munin plugin example	2023-02-16 16:06:00 -05:00
Joshua Boniface	3bd93563e6	Add CheckMK monitoring example plugins	2023-02-16 16:05:47 -05:00
Joshua Boniface	1093ca6264	Disallow health less than 0	2023-02-15 16:50:24 -05:00
Joshua Boniface	388f6556c0	Remove extra text from packages plugin	2023-02-15 16:28:41 -05:00
Joshua Boniface	6c7be492b8	Move Ceph health to global cluster health	2023-02-15 15:46:13 -05:00
Joshua Boniface	f4eef30770	Add JSON health to cluster data	2023-02-15 15:26:57 -05:00
Joshua Boniface	8565cf26b3	Add disk monitoring plugin	2023-02-15 11:30:49 -05:00
Joshua Boniface	0ecf219910	Run setup during plugin loads	2023-02-15 10:11:38 -05:00
Joshua Boniface	0f4edc54d1	Use percentage in keepalie output	2023-02-15 01:56:02 -05:00
Joshua Boniface	ca91be51e1	Improve ethtool parsing speeds	2023-02-14 15:49:58 -05:00
Joshua Boniface	e29d0e89eb	Add NIC monitoring plugin	2023-02-14 15:43:52 -05:00
Joshua Boniface	14d29f2986	Adjust text on log message	2023-02-13 22:21:23 -05:00
Joshua Boniface	bc88d764b0	Add logging flag for montioring plugin output	2023-02-13 22:04:39 -05:00
Joshua Boniface	a3c31564ca	Flip condition in EDAC check	2023-02-13 21:58:56 -05:00
Joshua Boniface	b07396c39a	Fix bugs if plugins fail to load	2023-02-13 21:51:48 -05:00
Joshua Boniface	71139fa66d	Add EDAC check plugin	2023-02-13 21:43:13 -05:00
Joshua Boniface	1ea4800212	Set node health to None when restarting	2023-02-13 15:54:46 -05:00
Joshua Boniface	9c14d84bfc	Add node health value and send out API	2023-02-13 15:53:39 -05:00
Joshua Boniface	d8f346abdd	Move Ceph cluster health reporting to plugin Also removes several outputs from the normal keepalive that were superfluous/static so that the main output fits on one line.	2023-02-13 13:29:40 -05:00
Joshua Boniface	2ee52e44d3	Move Ceph cluster health reporting to plugin Also removes several outputs from the normal keepalive that were superfluous/static so that the main output fits on one line.	2023-02-13 12:13:56 -05:00
Joshua Boniface	3c742a827b	Initial implementation of monitoring plugin system	2023-02-13 12:06:26 -05:00
Joshua Boniface	aeb238f43c	Bump version to 0.9.61	2023-02-08 10:08:05 -05:00
Joshua Boniface	a49510ecc8	Bump version to 0.9.60	2022-12-06 15:42:55 -05:00
Joshua Boniface	92feeefd26	Bump version to 0.9.59	2022-11-15 15:50:15 -05:00
Joshua Boniface	38d63d9837	Flip behaviour of memory selectors It didn't make any sense to me for mem(prov) to be the default selector, since this has too many caveats versus mem(free). Switch to using mem(free) as the default (i.e. "mem") and make memprov the alternative.	2022-11-15 15:45:59 -05:00
Joshua Boniface	095bcb2373	Bump version to 0.9.58	2022-11-07 12:27:48 -05:00
Joshua Boniface	d65f512897	Bump version to 0.9.57	2022-11-06 01:39:50 -04:00
Joshua Boniface	c3bc55eff8	Bump version to 0.9.56	2022-10-27 14:21:04 -04:00
Joshua Boniface	6c58d52fa1	Add node autoready oneshot unit This replicates some of the more important functionality of the defunct pvc-flush.service unit. On presence of a trigger file (i.e. /etc/pvc/autoready), it will trigger a "node ready" on boot. It does nothing on shutdown as this must be handled by other mechanisms, though a similar autoflush could be added as well.	2022-10-27 14:09:14 -04:00
Joshua Boniface	726d0a562b	Update copyright header year	2022-10-06 11:55:27 -04:00
Joshua Boniface	f1df1cfe93	Bump version to 0.9.55	2022-10-04 13:21:40 -04:00
Joshua Boniface	5942aa50fc	Avoid raise/handle deadlocks Can cause log flooding in some edge cases and isn't really needed any longer. Use a proper conditional followed by an actual error handler.	2022-10-03 14:04:12 -04:00
Joshua Boniface	239c392892	Bump version to 0.9.54	2022-08-23 11:01:05 -04:00
Joshua Boniface	9b499b9f48	Bump version to 0.9.53	2022-08-12 17:47:11 -04:00
Joshua Boniface	2a21d48128	Bump version to 0.9.52	2022-08-12 11:09:25 -04:00
Joshua Boniface	8d0f26ff7a	Add additional kb_ values to OSD stats Allows for easier parsing later to get e.g. % values and more details on the used amounts.	2022-08-11 11:06:36 -04:00
Joshua Boniface	645b525ad7	Bump version to 0.9.51	2022-07-25 23:25:41 -04:00
Joshua Boniface	ec559aec0d	Remove pvc-flush service This service caused more headaches than it was worth, so remove it. The original goal was to cleanly flush nodes on shutdown and unflush them on startup, but this is tightly controlled by Ansible playbooks at this point, and this is something best left to the Administrator and their particular situation anyways.	2022-07-25 23:21:34 -04:00
Joshua Boniface	932b3c55a3	Bump version to 0.9.50	2022-07-06 16:01:14 -04:00
Joshua Boniface	92e2ff7449	Fix bug with space-containing detect strings	2022-07-06 15:58:57 -04:00
Joshua Boniface	f8cdcb30ba	Add migration selector via free memory Closes #152	2022-05-18 03:47:16 -04:00
Joshua Boniface	51ad2058ed	Bump version to 0.9.49	2022-05-06 15:49:39 -04:00
Joshua Boniface	7a40c7a55b	Add support for replacing/refreshing OSDs Adds commands to both replace an OSD disk, and refresh (reimport) an existing OSD disk on a new node. This handles the cases where an OSD disk should be replaced (either due to upgrades or failures) or where a node is rebuilt in-place and an existing OSD must be re-imported to it. This should avoid the need to do a full remove/add sequence for either case. Also cleans up some aspects of OSD removal that are identical between methods (e.g. using safe-to-destroy and sleeping after stopping) and fixes a bug if an OSD does not truly exist when the daemon starts up.	2022-05-06 15:32:06 -04:00
Joshua Boniface	3801fcc07b	Fix bug with initial JSON for stats	2022-05-02 13:28:19 -04:00
Joshua Boniface	c741900baf	Refactor OSD removal to use new ZK data With the OSD LVM information stored in Zookeeper, we can use this to determine the actual block device to zap rather than relying on runtime determination and guestimation.	2022-05-02 12:52:22 -04:00
Joshua Boniface	464f0e0356	Store additional OSD information in ZK Ensures that information like the FSIDs and the OSD LVM volume are stored in Zookeeper at creation time and updated at daemon start time (to ensure the data is populated at least once, or if the /dev/sdX path changes). This will allow safer operation of OSD removals and the potential implementation of re-activation after node replacements.	2022-05-02 12:11:39 -04:00
Joshua Boniface	cea8832f90	Ensure initial OSD stats is populated Values are all invalid but this ensures the client won't error out when trying to show an OSD that has never checked in yet.	2022-04-29 16:50:30 -04:00
Joshua Boniface	5807351405	Bump version to 0.9.48	2022-04-29 15:03:52 -04:00
Joshua Boniface	d6ca74376a	Fix bugs with forced removal	2022-04-29 14:03:07 -04:00
Joshua Boniface	4d698be34b	Add OSD removal force option Ensures a removal can continue even in situations where some step(s) might fail, for instance removing an obsolete OSD from a replaced node.	2022-04-29 11:16:33 -04:00
Joshua Boniface	ea709f573f	Bump version to 0.9.47	2021-12-28 22:03:08 -05:00
Joshua Boniface	58d57d7037	Bump version to 0.9.46	2021-12-28 15:02:14 -05:00
Joshua Boniface	00d2c67c41	Allow single-node clusters to restart and timeout Prevents a daemon from waiting forever to terminate if it is primary, and avoids this entirely if there is only a single node in the cluster.	2021-12-28 03:06:03 -05:00
Joshua Boniface	67131de4f6	Fix bug when removing OSDs Ensure the OSD is down as well as out or purge might fail.	2021-12-28 03:05:34 -05:00
Joshua Boniface	abc23ebb18	Handle detect strings as arguments for blockdevs Allows specifying blockdevs in the OSD and OSD-DB addition commands as detect strings rather than actual block device paths. This provides greater flexibility for automation with pvcbootstrapd (which originates the concept of detect strings) and in general usage as well.	2021-12-28 02:53:02 -05:00
Joshua Boniface	f164d898c1	Bump version to 0.9.45	2021-11-25 09:34:20 -05:00
Joshua Boniface	a8899a1d66	Fix ordering of pvcnoded unit We want to be after network.target and want network-online.target	2021-11-18 16:56:49 -05:00
Joshua Boniface	817dffcf30	Bump version to 0.9.44	2021-11-11 16:20:38 -05:00
Joshua Boniface	eda2a57a73	Add Munin plugin for Ceph utilization	2021-11-08 15:21:09 -05:00
Joshua Boniface	6e9fcd38a3	Bump version to 0.9.43	2021-11-08 02:29:17 -05:00
Joshua Boniface	78faa90139	Reformat recent changes with Black	2021-11-06 03:27:07 -04:00
Joshua Boniface	23b1501f40	Fix linting error F541 f-string placeholders	2021-11-06 03:26:03 -04:00
Joshua Boniface	66bfad3109	Fix linting errors F522/F523 unused args	2021-11-06 03:24:50 -04:00
Joshua Boniface	c41664d2da	Reformat code with Black code formatter Unify the code style along PEP and Black principles using the tool.	2021-11-06 03:02:43 -04:00
Joshua Boniface	2e7b9b28b3	Add some delay and additional tries to fencing	2021-10-27 16:24:17 -04:00
Joshua Boniface	55f397a347	Fix bad location of config sets	2021-10-12 17:23:04 -04:00
Joshua Boniface	dfebb2d3e5	Also validate on failures	2021-10-12 17:11:03 -04:00
Joshua Boniface	e88147db4a	Bump version to 0.9.42	2021-10-12 15:25:42 -04:00
Joshua Boniface	b8204d89ac	Go back to passing if exception Validation already happened and the set happens again later.	2021-10-12 14:21:52 -04:00
Joshua Boniface	fe73dfbdc9	Use current live value for bridge_mtu This will ensure that upgrading without the bridge_mtu config key set will keep things as they are.	2021-10-12 12:24:03 -04:00
Joshua Boniface	8f906c1f81	Use power off in fence instead of reset Use a power off (and then make the power on a requirement) during a node fence. Removes some potential ambiguity in the power state, since we will know for certain if it is off.	2021-10-12 11:04:27 -04:00
Joshua Boniface	2d9fb9688d	Validate network MTU after initial read	2021-10-12 10:53:17 -04:00
Joshua Boniface	f13cc04b89	Bump version to 0.9.41	2021-10-09 19:39:21 -04:00
Joshua Boniface	95e01f38d5	Adjust log type of object setup message	2021-10-09 19:23:12 -04:00
Joshua Boniface	3122d73bf5	Avoid duplicate runs of MTU set It wasn't the validator duplicating, but the update duplicating, so avoid that happening properly this time.	2021-10-09 19:21:47 -04:00
Joshua Boniface	7ed8ef179c	Revert "Avoid duplicate runs of MTU validator" This reverts commit `56021c443a`.	2021-10-09 19:11:42 -04:00
Joshua Boniface	caead02b2a	Set all log messages to information state None of these were "success" messages and thus shouldn't have been ok state.	2021-10-09 19:09:38 -04:00
Joshua Boniface	87bc5f93e6	Avoid duplicate runs of MTU validator	2021-10-09 19:07:41 -04:00
Joshua Boniface	203893559e	Use correct isinstance instead of type	2021-10-09 19:03:31 -04:00
Joshua Boniface	2c51bb0705	Move MTU validation to function Prevents code duplication and ensures validation runs when an MTU is updated, not just on network creation.	2021-10-09 19:01:45 -04:00
Joshua Boniface	46d3daf686	Add logger message when setting MTU	2021-10-09 18:56:18 -04:00
Joshua Boniface	e9d05aa24e	Ensure vx_mtu is always an int()	2021-10-09 18:52:50 -04:00
Joshua Boniface	6ce28c43af	Add MTU value checking and log messages Ensures that if a specified MTU is more than the maximum it is set to the maximum instead, and adds warning messages for both situations.	2021-10-09 18:48:56 -04:00
Joshua Boniface	c45f8f5bd5	Have VXNetworkInstance set MTU if unset Makes this explicit in Zookeeper if a network is unset, post-migration (schema version 6). Addresses #144	2021-10-09 17:52:57 -04:00
Joshua Boniface	3690a2c1e0	Fix migration bugs and invalid vx_mtu Addresses #144	2021-10-09 17:35:10 -04:00
Joshua Boniface	50d8aa0586	Add handlers for client network MTUs Refactors some of the code in VXNetworkInterface to handle MTUs in a more streamlined fashion. Also fixes a bug whereby bridge client networks were being explicitly given the cluster dev MTU which might not be correct. Now adds support for this option explicitly in the configs, and defaults to 1500 for safety (the standard Ethernet MTU). Addresses #144	2021-10-09 17:02:27 -04:00
Joshua Boniface	6ee4c55071	Correct flawed conditional in verify_ipmi	2021-10-07 15:11:19 -04:00
Joshua Boniface	c27359c4bf	Bump version to 0.9.40	2021-10-07 14:42:04 -04:00
Joshua Boniface	46078932c3	Correct bad stop_keepalive_timer call	2021-10-07 14:41:12 -04:00
Joshua Boniface	bdb9db8375	Bump version to 0.9.39	2021-10-07 11:52:38 -04:00
Joshua Boniface	da9248cfa2	Bump version to 0.9.38	2021-10-03 22:32:41 -04:00
Joshua Boniface	23977b04fc	Bump version to 0.9.37	2021-09-30 02:08:14 -04:00
Joshua Boniface	f6f6f07488	Add timeouts to queue gets and adjust Ensure that all keepalive timeouts are set (prevent the queue.get() actions from blocking forever) and set the thread timeouts to line up as well. Everything here is thus limited to keepalive_interval seconds (default 5s) to keep it uniform.	2021-09-27 16:10:27 -04:00
Joshua Boniface	142c999ce8	Re-add success log output during migration	2021-09-27 11:50:55 -04:00
Joshua Boniface	1de069298c	Fix missing character in log message	2021-09-27 00:49:43 -04:00
Joshua Boniface	55221b3d97	Simplify VM migration down to 3 steps Remove two superfluous synchronization steps which are not needed here, since the exclusive lock handles that situation anyways. Still does not fix the weird flush->unflush lock timeout bug, but is better worked-around now due to the cancelling of the other wait freeing this up and continuing.	2021-09-27 00:03:20 -04:00
Joshua Boniface	0d72798814	Work around synchronization lock issues Make the block on stage C only wait for 900 seconds (15 minutes) to prevent indefinite blocking. The issue comes if a VM is being received, and the current unflush is cancelled for a flush. When this happens, this lock acquisition seems to block for no obvious reason, and no other changes seem to affect it. This is certainly some sort of locking bug within Kazoo but I can't diagnose it as-is. Leave a TODO to look into this again in the future.	2021-09-26 23:26:21 -04:00
Joshua Boniface	3638efc77e	Improve log messages during VM migration	2021-09-26 23:15:38 -04:00
Joshua Boniface	c2c888d684	Use event to non-block wait and fix inf wait	2021-09-26 22:55:39 -04:00
Joshua Boniface	febef2e406	Track status of VM state thread	2021-09-26 22:55:21 -04:00
Joshua Boniface	2a4f38e933	Simplify locking process for VM migration Rather than using a cumbersome and overly complex ping-pong of read and write locks, instead move to a much simpler process using exclusive locks. Describing the process in ASCII or narrative is cumbersome, but the process ping-pongs via a set of exclusive locks and wait timers, so that the two sides are able to synchronize via blocking the exclusive lock. The end result is a much more streamlined migration (takes about half the time all things considered) which should be less error-prone.	2021-09-26 22:08:07 -04:00
Joshua Boniface	3b805cdc34	Fix failure to connect to libvirt in keepalive This should be caught and abort the thread rather than failing and holding up keepalives.	2021-09-26 20:42:01 -04:00
Joshua Boniface	06f0f7ed91	Fix several bugs in fence handling 1. Output from ipmitool was not being stripped, and stray newlines were throwing off the comparisons. Fixes this. 2. Several stages were lacking meaningful messages. Adds these in so the output is more clear about what is going on. 3. Reduce the sleep time after a fence to just 1x the keepalive_interval, rather than 2x, because this seemed like excessively long even for slow IPMI interfaces, especially since we're checking the power state now anyways. 4. Set the node daemon state to an explicit 'fenced' state after a successful fence to indicate to users that the node was indeed fenced successfully and not still 'dead'.	2021-09-26 20:07:30 -04:00
Joshua Boniface	fd040ab45a	Ensure pvc-flush is after network-online	2021-09-26 17:40:42 -04:00
Joshua Boniface	e23e2dd9bf	Fix typo in log message	2021-09-26 03:35:30 -04:00
Joshua Boniface	0f02c5eaef	Fix typo in sgdisk command options	2021-09-26 00:59:05 -04:00
Joshua Boniface	075abec5fe	Use re.search instead of re.match Required since we're not matching the start of the string.	2021-09-26 00:55:29 -04:00
Joshua Boniface	3a1cbf8d01	Raise basic exceptions in CephInstance Avoids no exception to reraise errors on failures.	2021-09-26 00:50:10 -04:00
Joshua Boniface	a438a4155a	Fix OSD creation for partition paths and fix gdisk The previous implementation did not work with /dev/nvme devices or any /dev/disk/by-* devices due to some logical failures in the partition naming scheme, so fix these, and be explicit about what is supported in the PVC CLI command output. The 'echo \| gdisk' implementation of partition creation also did not work due to limitations of subprocess.run; instead, use sgdisk which allows these commands to be written out explicitly and is included in the same package as gdisk.	2021-09-26 00:12:28 -04:00
Joshua Boniface	65df807b09	Add support for configurable OSD DB ratios The default of 0.05 (5%) is likely ideal in the initial implementation, but allow this to be set explicitly for maximum flexibility in space-constrained or performance-critical use-cases.	2021-09-24 01:06:39 -04:00
Joshua Boniface	d0f3e9e285	Bump version to 0.9.36	2021-09-23 14:01:38 -04:00
Joshua Boniface	adc8a5a3bc	Add separate OSD DB device support Adds in three parts: 1. Create an API endpoint to create OSD DB volume groups on a device. Passed through to the node via the same command pipeline as creating/removing OSDs, and creates a volume group with a fixed name (osd-db). 2. Adds API support for specifying whether or not to use this DB volume group when creating a new OSD via the "ext_db" flag. Naming and sizing is fixed for simplicity and based on Ceph recommendations (5% of OSD size). The Zookeeper schema tracks the block device to use during removal. 3. Adds CLI support for the new and modified API endpoints, as well as displaying the block device and DB block device in the OSD list. While I debated supporting adding a DB device to an existing OSD, in practice this ended up being a very complex operation involving stopping the OSD and setting some options, so this is not supported; this can be specified during OSD creation only. Closes #142	2021-09-23 13:59:49 -04:00
Joshua Boniface	df277edf1c	Move console watcher stop try up Could cause an exception if d_domain is not defined yet.	2021-09-22 16:02:04 -04:00
Joshua Boniface	772807deb3	Bump version to 0.9.35	2021-09-13 02:20:46 -04:00
Joshua Boniface	f3fb492633	Handle VM disk/network stats gathering exceptions	2021-09-12 19:41:07 -04:00
Joshua Boniface	e962743e51	Add VM device hot attach/detach support Adds a new API endpoint to support hot attach/detach of devices, and the corresponding client-side logic to use this endpoint when doing VM network/storage add/remove actions. The live attach is now the default behaviour for these types of additions and removals, and can be disabled if needed. Closes #141	2021-09-12 19:33:00 -04:00
Joshua Boniface	be954c1625	Don't crash cleanup if no this_node	2021-08-29 03:52:18 -04:00
Joshua Boniface	fb46f5f9e9	Change default node object state to flushed	2021-08-29 03:34:08 -04:00
Joshua Boniface	694b8e85a0	Bump version to 0.9.34	2021-08-24 16:15:25 -04:00
Joshua Boniface	a4c0e0befd	Fix typo in output message	2021-08-23 00:39:19 -04:00
Joshua Boniface	a18cef5f25	Bump version to 0.9.33	2021-08-21 03:28:48 -04:00
Joshua Boniface	afb0359c20	Refactor pvcnoded to reduce Daemon.py size This branch commit refactors the pvcnoded component to better adhere to good programming practices. The previous Daemon.py was a massive file which contained almost 2000 lines of direct, root-level code which was directly imported. Not only was this poor practice, but this resulted in a nigh-unmaintainable file which was hard even for me to understand. This refactoring splits a large section of the code from Daemon.py into separate small modules and functions in the `util/` directory. This will hopefully make most of the functionality easy to find and modify without having to dig through a single large file. Further the existing subcomponents have been moved to the `objects/` directory which clearly separates them. Finally, the Daemon.py code has mostly been moved into a function, `entrypoint()`, which is then called from the `pvcnoded.py` stub. An additional item is that most format strings have been replaced by f-strings to make use of the Python 3.6 features in Daemon.py and the utility files.	2021-08-21 03:14:22 -04:00
Joshua Boniface	afdf254297	Bump version to 0.9.32	2021-08-19 12:37:58 -04:00
Joshua Boniface	42e776fac1	Properly handle exceptions getting VM stats	2021-08-19 12:36:31 -04:00
Joshua Boniface	7ecc6a2635	Bump version to 0.9.31	2021-07-30 12:08:12 -04:00
Joshua Boniface	3ab6365a53	Adjust receive output to show proper source	2021-07-22 15:43:08 -04:00
Joshua Boniface	2a99a27feb	Bump version to 0.9.30	2021-07-20 00:01:45 -04:00
Joshua Boniface	fa1d93e933	Bump version to 0.9.29	2021-07-19 16:55:41 -04:00
Joshua Boniface	6ead21a308	Handle cleanup from a failure properly	2021-07-19 12:39:13 -04:00
Joshua Boniface	b7c8c2ee3d	Fix handling of this_node and d_domain in cleanup	2021-07-19 12:36:35 -04:00
Joshua Boniface	d48f58930b	Use harder exits and add cleanup termination	2021-07-19 12:27:16 -04:00
Joshua Boniface	7c36388c8f	Add post-networking delay and adjust daemon delay	2021-07-19 12:23:45 -04:00
Joshua Boniface	71e4d0b32a	Bump version to 0.9.28	2021-07-19 09:29:34 -04:00
Joshua Boniface	15d92c483f	Bump version to 0.9.27	2021-07-19 00:03:40 -04:00
Joshua Boniface	602093029c	Bump version to 0.9.26	2021-07-18 20:49:52 -04:00
Joshua Boniface	b770e15a91	Fix final termination of logger We need to do a bit more finagling with the logger on termination to ensure that all messages are written and the queue drained before actually terminating.	2021-07-18 19:53:00 -04:00
Joshua Boniface	e23a65128a	Remove del of logger item	2021-07-18 19:03:47 -04:00
Joshua Boniface	3a2478ee0c	Cleanly terminate logger on cleanup	2021-07-18 18:57:44 -04:00
Joshua Boniface	323c7c41ae	Implement node logging into Zookeeper Adds the ability to send node daemon logs to Zookeeper to facilitate a command like "pvc node log", similar to "pvc vm log". Each node stores its logs in a separate tree under "/logs" which can then be combined or queried. By default, set by config, only 2000 lines are kept.	2021-07-18 17:11:43 -04:00
Joshua Boniface	cd1db3d587	Ensure node name is part of confing	2021-07-18 16:38:58 -04:00
Joshua Boniface	75fb60b1b4	Add VM list filtering by tag Uses same method as state or node filtering, rather than altering how the main LIMIT field works.	2021-07-14 00:59:20 -04:00
Joshua Boniface	c6d552ae57	Rework success checks for IPMI fencing Previously, if the node failed to restart, it was declared a "bad fence" and no further action would be taken. However, there are some situations, for instance critical hardware failures, where intelligent systems will not attempt (or succeed at) starting up the node in such a case, which would result in dead, known-offline nodes without recovery. Tweak this behaviour somewhat. The main path of Reboot -> Check On -> Success + fence-flush is retained, but some additional side-paths are now defined: 1. We attempt to power "on" the chassis 1 second after the reboot, just in case it is off and can be recovered. We then wait another 2 seconds and check the power status (as we did before). 2. If the reboot succeeded, follow this series of choices: a. If the chassis is on, the fence succeeded. b. If the chassis is off, the fence "succeeded" as well. c. If the chassis is in some other state, the fence failed. 3. If the reboot failed, follow this series of choices: a. If the chassis is off, the fence itself failed, but we can treat it as "succeeded"" since the chassis is in a known-offline state. This is the most likely situation when there is a critical hardware failure, and the server's IPMI does not allow itself to start back up again. b. If the chassis is in any other state ("on" or unknown), the fence itself failed and we must treat this as a fence failure. Overall, this should alleviate the aforementioned issue of a critical failure rendering the node persistently "off" not triggering a fence-flush and ensure fencing is more robust.	2021-07-13 17:54:41 -04:00
Joshua Boniface	2e9f6ac201	Bump version to 0.9.25	2021-07-11 23:19:09 -04:00
Joshua Boniface	f09849bedf	Don't overwrite shutdown state on termination Just a minor quibble and not really impactful.	2021-07-11 23:18:14 -04:00
Joshua Boniface	c76149141f	Only log ZK connections when persistent Prevents spam in the API logs.	2021-07-10 23:35:49 -04:00
Joshua Boniface	f00c4d07f4	Add date output to keepalive Helps track when there is a log follow in "-o cat" mode.	2021-07-10 23:24:59 -04:00
Joshua Boniface	20b66c10e1	Move two more commands to Rados library	2021-07-10 17:28:42 -04:00
Joshua Boniface	cfeba50b17	Revert "Return to all command-based Ceph gathering" This reverts commit `65d14ccd92`. This was actually a bad idea. For inexplicable reasons, running these Ceph commands manually (not even via Python, but in a normal shell) takes 7 * two orders of magnitude longer than running them with the Rados module, so long in fact that some basic commands like "ceph health" would sometimes take longer than the 1 second timeout to complete. The Rados commands would however take about 1ms instead. Despite the occasional issues when monitors drop out, the Rados module is clearly far superior to the shell commands for any moderately-loaded Ceph cluster. We can look into solving timeouts another way (perhaps with Processes instead of Threads) at a later time. Rados module "ceph health": b'{"checks":{},"status":"HEALTH_OK"}' 0.001204 (s) b'{"checks":{},"status":"HEALTH_OK"}' 0.001258 (s) Command "ceph health": joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.772s user 0m0.707s sys 0m0.046s joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.796s user 0m0.728s sys 0m0.054s	2021-07-10 03:47:45 -04:00
Joshua Boniface	551bae2518	Bump version to 0.9.24	2021-07-09 15:58:36 -04:00
Joshua Boniface	2b5dc286ab	Correct failure to get ceph_health data	2021-07-09 13:10:28 -04:00
Joshua Boniface	330cf14638	Remove return statements in keepalive collectors These seem to bork the keepalive timer process, so just remove them and let it continue to press on.	2021-07-09 13:04:17 -04:00
Joshua Boniface	65d14ccd92	Return to all command-based Ceph gathering Using the Rados module was very problematic, specifically because it had no sensible timeout parameters and thus would hang for many seconds. This has poor implications since it blocks further keepalives. Instead, remove the Rados usage entirely and go back completely to using manual OS commands to gather this information. While this may cause PID exhaustion more quickly it's worthwhile to avoid failure scenarios when Ceph stats time out. Closes #137	2021-07-06 11:30:45 -04:00
Joshua Boniface	7082982a33	Bump version to 0.9.23	2021-07-05 23:40:32 -04:00
Joshua Boniface	5b6ef71909	Ensure daemon mode is updated on startup Fixes the side effect of the previous bug during deploys of 0.9.22.	2021-07-05 23:39:23 -04:00
Joshua Boniface	be7b0be8ed	Fix typo in schema path name	2021-07-05 23:23:23 -04:00
Joshua Boniface	37cd278bc2	Bump version to 0.9.22	2021-07-05 14:18:51 -04:00
Joshua Boniface	a69105569f	Add node PVC version data to Node information Allows API client to see the currently-active version of the node daemon.	2021-07-05 09:57:38 -04:00
Joshua Boniface	21a1a7da9e	Fix bad schema reference Not sure how this didn't cause an issue until now, but the wrong key path was used and this was getting unexpected data with the newly-added version string instead of the proper mode string.	2021-07-05 09:53:51 -04:00
Joshua Boniface	f0fd3d3f0e	Make extra sure VMs terminate when told When doing a stop_vm or terminate_vm, check again after 0.2 seconds and try re-terminating if it's still running. Covers cases where a VM doesn't stop if given the 'stop' state.	2021-07-02 11:40:34 -04:00
Joshua Boniface	f12de6727d	Adjust logo slightly and add debug state	2021-07-02 02:32:08 -04:00
Joshua Boniface	e94f5354e6	Update startup messages with new ASCII logo	2021-07-02 02:21:30 -04:00
Joshua Boniface	c51023ba81	Add profiler to keepalive function	2021-07-02 01:55:15 -04:00
Joshua Boniface	39e82ee426	Cast base schema version to int Or all our comparisons will fail later and nodes can't start.	2021-06-30 09:40:33 -04:00
Joshua Boniface	fe0a1d582a	Bump version to 0.9.21	2021-06-29 19:21:31 -04:00
Joshua Boniface	3490ecbb59	Remove explicit ZK address from Patronictl command	2021-06-22 03:31:06 -04:00
Joshua Boniface	2928d695c9	Ensure migration method is updated on state changes	2021-06-22 03:20:15 -04:00
Joshua Boniface	26dd24e3f5	Ensure MTU is set on VF when starting up	2021-06-22 02:26:14 -04:00
Joshua Boniface	e623909a43	Store PHY MAC for VFs and restore after free	2021-06-22 00:56:47 -04:00
Joshua Boniface	60e1da09dd	Don't try any shenannegans when updating NICs Trying to do this on the VMInstance side had problems because we can't differentiate the 3 types of migration there. So, just update this in the API side and hope everything goes well. This introduces an edge bug: if a VM is using a macvtap SR-IOV device, and then tries to migrate, and the migrate is aborted, the NIC lists will be inconsistent. When I revamp the VMInstance in the future, I should be able to correct this, but for now we'll have to live with that edgecase.	2021-06-22 00:00:50 -04:00
Joshua Boniface	7d42fba373	Ensure being in migrate doesn't abort shutdown	2021-06-21 23:28:53 -04:00
Joshua Boniface	24ce361a04	Ensure SR-IOV NIC states are updated on migration	2021-06-21 23:18:34 -04:00
Joshua Boniface	eeb83da97d	Add support for SR-IOV NICs to VMs	2021-06-21 23:18:22 -04:00
Joshua Boniface	64d1a37b3c	Add PCIe device paths to SR-IOV VF information This will be used when adding VM network interfaces of type hostdev.	2021-06-21 21:08:46 -04:00
Joshua Boniface	13cc0f986f	Implement SR-IOV VF config set Also fixes some random bugs, adds proper interface sorting, and assorted tweaks.	2021-06-21 18:40:11 -04:00

... 2 3 4 5 6 ...

856 Commits