parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	c8134d3a1c	Fix several bugs in fence handling 1. Output from ipmitool was not being stripped, and stray newlines were throwing off the comparisons. Fixes this. 2. Several stages were lacking meaningful messages. Adds these in so the output is more clear about what is going on. 3. Reduce the sleep time after a fence to just 1x the keepalive_interval, rather than 2x, because this seemed like excessively long even for slow IPMI interfaces, especially since we're checking the power state now anyways. 4. Set the node daemon state to an explicit 'fenced' state after a successful fence to indicate to users that the node was indeed fenced successfully and not still 'dead'.	2021-09-26 20:07:30 -04:00
Joshua M. Boniface	9f41373324	Ensure pvc-flush is after network-online	2021-09-26 17:40:42 -04:00
Joshua M. Boniface	8e62d5b30b	Fix typo in log message	2021-09-26 03:35:30 -04:00
Joshua M. Boniface	7df5b8e52e	Fix typo in sgdisk command options	2021-09-26 00:59:05 -04:00
Joshua M. Boniface	6f96219023	Use re.search instead of re.match Required since we're not matching the start of the string.	2021-09-26 00:55:29 -04:00
Joshua M. Boniface	51967e164b	Raise basic exceptions in CephInstance Avoids no exception to reraise errors on failures.	2021-09-26 00:50:10 -04:00
Joshua M. Boniface	7a3a44d47c	Fix OSD creation for partition paths and fix gdisk The previous implementation did not work with /dev/nvme devices or any /dev/disk/by-* devices due to some logical failures in the partition naming scheme, so fix these, and be explicit about what is supported in the PVC CLI command output. The 'echo \| gdisk' implementation of partition creation also did not work due to limitations of subprocess.run; instead, use sgdisk which allows these commands to be written out explicitly and is included in the same package as gdisk.	2021-09-26 00:12:28 -04:00
Joshua M. Boniface	44491dd988	Add support for configurable OSD DB ratios The default of 0.05 (5%) is likely ideal in the initial implementation, but allow this to be set explicitly for maximum flexibility in space-constrained or performance-critical use-cases.	2021-09-24 01:06:39 -04:00
Joshua M. Boniface	eba142f470	Bump version to 0.9.36	2021-09-23 14:01:38 -04:00
Joshua M. Boniface	6cef68d157	Add separate OSD DB device support Adds in three parts: 1. Create an API endpoint to create OSD DB volume groups on a device. Passed through to the node via the same command pipeline as creating/removing OSDs, and creates a volume group with a fixed name (osd-db). 2. Adds API support for specifying whether or not to use this DB volume group when creating a new OSD via the "ext_db" flag. Naming and sizing is fixed for simplicity and based on Ceph recommendations (5% of OSD size). The Zookeeper schema tracks the block device to use during removal. 3. Adds CLI support for the new and modified API endpoints, as well as displaying the block device and DB block device in the OSD list. While I debated supporting adding a DB device to an existing OSD, in practice this ended up being a very complex operation involving stopping the OSD and setting some options, so this is not supported; this can be specified during OSD creation only. Closes #142	2021-09-23 13:59:49 -04:00
Joshua M. Boniface	e8caf3369e	Move console watcher stop try up Could cause an exception if d_domain is not defined yet.	2021-09-22 16:02:04 -04:00
Joshua M. Boniface	3e3776a25b	Bump version to 0.9.35	2021-09-13 02:20:46 -04:00
Joshua M. Boniface	1b6d10e03a	Handle VM disk/network stats gathering exceptions	2021-09-12 19:41:07 -04:00
Joshua M. Boniface	73c96d1e93	Add VM device hot attach/detach support Adds a new API endpoint to support hot attach/detach of devices, and the corresponding client-side logic to use this endpoint when doing VM network/storage add/remove actions. The live attach is now the default behaviour for these types of additions and removals, and can be disabled if needed. Closes #141	2021-09-12 19:33:00 -04:00
Joshua M. Boniface	bc6395c959	Don't crash cleanup if no this_node	2021-08-29 03:52:18 -04:00
Joshua M. Boniface	d582f87472	Change default node object state to flushed	2021-08-29 03:34:08 -04:00
Joshua M. Boniface	e9735113af	Bump version to 0.9.34	2021-08-24 16:15:25 -04:00
Joshua M. Boniface	d3392c0282	Fix typo in output message	2021-08-23 00:39:19 -04:00
Joshua M. Boniface	560c013e95	Bump version to 0.9.33	2021-08-21 03:28:48 -04:00
Joshua M. Boniface	534c7cd7f0	Refactor pvcnoded to reduce Daemon.py size This branch commit refactors the pvcnoded component to better adhere to good programming practices. The previous Daemon.py was a massive file which contained almost 2000 lines of direct, root-level code which was directly imported. Not only was this poor practice, but this resulted in a nigh-unmaintainable file which was hard even for me to understand. This refactoring splits a large section of the code from Daemon.py into separate small modules and functions in the `util/` directory. This will hopefully make most of the functionality easy to find and modify without having to dig through a single large file. Further the existing subcomponents have been moved to the `objects/` directory which clearly separates them. Finally, the Daemon.py code has mostly been moved into a function, `entrypoint()`, which is then called from the `pvcnoded.py` stub. An additional item is that most format strings have been replaced by f-strings to make use of the Python 3.6 features in Daemon.py and the utility files.	2021-08-21 03:14:22 -04:00
Joshua M. Boniface	4014ef7714	Bump version to 0.9.32	2021-08-19 12:37:58 -04:00
Joshua M. Boniface	180f0445ac	Properly handle exceptions getting VM stats	2021-08-19 12:36:31 -04:00
Joshua M. Boniface	7ecc6a2635	Bump version to 0.9.31	2021-07-30 12:08:12 -04:00
Joshua M. Boniface	3ab6365a53	Adjust receive output to show proper source	2021-07-22 15:43:08 -04:00
Joshua M. Boniface	2a99a27feb	Bump version to 0.9.30	2021-07-20 00:01:45 -04:00
Joshua M. Boniface	fa1d93e933	Bump version to 0.9.29	2021-07-19 16:55:41 -04:00
Joshua M. Boniface	6ead21a308	Handle cleanup from a failure properly	2021-07-19 12:39:13 -04:00
Joshua M. Boniface	b7c8c2ee3d	Fix handling of this_node and d_domain in cleanup	2021-07-19 12:36:35 -04:00
Joshua M. Boniface	d48f58930b	Use harder exits and add cleanup termination	2021-07-19 12:27:16 -04:00
Joshua M. Boniface	7c36388c8f	Add post-networking delay and adjust daemon delay	2021-07-19 12:23:45 -04:00
Joshua M. Boniface	71e4d0b32a	Bump version to 0.9.28	2021-07-19 09:29:34 -04:00
Joshua M. Boniface	15d92c483f	Bump version to 0.9.27	2021-07-19 00:03:40 -04:00
Joshua M. Boniface	602093029c	Bump version to 0.9.26	2021-07-18 20:49:52 -04:00
Joshua M. Boniface	b770e15a91	Fix final termination of logger We need to do a bit more finagling with the logger on termination to ensure that all messages are written and the queue drained before actually terminating.	2021-07-18 19:53:00 -04:00
Joshua M. Boniface	e23a65128a	Remove del of logger item	2021-07-18 19:03:47 -04:00
Joshua M. Boniface	3a2478ee0c	Cleanly terminate logger on cleanup	2021-07-18 18:57:44 -04:00
Joshua M. Boniface	323c7c41ae	Implement node logging into Zookeeper Adds the ability to send node daemon logs to Zookeeper to facilitate a command like "pvc node log", similar to "pvc vm log". Each node stores its logs in a separate tree under "/logs" which can then be combined or queried. By default, set by config, only 2000 lines are kept.	2021-07-18 17:11:43 -04:00
Joshua M. Boniface	cd1db3d587	Ensure node name is part of confing	2021-07-18 16:38:58 -04:00
Joshua M. Boniface	75fb60b1b4	Add VM list filtering by tag Uses same method as state or node filtering, rather than altering how the main LIMIT field works.	2021-07-14 00:59:20 -04:00
Joshua M. Boniface	c6d552ae57	Rework success checks for IPMI fencing Previously, if the node failed to restart, it was declared a "bad fence" and no further action would be taken. However, there are some situations, for instance critical hardware failures, where intelligent systems will not attempt (or succeed at) starting up the node in such a case, which would result in dead, known-offline nodes without recovery. Tweak this behaviour somewhat. The main path of Reboot -> Check On -> Success + fence-flush is retained, but some additional side-paths are now defined: 1. We attempt to power "on" the chassis 1 second after the reboot, just in case it is off and can be recovered. We then wait another 2 seconds and check the power status (as we did before). 2. If the reboot succeeded, follow this series of choices: a. If the chassis is on, the fence succeeded. b. If the chassis is off, the fence "succeeded" as well. c. If the chassis is in some other state, the fence failed. 3. If the reboot failed, follow this series of choices: a. If the chassis is off, the fence itself failed, but we can treat it as "succeeded"" since the chassis is in a known-offline state. This is the most likely situation when there is a critical hardware failure, and the server's IPMI does not allow itself to start back up again. b. If the chassis is in any other state ("on" or unknown), the fence itself failed and we must treat this as a fence failure. Overall, this should alleviate the aforementioned issue of a critical failure rendering the node persistently "off" not triggering a fence-flush and ensure fencing is more robust.	2021-07-13 17:54:41 -04:00
Joshua M. Boniface	2e9f6ac201	Bump version to 0.9.25	2021-07-11 23:19:09 -04:00
Joshua M. Boniface	f09849bedf	Don't overwrite shutdown state on termination Just a minor quibble and not really impactful.	2021-07-11 23:18:14 -04:00
Joshua M. Boniface	c76149141f	Only log ZK connections when persistent Prevents spam in the API logs.	2021-07-10 23:35:49 -04:00
Joshua M. Boniface	f00c4d07f4	Add date output to keepalive Helps track when there is a log follow in "-o cat" mode.	2021-07-10 23:24:59 -04:00
Joshua M. Boniface	20b66c10e1	Move two more commands to Rados library	2021-07-10 17:28:42 -04:00
Joshua M. Boniface	cfeba50b17	Revert "Return to all command-based Ceph gathering" This reverts commit `65d14ccd92`. This was actually a bad idea. For inexplicable reasons, running these Ceph commands manually (not even via Python, but in a normal shell) takes 7 * two orders of magnitude longer than running them with the Rados module, so long in fact that some basic commands like "ceph health" would sometimes take longer than the 1 second timeout to complete. The Rados commands would however take about 1ms instead. Despite the occasional issues when monitors drop out, the Rados module is clearly far superior to the shell commands for any moderately-loaded Ceph cluster. We can look into solving timeouts another way (perhaps with Processes instead of Threads) at a later time. Rados module "ceph health": b'{"checks":{},"status":"HEALTH_OK"}' 0.001204 (s) b'{"checks":{},"status":"HEALTH_OK"}' 0.001258 (s) Command "ceph health": joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.772s user 0m0.707s sys 0m0.046s joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.796s user 0m0.728s sys 0m0.054s	2021-07-10 03:47:45 -04:00
Joshua M. Boniface	551bae2518	Bump version to 0.9.24	2021-07-09 15:58:36 -04:00
Joshua M. Boniface	2b5dc286ab	Correct failure to get ceph_health data	2021-07-09 13:10:28 -04:00
Joshua M. Boniface	330cf14638	Remove return statements in keepalive collectors These seem to bork the keepalive timer process, so just remove them and let it continue to press on.	2021-07-09 13:04:17 -04:00
Joshua M. Boniface	65d14ccd92	Return to all command-based Ceph gathering Using the Rados module was very problematic, specifically because it had no sensible timeout parameters and thus would hang for many seconds. This has poor implications since it blocks further keepalives. Instead, remove the Rados usage entirely and go back completely to using manual OS commands to gather this information. While this may cause PID exhaustion more quickly it's worthwhile to avoid failure scenarios when Ceph stats time out. Closes #137	2021-07-06 11:30:45 -04:00

1 2 3 4 5 ...

578 Commits