parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	553f96e7ef	Use logger for debug output Using simple print statements was annoying (lack of timing info and formatting), so move to using the debug logger for these instead with a custom state ('d') with white text to differentiate them. Also indicate which subthread of the keepalive each task is being executed in for easier tracing of issues.	2020-08-17 12:46:52 -04:00
Joshua M. Boniface	15e78aa9f0	Add status information in cluster status Provide textual explanations for the degraded status, including specific node/VM/OSD issues as well as detailed Ceph health. "Single pane of glass" mentality.	2020-08-17 12:25:23 -04:00
Joshua M. Boniface	65add58c9a	Properly properly handle issue	2020-08-16 11:38:39 -04:00
Joshua M. Boniface	0a01d84290	Tie fence timers to keepalive_interval Also wait 2 full keepalive intervals after fencing before doing anything else, to give the Ceph cluster a chance to recover.	2020-08-15 12:38:03 -04:00
Joshua M. Boniface	4afb288429	Properly handle missing domain_name fail	2020-08-15 12:07:23 -04:00
Joshua M. Boniface	2b4d980685	Display Ceph health in PVC status as well Makes this output a little more realistic and allows proper monitoring of the Ceph cluster status (separate from the PVC status which is tracking only OSD up/in state).	2020-08-13 15:10:57 -04:00
Joshua M. Boniface	985ad5edc0	Warn if fencing will fail Verify our IPMI state on startup, and then warn if fencing will fail. For now, this is sufficient, but in future (requires refactoring) we might want to adjust how fencing occurs based on this information.	2020-08-13 14:42:18 -04:00
Joshua M. Boniface	0587bcbd67	Go back to manual command for OSD stats Using the Ceph library was a disaster here; it had no timeout or way to force it to continue, so keepalives would become stuck and trigger fence storms. Go back to the manual osd dump command with a 2s timeout which is far more reliable and can be adequately terminated if it runs long.	2020-08-12 22:31:25 -04:00
Joshua M. Boniface	42f2dedf6d	Add syntax checking of userdata YAML	2020-08-12 14:09:56 -04:00
Joshua M. Boniface	0d470ae5f6	Work around formatting fail	2020-08-12 12:12:16 -04:00
Joshua M. Boniface	5b5b7d2276	Improve the conditional so it will always work	2020-08-11 23:08:40 -04:00
Joshua M. Boniface	0468eeb531	Support live resizing of running disk volumes This wasn't happening automatically, nor does it happen with qemu-img commands, so we have to manually trigger a libvirt blockResize against the volume. This setup is a little roundabout but seems to work fine.	2020-08-11 21:46:12 -04:00
Joshua M. Boniface	0dd719a682	Use single-quotes so Python isn't confused	2020-08-11 17:24:11 -04:00
Joshua M. Boniface	09c1bb6a46	Increase start delay of flush service	2020-08-11 14:17:35 -04:00
Joshua M. Boniface	e0cb4a58c3	Ensure zk_listener is readded after reconnect	2020-08-11 12:46:15 -04:00
Joshua M. Boniface	099c58ead8	Fix missing char in log message	2020-08-11 12:40:35 -04:00
Joshua M. Boniface	37b23c0e59	Add comments to build-and-deploy.sh	2020-08-11 12:10:28 -04:00
Joshua M. Boniface	0e5c681ada	Clean up imports Make several imports more specific to reduce redundant code imports and improve memory utilization.	2020-08-11 12:09:10 -04:00
Joshua M. Boniface	46ffe352e3	Better handle subthread timeouts in keepalive Prevent the main keepalive thread from getting stuck due to a subthread taking an enormous time. If this happens, the rest of the main keepalive will continue onward, thus ensuring that the main keepalive does not fail for a significant number of cycles, which would cause a fence.	2020-08-11 11:37:26 -04:00
Joshua M. Boniface	5526e13da9	Move all host provisioner steps to a try block Make the provisioner a bit more robust. This way, even if a provisioning step fails, cleanup is still performed this preventing the system from being left in an undefined state requiring manual correction. Addresses #91	2020-08-06 12:27:10 -04:00
Joshua M. Boniface	ccee124c8b	Adjust fence failcount limit to 6 (30s) The previous saving throw limit (3/15s) seems to have been too low. I was observing bizarre failures where a node would be fenced while it was still starting up. Some of this may have been related to Zookeeper connections taking too long, but this was inconsistent. Increase this to 6 saving throws (30s). This provides significantly more time for a node to properly check in on startup before another node fences it. In the real world, 15s vs 30s isn't that big of a downtime change, but prevents false-positive fences.	2020-08-05 22:40:07 -04:00
Joshua M. Boniface	02343079c0	Improve fencing migrate layout Open the option to do this in parallel with some threads	2020-08-05 22:26:01 -04:00
Joshua M. Boniface	37b83aad6a	Add logging and use better conditional	2020-08-05 21:57:36 -04:00
Joshua M. Boniface	876f2424e0	Ensure dead state isn't written erroneously	2020-08-05 21:57:11 -04:00
Joshua M. Boniface	4438dd401f	Add description to example in network add A required field so ensure this is in the example.	2020-08-05 10:35:41 -04:00
Joshua M. Boniface	142743b2c0	Fix erroneous comma	2020-08-05 10:34:30 -04:00
Joshua M. Boniface	bafdcf9f8c	Use new_size to match new_name	2020-08-05 10:25:37 -04:00
Joshua M. Boniface	6fe74b34b2	Use .get for JSON message responses	2020-07-20 12:31:12 -04:00
Joshua M. Boniface	9f86f12f1a	Only parse script_run_args if not None	2020-07-16 02:36:26 -04:00
Joshua M. Boniface	ad45f6097f	Don't output anything if no results and --raw	2020-07-16 02:35:02 -04:00
Joshua M. Boniface	be405caa11	Remove spurious print statement	2020-07-08 13:28:47 -04:00
Joshua M. Boniface	a1ba9d2eeb	Allow specifying arbitrary script_args on CLI Allow the specifying of arbitrary provisioner script install() args on the provisioner create CLI, either overriding or adding additional per-VM arguments to those found in the profile. Reference example is setting a "vm_fqdn" on a per-run basis. Closes #100	2020-07-08 13:18:12 -04:00
Joshua M. Boniface	8fc5299d38	Avoid failing if CPU features are missing	2020-07-08 12:32:42 -04:00
Joshua M. Boniface	37a58d35e8	Implement limiting of node output Closes #98	2020-06-25 11:51:53 -04:00
Joshua M. Boniface	d74f68c904	Add quiet option to CLI Closes #99	2020-06-25 11:09:55 -04:00
Joshua M. Boniface	15e986c158	Support storing client config in override dir	2020-06-25 11:07:01 -04:00
Joshua M. Boniface	5871380e1b	Avoid crashing VM stats thread if domain migrated	2020-06-10 17:10:46 -04:00
Joshua M. Boniface	2967c97f1a	Format and display extra VM statistics	2020-06-07 03:04:36 -04:00
Joshua M. Boniface	4cdf1f7247	Add statistics values to the API	2020-06-07 02:15:33 -04:00
Joshua M. Boniface	deaf138e45	Add stats to VM information	2020-06-07 00:42:11 -04:00
Joshua M. Boniface	654a3cb7fa	Improve debug output and use ceph df util data	2020-06-06 22:52:49 -04:00
Joshua M. Boniface	9b65d3271a	Improve handling of Ceph status gathering Use the Rados library instead of random OS commands, which massively improves the performance of these tasks. Closes #97	2020-06-06 22:30:25 -04:00
Joshua M. Boniface	fba39cb739	Fix broken sorting for pools and volumes	2020-06-06 21:28:54 -04:00
Joshua M. Boniface	598b2025e8	Use Rados and add Ceph entries to pvcnoded.yaml	2020-06-06 21:12:51 -04:00
Joshua M. Boniface	70b787d1fd	Move all VM functions into thread	2020-06-06 15:44:05 -04:00
Joshua M. Boniface	e1310a05f2	Implement recording of VM stats during keepalive	2020-06-06 15:34:03 -04:00
Joshua M. Boniface	2ad6860dfe	Move Ceph statistics gathering into thread	2020-06-06 13:25:02 -04:00
Joshua M. Boniface	cebb4bbc1a	Comment cleanup	2020-06-06 13:20:40 -04:00
Joshua M. Boniface	a672e06dd2	Move fencing to end of keepalive function	2020-06-06 13:19:11 -04:00
Joshua M. Boniface	1db73bb892	Move libvirt closure into previous section	2020-06-06 13:18:37 -04:00

... 30 31 32 33 34 ...

3288 Commits