Commit Graph

2041 Commits

Author SHA1 Message Date
Joshua Boniface 9b3ef6d610 Add connect timeout to Ceph
This doesn't seem to actually do anything (like most of these
timeouts...) but add it just for posterity.
2020-08-17 13:58:14 -04:00
Joshua Boniface b451c0e8e3 Add additional start/finish debug messages 2020-08-17 13:11:03 -04:00
Joshua Boniface f9b126a106 Make zkhandler accept failures more robustly
Most of these would silently fail if there was e.g. an issue with the ZK
connection. Instead, encase things in try blocks and handle the
exceptions in a more graceful way, returning None or False if
applicable. Except for locks, which should retry 5 times before
aborting.
2020-08-17 13:03:36 -04:00
Joshua Boniface 553f96e7ef Use logger for debug output
Using simple print statements was annoying (lack of timing info and
formatting), so move to using the debug logger for these instead with a
custom state ('d') with white text to differentiate them. Also indicate
which subthread of the keepalive each task is being executed in for
easier tracing of issues.
2020-08-17 12:46:52 -04:00
Joshua Boniface 15e78aa9f0 Add status information in cluster status
Provide textual explanations for the degraded status, including
specific node/VM/OSD issues as well as detailed Ceph health. "Single
pane of glass" mentality.
2020-08-17 12:25:23 -04:00
Joshua Boniface 65add58c9a Properly properly handle issue 2020-08-16 11:38:39 -04:00
Joshua Boniface 0a01d84290 Tie fence timers to keepalive_interval
Also wait 2 full keepalive intervals after fencing before doing anything
else, to give the Ceph cluster a chance to recover.
2020-08-15 12:38:03 -04:00
Joshua Boniface 4afb288429 Properly handle missing domain_name fail 2020-08-15 12:07:23 -04:00
Joshua Boniface 2b4d980685 Display Ceph health in PVC status as well
Makes this output a little more realistic and allows proper monitoring
of the Ceph cluster status (separate from the PVC status which is
tracking only OSD up/in state).
2020-08-13 15:10:57 -04:00
Joshua Boniface 985ad5edc0 Warn if fencing will fail
Verify our IPMI state on startup, and then warn if fencing will fail.
For now, this is sufficient, but in future (requires refactoring) we
might want to adjust how fencing occurs based on this information.
2020-08-13 14:42:18 -04:00
Joshua Boniface 0587bcbd67 Go back to manual command for OSD stats
Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
2020-08-12 22:31:25 -04:00
Joshua Boniface 42f2dedf6d Add syntax checking of userdata YAML 2020-08-12 14:09:56 -04:00
Joshua Boniface 0d470ae5f6 Work around formatting fail 2020-08-12 12:12:16 -04:00
Joshua Boniface 5b5b7d2276 Improve the conditional so it will always work 2020-08-11 23:08:40 -04:00
Joshua Boniface 0468eeb531 Support live resizing of running disk volumes
This wasn't happening automatically, nor does it happen with qemu-img
commands, so we have to manually trigger a libvirt blockResize against
the volume. This setup is a little roundabout but seems to work fine.
2020-08-11 21:46:12 -04:00
Joshua Boniface 0dd719a682 Use single-quotes so Python isn't confused 2020-08-11 17:24:11 -04:00
Joshua Boniface 09c1bb6a46 Increase start delay of flush service 2020-08-11 14:17:35 -04:00
Joshua Boniface e0cb4a58c3 Ensure zk_listener is readded after reconnect 2020-08-11 12:46:15 -04:00
Joshua Boniface 099c58ead8 Fix missing char in log message 2020-08-11 12:40:35 -04:00
Joshua Boniface 37b23c0e59 Add comments to build-and-deploy.sh 2020-08-11 12:10:28 -04:00
Joshua Boniface 0e5c681ada Clean up imports
Make several imports more specific to reduce redundant code imports and
improve memory utilization.
2020-08-11 12:09:10 -04:00
Joshua Boniface 46ffe352e3 Better handle subthread timeouts in keepalive
Prevent the main keepalive thread from getting stuck due to a subthread
taking an enormous time. If this happens, the rest of the main keepalive
will continue onward, thus ensuring that the main keepalive does not
fail for a significant number of cycles, which would cause a fence.
2020-08-11 11:37:26 -04:00
Joshua Boniface 5526e13da9 Move all host provisioner steps to a try block
Make the provisioner a bit more robust. This way, even if a provisioning
step fails, cleanup is still performed this preventing the system from
being left in an undefined state requiring manual correction.

Addresses #91
2020-08-06 12:27:10 -04:00
Joshua Boniface ccee124c8b Adjust fence failcount limit to 6 (30s)
The previous saving throw limit (3/15s) seems to have been too low. I
was observing bizarre failures where a node would be fenced while it was
still starting up. Some of this may have been related to Zookeeper
connections taking too long, but this was inconsistent.

Increase this to 6 saving throws (30s). This provides significantly more
time for a node to properly check in on startup before another node
fences it. In the real world, 15s vs 30s isn't that big of a downtime
change, but prevents false-positive fences.
2020-08-05 22:40:07 -04:00
Joshua Boniface 02343079c0 Improve fencing migrate layout
Open the option to do this in parallel with some threads
2020-08-05 22:26:01 -04:00
Joshua Boniface 37b83aad6a Add logging and use better conditional 2020-08-05 21:57:36 -04:00
Joshua Boniface 876f2424e0 Ensure dead state isn't written erroneously 2020-08-05 21:57:11 -04:00
Joshua Boniface 4438dd401f Add description to example in network add
A required field so ensure this is in the example.
2020-08-05 10:35:41 -04:00
Joshua Boniface 142743b2c0 Fix erroneous comma 2020-08-05 10:34:30 -04:00
Joshua Boniface bafdcf9f8c Use new_size to match new_name 2020-08-05 10:25:37 -04:00
Joshua Boniface 6fe74b34b2 Use .get for JSON message responses 2020-07-20 12:31:12 -04:00
Joshua Boniface 9f86f12f1a Only parse script_run_args if not None 2020-07-16 02:36:26 -04:00
Joshua Boniface ad45f6097f Don't output anything if no results and --raw 2020-07-16 02:35:02 -04:00
Joshua Boniface be405caa11 Remove spurious print statement 2020-07-08 13:28:47 -04:00
Joshua Boniface a1ba9d2eeb Allow specifying arbitrary script_args on CLI
Allow the specifying of arbitrary provisioner script install() args on
the provisioner create CLI, either overriding or adding additional
per-VM arguments to those found in the profile. Reference example is
setting a "vm_fqdn" on a per-run basis.

Closes #100
2020-07-08 13:18:12 -04:00
Joshua Boniface 8fc5299d38 Avoid failing if CPU features are missing 2020-07-08 12:32:42 -04:00
Joshua Boniface 37a58d35e8 Implement limiting of node output
Closes #98
2020-06-25 11:51:53 -04:00
Joshua Boniface d74f68c904 Add quiet option to CLI
Closes #99
2020-06-25 11:09:55 -04:00
Joshua Boniface 15e986c158 Support storing client config in override dir 2020-06-25 11:07:01 -04:00
Joshua Boniface 5871380e1b Avoid crashing VM stats thread if domain migrated 2020-06-10 17:10:46 -04:00
Joshua Boniface 2967c97f1a Format and display extra VM statistics 2020-06-07 03:04:36 -04:00
Joshua Boniface 4cdf1f7247 Add statistics values to the API 2020-06-07 02:15:33 -04:00
Joshua Boniface deaf138e45 Add stats to VM information 2020-06-07 00:42:11 -04:00
Joshua Boniface 654a3cb7fa Improve debug output and use ceph df util data 2020-06-06 22:52:49 -04:00
Joshua Boniface 9b65d3271a Improve handling of Ceph status gathering
Use the Rados library instead of random OS commands, which massively
improves the performance of these tasks.

Closes #97
2020-06-06 22:30:25 -04:00
Joshua Boniface fba39cb739 Fix broken sorting for pools and volumes 2020-06-06 21:28:54 -04:00
Joshua Boniface 598b2025e8 Use Rados and add Ceph entries to pvcnoded.yaml 2020-06-06 21:12:51 -04:00
Joshua Boniface 70b787d1fd Move all VM functions into thread 2020-06-06 15:44:05 -04:00
Joshua Boniface e1310a05f2 Implement recording of VM stats during keepalive 2020-06-06 15:34:03 -04:00
Joshua Boniface 2ad6860dfe Move Ceph statistics gathering into thread 2020-06-06 13:25:02 -04:00