Commit Graph

1931 Commits

Author SHA1 Message Date
Joshua Boniface 0587bcbd67 Go back to manual command for OSD stats
Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
2020-08-12 22:31:25 -04:00
Joshua Boniface 42f2dedf6d Add syntax checking of userdata YAML 2020-08-12 14:09:56 -04:00
Joshua Boniface 0d470ae5f6 Work around formatting fail 2020-08-12 12:12:16 -04:00
Joshua Boniface 5b5b7d2276 Improve the conditional so it will always work 2020-08-11 23:08:40 -04:00
Joshua Boniface 0468eeb531 Support live resizing of running disk volumes
This wasn't happening automatically, nor does it happen with qemu-img
commands, so we have to manually trigger a libvirt blockResize against
the volume. This setup is a little roundabout but seems to work fine.
2020-08-11 21:46:12 -04:00
Joshua Boniface 0dd719a682 Use single-quotes so Python isn't confused 2020-08-11 17:24:11 -04:00
Joshua Boniface 09c1bb6a46 Increase start delay of flush service 2020-08-11 14:17:35 -04:00
Joshua Boniface e0cb4a58c3 Ensure zk_listener is readded after reconnect 2020-08-11 12:46:15 -04:00
Joshua Boniface 099c58ead8 Fix missing char in log message 2020-08-11 12:40:35 -04:00
Joshua Boniface 37b23c0e59 Add comments to build-and-deploy.sh 2020-08-11 12:10:28 -04:00
Joshua Boniface 0e5c681ada Clean up imports
Make several imports more specific to reduce redundant code imports and
improve memory utilization.
2020-08-11 12:09:10 -04:00
Joshua Boniface 46ffe352e3 Better handle subthread timeouts in keepalive
Prevent the main keepalive thread from getting stuck due to a subthread
taking an enormous time. If this happens, the rest of the main keepalive
will continue onward, thus ensuring that the main keepalive does not
fail for a significant number of cycles, which would cause a fence.
2020-08-11 11:37:26 -04:00
Joshua Boniface 5526e13da9 Move all host provisioner steps to a try block
Make the provisioner a bit more robust. This way, even if a provisioning
step fails, cleanup is still performed this preventing the system from
being left in an undefined state requiring manual correction.

Addresses #91
2020-08-06 12:27:10 -04:00
Joshua Boniface ccee124c8b Adjust fence failcount limit to 6 (30s)
The previous saving throw limit (3/15s) seems to have been too low. I
was observing bizarre failures where a node would be fenced while it was
still starting up. Some of this may have been related to Zookeeper
connections taking too long, but this was inconsistent.

Increase this to 6 saving throws (30s). This provides significantly more
time for a node to properly check in on startup before another node
fences it. In the real world, 15s vs 30s isn't that big of a downtime
change, but prevents false-positive fences.
2020-08-05 22:40:07 -04:00
Joshua Boniface 02343079c0 Improve fencing migrate layout
Open the option to do this in parallel with some threads
2020-08-05 22:26:01 -04:00
Joshua Boniface 37b83aad6a Add logging and use better conditional 2020-08-05 21:57:36 -04:00
Joshua Boniface 876f2424e0 Ensure dead state isn't written erroneously 2020-08-05 21:57:11 -04:00
Joshua Boniface 4438dd401f Add description to example in network add
A required field so ensure this is in the example.
2020-08-05 10:35:41 -04:00
Joshua Boniface 142743b2c0 Fix erroneous comma 2020-08-05 10:34:30 -04:00
Joshua Boniface bafdcf9f8c Use new_size to match new_name 2020-08-05 10:25:37 -04:00
Joshua Boniface 6fe74b34b2 Use .get for JSON message responses 2020-07-20 12:31:12 -04:00
Joshua Boniface 9f86f12f1a Only parse script_run_args if not None 2020-07-16 02:36:26 -04:00
Joshua Boniface ad45f6097f Don't output anything if no results and --raw 2020-07-16 02:35:02 -04:00
Joshua Boniface be405caa11 Remove spurious print statement 2020-07-08 13:28:47 -04:00
Joshua Boniface a1ba9d2eeb Allow specifying arbitrary script_args on CLI
Allow the specifying of arbitrary provisioner script install() args on
the provisioner create CLI, either overriding or adding additional
per-VM arguments to those found in the profile. Reference example is
setting a "vm_fqdn" on a per-run basis.

Closes #100
2020-07-08 13:18:12 -04:00
Joshua Boniface 8fc5299d38 Avoid failing if CPU features are missing 2020-07-08 12:32:42 -04:00
Joshua Boniface 37a58d35e8 Implement limiting of node output
Closes #98
2020-06-25 11:51:53 -04:00
Joshua Boniface d74f68c904 Add quiet option to CLI
Closes #99
2020-06-25 11:09:55 -04:00
Joshua Boniface 15e986c158 Support storing client config in override dir 2020-06-25 11:07:01 -04:00
Joshua Boniface 5871380e1b Avoid crashing VM stats thread if domain migrated 2020-06-10 17:10:46 -04:00
Joshua Boniface 2967c97f1a Format and display extra VM statistics 2020-06-07 03:04:36 -04:00
Joshua Boniface 4cdf1f7247 Add statistics values to the API 2020-06-07 02:15:33 -04:00
Joshua Boniface deaf138e45 Add stats to VM information 2020-06-07 00:42:11 -04:00
Joshua Boniface 654a3cb7fa Improve debug output and use ceph df util data 2020-06-06 22:52:49 -04:00
Joshua Boniface 9b65d3271a Improve handling of Ceph status gathering
Use the Rados library instead of random OS commands, which massively
improves the performance of these tasks.

Closes #97
2020-06-06 22:30:25 -04:00
Joshua Boniface fba39cb739 Fix broken sorting for pools and volumes 2020-06-06 21:28:54 -04:00
Joshua Boniface 598b2025e8 Use Rados and add Ceph entries to pvcnoded.yaml 2020-06-06 21:12:51 -04:00
Joshua Boniface 70b787d1fd Move all VM functions into thread 2020-06-06 15:44:05 -04:00
Joshua Boniface e1310a05f2 Implement recording of VM stats during keepalive 2020-06-06 15:34:03 -04:00
Joshua Boniface 2ad6860dfe Move Ceph statistics gathering into thread 2020-06-06 13:25:02 -04:00
Joshua Boniface cebb4bbc1a Comment cleanup 2020-06-06 13:20:40 -04:00
Joshua Boniface a672e06dd2 Move fencing to end of keepalive function 2020-06-06 13:19:11 -04:00
Joshua Boniface 1db73bb892 Move libvirt closure into previous section 2020-06-06 13:18:37 -04:00
Joshua Boniface c1956072f0 Rename update_zookeeper function to node_keepalive 2020-06-06 12:49:50 -04:00
Joshua Boniface ce60836c34 Allow enforcement of live migration
Provides a CLI and API argument to force live migration, which triggers
a new VM state "migrate-live". The node daemon VMInstance during migrate
will read this flag from the state and, if enforced, will not trigger a
shutdown migration.

Closes #95
2020-06-06 12:00:44 -04:00
Joshua Boniface b5434ba744 Fix typo in variable name 2020-06-06 11:29:48 -04:00
Joshua Boniface f61d443773 Allow move of migrated VM to current node
Will make the migrate permanent instead of throwing an error.

Fixes #96
2020-06-06 11:25:10 -04:00
Joshua Boniface da20b4493a Properly return the function 2020-06-05 15:50:43 -04:00
Joshua Boniface 440821b136 Refactor cluster validation into a command wrapper
Instead of using group-based validation, which breaks the help context
for subcommands, use a decorator to validate the cluster status for each
command. The eager help option will then override this decorator for
help commands, while enforcing it for others.
2020-06-05 14:49:53 -04:00
Joshua Boniface b9e5b14f94 Update lastnode too if a self-migrate is aborted
References #92
2020-06-04 10:28:04 -04:00