Commit Graph

3317 Commits

Author SHA1 Message Date
Joshua Boniface 7bf91b1003 Improve store file handling for CLI
Don't try to chmod every time, instead only chmod when first creating
the file. Also allow loading the default permission from an envvar
rather than hardcoding it.
2020-08-27 13:14:55 -04:00
Joshua Boniface 4fbec63bf4 Add missing dependency for CLI 2020-08-27 13:14:46 -04:00
Joshua Boniface b51f0a339d Fix bug in SSL enabled WSGI server 2020-08-26 13:52:45 -04:00
Joshua Boniface fc9df76570 Standardize package building
1. Only build on GitLab when there's a tag.
2. Add the packages on GitLab to component "pvc" in the repo.
3. Add build-unstable-deb.sh script to build git-versioned packages.
4. Revamp build-and-deploy to use build-unstable-deb.sh and cut down on
   output.
2020-08-26 11:04:58 -04:00
Joshua Boniface 78dec77987 Bump version to 0.8 2020-08-26 10:24:44 -04:00
Joshua Boniface 6dc6dae26c Disable gtod_reduce for benchmarks
This ended up disabling latency measurements entirely, so don't use this
option for benchmarks.
2020-08-25 17:02:06 -04:00
Joshua Boniface 0089ec4e17 Multiple KiB values by 1024 in detail output
Since these are KiB and not B. Also fix some other anomalies.
2020-08-25 15:01:24 -04:00
Joshua Boniface 486408753b Don't print results to output 2020-08-25 13:38:46 -04:00
Joshua Boniface 169e174d85 Fix size of test volume to 8GB 2020-08-25 13:29:22 -04:00
Joshua Boniface 354150f757 Restore build-and-deploy script 2020-08-25 13:12:20 -04:00
Joshua Boniface eb06c1494e Add API spec for benchmark results 2020-08-25 12:43:16 -04:00
Joshua Boniface bb7b1a2bd0 Remove aggrpct from results
This value is useless to us since we're not running combined read/write
tests at all.
2020-08-25 12:38:49 -04:00
Joshua Boniface 70b9caedc3 Correct typo 2020-08-25 12:23:12 -04:00
Joshua Boniface 2731aa060c Finalize tests and output formatting 2020-08-25 12:16:23 -04:00
Joshua Boniface 18bcd39b46 Use nicer header format 2020-08-25 02:11:34 -04:00
Joshua Boniface d210eef200 Parse response message properly 2020-08-25 02:08:35 -04:00
Joshua Boniface 1dcc1f6d55 Rename sample database for API
From pvcprov to pvcapi to facilitate the changing nature of this
database and its expansion to benchmark results.
2020-08-25 01:59:35 -04:00
Joshua Boniface 887e14a4e2 Add storage benchmarking to API 2020-08-25 01:57:21 -04:00
Joshua Boniface e4891831ce Better handle missing elements from net config
Prevents situations with an un-editable, invalid config being stuck.
2020-08-21 10:27:45 -04:00
Joshua Boniface 1967034493 Use get() for all remaining VM XML gets
Prevents KeyErrors and such.
2020-08-21 10:10:13 -04:00
Joshua Boniface 921e57ca78 Fix syntax error 2020-08-20 23:05:56 -04:00
Joshua Boniface 3cc7df63f2 Add configurable VM shutdown timeout
Closes #102
2020-08-20 21:26:12 -04:00
Joshua Boniface 3dbdd12d8f Correct invalid comparison in template VNI add 2020-08-18 09:48:56 -04:00
Joshua Boniface 7e2114b536 Add initial monitoring configurations to daemon
Initial work to support multiple monitoring agents including Munin,
Check_MK, and NRPE at the least.
2020-08-17 17:05:55 -04:00
Joshua Boniface e8e65934e3 Use logger prefix for thread debug logs 2020-08-17 14:30:21 -04:00
Joshua Boniface 24fda8a73f Use new debug logger for DNS Aggregator 2020-08-17 14:26:43 -04:00
Joshua Boniface 9b3ef6d610 Add connect timeout to Ceph
This doesn't seem to actually do anything (like most of these
timeouts...) but add it just for posterity.
2020-08-17 13:58:14 -04:00
Joshua Boniface b451c0e8e3 Add additional start/finish debug messages 2020-08-17 13:11:03 -04:00
Joshua Boniface f9b126a106 Make zkhandler accept failures more robustly
Most of these would silently fail if there was e.g. an issue with the ZK
connection. Instead, encase things in try blocks and handle the
exceptions in a more graceful way, returning None or False if
applicable. Except for locks, which should retry 5 times before
aborting.
2020-08-17 13:03:36 -04:00
Joshua Boniface 553f96e7ef Use logger for debug output
Using simple print statements was annoying (lack of timing info and
formatting), so move to using the debug logger for these instead with a
custom state ('d') with white text to differentiate them. Also indicate
which subthread of the keepalive each task is being executed in for
easier tracing of issues.
2020-08-17 12:46:52 -04:00
Joshua Boniface 15e78aa9f0 Add status information in cluster status
Provide textual explanations for the degraded status, including
specific node/VM/OSD issues as well as detailed Ceph health. "Single
pane of glass" mentality.
2020-08-17 12:25:23 -04:00
Joshua Boniface 65add58c9a Properly properly handle issue 2020-08-16 11:38:39 -04:00
Joshua Boniface 0a01d84290 Tie fence timers to keepalive_interval
Also wait 2 full keepalive intervals after fencing before doing anything
else, to give the Ceph cluster a chance to recover.
2020-08-15 12:38:03 -04:00
Joshua Boniface 4afb288429 Properly handle missing domain_name fail 2020-08-15 12:07:23 -04:00
Joshua Boniface 2b4d980685 Display Ceph health in PVC status as well
Makes this output a little more realistic and allows proper monitoring
of the Ceph cluster status (separate from the PVC status which is
tracking only OSD up/in state).
2020-08-13 15:10:57 -04:00
Joshua Boniface 985ad5edc0 Warn if fencing will fail
Verify our IPMI state on startup, and then warn if fencing will fail.
For now, this is sufficient, but in future (requires refactoring) we
might want to adjust how fencing occurs based on this information.
2020-08-13 14:42:18 -04:00
Joshua Boniface 0587bcbd67 Go back to manual command for OSD stats
Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
2020-08-12 22:31:25 -04:00
Joshua Boniface 42f2dedf6d Add syntax checking of userdata YAML 2020-08-12 14:09:56 -04:00
Joshua Boniface 0d470ae5f6 Work around formatting fail 2020-08-12 12:12:16 -04:00
Joshua Boniface 5b5b7d2276 Improve the conditional so it will always work 2020-08-11 23:08:40 -04:00
Joshua Boniface 0468eeb531 Support live resizing of running disk volumes
This wasn't happening automatically, nor does it happen with qemu-img
commands, so we have to manually trigger a libvirt blockResize against
the volume. This setup is a little roundabout but seems to work fine.
2020-08-11 21:46:12 -04:00
Joshua Boniface 0dd719a682 Use single-quotes so Python isn't confused 2020-08-11 17:24:11 -04:00
Joshua Boniface 09c1bb6a46 Increase start delay of flush service 2020-08-11 14:17:35 -04:00
Joshua Boniface e0cb4a58c3 Ensure zk_listener is readded after reconnect 2020-08-11 12:46:15 -04:00
Joshua Boniface 099c58ead8 Fix missing char in log message 2020-08-11 12:40:35 -04:00
Joshua Boniface 37b23c0e59 Add comments to build-and-deploy.sh 2020-08-11 12:10:28 -04:00
Joshua Boniface 0e5c681ada Clean up imports
Make several imports more specific to reduce redundant code imports and
improve memory utilization.
2020-08-11 12:09:10 -04:00
Joshua Boniface 46ffe352e3 Better handle subthread timeouts in keepalive
Prevent the main keepalive thread from getting stuck due to a subthread
taking an enormous time. If this happens, the rest of the main keepalive
will continue onward, thus ensuring that the main keepalive does not
fail for a significant number of cycles, which would cause a fence.
2020-08-11 11:37:26 -04:00
Joshua Boniface 5526e13da9 Move all host provisioner steps to a try block
Make the provisioner a bit more robust. This way, even if a provisioning
step fails, cleanup is still performed this preventing the system from
being left in an undefined state requiring manual correction.

Addresses #91
2020-08-06 12:27:10 -04:00
Joshua Boniface ccee124c8b Adjust fence failcount limit to 6 (30s)
The previous saving throw limit (3/15s) seems to have been too low. I
was observing bizarre failures where a node would be fenced while it was
still starting up. Some of this may have been related to Zookeeper
connections taking too long, but this was inconsistent.

Increase this to 6 saving throws (30s). This provides significantly more
time for a node to properly check in on startup before another node
fences it. In the real world, 15s vs 30s isn't that big of a downtime
change, but prevents false-positive fences.
2020-08-05 22:40:07 -04:00