Commit Graph

2835 Commits

Author SHA1 Message Date
Joshua Boniface f9e7e9884f Improve handling of VM migrations
The VM migration code was very old, very spaghettified, and prone to
strange failures.

Improve this by taking cues from the node primary migration. Use
synchronization between the nodes to ensure lockstep completion of the
migration in discrete steps.

A proper queue can be built later to integrate with this code more
cleanly.

References #108
2020-10-20 13:01:55 -04:00
Joshua Boniface 726501f4d4 Add additional logging to flush selector
Adds additional debug logging to the flush selector to determine how any
why any given node is selected. Useful for troubleshooting strange
choices.
2020-10-20 12:34:18 -04:00
Joshua Boniface 7cc33451b9 Improve Munin check with extinfo 2020-10-19 11:01:00 -04:00
Joshua Boniface ffaa4c033f Improve handling of large file uploads
By default, Werkzeug would require the entire file (be it an OVA or
image file) to be uploaded and saved to a temporary, fake file under
`/tmp`, before any further processing could occur. This blocked most of
the execution of these functions until the upload was completed.

This entirely defeated the purpose of what I was trying to do, which was
to save the uploads directly to the temporary blockdev in each case,
thus avoiding any sort of memory or (host) disk usage.

The solution is two-fold:

  1. First, ensure that the `location='args'` value is set in
  RequestParser; without this, the `files` portion would be parsed
  during the argument parsing, which was the original source of this
  blocking behaviour.

  2. Instead of the convoluted request handling that was being done
  originally here, instead entirely defer the parsing of the `files`
  arguments until the point in the code where they are ready to be
  saved. Then, using an override stream_factory that simply opens the
  temporary blockdev, the upload can commence while being written
  directly out to it, rather than using `/tmp` space.

This does alter the error handling slightly; it is impossible to check
if the argument was passed until this point in the code, so it may take
longer to fail if the API consumer does not specify a file as they
should. This is a minor trade-off and I would expect my API consumers to
be sane here.
2020-10-19 01:00:34 -04:00
Joshua Boniface 7a27503f1b Allow network-less managed networks
Allows the specification of network-less managed networks, acting like
bridged networks but over the VXLAN system instead.

Closes #107
2020-10-18 23:13:12 -04:00
Joshua Boniface e7ab1bfddd Add cluster overprovision determination
Adds a check of (n-1) memory overprovisioning. (n-1) is considered to be
the configuration that excludes the "largest" node. The cluster will
report degraded when in this state.
2020-10-18 14:57:22 -04:00
Joshua Boniface c6e34c7dc6 Bump base version to 0.9 2020-10-18 14:31:19 -04:00
Joshua Boniface f749633f7c Use provisioned memory for mem migration selector
Use the new "provisioned" memory field, instead of the "allocated"
memory field, to determine the optimal node when using the "mem"
migration selector. This will take into account non-running VMs in the
calculation as well as running VMs.
2020-10-18 14:17:15 -04:00
Joshua Boniface a4b80be5ed Add provisioned memory to node info
Adds a separate field to the node memory, "provisioned", which totals
the amount of memory provisioned to all VMs on the node, regardless of
state, and in contrast to "allocated" which only counts running VMs.

Allows for the detection of potential overprovisioned states when
factoring in non-running VMs.

Includes the supporting code to get this data, since the original
implementation of VM memory selection was dependent on the VM being
running and getting this from libvirt. Now, if the VM is not active, it
gets this from the domain XML instead.
2020-10-18 14:17:15 -04:00
Joshua Boniface 9d7067469a Correct proper type of uploads 2020-10-16 11:47:09 -04:00
Joshua Boniface 891aeca388 Bump Debian changelog version 2020-10-15 11:02:41 -04:00
Joshua Boniface aa5f8c93fd Entirely disable IPv6 on bridged interfaces
Prevents any potential leakage due to autoconfigured IPv6 on bridged
interfaces. These are exclusively VM-side bridges, and the PVC host
should not have any IPv6 configuration on them, ever.
2020-10-15 11:00:59 -04:00
Joshua Boniface 9366977fe6 Copy d_domain before iterating
Prevents a bug where the thread can crash due to a change in the
d_domain object while running the for loop. By copying and iterating
over the copy, this becomes safer.
2020-09-16 15:12:37 -04:00
Joshua Boniface 973c78b8e0 Use monkeypatch to allow multithreaded prod flask
Without this tasks were blocking when other task were active (for
instance, any task with --wait). Using the moneypatch, these no longer
block.
2020-08-28 02:09:31 -04:00
Joshua Boniface 65b44f2955 Avoid breaking keepalive during incoming migration
The keepalive was getting stuck gathering memoryStats from the
non-running VM, since it was in a paused state. Avoid this by just
skipping past the rest of the stats gathering if the VM isn't running.
2020-08-28 01:47:36 -04:00
Joshua Boniface 7ce1bfd930 Fix bad integer/string in base convert 2020-08-28 01:08:48 -04:00
Joshua Boniface 423da08f5f Add colour indication if alloc mem is above total
Shows an "overprovisioned" state clearly without adding a hacky
additional domain state to the system.
2020-08-28 00:33:50 -04:00
Joshua Boniface 45542bfd67 Avoid verifying SSL on local connections
Since these will almost always connect to an IP rather than a "real"
hostname, don't verify the SSL cert (if applicable). Also allow the
overriding of SSL verification via an environment variable.

As a consequence, to reduce spam, SSL warnings are disabled for urllib3.
Instead, we warn in the "Using cluster" output whenever verification is
disabled.
2020-08-27 23:54:18 -04:00
Joshua Boniface 7bf91b1003 Improve store file handling for CLI
Don't try to chmod every time, instead only chmod when first creating
the file. Also allow loading the default permission from an envvar
rather than hardcoding it.
2020-08-27 13:14:55 -04:00
Joshua Boniface 4fbec63bf4 Add missing dependency for CLI 2020-08-27 13:14:46 -04:00
Joshua Boniface b51f0a339d Fix bug in SSL enabled WSGI server 2020-08-26 13:52:45 -04:00
Joshua Boniface fc9df76570 Standardize package building
1. Only build on GitLab when there's a tag.
2. Add the packages on GitLab to component "pvc" in the repo.
3. Add build-unstable-deb.sh script to build git-versioned packages.
4. Revamp build-and-deploy to use build-unstable-deb.sh and cut down on
   output.
2020-08-26 11:04:58 -04:00
Joshua Boniface 78dec77987 Bump version to 0.8 2020-08-26 10:24:44 -04:00
Joshua Boniface 6dc6dae26c Disable gtod_reduce for benchmarks
This ended up disabling latency measurements entirely, so don't use this
option for benchmarks.
2020-08-25 17:02:06 -04:00
Joshua Boniface 0089ec4e17 Multiple KiB values by 1024 in detail output
Since these are KiB and not B. Also fix some other anomalies.
2020-08-25 15:01:24 -04:00
Joshua Boniface 486408753b Don't print results to output 2020-08-25 13:38:46 -04:00
Joshua Boniface 169e174d85 Fix size of test volume to 8GB 2020-08-25 13:29:22 -04:00
Joshua Boniface 354150f757 Restore build-and-deploy script 2020-08-25 13:12:20 -04:00
Joshua Boniface eb06c1494e Add API spec for benchmark results 2020-08-25 12:43:16 -04:00
Joshua Boniface bb7b1a2bd0 Remove aggrpct from results
This value is useless to us since we're not running combined read/write
tests at all.
2020-08-25 12:38:49 -04:00
Joshua Boniface 70b9caedc3 Correct typo 2020-08-25 12:23:12 -04:00
Joshua Boniface 2731aa060c Finalize tests and output formatting 2020-08-25 12:16:23 -04:00
Joshua Boniface 18bcd39b46 Use nicer header format 2020-08-25 02:11:34 -04:00
Joshua Boniface d210eef200 Parse response message properly 2020-08-25 02:08:35 -04:00
Joshua Boniface 1dcc1f6d55 Rename sample database for API
From pvcprov to pvcapi to facilitate the changing nature of this
database and its expansion to benchmark results.
2020-08-25 01:59:35 -04:00
Joshua Boniface 887e14a4e2 Add storage benchmarking to API 2020-08-25 01:57:21 -04:00
Joshua Boniface e4891831ce Better handle missing elements from net config
Prevents situations with an un-editable, invalid config being stuck.
2020-08-21 10:27:45 -04:00
Joshua Boniface 1967034493 Use get() for all remaining VM XML gets
Prevents KeyErrors and such.
2020-08-21 10:10:13 -04:00
Joshua Boniface 921e57ca78 Fix syntax error 2020-08-20 23:05:56 -04:00
Joshua Boniface 3cc7df63f2 Add configurable VM shutdown timeout
Closes #102
2020-08-20 21:26:12 -04:00
Joshua Boniface 3dbdd12d8f Correct invalid comparison in template VNI add 2020-08-18 09:48:56 -04:00
Joshua Boniface 7e2114b536 Add initial monitoring configurations to daemon
Initial work to support multiple monitoring agents including Munin,
Check_MK, and NRPE at the least.
2020-08-17 17:05:55 -04:00
Joshua Boniface e8e65934e3 Use logger prefix for thread debug logs 2020-08-17 14:30:21 -04:00
Joshua Boniface 24fda8a73f Use new debug logger for DNS Aggregator 2020-08-17 14:26:43 -04:00
Joshua Boniface 9b3ef6d610 Add connect timeout to Ceph
This doesn't seem to actually do anything (like most of these
timeouts...) but add it just for posterity.
2020-08-17 13:58:14 -04:00
Joshua Boniface b451c0e8e3 Add additional start/finish debug messages 2020-08-17 13:11:03 -04:00
Joshua Boniface f9b126a106 Make zkhandler accept failures more robustly
Most of these would silently fail if there was e.g. an issue with the ZK
connection. Instead, encase things in try blocks and handle the
exceptions in a more graceful way, returning None or False if
applicable. Except for locks, which should retry 5 times before
aborting.
2020-08-17 13:03:36 -04:00
Joshua Boniface 553f96e7ef Use logger for debug output
Using simple print statements was annoying (lack of timing info and
formatting), so move to using the debug logger for these instead with a
custom state ('d') with white text to differentiate them. Also indicate
which subthread of the keepalive each task is being executed in for
easier tracing of issues.
2020-08-17 12:46:52 -04:00
Joshua Boniface 15e78aa9f0 Add status information in cluster status
Provide textual explanations for the degraded status, including
specific node/VM/OSD issues as well as detailed Ceph health. "Single
pane of glass" mentality.
2020-08-17 12:25:23 -04:00
Joshua Boniface 65add58c9a Properly properly handle issue 2020-08-16 11:38:39 -04:00