Commit Graph

2404 Commits

Author SHA1 Message Date
Joshua Boniface 252175fb6f Revamp benchmark tests
1. Move to a time-based (60s) benchmark to avoid these taking an absurd
amount of time to show the same information.

2. Eliminate the 256k random benchmarks, since they don't really add
anything.

3. Add in a 4k single-queue benchmark as this might provide valuable
insight into latency.

4. Adjust the output to reflect the above changes.

While this does change the benchmarking, this should not invalidate any
existing benchmarks since most of the test suit is unchanged (especially
the most important 4M sequential and 4K random tests). It simply removes
an unused entry and adds a more helpful one. The time-based change
should not significantly affect the results either, just reduces the
total runtime for long-tests and increase the runtime for quick tests to
provide a better picture.
2021-09-29 20:51:30 -04:00
Joshua Boniface f39b041471 Add primary node to benchmark job name
Ensures tracking of the current primary node the job was run on, since
this may be relevant for performance reasons.
2021-09-28 09:58:22 -04:00
Joshua Boniface 3b41759262 Add timeouts to queue gets and adjust
Ensure that all keepalive timeouts are set (prevent the queue.get()
actions from blocking forever) and set the thread timeouts to line up as
well. Everything here is thus limited to keepalive_interval seconds
(default 5s) to keep it uniform.
2021-09-27 16:10:27 -04:00
Joshua Boniface e514eed414 Re-add success log output during migration 2021-09-27 11:50:55 -04:00
Joshua Boniface b81e70ec18 Fix missing character in log message 2021-09-27 00:49:43 -04:00
Joshua Boniface c2a473ed8b Simplify VM migration down to 3 steps
Remove two superfluous synchronization steps which are not needed here,
since the exclusive lock handles that situation anyways.

Still does not fix the weird flush->unflush lock timeout bug, but is
better worked-around now due to the cancelling of the other wait freeing
this up and continuing.
2021-09-27 00:03:20 -04:00
Joshua Boniface 5355f6ff48 Work around synchronization lock issues
Make the block on stage C only wait for 900 seconds (15 minutes) to
prevent indefinite blocking.

The issue comes if a VM is being received, and the current unflush is
cancelled for a flush. When this happens, this lock acquisition seems to
block for no obvious reason, and no other changes seem to affect it.
This is certainly some sort of locking bug within Kazoo but I can't
diagnose it as-is. Leave a TODO to look into this again in the future.
2021-09-26 23:26:21 -04:00
Joshua Boniface bf7823deb5 Improve log messages during VM migration 2021-09-26 23:15:38 -04:00
Joshua Boniface 8ba371723e Use event to non-block wait and fix inf wait 2021-09-26 22:55:39 -04:00
Joshua Boniface e10ac52116 Track status of VM state thread 2021-09-26 22:55:21 -04:00
Joshua Boniface 341073521b Simplify locking process for VM migration
Rather than using a cumbersome and overly complex ping-pong of read and
write locks, instead move to a much simpler process using exclusive
locks.

Describing the process in ASCII or narrative is cumbersome, but the
process ping-pongs via a set of exclusive locks and wait timers, so that
the two sides are able to synchronize via blocking the exclusive lock.
The end result is a much more streamlined migration (takes about half
the time all things considered) which should be less error-prone.
2021-09-26 22:08:07 -04:00
Joshua Boniface 16c38da5ef Fix failure to connect to libvirt in keepalive
This should be caught and abort the thread rather than failing and
holding up keepalives.
2021-09-26 20:42:01 -04:00
Joshua Boniface c8134d3a1c Fix several bugs in fence handling
1. Output from ipmitool was not being stripped, and stray newlines were
throwing off the comparisons. Fixes this.

2. Several stages were lacking meaningful messages. Adds these in so the
output is more clear about what is going on.

3. Reduce the sleep time after a fence to just 1x the
keepalive_interval, rather than 2x, because this seemed like excessively
long even for slow IPMI interfaces, especially since we're checking the
power state now anyways.

4. Set the node daemon state to an explicit 'fenced' state after a
successful fence to indicate to users that the node was indeed fenced
successfully and not still 'dead'.
2021-09-26 20:07:30 -04:00
Joshua Boniface 9f41373324 Ensure pvc-flush is after network-online 2021-09-26 17:40:42 -04:00
Joshua Boniface 8e62d5b30b Fix typo in log message 2021-09-26 03:35:30 -04:00
Joshua Boniface 7a8eee244a Tweak CLI helptext around OSD actions
Adds some more detail about OSD commands and their values.
2021-09-26 01:29:23 -04:00
Joshua Boniface 7df5b8e52e Fix typo in sgdisk command options 2021-09-26 00:59:05 -04:00
Joshua Boniface 6f96219023 Use re.search instead of re.match
Required since we're not matching the start of the string.
2021-09-26 00:55:29 -04:00
Joshua Boniface 51967e164b Raise basic exceptions in CephInstance
Avoids no exception to reraise errors on failures.
2021-09-26 00:50:10 -04:00
Joshua Boniface 7a3a44d47c Fix OSD creation for partition paths and fix gdisk
The previous implementation did not work with /dev/nvme devices or any
/dev/disk/by-* devices due to some logical failures in the partition
naming scheme, so fix these, and be explicit about what is supported in
the PVC CLI command output.

The 'echo | gdisk' implementation of partition creation also did not
work due to limitations of subprocess.run; instead, use sgdisk which
allows these commands to be written out explicitly and is included in
the same package as gdisk.
2021-09-26 00:12:28 -04:00
Joshua Boniface 44491dd988 Add support for configurable OSD DB ratios
The default of 0.05 (5%) is likely ideal in the initial implementation,
but allow this to be set explicitly for maximum flexibility in
space-constrained or performance-critical use-cases.
2021-09-24 01:06:39 -04:00
Joshua Boniface eba142f470 Bump version to 0.9.36 2021-09-23 14:01:38 -04:00
Joshua Boniface 6cef68d157 Add separate OSD DB device support
Adds in three parts:

1. Create an API endpoint to create OSD DB volume groups on a device.
Passed through to the node via the same command pipeline as
creating/removing OSDs, and creates a volume group with a fixed name
(osd-db).

2. Adds API support for specifying whether or not to use this DB volume
group when creating a new OSD via the "ext_db" flag. Naming and sizing
is fixed for simplicity and based on Ceph recommendations (5% of OSD
size). The Zookeeper schema tracks the block device to use during
removal.

3. Adds CLI support for the new and modified API endpoints, as well as
displaying the block device and DB block device in the OSD list.

While I debated supporting adding a DB device to an existing OSD, in
practice this ended up being a very complex operation involving stopping
the OSD and setting some options, so this is not supported; this can be
specified during OSD creation only.

Closes #142
2021-09-23 13:59:49 -04:00
Joshua Boniface e8caf3369e Move console watcher stop try up
Could cause an exception if d_domain is not defined yet.
2021-09-22 16:02:04 -04:00
Joshua Boniface 3e3776a25b Bump version to 0.9.35 2021-09-13 02:20:46 -04:00
Joshua Boniface 6e0d0e264e Add memory and vCPU checks to VM define/modify
Ensures that a VM won't:

(a) Have provisioned more RAM than there is available on a given node.
Due to memory overprovisioning, this is simply a "is the VM memory count
more than the node count", and doesn't factor in free or used memory on
a node, total cluster usage, etc. So if a node has 64GB total RAM, the
VM limit is 64GB. It is up to an administrator to ensure sanity *below*
that value.

(b) Have provisioned more vCPUs than there are CPU cores on the node,
minus 2 to account for hypervisor/storage processes. Will ensure there
is no severe CPU contention caused by a single VM having more vCPUs than
there are actual execution threads available.

Closes #139
2021-09-13 01:51:21 -04:00
Joshua Boniface 1855d03a36 Add pool size check when resizing volumes
Closes #140
2021-09-12 19:54:51 -04:00
Joshua Boniface 1a286dc8dd Increase build-and-deploy sleep 2021-09-12 19:50:58 -04:00
Joshua Boniface 1b6d10e03a Handle VM disk/network stats gathering exceptions 2021-09-12 19:41:07 -04:00
Joshua Boniface 73c96d1e93 Add VM device hot attach/detach support
Adds a new API endpoint to support hot attach/detach of devices, and the
corresponding client-side logic to use this endpoint when doing VM
network/storage add/remove actions.

The live attach is now the default behaviour for these types of
additions and removals, and can be disabled if needed.

Closes #141
2021-09-12 19:33:00 -04:00
Joshua Boniface 5841c98a59 Adjust lint script for newer linter 2021-09-12 15:40:38 -04:00
Joshua Boniface bc6395c959 Don't crash cleanup if no this_node 2021-08-29 03:52:18 -04:00
Joshua Boniface d582f87472 Change default node object state to flushed 2021-08-29 03:34:08 -04:00
Joshua Boniface e9735113af Bump version to 0.9.34 2021-08-24 16:15:25 -04:00
Joshua Boniface 722fd0a65d Properly handle =-separated fsargs 2021-08-24 11:40:22 -04:00
Joshua Boniface 3b41beb0f3 Convert argument elements of task status to types 2021-08-23 14:28:12 -04:00
Joshua Boniface d3392c0282 Fix typo in output message 2021-08-23 00:39:19 -04:00
Joshua Boniface 560c013e95 Bump version to 0.9.33 2021-08-21 03:28:48 -04:00
Joshua Boniface 384c6320ef Avoid failing if no provisioner tasks 2021-08-21 03:25:16 -04:00
Joshua Boniface 445dec1c38 Ensure pycache files are removed on deb creation 2021-08-21 03:19:18 -04:00
Joshua Boniface 534c7cd7f0 Refactor pvcnoded to reduce Daemon.py size
This branch commit refactors the pvcnoded component to better adhere to
good programming practices. The previous Daemon.py was a massive file
which contained almost 2000 lines of direct, root-level code which was
directly imported. Not only was this poor practice, but this resulted
in a nigh-unmaintainable file which was hard even for me to understand.

This refactoring splits a large section of the code from Daemon.py into
separate small modules and functions in the `util/` directory. This will
hopefully make most of the functionality easy to find and modify without
having to dig through a single large file.

Further the existing subcomponents have been moved to the `objects/`
directory which clearly separates them.

Finally, the Daemon.py code has mostly been moved into a function,
`entrypoint()`, which is then called from the `pvcnoded.py` stub.

An additional item is that most format strings have been replaced by
f-strings to make use of the Python 3.6 features in Daemon.py and the
utility files.
2021-08-21 03:14:22 -04:00
Joshua Boniface 4014ef7714 Bump version to 0.9.32 2021-08-19 12:37:58 -04:00
Joshua Boniface 180f0445ac Properly handle exceptions getting VM stats 2021-08-19 12:36:31 -04:00
Joshua Boniface 074664d4c1 Fix image dimensions and size 2021-08-18 19:51:55 -04:00
Joshua Boniface 418ac23d40 Add screenshots to docs 2021-08-18 19:49:53 -04:00
Joshua Boniface 13e309b450 Fix colours of network status elements 2021-08-18 19:41:53 -04:00
Joshua Boniface 7ecc6a2635 Bump version to 0.9.31 2021-07-30 12:08:12 -04:00
Joshua Boniface 73e8149cb0 Remove explicit image-features from rbd cmd
This should be managed in ceph.conf with the `rbd default
features` configuration option instead, and thus can be tailored to the
underlying OS version.
2021-07-30 11:33:59 -04:00
Joshua Boniface 4a7246b8c0 Ensure RBD resize has bytes appended
If this isn't, the resize will be interpreted as a MB value and result
in an absurdly big volume instead. This is the same consistency
validation that occurs on add.
2021-07-30 11:25:13 -04:00
Joshua Boniface c49351469b Revert "Ensure consistent sizing of volumes"
This reverts commit dc03e95bbf.
2021-07-29 15:30:00 -04:00