125 Commits

Author SHA1 Message Date
518d699c15 Bump version to 0.9.10 2020-12-15 10:45:15 -05:00
7c99a7bda7 Safely reset RBD locks on failed VMs
Should correct issues on cold start as well as if a VM crashes
uncleanly, which would prevent the VM from starting due to stale RBD
locks.

This implementation has four parts:
  1. Update how IP addresses are handled, specifically by replacing all
  previous instances of "vni_ipaddr" with "vni_floatingipaddr", and then
  adding the "vni_ipaddr" with the real data for this node's IPs. Also
  include the storage IPs in this where they weren't before, so each
  this_node actually has the local IPs plus floating IPs. This enables
  the next two steps.
  2. Modify flush_locks to take this_node as an argument, and update the
  run_command function to only operate against this node, rather than on
  the primary coordinator.
  3. Have the flush_locks check each lock against the current node, to
  verify that the lock is actually held by the current node. This is the
  only way to do this safely. During fencing, we override this by not
  passing a this_node which bypasses this check.
  4. Have the VM start do the check for VM failure/startup and execute a
  flush_locks before actually starting the VM.
2020-12-14 15:53:18 -05:00
89c7e225a0 Move OSD stats uploading to primary only
Instead of each node uploading its own OSD stats, which would not work
if the PVC daemon wasn't running, instead have the primary upload stats
for all OSDs in the cluster.
2020-12-09 02:46:09 -05:00
b36ec43a2d Bump version to 0.9.9 2020-12-09 02:20:20 -05:00
ce5ee11841 Bump version to 0.9.8 2020-11-24 12:26:57 -05:00
d4a28d7a58 Bump version to 0.9.7 2020-11-19 10:48:28 -05:00
e69eb93cb3 Bump version to 0.9.6 2020-11-17 13:01:54 -05:00
a4e5323e81 Bump version to 0.9.5 2020-11-17 12:34:04 -05:00
9053edacd8 Bump version to 0.9.4 2020-11-10 15:33:50 -05:00
baac8f24fd Bump version to 0.9.3 2020-11-09 10:28:15 -05:00
11702f4bc8 Bump version to 0.9.2 2020-11-08 02:03:29 -05:00
6f66b77a00 Lint: E121/E126 continuation line under/over-indented for hanging indent 2020-11-07 15:06:21 -05:00
260b39ebf2 Lint: E302 expected 2 blank lines, found X 2020-11-07 14:45:24 -05:00
c3dfe2e381 Lint: F821 undefined name 'myshorthostname' 2020-11-07 13:31:19 -05:00
961ebb4c01 Lint: E305 expected 2 blank lines after class or function definition, found X 2020-11-07 13:17:49 -05:00
e553c5d42a Lint: E122 continuation line missing indentation or outdented 2020-11-07 13:12:26 -05:00
7932be3948 Lint: E261 at least two spaces before inline comment 2020-11-07 13:11:03 -05:00
d2490419c5 Lint: E202 whitespace before ']' 2020-11-07 13:02:54 -05:00
d2e5ede399 Lint: E202 whitespace before ')' 2020-11-07 12:58:54 -05:00
3f242cd437 Lint: E202 whitespace before '}' 2020-11-07 12:57:42 -05:00
b7daa8e1f6 E201 whitespace after '[' 2020-11-07 12:39:59 -05:00
c88965e898 Lint: E201 whitespace after '(' 2020-11-07 12:39:27 -05:00
e333f2b935 Lint: E201 whitespace after '{' 2020-11-07 12:38:31 -05:00
8c623023d5 Lint: F811 redefinition of unused '<function>' 2020-11-07 12:14:29 -05:00
2eef6a1c21 Lint: E265 block comment should start with '# ' 2020-11-06 21:32:17 -05:00
5da314902f Lint: F841 local variable '<variable>' is assigned to but never used 2020-11-06 21:13:13 -05:00
98a573bbc7 Lint: E402 module level import not at top of file 2020-11-06 20:40:32 -05:00
aecb845d6a Lint: E713 test for membership should be 'not in' 2020-11-06 20:37:52 -05:00
57c51d3234 Lint: E711 comparison to None should be 'if cond is not None:' 2020-11-06 19:37:13 -05:00
ce01b41d81 Lint: E711 comparison to None should be 'if cond is None:' 2020-11-06 19:36:36 -05:00
ebf254f62d Lint: W293 blank line contains whitespace 2020-11-06 19:11:07 -05:00
63f4f9aed7 Lint: E722 do not use bare 'except' 2020-11-06 18:55:10 -05:00
56ba7b1457 Bump version to 0.9.1 2020-10-29 12:16:38 -04:00
0f299777f1 Modify version to 3-digit numbering
I expect 0.9 will be fairly long-lived, so add another decimal place so
I may continue adding tweaks to it.

THIS IS NOT SEMVER.
2020-10-26 02:13:11 -04:00
c6e34c7dc6 Bump base version to 0.9 2020-10-18 14:31:19 -04:00
a4b80be5ed Add provisioned memory to node info
Adds a separate field to the node memory, "provisioned", which totals
the amount of memory provisioned to all VMs on the node, regardless of
state, and in contrast to "allocated" which only counts running VMs.

Allows for the detection of potential overprovisioned states when
factoring in non-running VMs.

Includes the supporting code to get this data, since the original
implementation of VM memory selection was dependent on the VM being
running and getting this from libvirt. Now, if the VM is not active, it
gets this from the domain XML instead.
2020-10-18 14:17:15 -04:00
9366977fe6 Copy d_domain before iterating
Prevents a bug where the thread can crash due to a change in the
d_domain object while running the for loop. By copying and iterating
over the copy, this becomes safer.
2020-09-16 15:12:37 -04:00
65b44f2955 Avoid breaking keepalive during incoming migration
The keepalive was getting stuck gathering memoryStats from the
non-running VM, since it was in a paused state. Avoid this by just
skipping past the rest of the stats gathering if the VM isn't running.
2020-08-28 01:47:36 -04:00
78dec77987 Bump version to 0.8 2020-08-26 10:24:44 -04:00
921e57ca78 Fix syntax error 2020-08-20 23:05:56 -04:00
3cc7df63f2 Add configurable VM shutdown timeout
Closes #102
2020-08-20 21:26:12 -04:00
e8e65934e3 Use logger prefix for thread debug logs 2020-08-17 14:30:21 -04:00
9b3ef6d610 Add connect timeout to Ceph
This doesn't seem to actually do anything (like most of these
timeouts...) but add it just for posterity.
2020-08-17 13:58:14 -04:00
b451c0e8e3 Add additional start/finish debug messages 2020-08-17 13:11:03 -04:00
553f96e7ef Use logger for debug output
Using simple print statements was annoying (lack of timing info and
formatting), so move to using the debug logger for these instead with a
custom state ('d') with white text to differentiate them. Also indicate
which subthread of the keepalive each task is being executed in for
easier tracing of issues.
2020-08-17 12:46:52 -04:00
65add58c9a Properly properly handle issue 2020-08-16 11:38:39 -04:00
0a01d84290 Tie fence timers to keepalive_interval
Also wait 2 full keepalive intervals after fencing before doing anything
else, to give the Ceph cluster a chance to recover.
2020-08-15 12:38:03 -04:00
4afb288429 Properly handle missing domain_name fail 2020-08-15 12:07:23 -04:00
985ad5edc0 Warn if fencing will fail
Verify our IPMI state on startup, and then warn if fencing will fail.
For now, this is sufficient, but in future (requires refactoring) we
might want to adjust how fencing occurs based on this information.
2020-08-13 14:42:18 -04:00
0587bcbd67 Go back to manual command for OSD stats
Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
2020-08-12 22:31:25 -04:00