71 Commits

Author SHA1 Message Date
3910843660 Add missing break 2020-10-20 15:39:29 -04:00
70f3fdbfb9 Tweak the delays slightly on receive 2020-10-20 15:38:07 -04:00
7cb0241a12 Attempt live migrates 3 times before proceeding 2020-10-20 15:33:41 -04:00
9fb33ed7a7 Increase peer lock acquiring timers 2020-10-20 15:26:59 -04:00
abfe0108ab Better handle aborting migrations 2020-10-20 15:22:16 -04:00
567fe8f36b Wait for existing migrations before proceeding 2020-10-20 15:12:32 -04:00
ec7b78b9b8 Add additional short sleep in receive 2020-10-20 13:29:17 -04:00
224c8082ef Alter text of synchronization messages 2020-10-20 13:08:18 -04:00
f9e7e9884f Improve handling of VM migrations
The VM migration code was very old, very spaghettified, and prone to
strange failures.

Improve this by taking cues from the node primary migration. Use
synchronization between the nodes to ensure lockstep completion of the
migration in discrete steps.

A proper queue can be built later to integrate with this code more
cleanly.

References #108
2020-10-20 13:01:55 -04:00
726501f4d4 Add additional logging to flush selector
Adds additional debug logging to the flush selector to determine how any
why any given node is selected. Useful for troubleshooting strange
choices.
2020-10-20 12:34:18 -04:00
c6e34c7dc6 Bump base version to 0.9 2020-10-18 14:31:19 -04:00
f749633f7c Use provisioned memory for mem migration selector
Use the new "provisioned" memory field, instead of the "allocated"
memory field, to determine the optimal node when using the "mem"
migration selector. This will take into account non-running VMs in the
calculation as well as running VMs.
2020-10-18 14:17:15 -04:00
a4b80be5ed Add provisioned memory to node info
Adds a separate field to the node memory, "provisioned", which totals
the amount of memory provisioned to all VMs on the node, regardless of
state, and in contrast to "allocated" which only counts running VMs.

Allows for the detection of potential overprovisioned states when
factoring in non-running VMs.

Includes the supporting code to get this data, since the original
implementation of VM memory selection was dependent on the VM being
running and getting this from libvirt. Now, if the VM is not active, it
gets this from the domain XML instead.
2020-10-18 14:17:15 -04:00
aa5f8c93fd Entirely disable IPv6 on bridged interfaces
Prevents any potential leakage due to autoconfigured IPv6 on bridged
interfaces. These are exclusively VM-side bridges, and the PVC host
should not have any IPv6 configuration on them, ever.
2020-10-15 11:00:59 -04:00
9366977fe6 Copy d_domain before iterating
Prevents a bug where the thread can crash due to a change in the
d_domain object while running the for loop. By copying and iterating
over the copy, this becomes safer.
2020-09-16 15:12:37 -04:00
65b44f2955 Avoid breaking keepalive during incoming migration
The keepalive was getting stuck gathering memoryStats from the
non-running VM, since it was in a paused state. Avoid this by just
skipping past the rest of the stats gathering if the VM isn't running.
2020-08-28 01:47:36 -04:00
78dec77987 Bump version to 0.8 2020-08-26 10:24:44 -04:00
921e57ca78 Fix syntax error 2020-08-20 23:05:56 -04:00
3cc7df63f2 Add configurable VM shutdown timeout
Closes #102
2020-08-20 21:26:12 -04:00
e8e65934e3 Use logger prefix for thread debug logs 2020-08-17 14:30:21 -04:00
24fda8a73f Use new debug logger for DNS Aggregator 2020-08-17 14:26:43 -04:00
9b3ef6d610 Add connect timeout to Ceph
This doesn't seem to actually do anything (like most of these
timeouts...) but add it just for posterity.
2020-08-17 13:58:14 -04:00
b451c0e8e3 Add additional start/finish debug messages 2020-08-17 13:11:03 -04:00
f9b126a106 Make zkhandler accept failures more robustly
Most of these would silently fail if there was e.g. an issue with the ZK
connection. Instead, encase things in try blocks and handle the
exceptions in a more graceful way, returning None or False if
applicable. Except for locks, which should retry 5 times before
aborting.
2020-08-17 13:03:36 -04:00
553f96e7ef Use logger for debug output
Using simple print statements was annoying (lack of timing info and
formatting), so move to using the debug logger for these instead with a
custom state ('d') with white text to differentiate them. Also indicate
which subthread of the keepalive each task is being executed in for
easier tracing of issues.
2020-08-17 12:46:52 -04:00
65add58c9a Properly properly handle issue 2020-08-16 11:38:39 -04:00
0a01d84290 Tie fence timers to keepalive_interval
Also wait 2 full keepalive intervals after fencing before doing anything
else, to give the Ceph cluster a chance to recover.
2020-08-15 12:38:03 -04:00
4afb288429 Properly handle missing domain_name fail 2020-08-15 12:07:23 -04:00
985ad5edc0 Warn if fencing will fail
Verify our IPMI state on startup, and then warn if fencing will fail.
For now, this is sufficient, but in future (requires refactoring) we
might want to adjust how fencing occurs based on this information.
2020-08-13 14:42:18 -04:00
0587bcbd67 Go back to manual command for OSD stats
Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
2020-08-12 22:31:25 -04:00
e0cb4a58c3 Ensure zk_listener is readded after reconnect 2020-08-11 12:46:15 -04:00
099c58ead8 Fix missing char in log message 2020-08-11 12:40:35 -04:00
0e5c681ada Clean up imports
Make several imports more specific to reduce redundant code imports and
improve memory utilization.
2020-08-11 12:09:10 -04:00
46ffe352e3 Better handle subthread timeouts in keepalive
Prevent the main keepalive thread from getting stuck due to a subthread
taking an enormous time. If this happens, the rest of the main keepalive
will continue onward, thus ensuring that the main keepalive does not
fail for a significant number of cycles, which would cause a fence.
2020-08-11 11:37:26 -04:00
ccee124c8b Adjust fence failcount limit to 6 (30s)
The previous saving throw limit (3/15s) seems to have been too low. I
was observing bizarre failures where a node would be fenced while it was
still starting up. Some of this may have been related to Zookeeper
connections taking too long, but this was inconsistent.

Increase this to 6 saving throws (30s). This provides significantly more
time for a node to properly check in on startup before another node
fences it. In the real world, 15s vs 30s isn't that big of a downtime
change, but prevents false-positive fences.
2020-08-05 22:40:07 -04:00
02343079c0 Improve fencing migrate layout
Open the option to do this in parallel with some threads
2020-08-05 22:26:01 -04:00
37b83aad6a Add logging and use better conditional 2020-08-05 21:57:36 -04:00
876f2424e0 Ensure dead state isn't written erroneously 2020-08-05 21:57:11 -04:00
5871380e1b Avoid crashing VM stats thread if domain migrated 2020-06-10 17:10:46 -04:00
654a3cb7fa Improve debug output and use ceph df util data 2020-06-06 22:52:49 -04:00
9b65d3271a Improve handling of Ceph status gathering
Use the Rados library instead of random OS commands, which massively
improves the performance of these tasks.

Closes #97
2020-06-06 22:30:25 -04:00
598b2025e8 Use Rados and add Ceph entries to pvcnoded.yaml 2020-06-06 21:12:51 -04:00
70b787d1fd Move all VM functions into thread 2020-06-06 15:44:05 -04:00
e1310a05f2 Implement recording of VM stats during keepalive 2020-06-06 15:34:03 -04:00
2ad6860dfe Move Ceph statistics gathering into thread 2020-06-06 13:25:02 -04:00
cebb4bbc1a Comment cleanup 2020-06-06 13:20:40 -04:00
a672e06dd2 Move fencing to end of keepalive function 2020-06-06 13:19:11 -04:00
1db73bb892 Move libvirt closure into previous section 2020-06-06 13:18:37 -04:00
c1956072f0 Rename update_zookeeper function to node_keepalive 2020-06-06 12:49:50 -04:00
ce60836c34 Allow enforcement of live migration
Provides a CLI and API argument to force live migration, which triggers
a new VM state "migrate-live". The node daemon VMInstance during migrate
will read this flag from the state and, if enforced, will not trigger a
shutdown migration.

Closes #95
2020-06-06 12:00:44 -04:00