Commit Graph

284 Commits

Author SHA1 Message Date
Joshua Boniface 1e4350ca6f Properly handle takeover state in VXNetworks
Most of these actions/conditionals were looking for primary state, but
were failing during node takeover. Update the conditionals to look for
both router states instead.

Also add a wait to lock flushing until a takeover is completed.
2020-03-02 10:41:00 -05:00
Joshua Boniface 57768f2583 Remove an obsolete script 2020-02-19 21:40:23 -05:00
Joshua Boniface e4e4e336b4 Handle invalid cursor setup cleanly
This seems to happen only during termination, so catch it and continue
so the loop terminates.
2020-02-19 16:29:59 -05:00
Joshua Boniface d2a5fe59c0 Use transitional takeover states for migration
Use a pair of transitional states, "takeover" and "relinquish", when
transitioning between primary and secondary coordinator states. This
provides a clsuter-wide record that the nodes are still working during
their synchronous transition states, and should allow clients to
determine when the node(s) have fully switched over. Also add an
additional 2 seconds of wait at the end of the transition jobs to ensure
everything has had a chance to start before proceeding.

References #72
2020-02-19 14:06:54 -05:00
Joshua Boniface 9c7041f12c Update package version to 0.7 2020-02-15 23:25:47 -05:00
Joshua Boniface 7ace5b5056 Remove /ceph/cmd pipe for (most) Ceph commands
Addresses #80
2020-02-08 23:40:02 -05:00
Joshua Boniface 37310e5455 Correct name of systemd target 2020-02-08 20:39:07 -05:00
Joshua Boniface ce985234c3 Use consistent naming of components
Rename "pvcd" to "pvcnoded", and "pvc-api" to "pvcapid" so names for the
daemons are fully consistent. Update the names of the configuration
files as well to match this new formatting.

References #79
2020-02-08 19:34:07 -05:00
Joshua Boniface 4505b239eb Rename API and common Debian packages
Closes #79
2020-02-08 18:50:38 -05:00
Joshua Boniface 74228eb063 Bump version to 0.6 2020-02-08 18:27:39 -05:00
Joshua Boniface 90e42683c6 Reduce sleep time during VM migrations 2020-02-04 17:52:37 -05:00
Joshua Boniface 20c8466296 Handle invalid search fields better 2020-02-04 17:35:24 -05:00
Joshua Boniface ab28bf40d1 Change ordering of services during primary switch
Fixes #77
2020-01-30 09:18:56 -05:00
Joshua Boniface 5d73974e95 Fix several bugs around load-based migrations 2020-01-29 17:35:10 -05:00
Joshua Boniface 0b31bab797 Add more helpful config parse error message 2020-01-22 12:09:31 -05:00
Joshua Boniface 4c1b78d7a4 Use dictionary get() to prevent crashes
Use the get() function throughout to prevent crashes in various
scenarios if the profile data isn't present or consistent.
2020-01-13 09:21:57 -05:00
Joshua Boniface 4ad29f669d Update default configuration samples 2020-01-12 21:33:15 -05:00
Joshua Boniface 0d2e22a111 Normalize all static networks with bridges
Modifies the storage and upstream networks to mirror the cluster
network, with a bridge on top of the underlying specified dev, and all
IPs bound to the bridge.

Allows creating VMs in the storage or upstream networks, as well as the
cluster network, should the administrator choose to do so (manually).
2020-01-12 19:04:31 -05:00
Joshua Boniface 1671a87dd4 Fix the flush service 2020-01-11 17:04:12 -05:00
Joshua Boniface b6474198a4 Implement cluster maintenance mode
Implements a "maintenance mode" for PVC clusters. For now, the only
thing this mode does is disable node fencing while the state is true.
This allows the administrator to tell PVC that network connectivity,
etc. might be interrupted and to avoid fencing nodes.

Closes #70
2020-01-09 10:53:27 -05:00
Joshua Boniface 4e5bce4975 Update copyright header year to 2020 2020-01-08 19:38:02 -05:00
Joshua Boniface c515d63340 Add provision state for VMs 2020-01-08 17:40:02 -05:00
Joshua Boniface 21d87f5e51 Add v6 configurations to dnsmasq
These options were only applied with v4 networks; now, use the v6
address in a dual-stack or v6-only network.
2020-01-06 23:48:04 -05:00
Joshua Boniface f326fd99e2 Properly fix IPv4 no-DHCP networking 2020-01-06 22:31:37 -05:00
Joshua Boniface 38dae8b32f Change name of cluster in patronictl command 2020-01-06 16:37:17 -05:00
Joshua Boniface 2d2bdb879e Use get() instead of direct dict reference 2020-01-06 16:34:39 -05:00
Joshua Boniface 30d4470c8f Only print AXFR errors in debug mode 2020-01-06 16:04:37 -05:00
Joshua Boniface bbfadac5e1 Fix dnsmasq options for DHCP-disabled networks 2020-01-06 16:04:26 -05:00
Joshua Boniface 7b3e267f7a Implement bridge_device for bridged VNIs
Required due to #64. Bridged networks were being created on top of a
vLAN if the Cluster network was a vLAN device, rather than being created
on the underlying device. This came from a previous revision of the
cluster architecture guidelines where Cluster was supposed to be a raw
device rather than a vLAN. This fixed the problem by implementing a
configuration field for a "bridge_device", a NIC device that can then
have the bridged vLANs created on top of it.

Fixes #64
2020-01-06 14:44:56 -05:00
Joshua Boniface 094ac8c3a8 Ensure stdout is used 2020-01-06 12:34:35 -05:00
Joshua Boniface 13548b791d Add additional debugging and fix pool_idx loop var 2020-01-06 11:31:22 -05:00
Joshua Boniface e7bc4f7328 Handle empty None-type hostname 2020-01-05 22:46:56 -05:00
Joshua Boniface be20ba02a7 Handle VM states in flush more accurately
We don't want to block forever on a failure, so limit valid waiting
states to just those we know it should be in during a migration.
2020-01-05 15:21:16 -05:00
Joshua Boniface 7311fa561b Fix bad join with new table name 2020-01-04 15:17:27 -05:00
Joshua Boniface bf89050e8b Update userdata table name 2020-01-04 15:10:37 -05:00
Joshua Boniface 20ae2186f9 Run VM state actions in a thread
Prevents blocking the main thread(s) while a VM is changing state. In
particular, this caused some issues with nodes not responding to
cancellation/reversal of a flush/ready state until the previous
migration was finished, which could cause issues. This entire subset of
actions is now threaded and so can run on its own in the background.
2019-12-26 11:08:16 -05:00
Joshua Boniface b3483fa810 Add explicit returns from flush/ready threads 2019-12-26 11:08:00 -05:00
Joshua Boniface 47cf0a8006 Ensure migration out occurs 2019-12-25 21:11:02 -05:00
Joshua Boniface 77db36a891 Ensure migration out occurs 2019-12-25 21:02:46 -05:00
Joshua Boniface 9a39d739e8 Ensure we empty of flush_thread 2019-12-25 20:29:17 -05:00
Joshua Boniface a66b834ae4 Fix several small bugs 2019-12-19 18:58:53 -05:00
Joshua Boniface b17b7bf22b Add black magic to minimize ping losses
This particular arping interval/count, along with forcing it to run in
the foreground, seems to minimize the packet loss when the primary
coordinator transitions. Through extensive testing, this value results
in the, consistently, least amount of loss: 1-2 pings, at an 0.025s ping
interval, return "TTL exceeded", with no other loss, and only when the
node the test VM is on is the one switching to secondary state. No other
combination of values here, nor tweaks to other parts of the code, seem
able to reduce this further, therefore this is likely the best
configuration possible.
2019-12-19 18:57:32 -05:00
Joshua Boniface 8c252aeecc Implemented coordinated locked node transitions
The previous method was a "throw it in the sea"-type migration with some
(very arbitrary) sleep statements thrown in for good measure.
Reimplement this with some hard locking. During each phase of the
transition, the nodes acquire read/write shared locks to a Zookeeper key
so that they can tightly coordinate the actions of transferring each
part of the primary state between them. This is done in a subthread to
prevent strange blocking issues that were encountered, likely due to
business in the existing main thread.
2019-12-19 10:56:34 -05:00
Joshua Boniface 0841ddf8b0 Handle integrity errors in DNS aggregator 2019-12-19 10:45:06 -05:00
Joshua Boniface 98764f1edd Clean up some aspects of node switchover 2019-12-18 21:39:40 -05:00
Joshua Boniface 23188199cb Handle failing Patroni events more gracefully 2019-12-18 21:12:22 -05:00
Joshua Boniface 2b1b78622e Fix invalid arping option
It made little difference and didn't error, but was incorrect.
2019-12-18 12:06:40 -05:00
Joshua Boniface 364ab10673 Add slight delay when stopping the metadata API 2019-12-18 11:56:04 -05:00
Joshua Boniface 39c9f911cc Increase arping interval to 0.2s 2019-12-15 14:55:34 -05:00
Joshua Boniface 686af31c08 Reduce arping interval to 0.1s 2019-12-15 12:30:45 -05:00