Commit Graph

381 Commits

Author SHA1 Message Date
Joshua Boniface 65add58c9a Properly properly handle issue 2020-08-16 11:38:39 -04:00
Joshua Boniface 0a01d84290 Tie fence timers to keepalive_interval
Also wait 2 full keepalive intervals after fencing before doing anything
else, to give the Ceph cluster a chance to recover.
2020-08-15 12:38:03 -04:00
Joshua Boniface 4afb288429 Properly handle missing domain_name fail 2020-08-15 12:07:23 -04:00
Joshua Boniface 985ad5edc0 Warn if fencing will fail
Verify our IPMI state on startup, and then warn if fencing will fail.
For now, this is sufficient, but in future (requires refactoring) we
might want to adjust how fencing occurs based on this information.
2020-08-13 14:42:18 -04:00
Joshua Boniface 0587bcbd67 Go back to manual command for OSD stats
Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
2020-08-12 22:31:25 -04:00
Joshua Boniface 09c1bb6a46 Increase start delay of flush service 2020-08-11 14:17:35 -04:00
Joshua Boniface e0cb4a58c3 Ensure zk_listener is readded after reconnect 2020-08-11 12:46:15 -04:00
Joshua Boniface 099c58ead8 Fix missing char in log message 2020-08-11 12:40:35 -04:00
Joshua Boniface 0e5c681ada Clean up imports
Make several imports more specific to reduce redundant code imports and
improve memory utilization.
2020-08-11 12:09:10 -04:00
Joshua Boniface 46ffe352e3 Better handle subthread timeouts in keepalive
Prevent the main keepalive thread from getting stuck due to a subthread
taking an enormous time. If this happens, the rest of the main keepalive
will continue onward, thus ensuring that the main keepalive does not
fail for a significant number of cycles, which would cause a fence.
2020-08-11 11:37:26 -04:00
Joshua Boniface ccee124c8b Adjust fence failcount limit to 6 (30s)
The previous saving throw limit (3/15s) seems to have been too low. I
was observing bizarre failures where a node would be fenced while it was
still starting up. Some of this may have been related to Zookeeper
connections taking too long, but this was inconsistent.

Increase this to 6 saving throws (30s). This provides significantly more
time for a node to properly check in on startup before another node
fences it. In the real world, 15s vs 30s isn't that big of a downtime
change, but prevents false-positive fences.
2020-08-05 22:40:07 -04:00
Joshua Boniface 02343079c0 Improve fencing migrate layout
Open the option to do this in parallel with some threads
2020-08-05 22:26:01 -04:00
Joshua Boniface 37b83aad6a Add logging and use better conditional 2020-08-05 21:57:36 -04:00
Joshua Boniface 876f2424e0 Ensure dead state isn't written erroneously 2020-08-05 21:57:11 -04:00
Joshua Boniface 5871380e1b Avoid crashing VM stats thread if domain migrated 2020-06-10 17:10:46 -04:00
Joshua Boniface 654a3cb7fa Improve debug output and use ceph df util data 2020-06-06 22:52:49 -04:00
Joshua Boniface 9b65d3271a Improve handling of Ceph status gathering
Use the Rados library instead of random OS commands, which massively
improves the performance of these tasks.

Closes #97
2020-06-06 22:30:25 -04:00
Joshua Boniface 598b2025e8 Use Rados and add Ceph entries to pvcnoded.yaml 2020-06-06 21:12:51 -04:00
Joshua Boniface 70b787d1fd Move all VM functions into thread 2020-06-06 15:44:05 -04:00
Joshua Boniface e1310a05f2 Implement recording of VM stats during keepalive 2020-06-06 15:34:03 -04:00
Joshua Boniface 2ad6860dfe Move Ceph statistics gathering into thread 2020-06-06 13:25:02 -04:00
Joshua Boniface cebb4bbc1a Comment cleanup 2020-06-06 13:20:40 -04:00
Joshua Boniface a672e06dd2 Move fencing to end of keepalive function 2020-06-06 13:19:11 -04:00
Joshua Boniface 1db73bb892 Move libvirt closure into previous section 2020-06-06 13:18:37 -04:00
Joshua Boniface c1956072f0 Rename update_zookeeper function to node_keepalive 2020-06-06 12:49:50 -04:00
Joshua Boniface ce60836c34 Allow enforcement of live migration
Provides a CLI and API argument to force live migration, which triggers
a new VM state "migrate-live". The node daemon VMInstance during migrate
will read this flag from the state and, if enforced, will not trigger a
shutdown migration.

Closes #95
2020-06-06 12:00:44 -04:00
Joshua Boniface b5434ba744 Fix typo in variable name 2020-06-06 11:29:48 -04:00
Joshua Boniface b9e5b14f94 Update lastnode too if a self-migrate is aborted
References #92
2020-06-04 10:28:04 -04:00
Joshua Boniface 5d2031d99e Prevent a VM migrating to the same node
Prevents a rare edge case where a node can end up "migrating" to itself.
Quick hack to fix this, though like most of the VM management should
probably be rethought/rewritten later.

Fixes #92
2020-06-04 10:26:47 -04:00
Joshua Boniface 5f9836f96d Add error message to OSD parse fail 2020-05-12 11:04:38 -04:00
Joshua Boniface 95c59ba629 Improve flush handling slightly 2020-05-12 11:04:38 -04:00
Joshua Boniface 72a38fd437 Correct changed dhcp_reservations key name 2020-05-09 10:00:53 -04:00
Joshua Boniface b580760537 Add missing fmt_cyan variable 2020-05-08 18:15:02 -04:00
Joshua Boniface 331027d124 Add further tweaks to takeover state checks
Just ensure that everything is proper state before proceeding
2020-04-22 11:16:19 -04:00
Joshua Boniface ae4f36b881 Hook flush into more services
Trying to ensure that pvc-flush completes before anything tries to shut
down.
2020-04-14 19:58:53 -04:00
Joshua Boniface 611e0edd80 Reorder last keepalive during cleanup
Make sure the stopping of the keepalive timer and final keepalive update
are done as the last step before complete shutdown. The previous setup
could conceivably result in a node being fenced should the cleanup
operations take longer than ~45 seconds, for instance if primary node
switchover took too long or blocked, or log watchers failed to stop
quickly enough. Ensures that keepalives will continue to be run during
the shutdown process until the last possible moment.
2020-04-12 03:49:29 -04:00
Joshua Boniface b413e042a6 Improve handling of primary contention
Previously, contention could occasionally cause a flap/dual primary
contention state due to the lack of checking within this function. This
could cause a state where a node transitions to primary than is almost
immediately shifted away, which could cause undefined behaviour in the
cluster.

The solution includes several elements:
    * Implement an exclusive lock operation in zkhandler
    * Switch the become_primary function to use this exclusive lock
    * Implement exclusive locking during the contention process
    * As a failsafe, check stat versions before setting the node as the
      primary node, in case another node already has
    * Delay the start of takeover/relinquish operations by slightly
      longer than the lock timeout
    * Make the current router_state conditions more explicit (positive
      conditionals rather than negative conditionals)

The new scenario ensures that during contention, only one secondary will
ever succeed at acquiring the lock. Ideally, the other would then grab
the lock and pass, but in testing this does not seem to be the case -
the lock always times out, so the failsafe check is technically not
needed but has been left as an added safety mechanism. With this setup,
the node that fails the contention will never block the switchover nor
will it try to force itself onto the cluster after another node has
successfully won contention.

Timeouts may need to be adjusted in the future, but the base timeout of
0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably
during preliminary tests.
2020-04-12 03:40:17 -04:00
Joshua Boniface e672d799a6 Set flush after pvcapid.service
This may or may not help, but should in theory prevent the flush from
trying to run after a (locally-running) API daemon is terminated, which
could cause an API failure and a failure to flush.
2020-04-12 01:48:50 -04:00
Joshua Boniface a130f19a19 Depend pvcnoded on Zookeeper (harder) and libvirtd 2020-04-09 09:57:53 -04:00
Joshua Boniface a671d9d457 Use consistent tense in messages 2020-04-08 22:00:51 -04:00
Joshua Boniface fee1c7dd6c Reorder cleanup and gracefully wait for flushes 2020-04-08 22:00:08 -04:00
Joshua Boniface 5d58bee34f Add some time around noded startup/shutdown
Otherwise, systemd kills networking before the node daemon fully stops
and it goes into "dead" status, which is super annoying.
2020-04-01 23:59:14 -04:00
Joshua Boniface f668412941 Don't use Requires as the dep is too hard
Requires seems to flush on every service restart which is NOT what we
want. Use Wants instead.
2020-04-01 15:15:37 -04:00
Joshua Boniface a0ebc0d3a7 Add more robust requirements to pvc-flush service 2020-04-01 15:09:44 -04:00
Joshua Boniface 98a7005c1b Add significant TimeoutSec to pvc-flush service
This will stop systemd from killing the service in the middle of a flush
or unflush operation, which completely defeats the purpose. 30 minutes
was chosen as this is a very large but still somewhat manageable value,
which should cover even a very large very loaded cluster with room to
spare.
2020-04-01 01:24:09 -04:00
Joshua Boniface 0a367898a0 Don't trigger aggregator fail if fine 2020-03-12 13:22:12 -04:00
Joshua Boniface c02bc0b46a Correct issues with VM lock freeing
Code was bad and using a depricated feature.
2020-03-02 12:45:12 -05:00
Joshua Boniface 1e4350ca6f Properly handle takeover state in VXNetworks
Most of these actions/conditionals were looking for primary state, but
were failing during node takeover. Update the conditionals to look for
both router states instead.

Also add a wait to lock flushing until a takeover is completed.
2020-03-02 10:41:00 -05:00
Joshua Boniface 57768f2583 Remove an obsolete script 2020-02-19 21:40:23 -05:00
Joshua Boniface e4e4e336b4 Handle invalid cursor setup cleanly
This seems to happen only during termination, so catch it and continue
so the loop terminates.
2020-02-19 16:29:59 -05:00
Joshua Boniface d2a5fe59c0 Use transitional takeover states for migration
Use a pair of transitional states, "takeover" and "relinquish", when
transitioning between primary and secondary coordinator states. This
provides a clsuter-wide record that the nodes are still working during
their synchronous transition states, and should allow clients to
determine when the node(s) have fully switched over. Also add an
additional 2 seconds of wait at the end of the transition jobs to ensure
everything has had a chance to start before proceeding.

References #72
2020-02-19 14:06:54 -05:00
Joshua Boniface 9c7041f12c Update package version to 0.7 2020-02-15 23:25:47 -05:00
Joshua Boniface 7ace5b5056 Remove /ceph/cmd pipe for (most) Ceph commands
Addresses #80
2020-02-08 23:40:02 -05:00
Joshua Boniface 37310e5455 Correct name of systemd target 2020-02-08 20:39:07 -05:00
Joshua Boniface ce985234c3 Use consistent naming of components
Rename "pvcd" to "pvcnoded", and "pvc-api" to "pvcapid" so names for the
daemons are fully consistent. Update the names of the configuration
files as well to match this new formatting.

References #79
2020-02-08 19:34:07 -05:00
Joshua Boniface 4505b239eb Rename API and common Debian packages
Closes #79
2020-02-08 18:50:38 -05:00
Joshua Boniface 74228eb063 Bump version to 0.6 2020-02-08 18:27:39 -05:00
Joshua Boniface 90e42683c6 Reduce sleep time during VM migrations 2020-02-04 17:52:37 -05:00
Joshua Boniface 20c8466296 Handle invalid search fields better 2020-02-04 17:35:24 -05:00
Joshua Boniface ab28bf40d1 Change ordering of services during primary switch
Fixes #77
2020-01-30 09:18:56 -05:00
Joshua Boniface 5d73974e95 Fix several bugs around load-based migrations 2020-01-29 17:35:10 -05:00
Joshua Boniface 0b31bab797 Add more helpful config parse error message 2020-01-22 12:09:31 -05:00
Joshua Boniface 4c1b78d7a4 Use dictionary get() to prevent crashes
Use the get() function throughout to prevent crashes in various
scenarios if the profile data isn't present or consistent.
2020-01-13 09:21:57 -05:00
Joshua Boniface 4ad29f669d Update default configuration samples 2020-01-12 21:33:15 -05:00
Joshua Boniface 0d2e22a111 Normalize all static networks with bridges
Modifies the storage and upstream networks to mirror the cluster
network, with a bridge on top of the underlying specified dev, and all
IPs bound to the bridge.

Allows creating VMs in the storage or upstream networks, as well as the
cluster network, should the administrator choose to do so (manually).
2020-01-12 19:04:31 -05:00
Joshua Boniface 1671a87dd4 Fix the flush service 2020-01-11 17:04:12 -05:00
Joshua Boniface b6474198a4 Implement cluster maintenance mode
Implements a "maintenance mode" for PVC clusters. For now, the only
thing this mode does is disable node fencing while the state is true.
This allows the administrator to tell PVC that network connectivity,
etc. might be interrupted and to avoid fencing nodes.

Closes #70
2020-01-09 10:53:27 -05:00
Joshua Boniface 4e5bce4975 Update copyright header year to 2020 2020-01-08 19:38:02 -05:00
Joshua Boniface c515d63340 Add provision state for VMs 2020-01-08 17:40:02 -05:00
Joshua Boniface 21d87f5e51 Add v6 configurations to dnsmasq
These options were only applied with v4 networks; now, use the v6
address in a dual-stack or v6-only network.
2020-01-06 23:48:04 -05:00
Joshua Boniface f326fd99e2 Properly fix IPv4 no-DHCP networking 2020-01-06 22:31:37 -05:00
Joshua Boniface 38dae8b32f Change name of cluster in patronictl command 2020-01-06 16:37:17 -05:00
Joshua Boniface 2d2bdb879e Use get() instead of direct dict reference 2020-01-06 16:34:39 -05:00
Joshua Boniface 30d4470c8f Only print AXFR errors in debug mode 2020-01-06 16:04:37 -05:00
Joshua Boniface bbfadac5e1 Fix dnsmasq options for DHCP-disabled networks 2020-01-06 16:04:26 -05:00
Joshua Boniface 7b3e267f7a Implement bridge_device for bridged VNIs
Required due to #64. Bridged networks were being created on top of a
vLAN if the Cluster network was a vLAN device, rather than being created
on the underlying device. This came from a previous revision of the
cluster architecture guidelines where Cluster was supposed to be a raw
device rather than a vLAN. This fixed the problem by implementing a
configuration field for a "bridge_device", a NIC device that can then
have the bridged vLANs created on top of it.

Fixes #64
2020-01-06 14:44:56 -05:00
Joshua Boniface 094ac8c3a8 Ensure stdout is used 2020-01-06 12:34:35 -05:00
Joshua Boniface 13548b791d Add additional debugging and fix pool_idx loop var 2020-01-06 11:31:22 -05:00
Joshua Boniface e7bc4f7328 Handle empty None-type hostname 2020-01-05 22:46:56 -05:00
Joshua Boniface be20ba02a7 Handle VM states in flush more accurately
We don't want to block forever on a failure, so limit valid waiting
states to just those we know it should be in during a migration.
2020-01-05 15:21:16 -05:00
Joshua Boniface 7311fa561b Fix bad join with new table name 2020-01-04 15:17:27 -05:00
Joshua Boniface bf89050e8b Update userdata table name 2020-01-04 15:10:37 -05:00
Joshua Boniface 20ae2186f9 Run VM state actions in a thread
Prevents blocking the main thread(s) while a VM is changing state. In
particular, this caused some issues with nodes not responding to
cancellation/reversal of a flush/ready state until the previous
migration was finished, which could cause issues. This entire subset of
actions is now threaded and so can run on its own in the background.
2019-12-26 11:08:16 -05:00
Joshua Boniface b3483fa810 Add explicit returns from flush/ready threads 2019-12-26 11:08:00 -05:00
Joshua Boniface 47cf0a8006 Ensure migration out occurs 2019-12-25 21:11:02 -05:00
Joshua Boniface 77db36a891 Ensure migration out occurs 2019-12-25 21:02:46 -05:00
Joshua Boniface 9a39d739e8 Ensure we empty of flush_thread 2019-12-25 20:29:17 -05:00
Joshua Boniface a66b834ae4 Fix several small bugs 2019-12-19 18:58:53 -05:00
Joshua Boniface b17b7bf22b Add black magic to minimize ping losses
This particular arping interval/count, along with forcing it to run in
the foreground, seems to minimize the packet loss when the primary
coordinator transitions. Through extensive testing, this value results
in the, consistently, least amount of loss: 1-2 pings, at an 0.025s ping
interval, return "TTL exceeded", with no other loss, and only when the
node the test VM is on is the one switching to secondary state. No other
combination of values here, nor tweaks to other parts of the code, seem
able to reduce this further, therefore this is likely the best
configuration possible.
2019-12-19 18:57:32 -05:00
Joshua Boniface 8c252aeecc Implemented coordinated locked node transitions
The previous method was a "throw it in the sea"-type migration with some
(very arbitrary) sleep statements thrown in for good measure.
Reimplement this with some hard locking. During each phase of the
transition, the nodes acquire read/write shared locks to a Zookeeper key
so that they can tightly coordinate the actions of transferring each
part of the primary state between them. This is done in a subthread to
prevent strange blocking issues that were encountered, likely due to
business in the existing main thread.
2019-12-19 10:56:34 -05:00
Joshua Boniface 0841ddf8b0 Handle integrity errors in DNS aggregator 2019-12-19 10:45:06 -05:00
Joshua Boniface 98764f1edd Clean up some aspects of node switchover 2019-12-18 21:39:40 -05:00
Joshua Boniface 23188199cb Handle failing Patroni events more gracefully 2019-12-18 21:12:22 -05:00
Joshua Boniface 2b1b78622e Fix invalid arping option
It made little difference and didn't error, but was incorrect.
2019-12-18 12:06:40 -05:00
Joshua Boniface 364ab10673 Add slight delay when stopping the metadata API 2019-12-18 11:56:04 -05:00
Joshua Boniface 39c9f911cc Increase arping interval to 0.2s 2019-12-15 14:55:34 -05:00
Joshua Boniface 686af31c08 Reduce arping interval to 0.1s 2019-12-15 12:30:45 -05:00
Joshua Boniface 0a94fac407 Fix bugs around passing master
Was not passing properly and getting stuck sometimes, so modify the
checking and route creation a bit to prevent it. Seems to work.
2019-12-15 00:08:18 -05:00
Joshua Boniface b3e21a5bf8 Integrate metadata API into node daemon 2019-12-14 16:41:01 -05:00
Joshua Boniface 8c36e7618a Modify node daemon to follow API 2019-12-14 14:13:26 -05:00