269 Commits

Author SHA1 Message Date
4c1b78d7a4 Use dictionary get() to prevent crashes
Use the get() function throughout to prevent crashes in various
scenarios if the profile data isn't present or consistent.
2020-01-13 09:21:57 -05:00
4ad29f669d Update default configuration samples 2020-01-12 21:33:15 -05:00
0d2e22a111 Normalize all static networks with bridges
Modifies the storage and upstream networks to mirror the cluster
network, with a bridge on top of the underlying specified dev, and all
IPs bound to the bridge.

Allows creating VMs in the storage or upstream networks, as well as the
cluster network, should the administrator choose to do so (manually).
2020-01-12 19:04:31 -05:00
1671a87dd4 Fix the flush service 2020-01-11 17:04:12 -05:00
b6474198a4 Implement cluster maintenance mode
Implements a "maintenance mode" for PVC clusters. For now, the only
thing this mode does is disable node fencing while the state is true.
This allows the administrator to tell PVC that network connectivity,
etc. might be interrupted and to avoid fencing nodes.

Closes #70
2020-01-09 10:53:27 -05:00
4e5bce4975 Update copyright header year to 2020 2020-01-08 19:38:02 -05:00
c515d63340 Add provision state for VMs 2020-01-08 17:40:02 -05:00
21d87f5e51 Add v6 configurations to dnsmasq
These options were only applied with v4 networks; now, use the v6
address in a dual-stack or v6-only network.
2020-01-06 23:48:04 -05:00
f326fd99e2 Properly fix IPv4 no-DHCP networking 2020-01-06 22:31:37 -05:00
38dae8b32f Change name of cluster in patronictl command 2020-01-06 16:37:17 -05:00
2d2bdb879e Use get() instead of direct dict reference 2020-01-06 16:34:39 -05:00
30d4470c8f Only print AXFR errors in debug mode 2020-01-06 16:04:37 -05:00
bbfadac5e1 Fix dnsmasq options for DHCP-disabled networks 2020-01-06 16:04:26 -05:00
7b3e267f7a Implement bridge_device for bridged VNIs
Required due to #64. Bridged networks were being created on top of a
vLAN if the Cluster network was a vLAN device, rather than being created
on the underlying device. This came from a previous revision of the
cluster architecture guidelines where Cluster was supposed to be a raw
device rather than a vLAN. This fixed the problem by implementing a
configuration field for a "bridge_device", a NIC device that can then
have the bridged vLANs created on top of it.

Fixes #64
2020-01-06 14:44:56 -05:00
094ac8c3a8 Ensure stdout is used 2020-01-06 12:34:35 -05:00
13548b791d Add additional debugging and fix pool_idx loop var 2020-01-06 11:31:22 -05:00
e7bc4f7328 Handle empty None-type hostname 2020-01-05 22:46:56 -05:00
be20ba02a7 Handle VM states in flush more accurately
We don't want to block forever on a failure, so limit valid waiting
states to just those we know it should be in during a migration.
2020-01-05 15:21:16 -05:00
7311fa561b Fix bad join with new table name 2020-01-04 15:17:27 -05:00
bf89050e8b Update userdata table name 2020-01-04 15:10:37 -05:00
20ae2186f9 Run VM state actions in a thread
Prevents blocking the main thread(s) while a VM is changing state. In
particular, this caused some issues with nodes not responding to
cancellation/reversal of a flush/ready state until the previous
migration was finished, which could cause issues. This entire subset of
actions is now threaded and so can run on its own in the background.
2019-12-26 11:08:16 -05:00
b3483fa810 Add explicit returns from flush/ready threads 2019-12-26 11:08:00 -05:00
47cf0a8006 Ensure migration out occurs 2019-12-25 21:11:02 -05:00
77db36a891 Ensure migration out occurs 2019-12-25 21:02:46 -05:00
9a39d739e8 Ensure we empty of flush_thread 2019-12-25 20:29:17 -05:00
a66b834ae4 Fix several small bugs 2019-12-19 18:58:53 -05:00
b17b7bf22b Add black magic to minimize ping losses
This particular arping interval/count, along with forcing it to run in
the foreground, seems to minimize the packet loss when the primary
coordinator transitions. Through extensive testing, this value results
in the, consistently, least amount of loss: 1-2 pings, at an 0.025s ping
interval, return "TTL exceeded", with no other loss, and only when the
node the test VM is on is the one switching to secondary state. No other
combination of values here, nor tweaks to other parts of the code, seem
able to reduce this further, therefore this is likely the best
configuration possible.
2019-12-19 18:57:32 -05:00
8c252aeecc Implemented coordinated locked node transitions
The previous method was a "throw it in the sea"-type migration with some
(very arbitrary) sleep statements thrown in for good measure.
Reimplement this with some hard locking. During each phase of the
transition, the nodes acquire read/write shared locks to a Zookeeper key
so that they can tightly coordinate the actions of transferring each
part of the primary state between them. This is done in a subthread to
prevent strange blocking issues that were encountered, likely due to
business in the existing main thread.
2019-12-19 10:56:34 -05:00
0841ddf8b0 Handle integrity errors in DNS aggregator 2019-12-19 10:45:06 -05:00
98764f1edd Clean up some aspects of node switchover 2019-12-18 21:39:40 -05:00
23188199cb Handle failing Patroni events more gracefully 2019-12-18 21:12:22 -05:00
2b1b78622e Fix invalid arping option
It made little difference and didn't error, but was incorrect.
2019-12-18 12:06:40 -05:00
364ab10673 Add slight delay when stopping the metadata API 2019-12-18 11:56:04 -05:00
39c9f911cc Increase arping interval to 0.2s 2019-12-15 14:55:34 -05:00
686af31c08 Reduce arping interval to 0.1s 2019-12-15 12:30:45 -05:00
0a94fac407 Fix bugs around passing master
Was not passing properly and getting stuck sometimes, so modify the
checking and route creation a bit to prevent it. Seems to work.
2019-12-15 00:08:18 -05:00
b3e21a5bf8 Integrate metadata API into node daemon 2019-12-14 16:41:01 -05:00
8c36e7618a Modify node daemon to follow API 2019-12-14 14:13:26 -05:00
78f053d81f Recreate network in aggregator if DNS changes 2019-12-13 00:03:47 -05:00
0a8dd30a48 Restart dnsmasq when network details change 2019-12-12 23:51:22 -05:00
6fa828e721 Don't stop the provisioner worker
It should probably just be running on all nodes all the time already,
but is started when a node first becomes primary.
2019-12-12 23:08:02 -05:00
c1b6ce0ff7 Reorder starting clients 2019-12-12 23:03:34 -05:00
b854d53fab Add API management to node daemon 2019-12-12 22:59:07 -05:00
88a181b20d Allow metadata API in nft rules 2019-12-11 17:04:29 -05:00
1fb560e996 Add DNS nameservers to networks 2019-12-08 23:55:45 -05:00
9cb5561e77 Move default NS record to upstream_domain 2019-12-08 23:05:32 -05:00
3471f4e57a Remove obsolete pvc-nsX and add pvc-ns name
Should point towards the floating IP.
2019-12-08 20:20:20 -05:00
356c12db2e Add ceph df output to pool data
Allows additional information visible in the `ceph df` command,
including pool free space and used percentage.
2019-12-06 00:47:27 -05:00
531578fd28 Use consistent tense for VM states
Replace "failed" with "fail" and "disabled" with "disable" for
consistency with the remaining states.
2019-10-23 23:57:59 -04:00
040ca33683 Clean up handling of OSD dump command 2019-10-22 12:51:29 -04:00