Commit Graph

230 Commits

Author SHA1 Message Date
Joshua Boniface b17b7bf22b Add black magic to minimize ping losses
This particular arping interval/count, along with forcing it to run in
the foreground, seems to minimize the packet loss when the primary
coordinator transitions. Through extensive testing, this value results
in the, consistently, least amount of loss: 1-2 pings, at an 0.025s ping
interval, return "TTL exceeded", with no other loss, and only when the
node the test VM is on is the one switching to secondary state. No other
combination of values here, nor tweaks to other parts of the code, seem
able to reduce this further, therefore this is likely the best
configuration possible.
2019-12-19 18:57:32 -05:00
Joshua Boniface 8c252aeecc Implemented coordinated locked node transitions
The previous method was a "throw it in the sea"-type migration with some
(very arbitrary) sleep statements thrown in for good measure.
Reimplement this with some hard locking. During each phase of the
transition, the nodes acquire read/write shared locks to a Zookeeper key
so that they can tightly coordinate the actions of transferring each
part of the primary state between them. This is done in a subthread to
prevent strange blocking issues that were encountered, likely due to
business in the existing main thread.
2019-12-19 10:56:34 -05:00
Joshua Boniface 0841ddf8b0 Handle integrity errors in DNS aggregator 2019-12-19 10:45:06 -05:00
Joshua Boniface 98764f1edd Clean up some aspects of node switchover 2019-12-18 21:39:40 -05:00
Joshua Boniface 23188199cb Handle failing Patroni events more gracefully 2019-12-18 21:12:22 -05:00
Joshua Boniface 2b1b78622e Fix invalid arping option
It made little difference and didn't error, but was incorrect.
2019-12-18 12:06:40 -05:00
Joshua Boniface 364ab10673 Add slight delay when stopping the metadata API 2019-12-18 11:56:04 -05:00
Joshua Boniface 39c9f911cc Increase arping interval to 0.2s 2019-12-15 14:55:34 -05:00
Joshua Boniface 686af31c08 Reduce arping interval to 0.1s 2019-12-15 12:30:45 -05:00
Joshua Boniface 0a94fac407 Fix bugs around passing master
Was not passing properly and getting stuck sometimes, so modify the
checking and route creation a bit to prevent it. Seems to work.
2019-12-15 00:08:18 -05:00
Joshua Boniface b3e21a5bf8 Integrate metadata API into node daemon 2019-12-14 16:41:01 -05:00
Joshua Boniface 8c36e7618a Modify node daemon to follow API 2019-12-14 14:13:26 -05:00
Joshua Boniface 78f053d81f Recreate network in aggregator if DNS changes 2019-12-13 00:03:47 -05:00
Joshua Boniface 0a8dd30a48 Restart dnsmasq when network details change 2019-12-12 23:51:22 -05:00
Joshua Boniface 6fa828e721 Don't stop the provisioner worker
It should probably just be running on all nodes all the time already,
but is started when a node first becomes primary.
2019-12-12 23:08:02 -05:00
Joshua Boniface c1b6ce0ff7 Reorder starting clients 2019-12-12 23:03:34 -05:00
Joshua Boniface b854d53fab Add API management to node daemon 2019-12-12 22:59:07 -05:00
Joshua Boniface 88a181b20d Allow metadata API in nft rules 2019-12-11 17:04:29 -05:00
Joshua Boniface 1fb560e996 Add DNS nameservers to networks 2019-12-08 23:55:45 -05:00
Joshua Boniface 9cb5561e77 Move default NS record to upstream_domain 2019-12-08 23:05:32 -05:00
Joshua Boniface 3471f4e57a Remove obsolete pvc-nsX and add pvc-ns name
Should point towards the floating IP.
2019-12-08 20:20:20 -05:00
Joshua Boniface 356c12db2e Add ceph df output to pool data
Allows additional information visible in the `ceph df` command,
including pool free space and used percentage.
2019-12-06 00:47:27 -05:00
Joshua Boniface 531578fd28 Use consistent tense for VM states
Replace "failed" with "fail" and "disabled" with "disable" for
consistency with the remaining states.
2019-10-23 23:57:59 -04:00
Joshua Boniface 040ca33683 Clean up handling of OSD dump command 2019-10-22 12:51:29 -04:00
Joshua Boniface 190623bdd9 Use empty string for node limit 2019-10-22 12:32:14 -04:00
Joshua Boniface f0e0a38a20 Fix bug in config element retrieval 2019-10-22 12:30:23 -04:00
Joshua Boniface 237a37015d Set upstream IP in key if changed 2019-10-21 16:50:41 -04:00
Joshua Boniface 10ae260b92 Properly handle empty node limit 2019-10-17 13:34:11 -04:00
Joshua Boniface 03447d3374 Update copyright string year to include 2019 2019-10-13 12:09:51 -04:00
Joshua Boniface 116013695f Fix bugs with bad strings 2019-10-12 18:43:29 -04:00
Joshua Boniface 18fc49fc6c Use node instead of hypervisor consistently 2019-10-12 01:59:08 -04:00
Joshua Boniface 8dc0c8f0ac Fix minor bugs 2019-10-12 01:36:50 -04:00
Joshua Boniface 5995353597 Implement VM metadata and use it
Implements the storing of three VM metadata attributes:
1. Node limits - allows specifying a list of hosts on which the VM must
run. This limit influences the migration behaviour of VMs.
2. Per-VM node selectors - allows each VM to have its migration
autoselection method specified, to automatically allow different methods
per VM based on the administrator's preferences.
3. VM autorestart - allows a VM to be automatically restarted from a
stopped state, presumably due to a failure to find a target node (either
due to limits or otherwise) during a flush/fence recovery, on the next
node unflush/ready state of its home hypervisor. Useful mostly in
conjunction with limits to ensure that VMs which were shut down due to
there being no valid migration targets are started back up when their
node becomes ready again.

Includes the full client interaction with these metadata options,
including printing, as well as defining a new function to modify this
metadata. For the CLI it is set/modified either on `vm define` or via the
`vm meta` command. For the API it is set/modified either on a POST to
the `/vm` endpoint (during VM definition) or on POST to the `/vm/<vm>`
endpoint. For the API this replaces the previous reserved word for VM
creation from scratch as this will no longer be implemented in-daemon
(see #22).

Closes #52
2019-10-12 01:17:39 -04:00
Joshua Boniface 76e6b42389 Add clone_volume backend command 2019-10-10 14:09:07 -04:00
Joshua Boniface 983daceaed Fix shutdown abort during restart
Restart state, being different from shutdown, would trigger an abort of
the shutdown. Fix this by including restart in the valid states to
continue.
2019-09-07 12:08:31 -04:00
Joshua Boniface 7c4d18691a Implement configurable replcfg (node-side)
Implements administrator-selectable replication configurations for new
pools in PVC clusters, overriding the default of copies=3,mincopies=2.
2019-08-23 21:58:54 -04:00
Joshua Boniface 267a3d16e5 Bump version to 0.5 2019-08-08 20:56:27 -04:00
Joshua Boniface 2880a761c0 Move Ceph command pipe to new location
Matching the new /cmd/domain pipe, move Ceph pipe to /cmd/ceph.
2019-08-07 14:47:27 -04:00
Joshua Boniface b7546e3711 Fix bugs in command pipeline for VMs 2019-08-07 14:13:01 -04:00
Joshua Boniface 0ff2d7d537 Use shlex for command splitting
This will preserve quoted strings, required for the rbd lock commands.
2019-08-07 14:02:57 -04:00
Joshua Boniface a2a630f6a0 Add pipeline for VM lock flush cmd 2019-08-07 13:49:33 -04:00
Joshua Boniface 496216321e Move lock flushing to VMInstance
Prepares for reuse of this function via client commands.
2019-08-07 13:36:56 -04:00
Joshua Boniface 0446b2db02 Catch exceptions if Patroni is not up 2019-08-07 11:46:58 -04:00
Joshua Boniface 7e77752ce5 Add limit to Patroni switchover attempts 2019-08-07 11:46:42 -04:00
Joshua Boniface 33a963c2af Improve fence output on failure and increase delay 2019-08-07 11:35:49 -04:00
Joshua Boniface e92a57606d Use better forceful arping command
Send ARP responses with the source IP in it to force update even if the
old primary did not cleanly terminate (during fencing for instance).
2019-08-07 11:29:38 -04:00
Joshua Boniface ef3b6b3723 Arping 3 times instead of 2
During fence 2 is not always enough for the network to recognize the
change in primary coordinator.
2019-08-07 11:15:36 -04:00
Joshua Boniface 3b27a88128 Allow abort of shutdown state
Adds some logic to allow an active shutdown state to be aborted by
changing the VM to another state. Useful mostly if a VM is doing funky
things and not responding to the shutdown, but the administrator either
doesn't want to wait for the timer to expire (forcing an immediate
termination) or wishes to abort the shutdown attempt.

Fixes #49
2019-08-07 10:58:18 -04:00
Joshua Boniface e2ae58b62c Add the missing newline to the string compare 2019-08-04 17:00:33 -04:00
Joshua Boniface d0d5ab4425 Fix bug if the switchover target is the same 2019-08-04 16:51:11 -04:00