Make sure the stopping of the keepalive timer and final keepalive update
are done as the last step before complete shutdown. The previous setup
could conceivably result in a node being fenced should the cleanup
operations take longer than ~45 seconds, for instance if primary node
switchover took too long or blocked, or log watchers failed to stop
quickly enough. Ensures that keepalives will continue to be run during
the shutdown process until the last possible moment.
Previously, contention could occasionally cause a flap/dual primary
contention state due to the lack of checking within this function. This
could cause a state where a node transitions to primary than is almost
immediately shifted away, which could cause undefined behaviour in the
cluster.
The solution includes several elements:
* Implement an exclusive lock operation in zkhandler
* Switch the become_primary function to use this exclusive lock
* Implement exclusive locking during the contention process
* As a failsafe, check stat versions before setting the node as the
primary node, in case another node already has
* Delay the start of takeover/relinquish operations by slightly
longer than the lock timeout
* Make the current router_state conditions more explicit (positive
conditionals rather than negative conditionals)
The new scenario ensures that during contention, only one secondary will
ever succeed at acquiring the lock. Ideally, the other would then grab
the lock and pass, but in testing this does not seem to be the case -
the lock always times out, so the failsafe check is technically not
needed but has been left as an added safety mechanism. With this setup,
the node that fails the contention will never block the switchover nor
will it try to force itself onto the cluster after another node has
successfully won contention.
Timeouts may need to be adjusted in the future, but the base timeout of
0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably
during preliminary tests.
This may or may not help, but should in theory prevent the flush from
trying to run after a (locally-running) API daemon is terminated, which
could cause an API failure and a failure to flush.
This will stop systemd from killing the service in the middle of a flush
or unflush operation, which completely defeats the purpose. 30 minutes
was chosen as this is a very large but still somewhat manageable value,
which should cover even a very large very loaded cluster with room to
spare.
Most of these actions/conditionals were looking for primary state, but
were failing during node takeover. Update the conditionals to look for
both router states instead.
Also add a wait to lock flushing until a takeover is completed.
Use a pair of transitional states, "takeover" and "relinquish", when
transitioning between primary and secondary coordinator states. This
provides a clsuter-wide record that the nodes are still working during
their synchronous transition states, and should allow clients to
determine when the node(s) have fully switched over. Also add an
additional 2 seconds of wait at the end of the transition jobs to ensure
everything has had a chance to start before proceeding.
References #72
Rename "pvcd" to "pvcnoded", and "pvc-api" to "pvcapid" so names for the
daemons are fully consistent. Update the names of the configuration
files as well to match this new formatting.
References #79
Modifies the storage and upstream networks to mirror the cluster
network, with a bridge on top of the underlying specified dev, and all
IPs bound to the bridge.
Allows creating VMs in the storage or upstream networks, as well as the
cluster network, should the administrator choose to do so (manually).
Implements a "maintenance mode" for PVC clusters. For now, the only
thing this mode does is disable node fencing while the state is true.
This allows the administrator to tell PVC that network connectivity,
etc. might be interrupted and to avoid fencing nodes.
Closes#70
Required due to #64. Bridged networks were being created on top of a
vLAN if the Cluster network was a vLAN device, rather than being created
on the underlying device. This came from a previous revision of the
cluster architecture guidelines where Cluster was supposed to be a raw
device rather than a vLAN. This fixed the problem by implementing a
configuration field for a "bridge_device", a NIC device that can then
have the bridged vLANs created on top of it.
Fixes#64
Prevents blocking the main thread(s) while a VM is changing state. In
particular, this caused some issues with nodes not responding to
cancellation/reversal of a flush/ready state until the previous
migration was finished, which could cause issues. This entire subset of
actions is now threaded and so can run on its own in the background.
This particular arping interval/count, along with forcing it to run in
the foreground, seems to minimize the packet loss when the primary
coordinator transitions. Through extensive testing, this value results
in the, consistently, least amount of loss: 1-2 pings, at an 0.025s ping
interval, return "TTL exceeded", with no other loss, and only when the
node the test VM is on is the one switching to secondary state. No other
combination of values here, nor tweaks to other parts of the code, seem
able to reduce this further, therefore this is likely the best
configuration possible.
The previous method was a "throw it in the sea"-type migration with some
(very arbitrary) sleep statements thrown in for good measure.
Reimplement this with some hard locking. During each phase of the
transition, the nodes acquire read/write shared locks to a Zookeeper key
so that they can tightly coordinate the actions of transferring each
part of the primary state between them. This is done in a subthread to
prevent strange blocking issues that were encountered, likely due to
business in the existing main thread.
Implements the storing of three VM metadata attributes:
1. Node limits - allows specifying a list of hosts on which the VM must
run. This limit influences the migration behaviour of VMs.
2. Per-VM node selectors - allows each VM to have its migration
autoselection method specified, to automatically allow different methods
per VM based on the administrator's preferences.
3. VM autorestart - allows a VM to be automatically restarted from a
stopped state, presumably due to a failure to find a target node (either
due to limits or otherwise) during a flush/fence recovery, on the next
node unflush/ready state of its home hypervisor. Useful mostly in
conjunction with limits to ensure that VMs which were shut down due to
there being no valid migration targets are started back up when their
node becomes ready again.
Includes the full client interaction with these metadata options,
including printing, as well as defining a new function to modify this
metadata. For the CLI it is set/modified either on `vm define` or via the
`vm meta` command. For the API it is set/modified either on a POST to
the `/vm` endpoint (during VM definition) or on POST to the `/vm/<vm>`
endpoint. For the API this replaces the previous reserved word for VM
creation from scratch as this will no longer be implemented in-daemon
(see #22).
Closes#52
Adds some logic to allow an active shutdown state to be aborted by
changing the VM to another state. Useful mostly if a VM is doing funky
things and not responding to the shutdown, but the administrator either
doesn't want to wait for the timer to expire (forcing an immediate
termination) or wishes to abort the shutdown attempt.
Fixes#49
listen-address is enough; adding interface too causes weird issues where
dnsmasq is listening on an IPv6 global wildcard too which conflicts with
the PowerDNS instance.
Includes a simple implementation of a zookeeper "rename" facility,
allowing a key and all data to be replaced by a new key with a different
name but containing all the same child elements and data.
[2/2] Implements #44
Store the flush_thread of a node as a class object. Before starting a
new flush thread (either flush or unflush), stop the existing one if it
exists to prevent further migrations, then start the new thread. Set the
object to None on init and again once the task actually finishes. Remove
the inflush flag as this is not required when using these threads and
functionally does nothing any longer, but add the flush_stopper flag to
trigger cancellation of the current job.
This just seemed like more trouble that it was worth. Flush locks were
originally intended as a way to counteract the weird issues around
flushing that were mostly fixed by the code refactoring, so this will
help test if those issues are truly gone. If not, will look into a
cleaner solution that doesn't result in unchangeable states.
Without this, the IPMI information set during initial node creation can
never be changed, which can cause issues later. Instead, always set it
fresh on each node boot.
Similar to recent client changes, don't replace the previous node record
of an already-migrated VM. Wait for shutdown if required. Use a
continue statement instead of a needless else block.
Adds a config flag that turns on the API client following the Primary
coordinator. The retcode of the start/stop commands is ignore so this
can fail gracefully if e.g. the client isn't installed.
This was very old code that was hard to follow and quite fragile, with
failures and infinite loops occurring fairly frequently. These changes
make the code more robust, including the addition of timeouts, some code
cleanup, and some improvements to the logical flow.
Also forces the libvirt migration to occur on the cluster network, which
couples to changes in the libvirtd listen (via pvc-ansible) and in
Daemon.py via the previous commit.
Reference: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717215#68
Without this, DHCP fails when traversing only the local bridge, for
Debian Jessie or earlier (and possibly other OSes as well), due to the
missing UDP checksums. This disables the offload and hence reenables
the checksums even on the software-only bridge.
Also rearranged the steps and added comments arround this section to
better clarify what each command is doing.
There was really no need for this to be shared among all the
coordinators, which seemed more fragile. This way only the primary will
try to fence dead nodes.
This seems like a super-gross way to do this, but at the moment
I don't have a better way. Maybe just remove this component since
none of the volume/snapshot stuff is dynamic; will see as this
progresses.
The old way of doing this was a little cumbersome, with an upper YAML
tree split between "devices" (name and MTU) and addresses. This commit
unifies these under the root "networking" section to make this section
clearer.
MTUs were hardcoded at 9000, which breaks if the underlying interface
or network switch does not support jumbo frames, a possible deployment
limitation. This has non-obvious consequences due to MTU mismatches
for certain services (Ceph, Zookeeper, etc.).
This commit adds support for configurable MTUs for each interface,
set in pvcd.yaml. The example has been updated to reflect this, with
a default of 1500 (the Ethernet standard).
This commit also adds autoconfiguration of the VNI device MTU based
on the `vni_mtu` value, the same for bridge networks and minus 50
(rather than 200 from the hardcoded value, based on the following
resource [1]) for VXLAN networks.
[1] http://ipengineer.net/2014/06/vxlan-mtu-vs-ip-mtu-consideration/
Use RemainAfterExit to avoid pvc-flush from auto-stopping immediately.
Use PartOf to tie services to the target itself.
Use --wait on flush to avoid daemon stopping before flush is complete.
Add a systemd service to manage node flush/unflush, useful during
system startup and shutdown to avoid requiring administrator
intervention for this to occur. This is optional and the service is
not enabled by default, and the postinst script informs the
administrator of this.
Also adds a systemd target to collect the two service units together
and provide an easy way to flush+shutdown or startup+unflush the
entire PVC system.
Closes#28
1 second was just slightly too little time to wait and packets would
occasionally be lost on primary switchover. Increase this to 2
seconds to provide more time for arping to run on the new primary.
1. Remove a number of time.sleep commands which don't really seem
necessary any longer and which significantly increased the startup
time while parsing the VM list.
2. Handle some variable sets during initialization of the object,
rather than waiting for a management command, enabling...
3. Know when a state change, and the corresponding Libvirt lookup,
is unnecessary due to the target node not matching the current node.
This also removes a number of unremovable errors from Libvirt on the
console which were annoying.
This reduces the total time taken by the VM startup segment (lines
760-762 of Daemon.py) from 17.117s down to 0.976s for 82 VMs.
MariaDB+Galera was terribly unstable, with the cluster failing to
start or dying randomly, and generally seemed incredibly unsuitable
for an HA solution. This commit switches the DNS aggregator SQL
backend to PostgreSQL, implemented via Patroni HA.
It also manages the Patroni state, forcing the primary instance to
follow the PVC coordinator, such that the active DNS Aggregator
instance is always able to communicate read+write with the local
system.
This required some logic changes to how the DNS Aggregator worked,
specifically ensuring that database changes aren't attempted while
the instance isn't actively running - to be honest this was a bug
anyways that had just never been noticed.
Closes#34
Makes an unflush a controlled event like flushing, rather than a
free-for-all. This does slow down unflushing somewhat (disallowing
parallelism from multiple hosts to the current host), but allows
the locking to actually be effective.
Implements a locking mechanism to prevent clobbering of node
flushes. When a flush begins, a global cluster lock is placed
which is freed once the flush completes. While the lock is in place,
other flush events queue waiting for the lock to free before
proceeding.
Modifies the CLI output flow when the `--wait` option is specified.
First, if a lock exists when running the command, the message is
tweaked to indicate this, and the client will wait first for the
lock to free, and then for the flush as normal. Second, the wait
depends on the active lock rather than the domain_status for
consistency purposes.
Closes#32
Implements the ability for a client to watch almost-live domain
console logs from the hypervisors. It does this using a deque-based
"tail -f" mechanism (with a configurable buffer per-VM) that watches
the domain console logfile in the (configurable) directory every
half-second. It then stores the current buffer in Zookeeper when
changed, where a client can then request it, either as a static piece
of text in the `less` pager, or via a similar "tail -f" functionality
implemented using fixed line splitting and comparison to provide a
generally-seamless output.
Enabling this feature requires each guest VM to implement a Libvirt
serial log and write its (text) console to it, for example using the
default logging directory:
```
<serial type='pty'>
<log file='/var/log/libvirt/vmname.log' append='off'/>
<serial>
```
The append mode can be either on or off; on grows files unbounded,
off causes the log (and hence the PVC log data) to be truncated on
initial VM startup from offline. The administrator must choose how
they best want to handle this until Libvirt implements their own
clog-type logging format.
1. Move to a YAML-based configuration format instead of the original
INI-based configuration to facilitate better organization and
readability.
2. Modify the daemon to be able to operate in several modes based
on configuration flags. Either networking or storage functions
can be disabled using the configuration, allowing the PVC system
to be used only for hypervisor management if required.
Trying to directly AXFR from dnsmasq is a mess, since their zone is
barely compliant with spec, it doesn't support notifies, and it is
generally really messy.
This implements an advanced "AXFR parser" system, which looks at the
results of an AXFR from the local dnsmasq instances per-network, and
updates the real replicated MariaDB pdns backend cluster with the
changed data. This allows a sensible, transferable zone with its own
SOA that is dynamically reconfigured as hosts come and go from the
dnsmasq zone.
Completely restructure the daemon code to move the 4 discrete daemons
into a single daemon that can be run on every hypervisor. Introduce the
idea of a static list of "coordinator" nodes which are configured at
install time to run Zookeeper and FRR in router mode, and which are
allowed to take on client network management duties (gateway, DHCP, DNS,
etc.) while also allowing them to run VMs (i.e. no dedicated "router"
nodes required).