Instead of exiting and trusting systemd to restart us, instead leverage
the os.execv() call to reload the process in the current PID context.
Also improves the log messages so it's very clear what's going on.
A hot reload isn't possible due to DataWatch and ChildrenWatch
constructs, so we instead need to terminate the daemon to "apply" the
schema update. Thus we use exit code 150 (Application defined in LSB)
and reorder some of the elements of the schema validation to ensure
things happen in the right order.
Adds a new class, ZKSchema, to handle schema management in Zookeeper in
an automated and consistent way. This should solve several issues:
1. Pain in managing changes to ZK keys
2. Pain in handling those changes during live upgrades
3. Simplifying the codebase to remove hardcoded ZK paths
The current master schema for PVC 0.9.19 is committed as version 0.
Addresses #129
Found a rare glitch where the subprocess pipes would not engage, causing
a daemon crash. Catch these exceptions with a retcode of 255 instead of
bailing out.
Closes#124
Libvirt will someones write junk out to console log files, which breaks
the log parser deque with a UnicodeDecodeError.
If this happens, clear the log and re-open the deque again for newer
updates.
Closes#123
Add nicer easy-to-find (yay ASCII art) banners for the startup printouts
of both the node and API daemons. Also adds the safe loader to pvcnoded
to prevent hassle messages and a version string in the API daemon file.
This caused a serious race condition, since the IPs managed by PVC had
not yet come up, but Zookeeper was trying to start and bind to them,
which of course failed.
Remove these dependencies entirely - the daemon itself starts these
services during initialization and they do not need to be started by
systemd first.
Sets in the node daemon, returns via the API, and shows in the CLI,
information about the live VNC listen address and port for VNC-enabled
VMs.
Closes#115
Prevents unnecessarily excessive timeouts if IPMI connections time out;
before, would have to go through 3 timed out commands at ~20s each
before failure was registered; reduced to 1 if the first times out.
If the VM is not in a stop state, failing to free the lock is now
considered a fatal error and will put the domain into fail state,
aborting the start. This is better than being unsafe or trying to start
a VM which will fail to boot due to read-only volumes.
Should correct issues on cold start as well as if a VM crashes
uncleanly, which would prevent the VM from starting due to stale RBD
locks.
This implementation has four parts:
1. Update how IP addresses are handled, specifically by replacing all
previous instances of "vni_ipaddr" with "vni_floatingipaddr", and then
adding the "vni_ipaddr" with the real data for this node's IPs. Also
include the storage IPs in this where they weren't before, so each
this_node actually has the local IPs plus floating IPs. This enables
the next two steps.
2. Modify flush_locks to take this_node as an argument, and update the
run_command function to only operate against this node, rather than on
the primary coordinator.
3. Have the flush_locks check each lock against the current node, to
verify that the lock is actually held by the current node. This is the
only way to do this safely. During fencing, we override this by not
passing a this_node which bypasses this check.
4. Have the VM start do the check for VM failure/startup and execute a
flush_locks before actually starting the VM.
Instead of each node uploading its own OSD stats, which would not work
if the PVC daemon wasn't running, instead have the primary upload stats
for all OSDs in the cluster.
Allow a VM to specify its migration type as a default choice. The valid
options are "default" (i.e. behave as now), "live" which forces a live
migration only, and "shutdown" which forces a shutdown migration only.
The new option is treated as a VM meta option and is set to default if
not found.
Avoids situations where two migrates, to different nodes, happen in
rapid succession. Aborts the migration if the current target node no
longer matches what was set at the start of the execution.
The VM migration code was very old, very spaghettified, and prone to
strange failures.
Improve this by taking cues from the node primary migration. Use
synchronization between the nodes to ensure lockstep completion of the
migration in discrete steps.
A proper queue can be built later to integrate with this code more
cleanly.
References #108
Use the new "provisioned" memory field, instead of the "allocated"
memory field, to determine the optimal node when using the "mem"
migration selector. This will take into account non-running VMs in the
calculation as well as running VMs.
Adds a separate field to the node memory, "provisioned", which totals
the amount of memory provisioned to all VMs on the node, regardless of
state, and in contrast to "allocated" which only counts running VMs.
Allows for the detection of potential overprovisioned states when
factoring in non-running VMs.
Includes the supporting code to get this data, since the original
implementation of VM memory selection was dependent on the VM being
running and getting this from libvirt. Now, if the VM is not active, it
gets this from the domain XML instead.
Prevents any potential leakage due to autoconfigured IPv6 on bridged
interfaces. These are exclusively VM-side bridges, and the PVC host
should not have any IPv6 configuration on them, ever.
Prevents a bug where the thread can crash due to a change in the
d_domain object while running the for loop. By copying and iterating
over the copy, this becomes safer.
The keepalive was getting stuck gathering memoryStats from the
non-running VM, since it was in a paused state. Avoid this by just
skipping past the rest of the stats gathering if the VM isn't running.
Most of these would silently fail if there was e.g. an issue with the ZK
connection. Instead, encase things in try blocks and handle the
exceptions in a more graceful way, returning None or False if
applicable. Except for locks, which should retry 5 times before
aborting.
Using simple print statements was annoying (lack of timing info and
formatting), so move to using the debug logger for these instead with a
custom state ('d') with white text to differentiate them. Also indicate
which subthread of the keepalive each task is being executed in for
easier tracing of issues.
Verify our IPMI state on startup, and then warn if fencing will fail.
For now, this is sufficient, but in future (requires refactoring) we
might want to adjust how fencing occurs based on this information.
Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
Prevent the main keepalive thread from getting stuck due to a subthread
taking an enormous time. If this happens, the rest of the main keepalive
will continue onward, thus ensuring that the main keepalive does not
fail for a significant number of cycles, which would cause a fence.
The previous saving throw limit (3/15s) seems to have been too low. I
was observing bizarre failures where a node would be fenced while it was
still starting up. Some of this may have been related to Zookeeper
connections taking too long, but this was inconsistent.
Increase this to 6 saving throws (30s). This provides significantly more
time for a node to properly check in on startup before another node
fences it. In the real world, 15s vs 30s isn't that big of a downtime
change, but prevents false-positive fences.
Provides a CLI and API argument to force live migration, which triggers
a new VM state "migrate-live". The node daemon VMInstance during migrate
will read this flag from the state and, if enforced, will not trigger a
shutdown migration.
Closes#95
Prevents a rare edge case where a node can end up "migrating" to itself.
Quick hack to fix this, though like most of the VM management should
probably be rethought/rewritten later.
Fixes#92
Make sure the stopping of the keepalive timer and final keepalive update
are done as the last step before complete shutdown. The previous setup
could conceivably result in a node being fenced should the cleanup
operations take longer than ~45 seconds, for instance if primary node
switchover took too long or blocked, or log watchers failed to stop
quickly enough. Ensures that keepalives will continue to be run during
the shutdown process until the last possible moment.
Previously, contention could occasionally cause a flap/dual primary
contention state due to the lack of checking within this function. This
could cause a state where a node transitions to primary than is almost
immediately shifted away, which could cause undefined behaviour in the
cluster.
The solution includes several elements:
* Implement an exclusive lock operation in zkhandler
* Switch the become_primary function to use this exclusive lock
* Implement exclusive locking during the contention process
* As a failsafe, check stat versions before setting the node as the
primary node, in case another node already has
* Delay the start of takeover/relinquish operations by slightly
longer than the lock timeout
* Make the current router_state conditions more explicit (positive
conditionals rather than negative conditionals)
The new scenario ensures that during contention, only one secondary will
ever succeed at acquiring the lock. Ideally, the other would then grab
the lock and pass, but in testing this does not seem to be the case -
the lock always times out, so the failsafe check is technically not
needed but has been left as an added safety mechanism. With this setup,
the node that fails the contention will never block the switchover nor
will it try to force itself onto the cluster after another node has
successfully won contention.
Timeouts may need to be adjusted in the future, but the base timeout of
0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably
during preliminary tests.
This may or may not help, but should in theory prevent the flush from
trying to run after a (locally-running) API daemon is terminated, which
could cause an API failure and a failure to flush.
This will stop systemd from killing the service in the middle of a flush
or unflush operation, which completely defeats the purpose. 30 minutes
was chosen as this is a very large but still somewhat manageable value,
which should cover even a very large very loaded cluster with room to
spare.
Most of these actions/conditionals were looking for primary state, but
were failing during node takeover. Update the conditionals to look for
both router states instead.
Also add a wait to lock flushing until a takeover is completed.
Use a pair of transitional states, "takeover" and "relinquish", when
transitioning between primary and secondary coordinator states. This
provides a clsuter-wide record that the nodes are still working during
their synchronous transition states, and should allow clients to
determine when the node(s) have fully switched over. Also add an
additional 2 seconds of wait at the end of the transition jobs to ensure
everything has had a chance to start before proceeding.
References #72
Rename "pvcd" to "pvcnoded", and "pvc-api" to "pvcapid" so names for the
daemons are fully consistent. Update the names of the configuration
files as well to match this new formatting.
References #79
Modifies the storage and upstream networks to mirror the cluster
network, with a bridge on top of the underlying specified dev, and all
IPs bound to the bridge.
Allows creating VMs in the storage or upstream networks, as well as the
cluster network, should the administrator choose to do so (manually).
Implements a "maintenance mode" for PVC clusters. For now, the only
thing this mode does is disable node fencing while the state is true.
This allows the administrator to tell PVC that network connectivity,
etc. might be interrupted and to avoid fencing nodes.
Closes#70
Required due to #64. Bridged networks were being created on top of a
vLAN if the Cluster network was a vLAN device, rather than being created
on the underlying device. This came from a previous revision of the
cluster architecture guidelines where Cluster was supposed to be a raw
device rather than a vLAN. This fixed the problem by implementing a
configuration field for a "bridge_device", a NIC device that can then
have the bridged vLANs created on top of it.
Fixes#64
Prevents blocking the main thread(s) while a VM is changing state. In
particular, this caused some issues with nodes not responding to
cancellation/reversal of a flush/ready state until the previous
migration was finished, which could cause issues. This entire subset of
actions is now threaded and so can run on its own in the background.
This particular arping interval/count, along with forcing it to run in
the foreground, seems to minimize the packet loss when the primary
coordinator transitions. Through extensive testing, this value results
in the, consistently, least amount of loss: 1-2 pings, at an 0.025s ping
interval, return "TTL exceeded", with no other loss, and only when the
node the test VM is on is the one switching to secondary state. No other
combination of values here, nor tweaks to other parts of the code, seem
able to reduce this further, therefore this is likely the best
configuration possible.
The previous method was a "throw it in the sea"-type migration with some
(very arbitrary) sleep statements thrown in for good measure.
Reimplement this with some hard locking. During each phase of the
transition, the nodes acquire read/write shared locks to a Zookeeper key
so that they can tightly coordinate the actions of transferring each
part of the primary state between them. This is done in a subthread to
prevent strange blocking issues that were encountered, likely due to
business in the existing main thread.
Implements the storing of three VM metadata attributes:
1. Node limits - allows specifying a list of hosts on which the VM must
run. This limit influences the migration behaviour of VMs.
2. Per-VM node selectors - allows each VM to have its migration
autoselection method specified, to automatically allow different methods
per VM based on the administrator's preferences.
3. VM autorestart - allows a VM to be automatically restarted from a
stopped state, presumably due to a failure to find a target node (either
due to limits or otherwise) during a flush/fence recovery, on the next
node unflush/ready state of its home hypervisor. Useful mostly in
conjunction with limits to ensure that VMs which were shut down due to
there being no valid migration targets are started back up when their
node becomes ready again.
Includes the full client interaction with these metadata options,
including printing, as well as defining a new function to modify this
metadata. For the CLI it is set/modified either on `vm define` or via the
`vm meta` command. For the API it is set/modified either on a POST to
the `/vm` endpoint (during VM definition) or on POST to the `/vm/<vm>`
endpoint. For the API this replaces the previous reserved word for VM
creation from scratch as this will no longer be implemented in-daemon
(see #22).
Closes#52
Adds some logic to allow an active shutdown state to be aborted by
changing the VM to another state. Useful mostly if a VM is doing funky
things and not responding to the shutdown, but the administrator either
doesn't want to wait for the timer to expire (forcing an immediate
termination) or wishes to abort the shutdown attempt.
Fixes#49
listen-address is enough; adding interface too causes weird issues where
dnsmasq is listening on an IPv6 global wildcard too which conflicts with
the PowerDNS instance.
Includes a simple implementation of a zookeeper "rename" facility,
allowing a key and all data to be replaced by a new key with a different
name but containing all the same child elements and data.
[2/2] Implements #44
Store the flush_thread of a node as a class object. Before starting a
new flush thread (either flush or unflush), stop the existing one if it
exists to prevent further migrations, then start the new thread. Set the
object to None on init and again once the task actually finishes. Remove
the inflush flag as this is not required when using these threads and
functionally does nothing any longer, but add the flush_stopper flag to
trigger cancellation of the current job.
This just seemed like more trouble that it was worth. Flush locks were
originally intended as a way to counteract the weird issues around
flushing that were mostly fixed by the code refactoring, so this will
help test if those issues are truly gone. If not, will look into a
cleaner solution that doesn't result in unchangeable states.
Without this, the IPMI information set during initial node creation can
never be changed, which can cause issues later. Instead, always set it
fresh on each node boot.
Similar to recent client changes, don't replace the previous node record
of an already-migrated VM. Wait for shutdown if required. Use a
continue statement instead of a needless else block.
Adds a config flag that turns on the API client following the Primary
coordinator. The retcode of the start/stop commands is ignore so this
can fail gracefully if e.g. the client isn't installed.
This was very old code that was hard to follow and quite fragile, with
failures and infinite loops occurring fairly frequently. These changes
make the code more robust, including the addition of timeouts, some code
cleanup, and some improvements to the logical flow.
Also forces the libvirt migration to occur on the cluster network, which
couples to changes in the libvirtd listen (via pvc-ansible) and in
Daemon.py via the previous commit.
Reference: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717215#68
Without this, DHCP fails when traversing only the local bridge, for
Debian Jessie or earlier (and possibly other OSes as well), due to the
missing UDP checksums. This disables the offload and hence reenables
the checksums even on the software-only bridge.
Also rearranged the steps and added comments arround this section to
better clarify what each command is doing.
There was really no need for this to be shared among all the
coordinators, which seemed more fragile. This way only the primary will
try to fence dead nodes.
This seems like a super-gross way to do this, but at the moment
I don't have a better way. Maybe just remove this component since
none of the volume/snapshot stuff is dynamic; will see as this
progresses.
The old way of doing this was a little cumbersome, with an upper YAML
tree split between "devices" (name and MTU) and addresses. This commit
unifies these under the root "networking" section to make this section
clearer.
MTUs were hardcoded at 9000, which breaks if the underlying interface
or network switch does not support jumbo frames, a possible deployment
limitation. This has non-obvious consequences due to MTU mismatches
for certain services (Ceph, Zookeeper, etc.).
This commit adds support for configurable MTUs for each interface,
set in pvcd.yaml. The example has been updated to reflect this, with
a default of 1500 (the Ethernet standard).
This commit also adds autoconfiguration of the VNI device MTU based
on the `vni_mtu` value, the same for bridge networks and minus 50
(rather than 200 from the hardcoded value, based on the following
resource [1]) for VXLAN networks.
[1] http://ipengineer.net/2014/06/vxlan-mtu-vs-ip-mtu-consideration/
Use RemainAfterExit to avoid pvc-flush from auto-stopping immediately.
Use PartOf to tie services to the target itself.
Use --wait on flush to avoid daemon stopping before flush is complete.
Add a systemd service to manage node flush/unflush, useful during
system startup and shutdown to avoid requiring administrator
intervention for this to occur. This is optional and the service is
not enabled by default, and the postinst script informs the
administrator of this.
Also adds a systemd target to collect the two service units together
and provide an easy way to flush+shutdown or startup+unflush the
entire PVC system.
Closes#28
1 second was just slightly too little time to wait and packets would
occasionally be lost on primary switchover. Increase this to 2
seconds to provide more time for arping to run on the new primary.
1. Remove a number of time.sleep commands which don't really seem
necessary any longer and which significantly increased the startup
time while parsing the VM list.
2. Handle some variable sets during initialization of the object,
rather than waiting for a management command, enabling...
3. Know when a state change, and the corresponding Libvirt lookup,
is unnecessary due to the target node not matching the current node.
This also removes a number of unremovable errors from Libvirt on the
console which were annoying.
This reduces the total time taken by the VM startup segment (lines
760-762 of Daemon.py) from 17.117s down to 0.976s for 82 VMs.
MariaDB+Galera was terribly unstable, with the cluster failing to
start or dying randomly, and generally seemed incredibly unsuitable
for an HA solution. This commit switches the DNS aggregator SQL
backend to PostgreSQL, implemented via Patroni HA.
It also manages the Patroni state, forcing the primary instance to
follow the PVC coordinator, such that the active DNS Aggregator
instance is always able to communicate read+write with the local
system.
This required some logic changes to how the DNS Aggregator worked,
specifically ensuring that database changes aren't attempted while
the instance isn't actively running - to be honest this was a bug
anyways that had just never been noticed.
Closes#34
Makes an unflush a controlled event like flushing, rather than a
free-for-all. This does slow down unflushing somewhat (disallowing
parallelism from multiple hosts to the current host), but allows
the locking to actually be effective.