Previously, if the node failed to restart, it was declared a "bad fence"
and no further action would be taken. However, there are some
situations, for instance critical hardware failures, where intelligent
systems will not attempt (or succeed at) starting up the node in such a
case, which would result in dead, known-offline nodes without recovery.
Tweak this behaviour somewhat. The main path of Reboot -> Check On ->
Success + fence-flush is retained, but some additional side-paths are
now defined:
1. We attempt to power "on" the chassis 1 second after the reboot, just
in case it is off and can be recovered. We then wait another 2 seconds
and check the power status (as we did before).
2. If the reboot succeeded, follow this series of choices:
a. If the chassis is on, the fence succeeded.
b. If the chassis is off, the fence "succeeded" as well.
c. If the chassis is in some other state, the fence failed.
3. If the reboot failed, follow this series of choices:
a. If the chassis is off, the fence itself failed, but we can treat
it as "succeeded"" since the chassis is in a known-offline state.
This is the most likely situation when there is a critical hardware
failure, and the server's IPMI does not allow itself to start back
up again.
b. If the chassis is in any other state ("on" or unknown), the fence
itself failed and we must treat this as a fence failure.
Overall, this should alleviate the aforementioned issue of a critical
failure rendering the node persistently "off" not triggering a
fence-flush and ensure fencing is more robust.
This reverts commit 65d14ccd92.
This was actually a bad idea. For inexplicable reasons, running these
Ceph commands manually (not even via Python, but in a normal shell)
takes 7 * two orders of magnitude longer than running them with the
Rados module, so long in fact that some basic commands like "ceph
health" would sometimes take longer than the 1 second timeout to
complete. The Rados commands would however take about 1ms instead.
Despite the occasional issues when monitors drop out, the Rados module
is clearly far superior to the shell commands for any moderately-loaded
Ceph cluster. We can look into solving timeouts another way (perhaps
with Processes instead of Threads) at a later time.
Rados module "ceph health":
b'{"checks":{},"status":"HEALTH_OK"}'
0.001204 (s)
b'{"checks":{},"status":"HEALTH_OK"}'
0.001258 (s)
Command "ceph health":
joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null
real 0m0.772s
user 0m0.707s
sys 0m0.046s
joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null
real 0m0.796s
user 0m0.728s
sys 0m0.054s
Using the Rados module was very problematic, specifically because it had
no sensible timeout parameters and thus would hang for many seconds.
This has poor implications since it blocks further keepalives.
Instead, remove the Rados usage entirely and go back completely to using
manual OS commands to gather this information. While this may cause PID
exhaustion more quickly it's worthwhile to avoid failure scenarios when
Ceph stats time out.
Closes#137
Not sure how this didn't cause an issue until now, but the wrong key
path was used and this was getting unexpected data with the newly-added
version string instead of the proper mode string.
When doing a stop_vm or terminate_vm, check again after 0.2 seconds
and try re-terminating if it's still running. Covers cases where a VM
doesn't stop if given the 'stop' state.
Trying to do this on the VMInstance side had problems because we can't
differentiate the 3 types of migration there. So, just update this in
the API side and hope everything goes well.
This introduces an edge bug: if a VM is using a macvtap SR-IOV device,
and then tries to migrate, and the migrate is aborted, the NIC lists
will be inconsistent.
When I revamp the VMInstance in the future, I should be able to correct
this, but for now we'll have to live with that edgecase.
Ensures that the configuration of a VF is not overwritten in Zookeeper
on a node restart. The SRIOVVFInstance handlers were modified to start
with None values, so that the DataWatch statements will always trigger
updates to the live system interfaces on daemon startup, thus ensuring
that the config stored in Zookeeper is applied to the system on startup
(mostly relevant after a cold boot or if the API changes them during a
daemon restart).
Adds support for the node daemon managing SR-IOV PF and VF instances.
PFs are added to Zookeeper automatically based on the config at startup
during network configuration, and are otherwise completely static. PFs
are automatically removed from Zookeeper, along with all coresponding
VFs, should the PF phy device be removed from the configuration.
VFs are configured based on the (autocreated) VFs of each PF device,
added to Zookeeper, and then a new class instance, SRIOVVFInstance, is
used to watch them for configuration changes. This will enable the
runtime management of VF settings by the API. The set of keys ensures
that both configuration and details of the NIC can be tracked.
Most keys are self-explanatory, especially for PFs and the basic keys
for VFs. The configuration tree is also self-explanatory, being based
entirely on the options available in the `ip link set {dev} vf` command.
Two additional keys are also present: `used` and `used_by`, which will
be able to track the (boolean) state of usage, as well as the VM that
uses a given VIF. Since the VM side implementation will support both
macvtap and direct "hostdev" assignments, this will ensure that this
state can be tracked on both the VF and the VM side.
Adds configuration values for enabled flag and SR-IOV devices to the
configuration and sets up the initial SR-IOV configuration on daemon
startup (inserting the module, configuring the VF count, etc.).
Instead of exiting and trusting systemd to restart us, instead leverage
the os.execv() call to reload the process in the current PID context.
Also improves the log messages so it's very clear what's going on.