Before, "-y"/"--yes" only confirmed the reboot portion. Instead, modify
this to confirm both the diff portion and the restart portion, and add
separate flags to bypass one or the other independently, ensuring the
administrator has lots of flexibility. UNSAFE mode implies "-y" so both
would be auto-confirmed if that option is set.
We need to do a bit more finagling with the logger on termination to
ensure that all messages are written and the queue drained before
actually terminating.
Adds the ability to send node daemon logs to Zookeeper to facilitate a
command like "pvc node log", similar to "pvc vm log". Each node stores
its logs in a separate tree under "/logs" which can then be combined or
queried. By default, set by config, only 2000 lines are kept.
Add an additional protected class, limit manipulation to one at a time,
and ensure future flexibility. Also makes display consistent with other
VM elements.
Adds tags to schema (v3), to VM definition, adds function to modify
tags, adds function to get tags, and adds tags to VM data output.
Tags will enable more granular classification of VMs based either on
administrator configuration or from automated system events.
Previously, if the node failed to restart, it was declared a "bad fence"
and no further action would be taken. However, there are some
situations, for instance critical hardware failures, where intelligent
systems will not attempt (or succeed at) starting up the node in such a
case, which would result in dead, known-offline nodes without recovery.
Tweak this behaviour somewhat. The main path of Reboot -> Check On ->
Success + fence-flush is retained, but some additional side-paths are
now defined:
1. We attempt to power "on" the chassis 1 second after the reboot, just
in case it is off and can be recovered. We then wait another 2 seconds
and check the power status (as we did before).
2. If the reboot succeeded, follow this series of choices:
a. If the chassis is on, the fence succeeded.
b. If the chassis is off, the fence "succeeded" as well.
c. If the chassis is in some other state, the fence failed.
3. If the reboot failed, follow this series of choices:
a. If the chassis is off, the fence itself failed, but we can treat
it as "succeeded"" since the chassis is in a known-offline state.
This is the most likely situation when there is a critical hardware
failure, and the server's IPMI does not allow itself to start back
up again.
b. If the chassis is in any other state ("on" or unknown), the fence
itself failed and we must treat this as a fence failure.
Overall, this should alleviate the aforementioned issue of a critical
failure rendering the node persistently "off" not triggering a
fence-flush and ensure fencing is more robust.
This reverts commit 65d14ccd92.
This was actually a bad idea. For inexplicable reasons, running these
Ceph commands manually (not even via Python, but in a normal shell)
takes 7 * two orders of magnitude longer than running them with the
Rados module, so long in fact that some basic commands like "ceph
health" would sometimes take longer than the 1 second timeout to
complete. The Rados commands would however take about 1ms instead.
Despite the occasional issues when monitors drop out, the Rados module
is clearly far superior to the shell commands for any moderately-loaded
Ceph cluster. We can look into solving timeouts another way (perhaps
with Processes instead of Threads) at a later time.
Rados module "ceph health":
b'{"checks":{},"status":"HEALTH_OK"}'
0.001204 (s)
b'{"checks":{},"status":"HEALTH_OK"}'
0.001258 (s)
Command "ceph health":
joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null
real 0m0.772s
user 0m0.707s
sys 0m0.046s
joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null
real 0m0.796s
user 0m0.728s
sys 0m0.054s