Commit Graph

2646 Commits

Author SHA1 Message Date
Joshua Boniface 4d6842f942 Don't bail out if write fails, keep retrying 2021-07-19 13:09:36 -04:00
Joshua Boniface 6ead21a308 Handle cleanup from a failure properly 2021-07-19 12:39:13 -04:00
Joshua Boniface b7c8c2ee3d Fix handling of this_node and d_domain in cleanup 2021-07-19 12:36:35 -04:00
Joshua Boniface d48f58930b Use harder exits and add cleanup termination 2021-07-19 12:27:16 -04:00
Joshua Boniface 7c36388c8f Add post-networking delay and adjust daemon delay 2021-07-19 12:23:45 -04:00
Joshua Boniface e9df043c0a Ensure ZK logging does not block startup 2021-07-19 12:19:59 -04:00
Joshua Boniface 71e4d0b32a Bump version to 0.9.28 2021-07-19 09:29:34 -04:00
Joshua Boniface f16bad4691 Revamp confirmation options for vm modify
Before, "-y"/"--yes" only confirmed the reboot portion. Instead, modify
this to confirm both the diff portion and the restart portion, and add
separate flags to bypass one or the other independently, ensuring the
administrator has lots of flexibility. UNSAFE mode implies "-y" so both
would be auto-confirmed if that option is set.
2021-07-19 00:25:43 -04:00
Joshua Boniface 15d92c483f Bump version to 0.9.27 2021-07-19 00:03:40 -04:00
Joshua Boniface 7dd17e71e7 Fix bug with VM editing with file
Current config is needed for the diff but it was in a conditional.
2021-07-19 00:02:19 -04:00
Joshua Boniface 5be968123f Readd 1 second queue get timeout
Otherwise daemon stops will sometimes inexplicably block.
2021-07-18 22:17:57 -04:00
Joshua Boniface 99fd7ebe63 Fix excessive CPU due to looping 2021-07-18 22:06:50 -04:00
Joshua Boniface cffc96d156 Fix failure in creating base keys 2021-07-18 21:00:23 -04:00
Joshua Boniface 602093029c Bump version to 0.9.26 2021-07-18 20:49:52 -04:00
Joshua Boniface bd7a773d6b Add node log following functionality 2021-07-18 20:37:53 -04:00
Joshua Boniface 8d671b3422 Add some tag tests to test-cluster.sh 2021-07-18 20:37:37 -04:00
Joshua Boniface 2358ad6bbe Reduce the number of lines per call
500 was a lot every half second; 200 seems more reasonable. Even a fast
kernel boot should generate < 200 lines in half a second.
2021-07-18 20:23:45 -04:00
Joshua Boniface a0e9b57d39 Increase log line frequency 2021-07-18 20:19:59 -04:00
Joshua Boniface 2d48127e9c Use even better/faster set comparison 2021-07-18 20:18:35 -04:00
Joshua Boniface 55f2b00366 Add some spaces for better readability 2021-07-18 20:18:23 -04:00
Joshua Boniface ba257048ad Improve output formatting of node logs 2021-07-18 20:06:08 -04:00
Joshua Boniface b770e15a91 Fix final termination of logger
We need to do a bit more finagling with the logger on termination to
ensure that all messages are written and the queue drained before
actually terminating.
2021-07-18 19:53:00 -04:00
Joshua Boniface e23a65128a Remove del of logger item 2021-07-18 19:03:47 -04:00
Joshua Boniface 982dfd52c6 Adjust date output format 2021-07-18 19:00:54 -04:00
Joshua Boniface 3a2478ee0c Cleanly terminate logger on cleanup 2021-07-18 18:57:44 -04:00
Joshua Boniface a088aa4484 Add node log functions to API and CLI 2021-07-18 18:54:28 -04:00
Joshua Boniface 323c7c41ae Implement node logging into Zookeeper
Adds the ability to send node daemon logs to Zookeeper to facilitate a
command like "pvc node log", similar to "pvc vm log". Each node stores
its logs in a separate tree under "/logs" which can then be combined or
queried. By default, set by config, only 2000 lines are kept.
2021-07-18 17:11:43 -04:00
Joshua Boniface cd1db3d587 Ensure node name is part of confing 2021-07-18 16:38:58 -04:00
Joshua Boniface 401f102344 Add serial BIOS to default libvirt schema 2021-07-15 10:45:14 -04:00
Joshua Boniface 4ac020888b Add some tag tests to test-cluster.sh 2021-07-14 15:02:03 -04:00
Joshua Boniface 8f3b68d48a Mention multiple option for tags in VM define 2021-07-14 01:12:10 -04:00
Joshua Boniface 6d4c26c8d8 Don't show tag line in info if no tags 2021-07-14 00:59:24 -04:00
Joshua Boniface 75fb60b1b4 Add VM list filtering by tag
Uses same method as state or node filtering, rather than altering how
the main LIMIT field works.
2021-07-14 00:59:20 -04:00
Joshua Boniface 9ea9ac3b8a Revamp tag handling and display
Add an additional protected class, limit manipulation to one at a time,
and ensure future flexibility. Also makes display consistent with other
VM elements.
2021-07-13 22:39:52 -04:00
Joshua Boniface 27f1758791 Add tags manipulation to API
Also fixes some checks for Metadata too since these two actions are
almost identical, and adds tags to define endpoint.
2021-07-13 19:05:33 -04:00
Joshua Boniface c0a3467b70 Simplify VM metadata reads
Directly call the new common getDomainMetadata function to avoid
excessive Zookeeper calls for this information.
2021-07-13 19:05:33 -04:00
Joshua Boniface 9a199992a1 Add functions for manipulating VM tags
Adds tags to schema (v3), to VM definition, adds function to modify
tags, adds function to get tags, and adds tags to VM data output.

Tags will enable more granular classification of VMs based either on
administrator configuration or from automated system events.
2021-07-13 19:05:33 -04:00
Joshua Boniface c6d552ae57 Rework success checks for IPMI fencing
Previously, if the node failed to restart, it was declared a "bad fence"
and no further action would be taken. However, there are some
situations, for instance critical hardware failures, where intelligent
systems will not attempt (or succeed at) starting up the node in such a
case, which would result in dead, known-offline nodes without recovery.

Tweak this behaviour somewhat. The main path of Reboot -> Check On ->
Success + fence-flush is retained, but some additional side-paths are
now defined:

1. We attempt to power "on" the chassis 1 second after the reboot, just
in case it is off and can be recovered. We then wait another 2 seconds
and check the power status (as we did before).

2. If the reboot succeeded, follow this series of choices:

    a. If the chassis is on, the fence succeeded.

    b. If the chassis is off, the fence "succeeded" as well.

    c. If the chassis is in some other state, the fence failed.

3. If the reboot failed, follow this series of choices:

    a. If the chassis is off, the fence itself failed, but we can treat
    it as "succeeded"" since the chassis is in a known-offline state.
    This is the most likely situation when there is a critical hardware
    failure, and the server's IPMI does not allow itself to start back
    up again.

    b. If the chassis is in any other state ("on" or unknown), the fence
    itself failed and we must treat this as a fence failure.

Overall, this should alleviate the aforementioned issue of a critical
failure rendering the node persistently "off" not triggering a
fence-flush and ensure fencing is more robust.
2021-07-13 17:54:41 -04:00
Joshua Boniface 2e9f6ac201 Bump version to 0.9.25 2021-07-11 23:19:09 -04:00
Joshua Boniface f09849bedf Don't overwrite shutdown state on termination
Just a minor quibble and not really impactful.
2021-07-11 23:18:14 -04:00
Joshua Boniface 8c975e5c46 Add chroot context manager example to debootstrap
Closes #132
2021-07-11 23:10:41 -04:00
Joshua Boniface c76149141f Only log ZK connections when persistent
Prevents spam in the API logs.
2021-07-10 23:35:49 -04:00
Joshua Boniface f00c4d07f4 Add date output to keepalive
Helps track when there is a log follow in "-o cat" mode.
2021-07-10 23:24:59 -04:00
Joshua Boniface 20b66c10e1 Move two more commands to Rados library 2021-07-10 17:28:42 -04:00
Joshua Boniface cfeba50b17 Revert "Return to all command-based Ceph gathering"
This reverts commit 65d14ccd92.

This was actually a bad idea. For inexplicable reasons, running these
Ceph commands manually (not even via Python, but in a normal shell)
takes 7 * two orders of magnitude longer than running them with the
Rados module, so long in fact that some basic commands like "ceph
health" would sometimes take longer than the 1 second timeout to
complete. The Rados commands would however take about 1ms instead.

Despite the occasional issues when monitors drop out, the Rados module
is clearly far superior to the shell commands for any moderately-loaded
Ceph cluster. We can look into solving timeouts another way (perhaps
with Processes instead of Threads) at a later time.

Rados module "ceph health":
    b'{"checks":{},"status":"HEALTH_OK"}'
    0.001204 (s)
    b'{"checks":{},"status":"HEALTH_OK"}'
    0.001258 (s)
Command "ceph health":
    joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null
    real    0m0.772s
    user    0m0.707s
    sys     0m0.046s
    joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null
    real    0m0.796s
    user    0m0.728s
    sys     0m0.054s
2021-07-10 03:47:45 -04:00
Joshua Boniface 0699c48d10 Fix bad schema path name 2021-07-09 16:47:09 -04:00
Joshua Boniface 551bae2518 Bump version to 0.9.24 2021-07-09 15:58:36 -04:00
Joshua Boniface 4832245d9c Handle non-RBD disks and non-RBD errors better 2021-07-09 15:48:57 -04:00
Joshua Boniface 2138f2f59f Fail VM removal on disk removal failures
Prevents bad states where the VM is "removed" but some of its disks
remain due to e.g. stuck watchers.

Rearrange the sequence so it goes stop, delete disks, then delete VM,
and then return a failure if any of the disk(s) fail to remove, allowing
the task to be rerun after fixing the problem.
2021-07-09 15:39:06 -04:00
Joshua Boniface d1d355a96b Avoid errors if stats data is None 2021-07-09 13:13:54 -04:00