We already parse this information out anyways, so might as well add it
to the API output JSON. This can be leveraged by the Prometheus endpoint
as well to avoid duplicate listings.
This significantly simplifies cluster state handling by removing most of
the superfluous get_list() calls, replacing them with basic child reads
since most of them are just for a count anyways. The ones that require
states simplify this down to a child read plus direct reads for the
exact items required while leveraging the new read_many() function.
Adds a function, "read_many", which can take in multiple ZK keys and
return the values from all of them, using asyncio to avoid reading
sequentially.
Initial tests show a marked improvement in read performance of multiple
read()-heavy functions (e.g. "get_list()" functions) with this method.
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.
Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.
Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.
Moves all tasks run by the Celery worker into a discrete package/module
for easier installation. Also adjusts several parameters throughout to
accomplish this.
Ensures that messages are fully read before each append. Adds more
Zookeeper hits, but ensures logs won't be overwritten by multiple
daemons.
Also don't use a set on the client side, to avoid "removing duplicate"
entries erroneously.
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.
Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.