Move monitoring in keepalives to plugin system #161

Closed
opened 2023-02-13 00:38:09 -05:00 by joshuaboniface · 2 comments

To address both #154 and #159, it seems sensible to rework the current statistics/health information collected in the keepalive messages into a more extensible plugin based system. This will allow the current "core" monitoring items to work within a unified monitoring framework, as well as allow new plugins - both integrated and 3rd party custom - to be added to the system easily.

Supercedes #154 #159

To address both #154 and #159, it seems sensible to rework the current statistics/health information collected in the keepalive messages into a more extensible plugin based system. This will allow the current "core" monitoring items to work within a unified monitoring framework, as well as allow new plugins - both integrated and 3rd party custom - to be added to the system easily. Supercedes #154 #159
joshuaboniface added the
feature
label 2023-02-13 00:38:23 -05:00
Author
Owner

The first task is to create a monitoring framework which can be run in the keepalive function and then return a standardized set of values based on a class instance with defined parameters and properties.

That data can then be written to Zookeeper in a set of keys under each node, along with some more information about the plugin (e.g. last successful run time, health delta, and data for client consumption in JSON format), updated at each run.

From there the API can determine things like overall cluster health or show the data of individual plugins.

The first task is to create a monitoring framework which can be run in the keepalive function and then return a standardized set of values based on a class instance with defined parameters and properties. That data can then be written to Zookeeper in a set of keys under each node, along with some more information about the plugin (e.g. last successful run time, health delta, and data for client consumption in JSON format), updated at each run. From there the API can determine things like overall cluster health or show the data of individual plugins.
Author
Owner

The initial implementation of a basic framework is in place and will be tested.

Integrating this framework into a total cluster health setup must be considered.

I believe the best course of action is to leverage some standards based on a percentage.

First, each node has a total health value of 100.

Individual plugins can adjust this node health value by some amount depending on their severity. Generally speaking a "1" would be something anomalous but not serious, "10" would be something anomalous but potentially noteworthy, and "50" would be a serious fault.

Each node would then display its total health (colourized!) in the node list and the more detailed output in the node info.

On the cluster level, the total health of the cluster would be (100 * nodes), thus a 3-node cluster would have 300 total health. The overall cluster states would then follow a pattern similar to an individual node, only aggregated over the whole cluster.

At the node level, health would become "degraded" at 50/100; at the cluster level, health would become degraded at 1/2 of the total available health (150/300 in a 3 node cluster).

How health will be output at the cluster level would depend on several factors that will be decided later during development.

The initial implementation of a basic framework is in place and will be tested. Integrating this framework into a total cluster health setup must be considered. I believe the best course of action is to leverage some standards based on a percentage. First, each node has a total health value of 100. Individual plugins can adjust this node health value by some amount depending on their severity. Generally speaking a "1" would be something anomalous but not serious, "10" would be something anomalous but potentially noteworthy, and "50" would be a serious fault. Each node would then display its total health (colourized!) in the node list and the more detailed output in the node info. On the cluster level, the total health of the cluster would be (100 * nodes), thus a 3-node cluster would have 300 total health. The overall cluster states would then follow a pattern similar to an individual node, only aggregated over the whole cluster. At the node level, health would become "degraded" at 50/100; at the cluster level, health would become degraded at 1/2 of the total available health (150/300 in a 3 node cluster). How health will be output at the cluster level would depend on several factors that will be decided later during development.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#161
No description provided.