Move monitoring in keepalives to plugin system #161

New Issue

joshuaboniface · 2023-02-13T00:38:09-05:00

joshuaboniface commented

2023-02-13 00:38:09 -05:00

To address both #154 and #159, it seems sensible to rework the current statistics/health information collected in the keepalive messages into a more extensible plugin based system. This will allow the current "core" monitoring items to work within a unified monitoring framework, as well as allow new plugins - both integrated and 3rd party custom - to be added to the system easily.

Supercedes #154 #159

To address both #154 and #159, it seems sensible to rework the current statistics/health information collected in the keepalive messages into a more extensible plugin based system. This will allow the current "core" monitoring items to work within a unified monitoring framework, as well as allow new plugins - both integrated and 3rd party custom - to be added to the system easily. Supercedes #154 #159

joshuaboniface added the

feature

label 2023-02-13 00:38:23 -05:00

joshuaboniface commented

2023-02-13 01:04:32 -05:00

The first task is to create a monitoring framework which can be run in the keepalive function and then return a standardized set of values based on a class instance with defined parameters and properties.

That data can then be written to Zookeeper in a set of keys under each node, along with some more information about the plugin (e.g. last successful run time, health delta, and data for client consumption in JSON format), updated at each run.

From there the API can determine things like overall cluster health or show the data of individual plugins.

The first task is to create a monitoring framework which can be run in the keepalive function and then return a standardized set of values based on a class instance with defined parameters and properties. That data can then be written to Zookeeper in a set of keys under each node, along with some more information about the plugin (e.g. last successful run time, health delta, and data for client consumption in JSON format), updated at each run. From there the API can determine things like overall cluster health or show the data of individual plugins.

joshuaboniface commented

2023-02-13 03:57:48 -05:00

The initial implementation of a basic framework is in place and will be tested.

Integrating this framework into a total cluster health setup must be considered.

I believe the best course of action is to leverage some standards based on a percentage.

First, each node has a total health value of 100.

Individual plugins can adjust this node health value by some amount depending on their severity. Generally speaking a "1" would be something anomalous but not serious, "10" would be something anomalous but potentially noteworthy, and "50" would be a serious fault.

Each node would then display its total health (colourized!) in the node list and the more detailed output in the node info.

On the cluster level, the total health of the cluster would be (100 * nodes), thus a 3-node cluster would have 300 total health. The overall cluster states would then follow a pattern similar to an individual node, only aggregated over the whole cluster.

At the node level, health would become "degraded" at 50/100; at the cluster level, health would become degraded at 1/2 of the total available health (150/300 in a 3 node cluster).

How health will be output at the cluster level would depend on several factors that will be decided later during development.

The initial implementation of a basic framework is in place and will be tested. Integrating this framework into a total cluster health setup must be considered. I believe the best course of action is to leverage some standards based on a percentage. First, each node has a total health value of 100. Individual plugins can adjust this node health value by some amount depending on their severity. Generally speaking a "1" would be something anomalous but not serious, "10" would be something anomalous but potentially noteworthy, and "50" would be a serious fault. Each node would then display its total health (colourized!) in the node list and the more detailed output in the node info. On the cluster level, the total health of the cluster would be (100 * nodes), thus a 3-node cluster would have 300 total health. The overall cluster states would then follow a pattern similar to an individual node, only aggregated over the whole cluster. At the node level, health would become "degraded" at 50/100; at the cluster level, health would become degraded at 1/2 of the total available health (150/300 in a 3 node cluster). How health will be output at the cluster level would depend on several factors that will be decided later during development.

joshuaboniface referenced this issue from a commit

2023-02-22 18:13:51 -05:00

Merge branch 'revamp-health'

joshuaboniface closed this issue

2023-02-22 18:13:51 -05:00

joshuaboniface referenced this issue from a commit

2023-09-01 15:58:46 -04:00

Merge branch 'revamp-health'

Sign in to join this conversation.