pvc/monitoring/README.md

# PVC Node Monitoring Resources

This directory contains several monitoring resources that can be used with various monitoring systems to track and alert on a PVC cluster system.

## Prometheus + Grafana

The included example Prometheus configuration and Grafana dashboard can be used to query the PVC API for Prometheus data and display it with a consistent dashboard.

See the README in the `prometheus` folder for more details.

## Munin

The included Munin plugins can be activated by linking to them from `/etc/munin/plugins/`. Two plugins are provided:

* `pvc`: Checks the PVC cluster and node health, as well as their status (OK/Warning/Critical, based on maintenance status), providing 4 graphs.

* `ceph_utilization`: Checks the Ceph cluster statistics, providing multiple graphs. Note that this plugin is independent of PVC itself, and makes local calls to various Ceph commands itself.

The `pvc` plugin provides no configuration; the status is hardcoded such that <=90% health is warning, <=50% health is critical, and maintenance state forces OK. The alerting is provided by two separate graphs from the health graph so that actual health state is logged regardless of alerting.

The `ceph_utilization` plugin provides no configuration; only the cluster utilization graph alerts such that >80% used is warning and >90% used is critical. Ceph itself begins warning above 80% as well.

## CheckMK

The included CheckMK plugin is divided into two parts: the agent plugin, and the monitoring server plugin. This monitoring server plugin requires CheckMK version 2.0 or higher. The two parts can be installed as follows:

* `pvc`: Place this file in the `/usr/lib/check_mk_agent/plugins/` directory on each node.

* `pvc.py`: Place this file in the `~/local/lib/python3/cmk/base/plugins/agent_based/` directory on the CheckMK monitoring host for each monitoring site.

The plugin provides no configuration: the status is hardcoded such that <=90% health is warning, <=50% health is critical, and maintenance state forces OK.

With both the agent and server plugins installed, you can then run `cmk -II <node>` (or use WATO) to inventory each node, which should produce two new checks:

* `PVC Cluster`: Provides the cluster-wide health. Note that this will be identical for all nodes in the cluster (i.e. if the cluster health drops, all nodes in the cluster will alert this check).

* `PVC Node <shortname>`: Provides the per-node health.

The "Summary" text, shown in the check lists, will be simplistic, only showing the current health percentage.

The "Details" text, found in the specific check details, will show the full list of problem(s) the check finds, as shown by `pvc status` itself.
Add initial monitoring configurations to daemon Initial work to support multiple monitoring agents including Munin, Check_MK, and NRPE at the least. 2020-08-17 17:05:55 -04:00			`# PVC Node Monitoring Resources`

			`This directory contains several monitoring resources that can be used with various monitoring systems to track and alert on a PVC cluster system.`

Add Prometheus monitoring examples 2023-12-09 17:42:51 -05:00			`## Prometheus + Grafana`

			`The included example Prometheus configuration and Grafana dashboard can be used to query the PVC API for Prometheus data and display it with a consistent dashboard.`

Update Prometheus readmes 2023-12-29 11:22:52 -05:00			See the README in the `prometheus` folder for more details.
Add Prometheus monitoring examples 2023-12-09 17:42:51 -05:00
Add CheckMK monitoring example plugins 2023-02-16 13:06:35 -05:00			`## Munin`
Add initial monitoring configurations to daemon Initial work to support multiple monitoring agents including Munin, Check_MK, and NRPE at the least. 2020-08-17 17:05:55 -04:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			The included Munin plugins can be activated by linking to them from `/etc/munin/plugins/`. Two plugins are provided:
Add initial monitoring configurations to daemon Initial work to support multiple monitoring agents including Munin, Check_MK, and NRPE at the least. 2020-08-17 17:05:55 -04:00
Update readme for Munin plugin 2023-02-18 00:00:04 -05:00			* `pvc`: Checks the PVC cluster and node health, as well as their status (OK/Warning/Critical, based on maintenance status), providing 4 graphs.
Add initial monitoring configurations to daemon Initial work to support multiple monitoring agents including Munin, Check_MK, and NRPE at the least. 2020-08-17 17:05:55 -04:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			* `ceph_utilization`: Checks the Ceph cluster statistics, providing multiple graphs. Note that this plugin is independent of PVC itself, and makes local calls to various Ceph commands itself.
Add initial monitoring configurations to daemon Initial work to support multiple monitoring agents including Munin, Check_MK, and NRPE at the least. 2020-08-17 17:05:55 -04:00
Update readme for Munin plugin 2023-02-18 00:00:04 -05:00			The `pvc` plugin provides no configuration; the status is hardcoded such that <=90% health is warning, <=50% health is critical, and maintenance state forces OK. The alerting is provided by two separate graphs from the health graph so that actual health state is logged regardless of alerting.
Add initial monitoring configurations to daemon Initial work to support multiple monitoring agents including Munin, Check_MK, and NRPE at the least. 2020-08-17 17:05:55 -04:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			The `ceph_utilization` plugin provides no configuration; only the cluster utilization graph alerts such that >80% used is warning and >90% used is critical. Ceph itself begins warning above 80% as well.
Add CheckMK monitoring example plugins 2023-02-16 13:06:35 -05:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			`## CheckMK`
Add CheckMK monitoring example plugins 2023-02-16 13:06:35 -05:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			`The included CheckMK plugin is divided into two parts: the agent plugin, and the monitoring server plugin. This monitoring server plugin requires CheckMK version 2.0 or higher. The two parts can be installed as follows:`
Add CheckMK monitoring example plugins 2023-02-16 13:06:35 -05:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			* `pvc`: Place this file in the `/usr/lib/check_mk_agent/plugins/` directory on each node.
Add CheckMK monitoring example plugins 2023-02-16 13:06:35 -05:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			* `pvc.py`: Place this file in the `~/local/lib/python3/cmk/base/plugins/agent_based/` directory on the CheckMK monitoring host for each monitoring site.
Add CheckMK monitoring example plugins 2023-02-16 13:06:35 -05:00
Update Munin plugin example 2023-02-16 16:06:00 -05:00			`The plugin provides no configuration: the status is hardcoded such that <=90% health is warning, <=50% health is critical, and maintenance state forces OK.`
Add CheckMK monitoring example plugins 2023-02-16 13:06:35 -05:00
			With both the agent and server plugins installed, you can then run `cmk -II <node>` (or use WATO) to inventory each node, which should produce two new checks:

			* `PVC Cluster`: Provides the cluster-wide health. Note that this will be identical for all nodes in the cluster (i.e. if the cluster health drops, all nodes in the cluster will alert this check).

			* `PVC Node <shortname>`: Provides the per-node health.

			`The "Summary" text, shown in the check lists, will be simplistic, only showing the current health percentage.`

			The "Details" text, found in the specific check details, will show the full list of problem(s) the check finds, as shown by `pvc status` itself.