Compare commits

...

9 Commits

7 changed files with 479 additions and 7 deletions

View File

@ -58,6 +58,14 @@ This plugin checks for Debian package updates, invalid package states (i.e. not
This plugin checks the EDAC utility for messages about errors, primarily in the ECC memory subsystem. It will raise a health delta of 50 if any `Uncorrected` EDAC errors are detected, possibly indicating failing memory.
#### `ipmi`
This plugin checks whether the daemon can reach its own IPMI address and connect. If it cannot, it raises a health delta of 10.
#### `lbvt`
This plugin checks whether the daemon can connect to the local Libvirt daemon instance. If it cannot, it raises a health delta of 50.
#### `load`
This plugin checks the current 1-minute system load (as reported during keepalives) against the number of total CPU threads available on the node. If the load average is greater, i.e. the node is overloaded, it raises a health delta of 50.
@ -68,12 +76,20 @@ This plugin checks that all NICs underlying PVC networks and bridges are operati
* For each device defined (`bridge_dev`, `upstream_dev`, `cluster_dev`, and `storage_dev`), it determines the type of device. If it is a vLAN, it obtains the underlying device; otherwise, it uses the specified device. It then adds this device to a list of core NICs. Ideally, this list will contain either bonding interfaces or actual ethernet NICs.
* For each core NIC, it checks its type. If it is a `bond` device, it checks the bonding state to ensure that at least 2 slave interfaces are up and operating. If there are not, it raises a health delta of 10. It then performs the following step for each slave NIC.
* For each core NIC, it checks its type. If it is a `bond` device, it checks the bonding state to ensure that at least 2 slave interfaces are up and operating. If there are not, it raises a health delta of 10.
* For each core NIC or bond slave device, it checks its maximum possible speed as reported by `ethtool` as well as the current active speed. If the NIC is operating at less than its maximum possible speed, it raises a health delta of 10.
* For each core NIC, it checks its maximum possible speed as reported by `ethtool` as well as the current active speed. If the NIC is operating at less than its maximum possible speed, it raises a health delta of 10.
Note that this check may pose problems in some deployment scenarios (e.g. running 25GbE NICs at 10GbE by design). Currently the plugin logic cannot handle this and manual modifications may be required. This is left to the administrator if applicable.
#### `psql`
This plugin checks whether the daemon can connect to the local PostgreSQL/Patroni daemon instance. If it cannot, it raises a health delta of 50.
#### `zkpr`
This plugin checks whether the daemon can connect to the local Zookeeper daemon instance. If it cannot, it raises a health delta of 50.
### Custom Health Plugins
In addition to the included health plugins, the plugin architecture allows administrators to write their own plugins as required to check specific node details that might not be checked by the default plugins. While the author has endeavoured to cover as many important aspects as possible with the default plugins, there is always the possibility that some other condition becomes important and thus the system is flexible to this need. That said, we would welcome pull requests of new plugins to future version of PVC should they be widely applicable.
@ -92,7 +108,7 @@ from pvcnoded.objects.MonitoringInstance import MonitoringPlugin
```
* A `PLUGIN_NAME` variable which defines the name of the plugin. This must match the filename.
* A `PLUGIN_NAME` variable which defines the name of the plugin. This must match the filename. Generally, a plugin name will be 4 characters, but this is purely a convention and not a requirement.
```
# A monitoring plugin script must always expose its nice name, which must be identical to the file name

106
node-daemon/plugins/ipmi Normal file
View File

@ -0,0 +1,106 @@
#!/usr/bin/env python3
# ipmi.py - PVC Monitoring example plugin for IPMI
# Part of the Parallel Virtual Cluster (PVC) system
#
# Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
###############################################################################
# This script provides an example of a PVC monitoring plugin script. It will create
# a simple plugin to check whether the system IPMI is reachable.
# This script can thus be used as an example or reference implementation of a
# PVC monitoring pluginscript and expanded upon as required.
# A monitoring plugin script must implement the class "MonitoringPluginScript" which
# extends "MonitoringPlugin", providing the 3 functions indicated. Detailed explanation
# of the role of each function is provided in context of the example; see the other
# examples for more potential uses.
# WARNING:
#
# This script will run in the context of the node daemon keepalives as root.
# DO NOT install untrusted, unvetted plugins under any circumstances.
# This import is always required here, as MonitoringPlugin is used by the
# MonitoringPluginScript class
from pvcnoded.objects.MonitoringInstance import MonitoringPlugin
# A monitoring plugin script must always expose its nice name, which must be identical to
# the file name
PLUGIN_NAME = "ipmi"
# The MonitoringPluginScript class must be named as such, and extend MonitoringPlugin.
class MonitoringPluginScript(MonitoringPlugin):
def setup(self):
"""
setup(): Perform special setup steps during node daemon startup
This step is optional and should be used sparingly.
If you wish for the plugin to not ipmi in certain conditions, do any checks here
and return a non-None failure message to indicate the error.
"""
pass
def run(self):
"""
run(): Perform the check actions and return a PluginResult object
"""
# Run any imports first
from daemon_lib.common import run_os_command
# Check the node's IPMI interface
ipmi_hostname = self.config["ipmi_hostname"]
ipmi_username = self.config["ipmi_username"]
ipmi_password = self.config["ipmi_password"]
retcode, _, _ = run_os_command(
f"/usr/bin/ipmitool -I lanplus -H {ipmi_hostname} -U {ipmi_username} -P {ipmi_password} chassis power status"
)
if retcode > 0:
# Set the health delta to 10 (subtract 10 from the total of 100)
health_delta = 10
# Craft a message that can be used by the clients
message = f"IPMI via {ipmi_username}@{ipmi_hostname} is NOT responding"
else:
# Set the health delta to 0 (no change)
health_delta = 0
# Craft a message that can be used by the clients
message = f"IPMI via {ipmi_username}@{ipmi_hostname} is responding"
# Set the health delta in our local PluginResult object
self.plugin_result.set_health_delta(health_delta)
# Set the message in our local PluginResult object
self.plugin_result.set_message(message)
# Return our local PluginResult object
return self.plugin_result
def cleanup(self):
"""
cleanup(): Perform special cleanup steps during node daemon termination
This step is optional and should be used sparingly.
"""
pass

105
node-daemon/plugins/lbvt Normal file
View File

@ -0,0 +1,105 @@
#!/usr/bin/env python3
# lbvt.py - PVC Monitoring example plugin for Libvirtd
# Part of the Parallel Virtual Cluster (PVC) system
#
# Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
###############################################################################
# This script provides an example of a PVC monitoring plugin script. It will create
# a simple plugin to check the Libvirt daemon instance on the node for operation.
# This script can thus be used as an example or reference implementation of a
# PVC monitoring pluginscript and expanded upon as required.
# A monitoring plugin script must implement the class "MonitoringPluginScript" which
# extends "MonitoringPlugin", providing the 3 functions indicated. Detailed explanation
# of the role of each function is provided in context of the example; see the other
# examples for more potential uses.
# WARNING:
#
# This script will run in the context of the node daemon keepalives as root.
# DO NOT install untrusted, unvetted plugins under any circumstances.
# This import is always required here, as MonitoringPlugin is used by the
# MonitoringPluginScript class
from pvcnoded.objects.MonitoringInstance import MonitoringPlugin
# A monitoring plugin script must always expose its nice name, which must be identical to
# the file name
PLUGIN_NAME = "lbvt"
# The MonitoringPluginScript class must be named as such, and extend MonitoringPlugin.
class MonitoringPluginScript(MonitoringPlugin):
def setup(self):
"""
setup(): Perform special setup steps during node daemon startup
This step is optional and should be used sparingly.
If you wish for the plugin to not lbvt in certain conditions, do any checks here
and return a non-None failure message to indicate the error.
"""
pass
def run(self):
"""
run(): Perform the check actions and return a PluginResult object
"""
# Run any imports first
from libvirt import openReadOnly as lvopen
lv_conn = None
# Set the health delta to 0 (no change)
health_delta = 0
# Craft a message that can be used by the clients
message = "Successfully connected to Libvirtd on localhost"
# Check the Zookeeper connection
try:
lv_conn = lvopen(f"qemu+tcp://{self.this_node.name}/system")
data = lv_conn.getHostname()
except Exception as e:
health_delta = 50
message = f"Failed to connect to Libvirtd: {e}"
finally:
if lv_conn is not None:
lv_conn.close()
# Set the health delta in our local PluginResult object
self.plugin_result.set_health_delta(health_delta)
# Set the message in our local PluginResult object
self.plugin_result.set_message(message)
# Return our local PluginResult object
return self.plugin_result
def cleanup(self):
"""
cleanup(): Perform special cleanup steps during node daemon termination
This step is optional and should be used sparingly.
"""
pass

View File

@ -20,8 +20,7 @@
###############################################################################
# This script provides an example of a PVC monitoring plugin script. It will create
# a simple plugin to check the system load against the total number of CPU cores,
# and return a 10 health delta (100 -> 90) if the load average is > 1/2 that number.
# a simple plugin to check the system load against the total number of CPU cores.
# This script can thus be used as an example or reference implementation of a
# PVC monitoring pluginscript and expanded upon as required.

139
node-daemon/plugins/psql Normal file
View File

@ -0,0 +1,139 @@
#!/usr/bin/env python3
# psql.py - PVC Monitoring example plugin for Postgres/Patroni
# Part of the Parallel Virtual Cluster (PVC) system
#
# Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
###############################################################################
# This script provides an example of a PVC monitoring plugin script. It will create
# a simple plugin to check the Patroni PostgreSQL instance on the node for operation.
# This script can thus be used as an example or reference implementation of a
# PVC monitoring pluginscript and expanded upon as required.
# A monitoring plugin script must implement the class "MonitoringPluginScript" which
# extends "MonitoringPlugin", providing the 3 functions indicated. Detailed explanation
# of the role of each function is provided in context of the example; see the other
# examples for more potential uses.
# WARNING:
#
# This script will run in the context of the node daemon keepalives as root.
# DO NOT install untrusted, unvetted plugins under any circumstances.
# This import is always required here, as MonitoringPlugin is used by the
# MonitoringPluginScript class
from pvcnoded.objects.MonitoringInstance import MonitoringPlugin
# A monitoring plugin script must always expose its nice name, which must be identical to
# the file name
PLUGIN_NAME = "psql"
# The MonitoringPluginScript class must be named as such, and extend MonitoringPlugin.
class MonitoringPluginScript(MonitoringPlugin):
def setup(self):
"""
setup(): Perform special setup steps during node daemon startup
This step is optional and should be used sparingly.
"""
pass
def run(self):
"""
run(): Perform the check actions and return a PluginResult object
"""
# Run any imports first
from psycopg2 import connect
conn_metadata = None
cur_metadata = None
conn_dns = None
cur_dns = None
# Set the health delta to 0 (no change)
health_delta = 0
# Craft a message that can be used by the clients
message = "Successfully connected to PostgreSQL databases on localhost"
# Check the Metadata database (primary)
try:
conn_metadata = connect(
host=self.this_node.name,
port=self.config["metadata_postgresql_port"],
dbname=self.config["metadata_postgresql_dbname"],
user=self.config["metadata_postgresql_user"],
password=self.config["metadata_postgresql_password"],
)
cur_metadata = conn_metadata.cursor()
cur_metadata.execute("""SELECT * FROM alembic_version""")
data = cur_metadata.fetchone()
except Exception as e:
health_delta = 50
err = str(e).split('\n')[0]
message = f"Failed to connect to PostgreSQL database {self.config['metadata_postgresql_dbname']}: {err}"
finally:
if cur_metadata is not None:
cur_metadata.close()
if conn_metadata is not None:
conn_metadata.close()
if health_delta == 0:
# Check the PowerDNS database (secondary)
try:
conn_pdns = connect(
host=self.this_node.name,
port=self.config["pdns_postgresql_port"],
dbname=self.config["pdns_postgresql_dbname"],
user=self.config["pdns_postgresql_user"],
password=self.config["pdns_postgresql_password"],
)
cur_pdns = conn_pdns.cursor()
cur_pdns.execute("""SELECT * FROM supermasters""")
data = cur_pdns.fetchone()
except Exception as e:
health_delta = 50
err = str(e).split('\n')[0]
message = f"Failed to connect to PostgreSQL database {self.config['pdns_postgresql_dbname']}: {err}"
finally:
if cur_pdns is not None:
cur_pdns.close()
if conn_pdns is not None:
conn_pdns.close()
# Set the health delta in our local PluginResult object
self.plugin_result.set_health_delta(health_delta)
# Set the message in our local PluginResult object
self.plugin_result.set_message(message)
# Return our local PluginResult object
return self.plugin_result
def cleanup(self):
"""
cleanup(): Perform special cleanup steps during node daemon termination
This step is optional and should be used sparingly.
"""
pass

107
node-daemon/plugins/zkpr Normal file
View File

@ -0,0 +1,107 @@
#!/usr/bin/env python3
# zkpr.py - PVC Monitoring example plugin for Zookeeper
# Part of the Parallel Virtual Cluster (PVC) system
#
# Copyright (C) 2018-2022 Joshua M. Boniface <joshua@boniface.me>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
###############################################################################
# This script provides an example of a PVC monitoring plugin script. It will create
# a simple plugin to check the Zookeeper instance on the node for operation.
# This script can thus be used as an example or reference implementation of a
# PVC monitoring pluginscript and expanded upon as required.
# A monitoring plugin script must implement the class "MonitoringPluginScript" which
# extends "MonitoringPlugin", providing the 3 functions indicated. Detailed explanation
# of the role of each function is provided in context of the example; see the other
# examples for more potential uses.
# WARNING:
#
# This script will run in the context of the node daemon keepalives as root.
# DO NOT install untrusted, unvetted plugins under any circumstances.
# This import is always required here, as MonitoringPlugin is used by the
# MonitoringPluginScript class
from pvcnoded.objects.MonitoringInstance import MonitoringPlugin
# A monitoring plugin script must always expose its nice name, which must be identical to
# the file name
PLUGIN_NAME = "zkpr"
# The MonitoringPluginScript class must be named as such, and extend MonitoringPlugin.
class MonitoringPluginScript(MonitoringPlugin):
def setup(self):
"""
setup(): Perform special setup steps during node daemon startup
This step is optional and should be used sparingly.
If you wish for the plugin to not zkpr in certain conditions, do any checks here
and return a non-None failure message to indicate the error.
"""
pass
def run(self):
"""
run(): Perform the check actions and return a PluginResult object
"""
# Run any imports first
from kazoo.client import KazooClient, KazooState
zk_conn = None
# Set the health delta to 0 (no change)
health_delta = 0
# Craft a message that can be used by the clients
message = "Successfully connected to Zookeeper on localhost"
# Check the Zookeeper connection
try:
zk_conn = KazooClient(hosts=[f"{self.this_node.name}:2181"], timeout=1, read_only=True)
zk_conn.start(timeout=1)
data = zk_conn.get('/primary_node')
except Exception as e:
health_delta = 50
message = f"Failed to connect to Zookeeper: {e}"
finally:
if zk_conn is not None:
zk_conn.stop()
zk_conn.close()
# Set the health delta in our local PluginResult object
self.plugin_result.set_health_delta(health_delta)
# Set the message in our local PluginResult object
self.plugin_result.set_message(message)
# Return our local PluginResult object
return self.plugin_result
def cleanup(self):
"""
cleanup(): Perform special cleanup steps during node daemon termination
This step is optional and should be used sparingly.
"""
pass

View File

@ -45,7 +45,7 @@ class PluginResult(object):
self.plugin_name = plugin_name
self.current_time = int(time.time())
self.health_delta = 0
self.message = None
self.message = "N/A"
self.data = {}
self.runtime = "0.00"
@ -359,7 +359,7 @@ class MonitoringInstance(object):
for result in sorted(plugin_results, key=lambda x: x.plugin_name):
if self.config["log_keepalive_plugin_details"]:
self.logger.out(
result.message,
result.message + f" [-{result.health_delta}]",
state="t",
prefix=f"{result.plugin_name} ({result.runtime}s)",
)