Block a user
Adjust cluster health states to be more meaningful
One big complexity to note would be working the Ceph health messages into this, but luckily their levels are fairly consistent (WARN and ERR being roughly -10 and -50 respectively) and they output…
Adjust cluster health states to be more meaningful
Add hardware/system diag info to node status
This actually wouldn't work as well as-is as would be obvious. Probably better to make our own "plugin" system that allows arbitrary things to monitor then build a coherent check framework around…
Add hardware/system diag info to node status
I definitely think leveraging part of the check_mk_agent system for this might be worthwhile. At regular intervals (say every minute so every ~12 keepalives), the plugins can run and save output…