Joshua Boniface
1f6347d24b
Add Prometheus monitoring examples
2023-12-09 17:42:51 -05:00
Joshua Boniface
e8552b471b
Require at least one FAULT_ID
2023-12-09 17:31:56 -05:00
Joshua Boniface
fc443a323b
Allow ack/delete of multiple faults at once
2023-12-09 17:28:13 -05:00
Joshua Boniface
b0557edb76
Ensure entry in name is uppercase
2023-12-09 17:01:41 -05:00
Joshua Boniface
47bd7bf2f5
Only run cluster-wide health checks on primary
...
Avoids multiple coordinators trying to write updated cluster-wide fault
events. Instead, they are now only written by the primary (or the
incoming primary if still in a transition).
2023-12-09 16:50:51 -05:00
Joshua Boniface
b9fbfe2ed5
Improve fault ID format
...
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
2023-12-09 16:48:14 -05:00
Joshua Boniface
764e3e3722
Fix bug in fault header format
2023-12-09 16:47:56 -05:00
Joshua Boniface
7e6d922877
Improve fault detail handling further
...
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
2023-12-09 16:13:36 -05:00
Joshua Boniface
4ca2381077
Rework metrics output and add combined endpoint
2023-12-09 15:47:40 -05:00
Joshua Boniface
4003204f14
Remove bracketed text from fault_str
...
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.
Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
Joshua Boniface
a70c1d63b0
Separate state totals from states, separate states
2023-12-09 13:59:17 -05:00
Joshua Boniface
2bea78d25e
Make all remaining limits optional
2023-12-09 13:43:58 -05:00
Joshua Boniface
fd717b702d
Use external list of fault states
2023-12-09 12:51:41 -05:00
Joshua Boniface
132cde5591
Add totals and nice-format states
...
Avoids tons of annoying rewriting in the UI later.
2023-12-09 12:50:19 -05:00
Joshua Boniface
ba565ead4c
Report all state combinations in Prom metrics
...
Ensures that every state combination is always shown to metrics, even if
it contains 0 entries.
2023-12-09 12:40:37 -05:00
Joshua Boniface
317ca4b98c
Move defined state combinations into common
2023-12-09 12:36:32 -05:00
Joshua Boniface
2b8abea8df
Remove debug printing
2023-12-09 12:22:36 -05:00
Joshua Boniface
9b3c9f1be5
Add Ceph metrics proxy and health fault counts
2023-12-09 12:22:36 -05:00
Joshua Boniface
7373bfed3f
Add Prometheus metric exporter
...
Adds a "fake" Prometheus metrics endpoint which returns cluster status
information in Prometheus format.
2023-12-09 12:22:36 -05:00
Joshua Boniface
d0e7c19602
Add prometheus client dependencies
2023-12-09 12:22:36 -05:00
Joshua Boniface
f01c12c86b
Import from pvcworkerd not pvcapid
2023-12-09 12:22:19 -05:00
Joshua Boniface
0bda095571
Move libvirt_schema and fix other imports
2023-12-09 12:20:29 -05:00
Joshua Boniface
7976e1d2d0
Correct import location in scripts
2023-12-09 12:18:33 -05:00
Joshua Boniface
813aef1463
Fix incorrect UUID key name
2023-12-09 12:14:57 -05:00
Joshua Boniface
5a7ea25266
Fix incorrect database name entries
2023-12-09 12:12:00 -05:00
Joshua Boniface
82a7fd3c80
Add more debugging info to psql
2023-12-07 21:36:05 -05:00
Joshua Boniface
ddd9d9ee07
Adjust psql check to avoid weird failures
2023-12-07 15:07:59 -05:00
Joshua Boniface
9e2e749c55
Combine pvchealthd output into single log message
2023-12-07 14:00:43 -05:00
Joshua Boniface
157b8c20bf
Add Patroni output to debug logs
2023-12-07 14:00:35 -05:00
Joshua Boniface
bf158dc2d9
Shorten debug output
2023-12-07 13:31:20 -05:00
Joshua Boniface
1b84553405
Use passed coordinator state
2023-12-07 11:19:26 -05:00
Joshua Boniface
60dac143f2
Use simpler health calculation
2023-12-07 11:17:31 -05:00
Joshua Boniface
a13273335d
Add colon to result text
2023-12-07 11:15:42 -05:00
Joshua Boniface
e7f21b7058
Enhance and fix bugs in psql plugin
...
1. Check Patronictl statuses
2. Don't error during node primary transitions
2023-12-07 11:14:16 -05:00
Joshua Boniface
9dbadfdd6e
Move back to per-plugin fault reporting
2023-12-07 11:13:56 -05:00
Joshua Boniface
61b39d0739
Fix incorrect cluster health calculation
2023-12-07 11:13:36 -05:00
Joshua Boniface
4bf80a5913
Fix missing datetime shrink
2023-12-06 17:15:36 -05:00
Joshua Boniface
6c0dfe16cf
Improve word splitting for fault messages
...
This ensures that fault messages are split on word boundaries and that
the column length is equal to the longest of these if applicable.
2023-12-06 17:10:19 -05:00
Joshua Boniface
3fde494fc5
Add status back to short fault list
2023-12-06 16:53:23 -05:00
Joshua Boniface
0945b3faf3
Use same fault formatting for short and long
2023-12-06 16:19:44 -05:00
Joshua Boniface
1416f9edc0
Remove bad sort values
2023-12-06 14:38:29 -05:00
Joshua Boniface
5691f75ac9
Fix bad import
2023-12-06 14:28:32 -05:00
Joshua Boniface
e0bf7f7d1a
Fix bad ID values in acknowledge
2023-12-06 14:18:31 -05:00
Joshua Boniface
0c34c88a1f
Fix bad dict key name
2023-12-06 14:16:19 -05:00
Joshua Boniface
20acf3295f
Add mass ack/delete of faults
2023-12-06 13:59:39 -05:00
Joshua Boniface
4a02c2c8e3
Add additional faults
2023-12-06 13:27:39 -05:00
Joshua Boniface
6fc5c927a1
Properly sort status faults
2023-12-06 13:27:18 -05:00
Joshua Boniface
d1e34e7333
Store fault times only to the second
...
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00
Joshua Boniface
79eb54d5da
Move fault generation to common library
2023-12-06 13:17:10 -05:00
Joshua Boniface
536fb2080f
Fix get_terminal_size over SSH
2023-12-06 13:11:28 -05:00