4ca2381077
Rework metrics output and add combined endpoint
2023-12-09 15:47:40 -05:00
4003204f14
Remove bracketed text from fault_str
...
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.
Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
a70c1d63b0
Separate state totals from states, separate states
2023-12-09 13:59:17 -05:00
2bea78d25e
Make all remaining limits optional
2023-12-09 13:43:58 -05:00
fd717b702d
Use external list of fault states
2023-12-09 12:51:41 -05:00
132cde5591
Add totals and nice-format states
...
Avoids tons of annoying rewriting in the UI later.
2023-12-09 12:50:19 -05:00
ba565ead4c
Report all state combinations in Prom metrics
...
Ensures that every state combination is always shown to metrics, even if
it contains 0 entries.
2023-12-09 12:40:37 -05:00
317ca4b98c
Move defined state combinations into common
2023-12-09 12:36:32 -05:00
2b8abea8df
Remove debug printing
2023-12-09 12:22:36 -05:00
9b3c9f1be5
Add Ceph metrics proxy and health fault counts
2023-12-09 12:22:36 -05:00
7373bfed3f
Add Prometheus metric exporter
...
Adds a "fake" Prometheus metrics endpoint which returns cluster status
information in Prometheus format.
2023-12-09 12:22:36 -05:00
d0e7c19602
Add prometheus client dependencies
2023-12-09 12:22:36 -05:00
f01c12c86b
Import from pvcworkerd not pvcapid
2023-12-09 12:22:19 -05:00
0bda095571
Move libvirt_schema and fix other imports
2023-12-09 12:20:29 -05:00
7976e1d2d0
Correct import location in scripts
2023-12-09 12:18:33 -05:00
813aef1463
Fix incorrect UUID key name
2023-12-09 12:14:57 -05:00
5a7ea25266
Fix incorrect database name entries
2023-12-09 12:12:00 -05:00
82a7fd3c80
Add more debugging info to psql
2023-12-07 21:36:05 -05:00
ddd9d9ee07
Adjust psql check to avoid weird failures
2023-12-07 15:07:59 -05:00
9e2e749c55
Combine pvchealthd output into single log message
2023-12-07 14:00:43 -05:00
157b8c20bf
Add Patroni output to debug logs
2023-12-07 14:00:35 -05:00
bf158dc2d9
Shorten debug output
2023-12-07 13:31:20 -05:00
1b84553405
Use passed coordinator state
2023-12-07 11:19:26 -05:00
60dac143f2
Use simpler health calculation
2023-12-07 11:17:31 -05:00
a13273335d
Add colon to result text
2023-12-07 11:15:42 -05:00
e7f21b7058
Enhance and fix bugs in psql plugin
...
1. Check Patronictl statuses
2. Don't error during node primary transitions
2023-12-07 11:14:16 -05:00
9dbadfdd6e
Move back to per-plugin fault reporting
2023-12-07 11:13:56 -05:00
61b39d0739
Fix incorrect cluster health calculation
2023-12-07 11:13:36 -05:00
4bf80a5913
Fix missing datetime shrink
2023-12-06 17:15:36 -05:00
6c0dfe16cf
Improve word splitting for fault messages
...
This ensures that fault messages are split on word boundaries and that
the column length is equal to the longest of these if applicable.
2023-12-06 17:10:19 -05:00
3fde494fc5
Add status back to short fault list
2023-12-06 16:53:23 -05:00
0945b3faf3
Use same fault formatting for short and long
2023-12-06 16:19:44 -05:00
1416f9edc0
Remove bad sort values
2023-12-06 14:38:29 -05:00
5691f75ac9
Fix bad import
2023-12-06 14:28:32 -05:00
e0bf7f7d1a
Fix bad ID values in acknowledge
2023-12-06 14:18:31 -05:00
0c34c88a1f
Fix bad dict key name
2023-12-06 14:16:19 -05:00
20acf3295f
Add mass ack/delete of faults
2023-12-06 13:59:39 -05:00
4a02c2c8e3
Add additional faults
2023-12-06 13:27:39 -05:00
6fc5c927a1
Properly sort status faults
2023-12-06 13:27:18 -05:00
d1e34e7333
Store fault times only to the second
...
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00
79eb54d5da
Move fault generation to common library
2023-12-06 13:17:10 -05:00
536fb2080f
Fix get_terminal_size over SSH
2023-12-06 13:11:28 -05:00
2267a9c85d
Improve output formatting for simplicity
2023-12-05 10:37:35 -05:00
067e73337f
Shorten health IDs to 8 characters
2023-12-04 15:48:27 -05:00
672e58133f
Implement interfaces to faults
2023-12-04 01:37:54 -05:00
b59f743690
Improve logging and handling of fault entries
2023-12-01 17:38:28 -05:00
4c3f235e05
Avoid running fault updates in maintenance mode
...
When the cluster is in maintenance mode, all faults should be ignored.
2023-12-01 17:38:28 -05:00
3dc48c1783
Lower default monitoring interval to 15s
...
Faults are also reported on the monitoring interval, so 60s seems like
too long. Lower this to 15 seconds by default instead.
2023-12-01 17:38:28 -05:00
9c2b1b29ee
Add node health to fault states
...
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.
Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.
2023-12-01 17:38:28 -05:00
8594eb697f
Add initial fault generation in pvchealthd
...
References: #164
2023-12-01 17:38:27 -05:00