Commit Graph

3266 Commits

Author SHA1 Message Date
Joshua Boniface 2b8abea8df Remove debug printing 2023-12-09 12:22:36 -05:00
Joshua Boniface 9b3c9f1be5 Add Ceph metrics proxy and health fault counts 2023-12-09 12:22:36 -05:00
Joshua Boniface 7373bfed3f Add Prometheus metric exporter
Adds a "fake" Prometheus metrics endpoint which returns cluster status
information in Prometheus format.
2023-12-09 12:22:36 -05:00
Joshua Boniface d0e7c19602 Add prometheus client dependencies 2023-12-09 12:22:36 -05:00
Joshua Boniface f01c12c86b Import from pvcworkerd not pvcapid 2023-12-09 12:22:19 -05:00
Joshua Boniface 0bda095571 Move libvirt_schema and fix other imports 2023-12-09 12:20:29 -05:00
Joshua Boniface 7976e1d2d0 Correct import location in scripts 2023-12-09 12:18:33 -05:00
Joshua Boniface 813aef1463 Fix incorrect UUID key name 2023-12-09 12:14:57 -05:00
Joshua Boniface 5a7ea25266 Fix incorrect database name entries 2023-12-09 12:12:00 -05:00
Joshua Boniface 82a7fd3c80 Add more debugging info to psql 2023-12-07 21:36:05 -05:00
Joshua Boniface ddd9d9ee07 Adjust psql check to avoid weird failures 2023-12-07 15:07:59 -05:00
Joshua Boniface 9e2e749c55 Combine pvchealthd output into single log message 2023-12-07 14:00:43 -05:00
Joshua Boniface 157b8c20bf Add Patroni output to debug logs 2023-12-07 14:00:35 -05:00
Joshua Boniface bf158dc2d9 Shorten debug output 2023-12-07 13:31:20 -05:00
Joshua Boniface 1b84553405 Use passed coordinator state 2023-12-07 11:19:26 -05:00
Joshua Boniface 60dac143f2 Use simpler health calculation 2023-12-07 11:17:31 -05:00
Joshua Boniface a13273335d Add colon to result text 2023-12-07 11:15:42 -05:00
Joshua Boniface e7f21b7058 Enhance and fix bugs in psql plugin
1. Check Patronictl statuses
2. Don't error during node primary transitions
2023-12-07 11:14:16 -05:00
Joshua Boniface 9dbadfdd6e Move back to per-plugin fault reporting 2023-12-07 11:13:56 -05:00
Joshua Boniface 61b39d0739 Fix incorrect cluster health calculation 2023-12-07 11:13:36 -05:00
Joshua Boniface 4bf80a5913 Fix missing datetime shrink 2023-12-06 17:15:36 -05:00
Joshua Boniface 6c0dfe16cf Improve word splitting for fault messages
This ensures that fault messages are split on word boundaries and that
the column length is equal to the longest of these if applicable.
2023-12-06 17:10:19 -05:00
Joshua Boniface 3fde494fc5 Add status back to short fault list 2023-12-06 16:53:23 -05:00
Joshua Boniface 0945b3faf3 Use same fault formatting for short and long 2023-12-06 16:19:44 -05:00
Joshua Boniface 1416f9edc0 Remove bad sort values 2023-12-06 14:38:29 -05:00
Joshua Boniface 5691f75ac9 Fix bad import 2023-12-06 14:28:32 -05:00
Joshua Boniface e0bf7f7d1a Fix bad ID values in acknowledge 2023-12-06 14:18:31 -05:00
Joshua Boniface 0c34c88a1f Fix bad dict key name 2023-12-06 14:16:19 -05:00
Joshua Boniface 20acf3295f Add mass ack/delete of faults 2023-12-06 13:59:39 -05:00
Joshua Boniface 4a02c2c8e3 Add additional faults 2023-12-06 13:27:39 -05:00
Joshua Boniface 6fc5c927a1 Properly sort status faults 2023-12-06 13:27:18 -05:00
Joshua Boniface d1e34e7333 Store fault times only to the second
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00
Joshua Boniface 79eb54d5da Move fault generation to common library 2023-12-06 13:17:10 -05:00
Joshua Boniface 536fb2080f Fix get_terminal_size over SSH 2023-12-06 13:11:28 -05:00
Joshua Boniface 2267a9c85d Improve output formatting for simplicity 2023-12-05 10:37:35 -05:00
Joshua Boniface 067e73337f Shorten health IDs to 8 characters 2023-12-04 15:48:27 -05:00
Joshua Boniface 672e58133f Implement interfaces to faults 2023-12-04 01:37:54 -05:00
Joshua Boniface b59f743690 Improve logging and handling of fault entries 2023-12-01 17:38:28 -05:00
Joshua Boniface 4c3f235e05 Avoid running fault updates in maintenance mode
When the cluster is in maintenance mode, all faults should be ignored.
2023-12-01 17:38:28 -05:00
Joshua Boniface 3dc48c1783 Lower default monitoring interval to 15s
Faults are also reported on the monitoring interval, so 60s seems like
too long. Lower this to 15 seconds by default instead.
2023-12-01 17:38:28 -05:00
Joshua Boniface 9c2b1b29ee Add node health to fault states
Adjusts ordering and ensures that node health states are included in
faults if they are less than 50%.

Also adjusts fault ID generation and runs fault checks only coordinator
nodes to avoid too many runs.
2023-12-01 17:38:28 -05:00
Joshua Boniface 8594eb697f Add initial fault generation in pvchealthd
References: #164
2023-12-01 17:38:27 -05:00
Joshua Boniface 988de1218f Bump version to 0.9.83 2023-12-01 17:37:42 -05:00
Joshua Boniface 0ffcbf3152 Fix bad file paths 2023-12-01 17:25:12 -05:00
Joshua Boniface ad8d8cf7a7 Avoid removing changelog file until the end
Avoids losing a changelog if something else fails.
2023-12-01 17:23:43 -05:00
Joshua Boniface 915a84ee3c Fix psql check for new configs 2023-12-01 03:58:21 -05:00
Joshua Boniface 6315a068d1 Use SafeLoader for config load 2023-12-01 02:01:24 -05:00
Joshua Boniface 2afd064445 Update CLI to read from pvc.conf 2023-12-01 01:53:33 -05:00
Joshua Boniface 7cb9ebae6b Remove legacy configuration handler
This is not going to be needed.
2023-12-01 01:25:40 -05:00
Joshua Boniface 1fb0463dea Adjust daemon service startup
Add healthd, adjust workerd, lower waittime
2023-11-30 03:28:02 -05:00