Commit Graph

3384 Commits

Author SHA1 Message Date
Joshua Boniface 8f0ae3e2dd Fix config file for database migrations 2023-12-09 22:51:54 -05:00
Joshua Boniface 946d3eaf43 Add wait after stopping VM 2023-12-09 18:14:03 -05:00
Joshua Boniface 1f6347d24b Add Prometheus monitoring examples 2023-12-09 17:42:51 -05:00
Joshua Boniface e8552b471b Require at least one FAULT_ID 2023-12-09 17:31:56 -05:00
Joshua Boniface fc443a323b Allow ack/delete of multiple faults at once 2023-12-09 17:28:13 -05:00
Joshua Boniface b0557edb76 Ensure entry in name is uppercase 2023-12-09 17:01:41 -05:00
Joshua Boniface 47bd7bf2f5 Only run cluster-wide health checks on primary
Avoids multiple coordinators trying to write updated cluster-wide fault
events. Instead, they are now only written by the primary (or the
incoming primary if still in a transition).
2023-12-09 16:50:51 -05:00
Joshua Boniface b9fbfe2ed5 Improve fault ID format
Instead of using random hex characters from an md5sum, use a nice name
in all-caps similar to how Ceph does. This further helps prevent dupes
but also permits a changing health delta within a single event (which
would really only ever apply to plugin faults).
2023-12-09 16:48:14 -05:00
Joshua Boniface 764e3e3722 Fix bug in fault header format 2023-12-09 16:47:56 -05:00
Joshua Boniface 7e6d922877 Improve fault detail handling further
Since we already had a "details" field, simply move where it gets added
to the message later, in generate_fault, after the main message value
was used to generate the ID.
2023-12-09 16:13:36 -05:00
Joshua Boniface 4ca2381077 Rework metrics output and add combined endpoint 2023-12-09 15:47:40 -05:00
Joshua Boniface 4003204f14 Remove bracketed text from fault_str
This ensures that certain faults e.g. Ceph status faults, will be
combined despite the added text in brackets, while still keeping them
mostly separate.

Also ensure the health text is updated each time to assist with this, as
this health text may now change independent of the fault ID.
2023-12-09 15:34:18 -05:00
Joshua Boniface a70c1d63b0 Separate state totals from states, separate states 2023-12-09 13:59:17 -05:00
Joshua Boniface 2bea78d25e Make all remaining limits optional 2023-12-09 13:43:58 -05:00
Joshua Boniface fd717b702d Use external list of fault states 2023-12-09 12:51:41 -05:00
Joshua Boniface 132cde5591 Add totals and nice-format states
Avoids tons of annoying rewriting in the UI later.
2023-12-09 12:50:19 -05:00
Joshua Boniface ba565ead4c Report all state combinations in Prom metrics
Ensures that every state combination is always shown to metrics, even if
it contains 0 entries.
2023-12-09 12:40:37 -05:00
Joshua Boniface 317ca4b98c Move defined state combinations into common 2023-12-09 12:36:32 -05:00
Joshua Boniface 2b8abea8df Remove debug printing 2023-12-09 12:22:36 -05:00
Joshua Boniface 9b3c9f1be5 Add Ceph metrics proxy and health fault counts 2023-12-09 12:22:36 -05:00
Joshua Boniface 7373bfed3f Add Prometheus metric exporter
Adds a "fake" Prometheus metrics endpoint which returns cluster status
information in Prometheus format.
2023-12-09 12:22:36 -05:00
Joshua Boniface d0e7c19602 Add prometheus client dependencies 2023-12-09 12:22:36 -05:00
Joshua Boniface f01c12c86b Import from pvcworkerd not pvcapid 2023-12-09 12:22:19 -05:00
Joshua Boniface 0bda095571 Move libvirt_schema and fix other imports 2023-12-09 12:20:29 -05:00
Joshua Boniface 7976e1d2d0 Correct import location in scripts 2023-12-09 12:18:33 -05:00
Joshua Boniface 813aef1463 Fix incorrect UUID key name 2023-12-09 12:14:57 -05:00
Joshua Boniface 5a7ea25266 Fix incorrect database name entries 2023-12-09 12:12:00 -05:00
Joshua Boniface 82a7fd3c80 Add more debugging info to psql 2023-12-07 21:36:05 -05:00
Joshua Boniface ddd9d9ee07 Adjust psql check to avoid weird failures 2023-12-07 15:07:59 -05:00
Joshua Boniface 9e2e749c55 Combine pvchealthd output into single log message 2023-12-07 14:00:43 -05:00
Joshua Boniface 157b8c20bf Add Patroni output to debug logs 2023-12-07 14:00:35 -05:00
Joshua Boniface bf158dc2d9 Shorten debug output 2023-12-07 13:31:20 -05:00
Joshua Boniface 1b84553405 Use passed coordinator state 2023-12-07 11:19:26 -05:00
Joshua Boniface 60dac143f2 Use simpler health calculation 2023-12-07 11:17:31 -05:00
Joshua Boniface a13273335d Add colon to result text 2023-12-07 11:15:42 -05:00
Joshua Boniface e7f21b7058 Enhance and fix bugs in psql plugin
1. Check Patronictl statuses
2. Don't error during node primary transitions
2023-12-07 11:14:16 -05:00
Joshua Boniface 9dbadfdd6e Move back to per-plugin fault reporting 2023-12-07 11:13:56 -05:00
Joshua Boniface 61b39d0739 Fix incorrect cluster health calculation 2023-12-07 11:13:36 -05:00
Joshua Boniface 4bf80a5913 Fix missing datetime shrink 2023-12-06 17:15:36 -05:00
Joshua Boniface 6c0dfe16cf Improve word splitting for fault messages
This ensures that fault messages are split on word boundaries and that
the column length is equal to the longest of these if applicable.
2023-12-06 17:10:19 -05:00
Joshua Boniface 3fde494fc5 Add status back to short fault list 2023-12-06 16:53:23 -05:00
Joshua Boniface 0945b3faf3 Use same fault formatting for short and long 2023-12-06 16:19:44 -05:00
Joshua Boniface 1416f9edc0 Remove bad sort values 2023-12-06 14:38:29 -05:00
Joshua Boniface 5691f75ac9 Fix bad import 2023-12-06 14:28:32 -05:00
Joshua Boniface e0bf7f7d1a Fix bad ID values in acknowledge 2023-12-06 14:18:31 -05:00
Joshua Boniface 0c34c88a1f Fix bad dict key name 2023-12-06 14:16:19 -05:00
Joshua Boniface 20acf3295f Add mass ack/delete of faults 2023-12-06 13:59:39 -05:00
Joshua Boniface 4a02c2c8e3 Add additional faults 2023-12-06 13:27:39 -05:00
Joshua Boniface 6fc5c927a1 Properly sort status faults 2023-12-06 13:27:18 -05:00
Joshua Boniface d1e34e7333 Store fault times only to the second
Any more precision is unnecessary and saves 6 chars when displaying
these times elsewhere.
2023-12-06 13:20:18 -05:00