Commit Graph

560 Commits

Author SHA1 Message Date
Joshua Boniface 7c7ca4a229 Allow inter-cluster orphan NTP sync
Due to the requirement of Ceph to have all peer nodes tightly
synchronized with each other to come online, PVC nodes need a way to
synchronize to each other even in the absence of an external time
reference. This is especially prevalent if a set of nodes are left
offline for an extended period (>1-2 weeks), since their hardware clocks
will drift. If the resulting Internet connectivity is then dependent on
a VM, this will cause a catch-22 and the cluster will not properly
start.

This configuration will accomplish that - if no suitable >6 stratum
peers are found, the hosts will enter orphan mode. Since they are now
all configured as "peers" with each other, they will collectively decide
on one of them to become the source and sync to it. A local stratum 10
fudge is added so that at least one of the nodes can become this source.

While this is not an ideal use of NTP, it is by far the cleanest
solution to this problem, and does not impact normal functionality when
the two configured stratum-2 servers are reachable.
2023-09-01 15:42:25 -04:00
Joshua Boniface 027a819a83 Move some other tasks to bootstrap role
Avoids an issue where the pvcnoded service is stopped on non-bootstrap
runs.
2023-09-01 15:42:25 -04:00
Joshua Boniface e53342474c Remove GRUB config from base role
This is not actually ideal.
2023-09-01 15:42:25 -04:00
Joshua Boniface 4666db17cb Fix version sorting bugs in kernel-cleanup.sh 2023-09-01 15:42:25 -04:00
Joshua Boniface 6903627150 Add additional items to base role
Backups, GRUB configuration, and IPMI configuration.
2023-09-01 15:42:25 -04:00
Joshua Boniface c96ad603b0 Fix sudoers to use conditional deploy_username 2023-09-01 15:42:25 -04:00
Joshua Boniface 29363ebf80 Allow configurable fail2ban IPs 2023-09-01 15:42:25 -04:00
Joshua Boniface d9be39a048 Allow customization of deploy username 2023-09-01 15:42:25 -04:00
Joshua Boniface 95b9d6d786 Fix group_vars to match new setup 2023-09-01 15:42:25 -04:00
Joshua Boniface 4dc5ebdba0 Move to more dynamic apt configs
Allow specifying repository URLs in the group_vars, and add
release-specific template files to support future version changes.
2023-09-01 15:42:25 -04:00
Joshua Boniface 5873ce7d0c Update root password in default group_vars 2023-09-01 15:42:25 -04:00
Joshua Boniface 6a61f8f7bf Update relative path to bootstrap files 2023-09-01 15:42:25 -04:00
Joshua Boniface 4caab67d03 Remove superfluous symlink 2023-09-01 15:42:25 -04:00
Joshua Boniface 57e5953fd1 Add sensible sorting of kernel removals 2023-09-01 15:42:25 -04:00
Joshua Boniface 2a72a826f5 Remove cruft and add mkpasswd setup 2023-09-01 15:42:25 -04:00
Joshua Boniface 3a24e6dd8a Update file copyright header 2023-09-01 15:42:25 -04:00
Joshua Boniface bf02da693f Correct bad indentation in base role 2023-09-01 15:42:25 -04:00
Joshua Boniface 39b8229c35 Add libguestfs-tools to libvirt role deps 2023-09-01 15:42:25 -04:00
Joshua Boniface f34b2a5f7e Add cleanup to update oneshot playbook 2023-09-01 15:42:25 -04:00
Joshua Boniface 1f6cb077fa Update tags and add kernel-cleanup script 2023-09-01 15:42:25 -04:00
Joshua Boniface 0bf9c6209c Fix incorrect systemd enabling in Patroni 2023-09-01 15:42:25 -04:00
Joshua Boniface 8d41650619 Add reboot to purge 2023-09-01 15:42:25 -04:00
Joshua Boniface c634823ba5 Remove log dirs during purge 2023-09-01 15:42:25 -04:00
Joshua Boniface c0dc6fad4e Add some additional compression libraries 2023-09-01 15:42:25 -04:00
Joshua Boniface a4be011884 Add local domain to resolver config 2023-09-01 15:42:25 -04:00
Joshua Boniface 4f5dbee8ee Correct bugs during bootstrap
1. Ensure Zookeeper restarts and checks out successfully before
proceeding with other steps.
2. Make sure PVC itself doesn't start prematurely.
2023-09-01 15:42:25 -04:00
Joshua Boniface 05a2c1949d Add removal of Zookeeper keys too 2023-09-01 15:42:25 -04:00
Joshua Boniface a4f1d6eedc Update purge script 2023-09-01 15:42:25 -04:00
Joshua Boniface 26dbd082ef Retry pgsql bootstrap startup 6 times
This will sometimes fail, so retry it several times
2023-09-01 15:42:25 -04:00
Joshua Boniface e9f08ad100 Retry msgr2 enabling 6 times
This will sometimes fail, so retry it several times
2023-09-01 15:42:25 -04:00
Joshua Boniface a77e41bf7c Remove invalid timezone entries in postgres conf 2023-09-01 15:42:25 -04:00
Joshua Boniface 0ddd11844e Reorder Ceph stop and lower some waits 2023-09-01 15:42:25 -04:00
Joshua Boniface cf609eb609 Add tasks to verify node has finished (un)flushing 2023-09-01 15:42:25 -04:00
Joshua Boniface c29cdd5305 Increase all wait timeouts to 30s
Ensure that even on slow(er) clusters, these timeouts have more time to
complete before proceeding so the task won't fail.
2023-09-01 15:42:24 -04:00
Joshua Boniface cba276e248 Add default values 2023-09-01 15:42:24 -04:00
Joshua Boniface be94bc134f Add configurable ZK memory limits 2023-09-01 15:42:24 -04:00
Joshua Boniface 6e74ac44a5 Remove libjemalloc package 2023-09-01 15:42:24 -04:00
Joshua Boniface 2bd5cc5a25 Tune Zookeeper memory usage
Use Xms and Xmx=128M to reduce overall Zookeeper memory usage.
2023-09-01 15:42:24 -04:00
Joshua Boniface b4e36d146a Add tuning for Ceph OSDs 2023-09-01 15:42:24 -04:00
Joshua Boniface 24764fe704 Don't use libjemalloc for Ceph daemons
This was an artifact of a much, much older Ceph configuration I ran, and
is not relevant with newer Ceph versions like those used in PVC.
Performance testing with Nautilus and Bluestore reveals a minimal
performance hit, and using `jemalloc` prevents cache autotuning from
being effective, so remove it.
2023-09-01 15:42:24 -04:00
Joshua Boniface 750cb4b55c Disable pvc-flush service while rebooting
Prevents the flush daemon from starting on node boot, before the
playbook is actually ready to unflush the node.
2023-09-01 15:42:24 -04:00
Joshua Boniface cdc7e3377b Tweak oneshot script
Cleanly stop daemons; check if OSDs are back before continuing; wait
less
2023-09-01 15:42:24 -04:00
Joshua Boniface 458e7b4872 Use new init command location
Command was renamed in the PVC CLI to facilitate other "task" actions
like backup/restore.
2023-09-01 15:42:24 -04:00
Joshua Boniface bcb5962353 Add jute.maxbuffer to Zookeeper environment ops
Adds this option based on the findings of
https://github.com/python-zk/kazoo/issues/630, whereby restores of >1MB
in size would fail. This is considered an unsafe option, but given our
usecase no actual znode should ever exceed this limit; this is purely
for the large transactions that come from a `pvc task restore` action to
an empty Zookeeper instance.
2023-09-01 15:42:24 -04:00
Joshua Boniface 075ce8ea22 Add PVC status MOTD script 2023-09-01 15:42:24 -04:00
Joshua Boniface 68a475ccf9 Set proper mode on agent plugins 2023-09-01 15:42:24 -04:00
Joshua Boniface 9962ceaf0a Add cluster safe update playbook
This playbook will perform a oneshot upgrade of the systems in the
cluster, including performing a clean and safe reboot of the node(s) if
required (either due to services needing a restart, or the kernel
changing). It runs in serial=1 and only reboots if needed.
2023-09-01 15:42:24 -04:00
Joshua Boniface f86ec62416 Add check-mk-agent plugin installs
These are used by various Ansible tasks, even if the administrator is
not using Check_MK for monitoring.
2023-09-01 15:42:24 -04:00
Joshua Boniface 62d53b0c9c Add PCI and USB utils 2023-09-01 15:42:24 -04:00
Joshua Boniface f79fb605de Support using existing SSL certs on system
Add the additional pvc_api_ssl_cert_path and pvc_api_ssl_key_path
group_vars options, which can be used to set the SSL details to existing
files on the filesystem if desired. If these are empty (or nonexistent),
the original pvc_api_ssl_cert and pvc_api_ssl_key raw format options
will be used as they were.

Allows the administrator to use outside methods (such as Let's Encrypt)
to obtain the certs locally on the system, avoiding changes to the
group_vars and redeployment to manage SSL keys.
2023-09-01 15:42:24 -04:00