Commit Graph

920 Commits

Author SHA1 Message Date
Joshua Boniface a4ab3075ab Correct some bugs around new code 2019-06-19 00:23:25 -04:00
Joshua Boniface 01959cb9e3 Implementation of RBD volumes and snapshots
Adds the ability to manage RBD volumes (add/remove) and RBD
snapshots (add/remove). (Working) list functions to come.
2019-06-19 00:12:44 -04:00
Joshua Boniface b50b2a827b Add forced delays after pool add/remove
Prevents returning immediately to give the cluster some breathing
room before the admin can do other commands. Keep the write lock
as well to prevent other clients from attempting this as well.
2019-06-18 21:56:24 -04:00
Joshua Boniface 537ad5de43 Make ceph pool removal confirmation verbose 2019-06-18 21:51:17 -04:00
Joshua Boniface ee73676114 Fix bug with pool removal 2019-06-18 21:51:11 -04:00
Joshua Boniface 264c2d4748 Fix broken prompting for pool removal 2019-06-18 21:33:39 -04:00
Joshua Boniface 2bbbda3da5 Only trigger pool updates on primary 2019-06-18 21:26:05 -04:00
Joshua Boniface 612f5ab52c Strip pv_block from stdout 2019-06-18 20:34:25 -04:00
Joshua Boniface 1622226c32 Add more logging during OSD creation/deletion 2019-06-18 20:31:04 -04:00
Joshua Boniface 3adeef6fdd Use the fsid to activate new OSDs 2019-06-18 20:22:28 -04:00
Joshua Boniface 443108f53d Add support for enable/disable keepalive detail 2019-06-18 19:54:42 -04:00
Joshua Boniface 79f284a0a9 Pass logger into run_command 2019-06-18 13:45:59 -04:00
Joshua Boniface 080ca3201c Correct actual problem with this_node 2019-06-18 13:43:54 -04:00
Joshua Boniface d076f9f4eb Use self.this_node everywhere 2019-06-18 13:25:16 -04:00
Joshua Boniface aee078f3eb Support disabling keepalive logging 2019-06-18 12:44:07 -04:00
Joshua Boniface b0411e8e1a Remove "error" message from Ceph commands
This triggeres at every node start and isn't useful.
2019-06-18 12:41:38 -04:00
Joshua Boniface 8d9007f697 Remove OSD stat collection if count is zero
Otherwise, ceph osd df will hang indefinitely trying to get data
for the zero OSDs.
2019-06-18 12:36:53 -04:00
Joshua Boniface 5a327dc41a Clean up Ceph pipeline and add more debug logs 2019-06-18 11:19:03 -04:00
Joshua Boniface 46a416bc78 Use a proper variable for vni_mtu 2019-06-18 00:01:12 -04:00
Joshua Boniface 1f92b90a3e Don't encode initial data as we're using zkhander 2019-06-17 23:53:16 -04:00
Joshua Boniface d4ebe63d9b Rename network device field
It seems much nicer and more consistent as "device" rather than as
"name".
2019-06-17 23:44:41 -04:00
Joshua Boniface 1d3f868206 Unify network devices and addresses in config
The old way of doing this was a little cumbersome, with an upper YAML
tree split between "devices" (name and MTU) and addresses. This commit
unifies these under the root "networking" section to make this section
clearer.
2019-06-17 23:41:07 -04:00
Joshua Boniface e70255dbd6 Support configurable interface MTUs
MTUs were hardcoded at 9000, which breaks if the underlying interface
or network switch does not support jumbo frames, a possible deployment
limitation. This has non-obvious consequences due to MTU mismatches
for certain services (Ceph, Zookeeper, etc.).

This commit adds support for configurable MTUs for each interface,
set in pvcd.yaml. The example has been updated to reflect this, with
a default of 1500 (the Ethernet standard).

This commit also adds autoconfiguration of the VNI device MTU based
on the `vni_mtu` value, the same for bridge networks and minus 50
(rather than 200 from the hardcoded value, based on the following
resource [1]) for VXLAN networks.

[1] http://ipengineer.net/2014/06/vxlan-mtu-vs-ip-mtu-consideration/
2019-06-17 23:34:48 -04:00
Joshua Boniface c583ee1709 Revert "Wait a little longer"
This reverts commit bd7a55e9e1.

This is not really needed, but do keep the 5s wait
2019-06-17 21:56:06 -04:00
Joshua Boniface bd7a55e9e1 Wait a little longer 2019-06-17 12:14:13 -04:00
Joshua Boniface 23994f8a11 Increase wait time for daemons and log message 2019-06-17 10:30:46 -04:00
Joshua Boniface fe654aa5a2 Correct typo in daemon 2019-06-16 19:27:20 -04:00
Joshua Boniface 3ba3c339a7 Show vCPU count on CLI output
Showing the static, total number of CPUs was pointless. Instead,
show the number of allocated vCPUs. To preserve space, no longer
show the host CPU count in the list.
2019-06-02 22:30:26 -04:00
Joshua Boniface 45da4e3f9a Remove backup file 2019-05-30 21:59:56 -04:00
Joshua Boniface 7596e3c3b5 Add missing number 2019-05-28 23:41:31 -04:00
Joshua Boniface b7beea2692 Fix some typos and poor wordings 2019-05-28 20:17:45 -04:00
Joshua Boniface 2a6157521d Reorganize documentation 2019-05-28 20:04:55 -04:00
Joshua Boniface b9774bdf03 Increase wait sleeps in node flush/unflush 2019-05-26 23:21:01 -04:00
Joshua Boniface 14e9ba892c Wait on both sides for 30s
Still finding issues with the flush
2019-05-24 01:23:18 -04:00
Joshua Boniface 703e34e8ea Remove disable of pvc-flush
Since it isn't re-enabled and this makes life difficult, don't
disable the pvc-flush service if it was enabled.
2019-05-23 23:47:57 -04:00
Joshua Boniface ae37afcf75 Wait 10 seconds when starting pvc-flush
Without waiting the unflush will trigger too soon, before the
daemon is fully ready and such it fails in odd ways.
2019-05-23 23:35:01 -04:00
Joshua Boniface e8b666708c Add one final keepalive update before exiting 2019-05-23 23:23:03 -04:00
Joshua Boniface 4c5ce9b995 Perform additional tweaks to units
Use RemainAfterExit to avoid pvc-flush from auto-stopping immediately.

Use PartOf to tie services to the target itself.

Use --wait on flush to avoid daemon stopping before flush is complete.
2019-05-23 23:18:28 -04:00
Joshua Boniface e46aa22989 Remove invalid Restart in pvc-flush.service 2019-05-23 22:51:36 -04:00
Joshua Boniface 0421f5cac8 Make the informational messages stand out 2019-05-23 22:49:00 -04:00
Joshua Boniface 7c6132f7dd Add node autoflush service and target
Add a systemd service to manage node flush/unflush, useful during
system startup and shutdown to avoid requiring administrator
intervention for this to occur. This is optional and the service is
not enabled by default, and the postinst script informs the
administrator of this.

Also adds a systemd target to collect the two service units together
and provide an easy way to flush+shutdown or startup+unflush the
entire PVC system.

Closes #28
2019-05-23 22:42:51 -04:00
Joshua Boniface 69462d2c7b Ensure myhostname is short
PVC now uses shortnames for node names, so ensure this is reflected
in the default choices for some node-level commands.
2019-05-23 22:27:34 -04:00
Joshua Boniface 8ef21cf9f2 Sleep longer before removing gateways
1 second was just slightly too little time to wait and packets would
occasionally be lost on primary switchover. Increase this to 2
seconds to provide more time for arping to run on the new primary.
2019-05-23 22:20:38 -04:00
Joshua Boniface d59280d829 Update dependencies for Postgres 2019-05-22 21:57:06 -04:00
Joshua Boniface 8881b97e8b Correct a missing capitalization 2019-05-21 23:19:19 -04:00
Joshua Boniface 4bfbbaa7d9 Remove commented needless call 2019-05-21 23:08:28 -04:00
Joshua Boniface 3893666507 Improve performance by removing spurious actions
1. Remove a number of time.sleep commands which don't really seem
necessary any longer and which significantly increased the startup
time while parsing the VM list.
2. Handle some variable sets during initialization of the object,
rather than waiting for a management command, enabling...
3. Know when a state change, and the corresponding Libvirt lookup,
is unnecessary due to the target node not matching the current node.
This also removes a number of unremovable errors from Libvirt on the
console which were annoying.

This reduces the total time taken by the VM startup segment (lines
760-762 of Daemon.py) from 17.117s down to 0.976s for 82 VMs.
2019-05-21 22:56:40 -04:00
Joshua Boniface 6fd4710f7f Remove bad replacement 2019-05-21 19:51:23 -04:00
Joshua Boniface 79d0a2eafc Handle raw sorting properly with new list format 2019-05-21 14:44:45 -04:00
Joshua Boniface 595cf1782c Switch DNS aggregator to PostgreSQL
MariaDB+Galera was terribly unstable, with the cluster failing to
start or dying randomly, and generally seemed incredibly unsuitable
for an HA solution. This commit switches the DNS aggregator SQL
backend to PostgreSQL, implemented via Patroni HA.

It also manages the Patroni state, forcing the primary instance to
follow the PVC coordinator, such that the active DNS Aggregator
instance is always able to communicate read+write with the local
system.

This required some logic changes to how the DNS Aggregator worked,
specifically ensuring that database changes aren't attempted while
the instance isn't actively running - to be honest this was a bug
anyways that had just never been noticed.

Closes #34
2019-05-21 01:07:41 -04:00