Commit Graph

2317 Commits

Author SHA1 Message Date
Joshua Boniface 683c3afea6 Correct spelling mistake 2020-05-06 11:29:42 -04:00
Joshua Boniface 4c7cb1a20c Add further wording tweaks and details 2020-05-06 11:20:12 -04:00
Joshua Boniface 90feb83eab Revamp some wording in the documentation 2020-05-06 10:41:13 -04:00
Joshua Boniface b91923735c Move some messages around 2020-05-05 16:19:18 -04:00
Joshua Boniface 34c4690d49 Don't convert bytes into KB in OVA import
Doing so can create an image that is 1 sector (512 bytes) too large,
which will then break qemu-img because it's stupid (or, VMDK is stupid,
I haven't decided which is).. Current Ceph rbd commands seem to accept
--size in bytes so this is fine.
2020-05-05 16:14:18 -04:00
Joshua Boniface 3e351bb84a Add additional error checking for profile creation 2020-05-05 15:28:39 -04:00
Joshua Boniface 331027d124 Add further tweaks to takeover state checks
Just ensure that everything is proper state before proceeding
2020-04-22 11:16:19 -04:00
Joshua Boniface ae4f36b881 Hook flush into more services
Trying to ensure that pvc-flush completes before anything tries to shut
down.
2020-04-14 19:58:53 -04:00
Joshua Boniface e451426c7c Fix minor bugs from change in VM info handling 2020-04-13 22:56:19 -04:00
Joshua Boniface 611e0edd80 Reorder last keepalive during cleanup
Make sure the stopping of the keepalive timer and final keepalive update
are done as the last step before complete shutdown. The previous setup
could conceivably result in a node being fenced should the cleanup
operations take longer than ~45 seconds, for instance if primary node
switchover took too long or blocked, or log watchers failed to stop
quickly enough. Ensures that keepalives will continue to be run during
the shutdown process until the last possible moment.
2020-04-12 03:49:29 -04:00
Joshua Boniface b413e042a6 Improve handling of primary contention
Previously, contention could occasionally cause a flap/dual primary
contention state due to the lack of checking within this function. This
could cause a state where a node transitions to primary than is almost
immediately shifted away, which could cause undefined behaviour in the
cluster.

The solution includes several elements:
    * Implement an exclusive lock operation in zkhandler
    * Switch the become_primary function to use this exclusive lock
    * Implement exclusive locking during the contention process
    * As a failsafe, check stat versions before setting the node as the
      primary node, in case another node already has
    * Delay the start of takeover/relinquish operations by slightly
      longer than the lock timeout
    * Make the current router_state conditions more explicit (positive
      conditionals rather than negative conditionals)

The new scenario ensures that during contention, only one secondary will
ever succeed at acquiring the lock. Ideally, the other would then grab
the lock and pass, but in testing this does not seem to be the case -
the lock always times out, so the failsafe check is technically not
needed but has been left as an added safety mechanism. With this setup,
the node that fails the contention will never block the switchover nor
will it try to force itself onto the cluster after another node has
successfully won contention.

Timeouts may need to be adjusted in the future, but the base timeout of
0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably
during preliminary tests.
2020-04-12 03:40:17 -04:00
Joshua Boniface e672d799a6 Set flush after pvcapid.service
This may or may not help, but should in theory prevent the flush from
trying to run after a (locally-running) API daemon is terminated, which
could cause an API failure and a failure to flush.
2020-04-12 01:48:50 -04:00
Joshua Boniface 59707bad4e Fix some errors in the FAQ 2020-04-11 01:33:18 -04:00
Joshua Boniface 9c19813808 Fix link to FAQ page 2020-04-11 01:28:32 -04:00
Joshua Boniface 8fe50bea77 Add FAQ to documentation 2020-04-11 01:22:07 -04:00
Joshua Boniface 8faa3bb53d Handle info fuzzy matches better
If we are calling info, we want one VM. Don't silently discard other
options or try (and fail later) to parse multiple, just say no VM found.
2020-04-09 10:26:49 -04:00
Joshua Boniface a130f19a19 Depend pvcnoded on Zookeeper (harder) and libvirtd 2020-04-09 09:57:53 -04:00
Joshua Boniface a671d9d457 Use consistent tense in messages 2020-04-08 22:00:51 -04:00
Joshua Boniface fee1c7dd6c Reorder cleanup and gracefully wait for flushes 2020-04-08 22:00:08 -04:00
Joshua Boniface b3a75d8069 Use post instead of get on initialize 2020-04-06 15:05:33 -04:00
Joshua Boniface c3bd6b6ecc Add missing call into cluster initialize function 2020-04-06 14:48:26 -04:00
Joshua Boniface 5d58bee34f Add some time around noded startup/shutdown
Otherwise, systemd kills networking before the node daemon fully stops
and it goes into "dead" status, which is super annoying.
2020-04-01 23:59:14 -04:00
Joshua Boniface f668412941 Don't use Requires as the dep is too hard
Requires seems to flush on every service restart which is NOT what we
want. Use Wants instead.
2020-04-01 15:15:37 -04:00
Joshua Boniface a0ebc0d3a7 Add more robust requirements to pvc-flush service 2020-04-01 15:09:44 -04:00
Joshua Boniface 98a7005c1b Add significant TimeoutSec to pvc-flush service
This will stop systemd from killing the service in the middle of a flush
or unflush operation, which completely defeats the purpose. 30 minutes
was chosen as this is a very large but still somewhat manageable value,
which should cover even a very large very loaded cluster with room to
spare.
2020-04-01 01:24:09 -04:00
Joshua Boniface 44efd66f2c Fix error renaming keys
This function was not implemented and thus failed; implements it.
2020-03-30 21:38:18 -04:00
Joshua Boniface 09aeb33d13 Don't convert non-integer bytes/ops 2020-03-30 19:09:16 -04:00
Joshua Boniface 6563053f6c Add underlying OS and architecture blurbs 2020-03-25 15:54:03 -04:00
Joshua Boniface 862f7ee9a8 Reword the opening paragraph 2020-03-25 15:42:51 -04:00
Joshua Boniface 97a560fcbe Update cluster documentation
Add a TOC, add additional sections, improve wording in some sections,
spellcheck.
2020-03-25 15:38:00 -04:00
Joshua Boniface d84e94eff4 Add force_single_node script 2020-03-25 10:48:49 -04:00
Joshua Boniface ce9d0e9603 Add helper scripts to CLI client 2020-03-22 01:19:55 -04:00
Joshua Boniface 3aea5ae34b Correct invalid function call 2020-03-21 16:46:34 -04:00
Joshua Boniface 3f5076d9ca Revamp some architecture documentation 2020-03-15 18:07:05 -04:00
Joshua Boniface 8ed602ef9c Update getting started paragraph 2020-03-15 17:50:16 -04:00
Joshua Boniface e501345e44 Revamp GitHub notice 2020-03-15 17:39:06 -04:00
Joshua Boniface d8f97d090a Update title in README 2020-03-15 17:37:30 -04:00
Joshua Boniface 082648f3b2 Mention Zookeeper in initial paragraph 2020-03-15 17:36:12 -04:00
Joshua Boniface 2df8f5d407 Fix pvcapid config in migrations script 2020-03-15 17:33:27 -04:00
Joshua Boniface ca65cb66b8 Update Debian changelog 2020-03-15 17:32:12 -04:00
Joshua Boniface 616d7c43ed Add additional info about OVA deployment 2020-03-15 17:31:12 -04:00
Joshua Boniface 4fe3a73980 Reorganize manuals and architecture pages 2020-03-15 17:19:51 -04:00
Joshua Boniface 26084741d0 Update README and index for 0.7 2020-03-15 17:17:17 -04:00
Joshua Boniface 4a52ff56b9 Catch failures in getPoolInformation
Fixes #90
2020-03-15 16:58:13 -04:00
Joshua Boniface 0a367898a0 Don't trigger aggregator fail if fine 2020-03-12 13:22:12 -04:00
Joshua Boniface ca5327b908 Make strtobool even more robust
If strtobool fails, return False always.
2020-03-09 09:30:16 -04:00
Joshua Boniface d36d8e0637 Use custom strtobool to handle weird edge cases 2020-03-06 09:40:13 -05:00
Joshua Boniface 36588a3a81 Work around bad RequestArgs handling 2020-03-03 16:48:20 -05:00
Joshua Boniface c02bc0b46a Correct issues with VM lock freeing
Code was bad and using a depricated feature.
2020-03-02 12:45:12 -05:00
Joshua Boniface 1e4350ca6f Properly handle takeover state in VXNetworks
Most of these actions/conditionals were looking for primary state, but
were failing during node takeover. Update the conditionals to look for
both router states instead.

Also add a wait to lock flushing until a takeover is completed.
2020-03-02 10:41:00 -05:00