Commit Graph

2241 Commits

Author SHA1 Message Date
Joshua Boniface cebb4bbc1a Comment cleanup 2020-06-06 13:20:40 -04:00
Joshua Boniface a672e06dd2 Move fencing to end of keepalive function 2020-06-06 13:19:11 -04:00
Joshua Boniface 1db73bb892 Move libvirt closure into previous section 2020-06-06 13:18:37 -04:00
Joshua Boniface c1956072f0 Rename update_zookeeper function to node_keepalive 2020-06-06 12:49:50 -04:00
Joshua Boniface ce60836c34 Allow enforcement of live migration
Provides a CLI and API argument to force live migration, which triggers
a new VM state "migrate-live". The node daemon VMInstance during migrate
will read this flag from the state and, if enforced, will not trigger a
shutdown migration.

Closes #95
2020-06-06 12:00:44 -04:00
Joshua Boniface b5434ba744 Fix typo in variable name 2020-06-06 11:29:48 -04:00
Joshua Boniface f61d443773 Allow move of migrated VM to current node
Will make the migrate permanent instead of throwing an error.

Fixes #96
2020-06-06 11:25:10 -04:00
Joshua Boniface da20b4493a Properly return the function 2020-06-05 15:50:43 -04:00
Joshua Boniface 440821b136 Refactor cluster validation into a command wrapper
Instead of using group-based validation, which breaks the help context
for subcommands, use a decorator to validate the cluster status for each
command. The eager help option will then override this decorator for
help commands, while enforcing it for others.
2020-06-05 14:49:53 -04:00
Joshua Boniface b9e5b14f94 Update lastnode too if a self-migrate is aborted
References #92
2020-06-04 10:28:04 -04:00
Joshua Boniface 5d2031d99e Prevent a VM migrating to the same node
Prevents a rare edge case where a node can end up "migrating" to itself.
Quick hack to fix this, though like most of the VM management should
probably be rethought/rewritten later.

Fixes #92
2020-06-04 10:26:47 -04:00
Joshua Boniface 9ee5ae4826 Volume and Snapshot are not sorted by ID 2020-05-29 13:43:44 -04:00
Joshua Boniface 48711000b0 Ensure stats sorting is by right key 2020-05-29 13:41:52 -04:00
Joshua Boniface 82c067b591 Sort list output in CLI client properly 2020-05-29 13:39:20 -04:00
Joshua Boniface 0fab7072ac Sort all Ceph lists by numeric ID 2020-05-29 13:31:18 -04:00
Joshua Boniface 2d507f8b42 Ensure rbdlist is updated when modifying VM config 2020-05-12 11:08:47 -04:00
Joshua Boniface 5f9836f96d Add error message to OSD parse fail 2020-05-12 11:04:38 -04:00
Joshua Boniface 95c59ba629 Improve flush handling slightly 2020-05-12 11:04:38 -04:00
Joshua Boniface e724e73140 Don't show built-in bridges as invalid 2020-05-12 10:46:10 -04:00
Joshua Boniface 3cf90c46ad Correct bad handling of static reservations 2020-05-09 10:20:06 -04:00
Joshua Boniface 7b2180b626 Get both reservations in leases by default 2020-05-09 10:05:55 -04:00
Joshua Boniface 72a38fd437 Correct changed dhcp_reservations key name 2020-05-09 10:00:53 -04:00
Joshua Boniface 73eb4fb457 Fix typo of macaddress in dhcp add 2020-05-09 00:15:25 -04:00
Joshua Boniface b580760537 Add missing fmt_cyan variable 2020-05-08 18:15:02 -04:00
Joshua Boniface 683c3afea6 Correct spelling mistake 2020-05-06 11:29:42 -04:00
Joshua Boniface 4c7cb1a20c Add further wording tweaks and details 2020-05-06 11:20:12 -04:00
Joshua Boniface 90feb83eab Revamp some wording in the documentation 2020-05-06 10:41:13 -04:00
Joshua Boniface b91923735c Move some messages around 2020-05-05 16:19:18 -04:00
Joshua Boniface 34c4690d49 Don't convert bytes into KB in OVA import
Doing so can create an image that is 1 sector (512 bytes) too large,
which will then break qemu-img because it's stupid (or, VMDK is stupid,
I haven't decided which is).. Current Ceph rbd commands seem to accept
--size in bytes so this is fine.
2020-05-05 16:14:18 -04:00
Joshua Boniface 3e351bb84a Add additional error checking for profile creation 2020-05-05 15:28:39 -04:00
Joshua Boniface 331027d124 Add further tweaks to takeover state checks
Just ensure that everything is proper state before proceeding
2020-04-22 11:16:19 -04:00
Joshua Boniface ae4f36b881 Hook flush into more services
Trying to ensure that pvc-flush completes before anything tries to shut
down.
2020-04-14 19:58:53 -04:00
Joshua Boniface e451426c7c Fix minor bugs from change in VM info handling 2020-04-13 22:56:19 -04:00
Joshua Boniface 611e0edd80 Reorder last keepalive during cleanup
Make sure the stopping of the keepalive timer and final keepalive update
are done as the last step before complete shutdown. The previous setup
could conceivably result in a node being fenced should the cleanup
operations take longer than ~45 seconds, for instance if primary node
switchover took too long or blocked, or log watchers failed to stop
quickly enough. Ensures that keepalives will continue to be run during
the shutdown process until the last possible moment.
2020-04-12 03:49:29 -04:00
Joshua Boniface b413e042a6 Improve handling of primary contention
Previously, contention could occasionally cause a flap/dual primary
contention state due to the lack of checking within this function. This
could cause a state where a node transitions to primary than is almost
immediately shifted away, which could cause undefined behaviour in the
cluster.

The solution includes several elements:
    * Implement an exclusive lock operation in zkhandler
    * Switch the become_primary function to use this exclusive lock
    * Implement exclusive locking during the contention process
    * As a failsafe, check stat versions before setting the node as the
      primary node, in case another node already has
    * Delay the start of takeover/relinquish operations by slightly
      longer than the lock timeout
    * Make the current router_state conditions more explicit (positive
      conditionals rather than negative conditionals)

The new scenario ensures that during contention, only one secondary will
ever succeed at acquiring the lock. Ideally, the other would then grab
the lock and pass, but in testing this does not seem to be the case -
the lock always times out, so the failsafe check is technically not
needed but has been left as an added safety mechanism. With this setup,
the node that fails the contention will never block the switchover nor
will it try to force itself onto the cluster after another node has
successfully won contention.

Timeouts may need to be adjusted in the future, but the base timeout of
0.4 seconds (and transition delay of 0.5 seconds) seems to work reliably
during preliminary tests.
2020-04-12 03:40:17 -04:00
Joshua Boniface e672d799a6 Set flush after pvcapid.service
This may or may not help, but should in theory prevent the flush from
trying to run after a (locally-running) API daemon is terminated, which
could cause an API failure and a failure to flush.
2020-04-12 01:48:50 -04:00
Joshua Boniface 59707bad4e Fix some errors in the FAQ 2020-04-11 01:33:18 -04:00
Joshua Boniface 9c19813808 Fix link to FAQ page 2020-04-11 01:28:32 -04:00
Joshua Boniface 8fe50bea77 Add FAQ to documentation 2020-04-11 01:22:07 -04:00
Joshua Boniface 8faa3bb53d Handle info fuzzy matches better
If we are calling info, we want one VM. Don't silently discard other
options or try (and fail later) to parse multiple, just say no VM found.
2020-04-09 10:26:49 -04:00
Joshua Boniface a130f19a19 Depend pvcnoded on Zookeeper (harder) and libvirtd 2020-04-09 09:57:53 -04:00
Joshua Boniface a671d9d457 Use consistent tense in messages 2020-04-08 22:00:51 -04:00
Joshua Boniface fee1c7dd6c Reorder cleanup and gracefully wait for flushes 2020-04-08 22:00:08 -04:00
Joshua Boniface b3a75d8069 Use post instead of get on initialize 2020-04-06 15:05:33 -04:00
Joshua Boniface c3bd6b6ecc Add missing call into cluster initialize function 2020-04-06 14:48:26 -04:00
Joshua Boniface 5d58bee34f Add some time around noded startup/shutdown
Otherwise, systemd kills networking before the node daemon fully stops
and it goes into "dead" status, which is super annoying.
2020-04-01 23:59:14 -04:00
Joshua Boniface f668412941 Don't use Requires as the dep is too hard
Requires seems to flush on every service restart which is NOT what we
want. Use Wants instead.
2020-04-01 15:15:37 -04:00
Joshua Boniface a0ebc0d3a7 Add more robust requirements to pvc-flush service 2020-04-01 15:09:44 -04:00
Joshua Boniface 98a7005c1b Add significant TimeoutSec to pvc-flush service
This will stop systemd from killing the service in the middle of a flush
or unflush operation, which completely defeats the purpose. 30 minutes
was chosen as this is a very large but still somewhat manageable value,
which should cover even a very large very loaded cluster with room to
spare.
2020-04-01 01:24:09 -04:00
Joshua Boniface 44efd66f2c Fix error renaming keys
This function was not implemented and thus failed; implements it.
2020-03-30 21:38:18 -04:00