Work around synchronization lock issues

Make the block on stage C only wait for 900 seconds (15 minutes) to
prevent indefinite blocking.

The issue comes if a VM is being received, and the current unflush is
cancelled for a flush. When this happens, this lock acquisition seems to
block for no obvious reason, and no other changes seem to affect it.
This is certainly some sort of locking bug within Kazoo but I can't
diagnose it as-is. Leave a TODO to look into this again in the future.
This commit is contained in:
Joshua Boniface 2021-09-26 23:24:23 -04:00
parent 3638efc77e
commit 0d72798814
1 changed files with 10 additions and 3 deletions

View File

@ -555,9 +555,16 @@ class VMInstance(object):
time.sleep(0.5)
self.logger.out('Acquiring lock for phase C', state='i', prefix='Domain {}'.format(self.domuuid))
lock.acquire()
try:
# Wait for only 900 seconds on this step since we don't do anything and it can fail
# if a flush or unflush is cancelled. 900 seconds should be plenty for real long
# migations while still avoiding an indefinite blocking here.
# TODO: Really dig into why
lock.acquire(timeout=900)
# This is strictly a synchronizng step
lock.release()
except Exception:
self.logger.out('Failed to acquire lock for phase C within 15 minutes, continuing', state='w', prefix='Domain {}'.format(self.domuuid))
time.sleep(0.5)