[BUG] Not all VMs migrate properly #7

Closed
opened 2018-07-17 03:03:04 -04:00 by JoshuaBoniface · 5 comments
JoshuaBoniface commented 2018-07-17 03:03:04 -04:00 (Migrated from git.bonifacelabs.ca)

Some VMs are left running after flush and are never updated. Investigate cause.

Some VMs are left running after flush and are never updated. Investigate cause.
JoshuaBoniface commented 2018-07-17 11:58:38 -04:00 (Migrated from git.bonifacelabs.ca)

I think this is caused because the domain_list ends up missing hosts for some reason. Should have the keepalive re-add any missing hosts to the list (it should be already, but shrug).

I think this is caused because the `domain_list` ends up missing hosts for some reason. Should have the keepalive re-add any missing hosts to the list (it should be already, but shrug).
JoshuaBoniface commented 2018-07-17 12:18:02 -04:00 (Migrated from git.bonifacelabs.ca)

mentioned in commit 778eff2d7d

mentioned in commit 778eff2d7d1872f30b8339eecd27d4ccc3d255e7
JoshuaBoniface commented 2018-07-17 13:15:54 -04:00 (Migrated from git.bonifacelabs.ca)

So my initial hunch was not correct. It seems that consistently, it's always the second element in the domain_list that is missed. For example:

>>> 2018/07/17 13:13:42.653969 - Flushing node "hv3.i.bonilan.net" of running VMs
    Domain list: 94711a1c-4e06-4750-b9f0-a2d223bb8d88, eb2bfdcc-ba79-47a2-b500-98bcee885ebc, 4ccffadd-bb55-43af-aeda-50bf3259c355, 95b18050-902c-4f29-86da-cef6c543ac53, 8e113171-6a00-46ca-af0e-94d466fa7ab1, 97a555e7-bfb1-4f26-9018-f2f24f539bd1, 8f93598f-a5bd-46a6-afe1-dd5560d33b85, b4048a35-d7c6-4141-a854-2b2b51434863, 79126c52-a3b0-4ae5-9023-583bb24931cc, 077ce7ef-1d14-43a1-ba65-5a752a3aa20a, a5458207-b642-4683-b9c6-f7ca09535b0e, 80f36f39-9173-45be-80ca-28a1e61a469c, b89f5a9d-0d7f-4653-9ff0-e21044b263de, fd24313f-c73d-434e-a977-61029fdc5405, 358196cd-2656-4a64-bbfa-6b4e9d8e3276
>>> 2018/07/17 13:13:42.654220 - Selecting target to migrate VM "94711a1c-4e06-4750-b9f0-a2d223bb8d88"
>>> 2018/07/17 13:13:42.659490 - Migrating VM "94711a1c-4e06-4750-b9f0-a2d223bb8d88" to hypervisor "hv1.i.bonilan.net"
>>> 2018/07/17 13:13:42.870474 - VM state change for "94711a1c-4e06-4750-b9f0-a2d223bb8d88": migrate hv1.i.bonilan.net
>>> 2018/07/17 13:13:42.870592 - 94711a1c-4e06-4750-b9f0-a2d223bb8d88: Migrating VM to hypervisor "hv1.i.bonilan.net"
>>> 2018/07/17 13:13:44.974542 - 94711a1c-4e06-4750-b9f0-a2d223bb8d88: Successfully migrated VM
libvirt: QEMU Driver error : Domain not found: no domain with matching uuid '94711a1c-4e06-4750-b9f0-a2d223bb8d88' (ombi1)
>>> 2018/07/17 13:13:46.204049 - VM state change for "94711a1c-4e06-4750-b9f0-a2d223bb8d88": start hv1.i.bonilan.net
>>> 2018/07/17 13:13:46.230604 - hv3.i.bonilan.net keepalive
    Active domains: 14  Free memory [MiB]: 78698  Used memory [MiB]: 17736  Load: 1.71
>>> 2018/07/17 13:13:46.236021 - Cluster status
    Active nodes: hv2.i.bonilan.net hv1.i.bonilan.net hv3.i.bonilan.net
    Inactive nodes:
    Flushed nodes:
>>> 2018/07/17 13:13:46.674518 - Selecting target to migrate VM "4ccffadd-bb55-43af-aeda-50bf3259c355"
>>> 2018/07/17 13:13:46.679928 - Migrating VM "4ccffadd-bb55-43af-aeda-50bf3259c355" to hypervisor "hv1.i.bonilan.net"
>>> 2018/07/17 13:13:46.891465 - VM state change for "4ccffadd-bb55-43af-aeda-50bf3259c355": migrate hv1.i.bonilan.net
>>> 2018/07/17 13:13:46.891578 - 4ccffadd-bb55-43af-aeda-50bf3259c355: Migrating VM to hypervisor "hv1.i.bonilan.net"
>>> 2018/07/17 13:13:49.187614 - 4ccffadd-bb55-43af-aeda-50bf3259c355: Successfully migrated VM
libvirt: QEMU Driver error : Domain not found: no domain with matching uuid '4ccffadd-bb55-43af-aeda-50bf3259c355' (radarr1)
>>> 2018/07/17 13:13:50.413796 - VM state change for "4ccffadd-bb55-43af-aeda-50bf3259c355": start hv1.i.bonilan.net

eb2bfdcc-ba79-47a2-b500-98bcee885ebc is the second element of the array, and it is skipped during migration without even beginning (the Selecting target message is the very first thing in the loop). Something weird is up here.

So my initial hunch was not correct. It seems that *consistently*, it's always the second element in the `domain_list` that is missed. For example: ``` >>> 2018/07/17 13:13:42.653969 - Flushing node "hv3.i.bonilan.net" of running VMs Domain list: 94711a1c-4e06-4750-b9f0-a2d223bb8d88, eb2bfdcc-ba79-47a2-b500-98bcee885ebc, 4ccffadd-bb55-43af-aeda-50bf3259c355, 95b18050-902c-4f29-86da-cef6c543ac53, 8e113171-6a00-46ca-af0e-94d466fa7ab1, 97a555e7-bfb1-4f26-9018-f2f24f539bd1, 8f93598f-a5bd-46a6-afe1-dd5560d33b85, b4048a35-d7c6-4141-a854-2b2b51434863, 79126c52-a3b0-4ae5-9023-583bb24931cc, 077ce7ef-1d14-43a1-ba65-5a752a3aa20a, a5458207-b642-4683-b9c6-f7ca09535b0e, 80f36f39-9173-45be-80ca-28a1e61a469c, b89f5a9d-0d7f-4653-9ff0-e21044b263de, fd24313f-c73d-434e-a977-61029fdc5405, 358196cd-2656-4a64-bbfa-6b4e9d8e3276 >>> 2018/07/17 13:13:42.654220 - Selecting target to migrate VM "94711a1c-4e06-4750-b9f0-a2d223bb8d88" >>> 2018/07/17 13:13:42.659490 - Migrating VM "94711a1c-4e06-4750-b9f0-a2d223bb8d88" to hypervisor "hv1.i.bonilan.net" >>> 2018/07/17 13:13:42.870474 - VM state change for "94711a1c-4e06-4750-b9f0-a2d223bb8d88": migrate hv1.i.bonilan.net >>> 2018/07/17 13:13:42.870592 - 94711a1c-4e06-4750-b9f0-a2d223bb8d88: Migrating VM to hypervisor "hv1.i.bonilan.net" >>> 2018/07/17 13:13:44.974542 - 94711a1c-4e06-4750-b9f0-a2d223bb8d88: Successfully migrated VM libvirt: QEMU Driver error : Domain not found: no domain with matching uuid '94711a1c-4e06-4750-b9f0-a2d223bb8d88' (ombi1) >>> 2018/07/17 13:13:46.204049 - VM state change for "94711a1c-4e06-4750-b9f0-a2d223bb8d88": start hv1.i.bonilan.net >>> 2018/07/17 13:13:46.230604 - hv3.i.bonilan.net keepalive Active domains: 14 Free memory [MiB]: 78698 Used memory [MiB]: 17736 Load: 1.71 >>> 2018/07/17 13:13:46.236021 - Cluster status Active nodes: hv2.i.bonilan.net hv1.i.bonilan.net hv3.i.bonilan.net Inactive nodes: Flushed nodes: >>> 2018/07/17 13:13:46.674518 - Selecting target to migrate VM "4ccffadd-bb55-43af-aeda-50bf3259c355" >>> 2018/07/17 13:13:46.679928 - Migrating VM "4ccffadd-bb55-43af-aeda-50bf3259c355" to hypervisor "hv1.i.bonilan.net" >>> 2018/07/17 13:13:46.891465 - VM state change for "4ccffadd-bb55-43af-aeda-50bf3259c355": migrate hv1.i.bonilan.net >>> 2018/07/17 13:13:46.891578 - 4ccffadd-bb55-43af-aeda-50bf3259c355: Migrating VM to hypervisor "hv1.i.bonilan.net" >>> 2018/07/17 13:13:49.187614 - 4ccffadd-bb55-43af-aeda-50bf3259c355: Successfully migrated VM libvirt: QEMU Driver error : Domain not found: no domain with matching uuid '4ccffadd-bb55-43af-aeda-50bf3259c355' (radarr1) >>> 2018/07/17 13:13:50.413796 - VM state change for "4ccffadd-bb55-43af-aeda-50bf3259c355": start hv1.i.bonilan.net ``` `eb2bfdcc-ba79-47a2-b500-98bcee885ebc` is the second element of the array, and it is skipped during migration without even beginning (the `Selecting target` message is the very first thing in the loop). Something weird is up here.
JoshuaBoniface commented 2018-07-17 14:30:59 -04:00 (Migrated from git.bonifacelabs.ca)

closed via commit 7f3caa2859

closed via commit 7f3caa28591d9207108eb9e134f82ac0e2276119
JoshuaBoniface commented 2018-07-22 20:25:53 -04:00 (Migrated from git.bonifacelabs.ca)

For posterity: The cause was me misinterpreting how Python handled list objects during looping and modification. In short, I was removing elements from the domain_list list object, but since I was also looping over that object, it was getting thrown off by one - only the second entry would be missed but it was consistently the second, which makes sense considering the element removed was the first, which in turn meant that on the next loop, object "2" would actually be the 3rd element, and the second was missed.

Fixed it up by doing an explicit copy() from the domain_list list before looping over it, ensuring the modifications coming from the main daemon process didn't affect the loop.

For posterity: The cause was me misinterpreting how Python handled list objects during looping and modification. In short, I was removing elements from the `domain_list` list object, but since I was also looping over that object, it was getting thrown off by one - only the second entry would be missed but it was consistently the second, which makes sense considering the element removed was the first, which in turn meant that on the next loop, object "2" would actually be the 3rd element, and the second was missed. Fixed it up by doing an explicit `copy()` from the `domain_list` list before looping over it, ensuring the modifications coming from the main daemon process didn't affect the loop.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#7
No description provided.