Document/Fix RAM safety issues #104

Closed
opened 2020-10-12 17:45:00 -04:00 by JoshuaBoniface · 8 comments
JoshuaBoniface commented 2020-10-12 17:45:00 -04:00 (Migrated from git.bonifacelabs.ca)

As it stands PVC has little in the way of helpful "bounds-checking" or "safety checking" of node RAM usage and VM RAM allocations. It is very possible for a node to be stuck in a heavy swapping state or crash due to this.

  1. Document the deficiencies and advise on keeping the node memory usage carefully controlled.

  2. Over time develop safeguards in PVC against this. As a first step, warn the cluster when total usage is greater than n-1 of the largest server.

As it stands PVC has little in the way of helpful "bounds-checking" or "safety checking" of node RAM usage and VM RAM allocations. It is very possible for a node to be stuck in a heavy swapping state or crash due to this. 1. Document the deficiencies and advise on keeping the node memory usage carefully controlled. 2. Over time develop safeguards in PVC against this. As a first step, warn the cluster when total usage is greater than n-1 of the largest server.
JoshuaBoniface commented 2020-10-12 17:45:00 -04:00 (Migrated from git.bonifacelabs.ca)

changed milestone to %4

changed milestone to %4
JoshuaBoniface commented 2020-10-12 17:46:00 -04:00 (Migrated from git.bonifacelabs.ca)

Add some sub-issues:

  1. #105 Ensure "VM" node memory counts include stopped VMs
  2. #106 Better handle mismatched-RAM scenarios (e.g. stop allocation in lockstep if one node is smaller).
Add some sub-issues: 1. #105 Ensure "VM" node memory counts include stopped VMs 2. #106 Better handle mismatched-RAM scenarios (e.g. stop allocation in lockstep if one node is smaller).
JoshuaBoniface commented 2020-10-18 14:29:46 -04:00 (Migrated from git.bonifacelabs.ca)

First two solutions have been implemented.

  1. An additional node memory field, "provisioned", tracks the total provisioned memory of both running (as in "allocated") and non-running VMs. This is shown in the node list and node details, and in the CLI, like the allocated memory, shows in yellow if the limit is violated.

  2. The mem migration selector has been modified to use this "provisioned" memory count, instead of "allocated", to better take into account situations where some VM(s) are stopped/non-running during a migration.

Next steps:

  1. An additional check in the cluster status should show if the (n-1) memory usage exceeds that of the (n-1) smallest nodes. Basically, so that the administrator becomes aware of potential (n-1) allocation issues.
First two solutions have been implemented. 1. An additional node memory field, "provisioned", tracks the total provisioned memory of both running (as in "allocated") and non-running VMs. This is shown in the node list and node details, and in the CLI, like the allocated memory, shows in yellow if the limit is violated. 2. The `mem` migration selector has been modified to use this "provisioned" memory count, instead of "allocated", to better take into account situations where some VM(s) are stopped/non-running during a migration. Next steps: 1. An additional check in the cluster status should show if the (n-1) memory usage exceeds that of the (n-1) smallest nodes. Basically, so that the administrator becomes aware of potential (n-1) allocation issues.
JoshuaBoniface commented 2020-10-18 19:52:15 -04:00 (Migrated from git.bonifacelabs.ca)

The next step has been implemented as well. This makes things much safer in theory.

The next step has been implemented as well. This makes things much safer in theory.
JoshuaBoniface commented 2020-10-18 23:14:25 -04:00 (Migrated from git.bonifacelabs.ca)

removed milestone

removed milestone
JoshuaBoniface commented 2020-10-18 23:14:28 -04:00 (Migrated from git.bonifacelabs.ca)

changed milestone to %5

changed milestone to %5
JoshuaBoniface commented 2020-10-21 03:30:57 -04:00 (Migrated from git.bonifacelabs.ca)

closed via commit 9bfcab5e2b

closed via commit 9bfcab5e2bbf29f1b9d40442544a091c1e8de476
JoshuaBoniface commented 2020-10-21 03:31:29 -04:00 (Migrated from git.bonifacelabs.ca)

Between the warning implemented and the documentation changes, I consider this issue resolved. If administrators force through these two pieces of advice, undefined behaviour can be expected.

Between the warning implemented and the documentation changes, I consider this issue resolved. If administrators force through these two pieces of advice, undefined behaviour can be expected.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: parallelvirtualcluster/pvc#104
No description provided.