Apply editorial feedback

This commit is contained in:
Joshua Boniface 2024-09-23 13:58:00 -04:00
parent c684fcc281
commit 7c518abd82
1 changed files with 5 additions and 5 deletions

View File

@ -21,7 +21,7 @@ Like parts 2 and 3, I'll jump right into the cluster specifications, changes to
## The Cluster Specs (even better)
Parts 1 and 2 used my own home server setup, based on Dell R430 servers using Broadwell-era Intel Xeon CPUs, for analysis; then part 3 used a more modern AMD Epyc based Dell system through my employer. In this part, we have the beefiest cluster I've built yet. These servers are, at least to me, truly extreme in terms of power, featuring a full 64-core high speed Epyc processor, 1TB of RAM, and 2x 100GbE ports per node. Like all previous test clusters, there are 3 nodes:
Parts 1 and 2 used my own home server setup, based on Dell R430 servers using Broadwell-era Intel Xeon CPUs, for analysis; then part 3 used a more modern AMD Epyc based Dell system through my employer. In this part, we have significantly more powerful machines, featuring a 64-core high speed Epyc processor, 1TB of RAM, and 2x 100GbE ports per node. Like all previous test clusters, there are 3 nodes:
| **Part**           | **node1 + node2 + node3** |
| :-------------------------------------------------------------- | :------------------------ |
@ -34,7 +34,7 @@ Parts 1 and 2 used my own home server setup, based on Dell R430 servers using Br
## Test Outline and Hypothesis
The primary hypothesis of this set of benchmarks is that there is a linear scaling of performance the more OSD processes that are added to the Ceph subsystem. To achieve this, we are using one of the highest-possible spec'd systems, and by far the highest spec'd system I have used, possible for PVC in terms of a scale-up system while remaining on 3 nodes. In addition, a secondary hypothesis is that adding additional OSD processes per NVMe disk (i.e. splitting a single NVMe disk into several smaller "virtual" NVMe disks) will increase performance.
The primary hypothesis of this set of benchmarks is that there is a linear scaling of performance the more OSD processes that are added to the Ceph subsystem. In addition, a secondary hypothesis is that adding additional OSD processes per NVMe disk (i.e. splitting a single NVMe disk into several smaller "virtual" NVMe disks) will increase performance.
Based on the results of the last post, I've focused this test suite mostly on determining the levels of performance scaling and exactly how many OSDs will optimize performance on such a powerful system. CPU sets provided some very contradictory results for NVMe drives in part 3, so I have excluded them from any of the testing here, since I do not believe them to be significantly useful in most workloads. In addition, these tests were conducted on a completely empty cluster, with no VMs active, so these tests are truly of the theoretical maximum performance of the Ceph subsystem on the given hardware and nothing else.
@ -156,15 +156,15 @@ CPU, memory, and network throughput shows similar trends as well though network
For sequentual read I/O, there is a clear benefit to more OSDs up to a point, with the maximum performance falling somewhere around 8 total OSD processes on 4 disks, or 4 total OSD processes on 1 disk. Overall for random reads there does not seem to be a clear performance benefit to more disks beyond 2, but there is a benefit to more OSD processes on those disks, especially with 1 disk.
For sequential write I/O, there is a similar benefit to more OSDs up to about 4 total, at which point it seems that network throughput - or, more specifically, throughput between the internal system components - becomes a bottleneck. CPU utilization also remains fairly low at ~10% maximum, so this is not a major constraint.
For sequential write I/O, there is a similar benefit to more OSDs up to about 4 total, at which point it seems that network throughput - or, more specifically, throughput between the internal system components - becomes a bottleneck. CPU utilization also remains fairly low at ~10% maximum, so this does not appear to be a constraint on these systems.
Overall, while sequential performance is purely synthetic, these results do help us draw some useful conclusions: namely, that 1-2 disks per node is plenty, and that ~2 OSD processes per disk provides close to optimal performance.
For random read I/O, the returns of more OSD processes is signficant, with noticeable scaling, though nowhere close to linear once above 4 total OSD processes per node.
For random write I/O, the story is similar, with a clear peak around 4 total OSD processes per node, but whether these are on 1 or 4 disks does not seem to matter. CPU utilization does point clearly to "fewer OSDs are better" if this is a constraint, but the overal maximum utilization of ~20% still leaves a significant portion of the CPU resources to VMs.
For random write I/O, the story is similar, with a clear peak around 4 total OSD processes per node, but whether these are on 1 or 4 disks does not seem to matter. Where the bottleneck seems to be on this system would need to be investigated further.
Overall, random I/O performance definitely points to a sweet spot of 4 OSD processes per node, though this appears to be a clear maximum. One disk per node with 4 OSD processes can also maintain the same throughput, thus making additional disks unnecessary.
Overall, random I/O performance definitely points to a sweet spot of 4 OSD processes per node, though this appears to be a clear maximum; one disk per node with 4 OSD processes can also maintain the same throughput.
For low queue-depth I/O, as expected fewer OSDs are better in terms of performance, but latency seems to scale fairly consistently with no major changes, and thus, the number of disks does not significantly alter this metric for good or bad.