Correct spelling mistakes

This commit is contained in:
Joshua Boniface 2022-11-13 01:07:53 -05:00
parent 3e7eb613fc
commit 3cc3a5dd29
1 changed files with 10 additions and 10 deletions

View File

@ -31,7 +31,7 @@ All 3 nodes in the cluster now have the following specifications:
## The OSD Database Device
As part of that original round of testing, I compared various configurations, including no WAL, with WAL, and various cpuset configurations with no WAL. After that testing was completed, the slight gains of the WAL prompted me to leave that configuration in place for production going forward, and I don't see much reason to remove it for further testing, due to the clear benefit (even if slight) that it gave to write performance. Thus, this follow-up post will focus exclusively on the cpuset configurations with the upgraded and balanaced CPUs.
As part of that original round of testing, I compared various configurations, including no WAL, with WAL, and various CPU set configurations with no WAL. After that testing was completed, the slight gains of the WAL prompted me to leave that configuration in place for production going forward, and I don't see much reason to remove it for further testing, due to the clear benefit (even if slight) that it gave to write performance. Thus, this follow-up post will focus exclusively on the CPU set configurations with the upgraded and balanced CPUs.
## The Fatal Flaw of my Previous Tests and Updated CPU Set Configuration
@ -39,7 +39,7 @@ The CPU-limited tests as outlined in the original post were fatally flawed. Whil
To counteract this, I created a fresh, from-scratch CPU tuning mechanism for the PVC Ansible deployment scheme. With this new mechanism, CPUs are limited with the systemd AllowedCPUs and CPUAffinity flags, which are then set on the various specific systemd slices that the system uses, including a custom OSD slice. This ensures that the limit happens in both directions and everything is forced into its own CPU set.
In addition to seprating the OSDs and VMs, a third CPU set is also added strictly for system processes. This is capped at 2 cores (plus hyperthreads) for all testing here, and the FIO processes are also limited to this CPU set.
In addition to separating the OSDs and VMs, a third CPU set is also added strictly for system processes. This is capped at 2 cores (plus hyperthreads) for all testing here, and the FIO processes are also limited to this CPU set.
Thus, the final layout of CPU core sets on all 3 nodes looks like:
@ -91,7 +91,7 @@ This is overall an interesting result and, as will be shown below, the outlier i
![Sequential Write Bandwidth (MB/s, 4M block size, 64 queue depth)](/images/pvc-ceph-tuning-adventures-part-2/seq-write.png)
Sequential write shows a much more consistent result in line with the hypothesis above, and providing a clear "no" answer for the first question and a fairly clear point of diminishing returns for the second. The overal margin between the configurations, however, is minimal, with just 17 MB/s of performance difference between the best (2+6+8) and worst (2+2+12) configurations.
Sequential write shows a much more consistent result in line with the hypothesis above, and providing a clear "no" answer for the first question and a fairly clear point of diminishing returns for the second. The overall margin between the configurations, however, is minimal, with just 17 MB/s of performance difference between the best (2+6+8) and worst (2+2+12) configurations.
There is a clear drop going from the all-cores configuration to the 2+2+12 configuration, however performance immediately spikes to even higher levels with the 2+4+10 and 2+6+8 configurations, with those only showing a 1 MB/s difference between them. This points towards the 2+4+10 configuration as an optimal one for sequential write performance, as it leaves more cores for VMs and shows that OSDs tend to use at most 2 cores each for sequential write operations. The performance spread does however limit the applicability of this test to much higher-throughput devices (i.e. NVMe SSDs), leaving the question still somewhat open.
@ -107,17 +107,17 @@ Random read, like sequential write above, shows a fairly consistent upward trend
This test shows the all-cores configuration as the clear loser, with a very significant performance benefit to even the most modest (2+2+12) limited configuration; beyond that, the difference between 2 OSD cores and 6 OSD cores is a relatively small 643 IOs per second; still significant, but not nearly as much as the nearly 3500 IOs per second uplift between the all-cores and 2+2+12 configurations.
This test definitely points towards a tradeoff between VM CPU allocations and maximum read performance, but also seems to indicate that, unlike sequential reads, Ceph does far better with just a few dedicated cores versus many shared cores when performing random reads.
This test definitely points towards a trade-off between VM CPU allocations and maximum read performance, but also seems to indicate that, unlike sequential reads, Ceph does far better with just a few dedicated cores versus many shared cores when performing random reads.
System load follows a similar result to the sequential read tests, with more significant load on the testing node for the all-core and 2+2+12 configurations, before balancing out more in the 2+6+8 configuration.
![Random Write IOs (IOPS, 4k block size, 64 queue depth)](/images/pvc-ceph-tuning-adventures-part-2/rand-write.png)
Random write again continues a general trend in line with the hypothesis and providing nearly the same answers as the sequential write tests, with a similar precipitous drop for the 2+2+12 configuration versus the all-core configuration, before rebounding and increasing with the 2+4+10 and 2+6+8 configurations. The overal margin is a very significant 7832 IOs per second between the worst (2+2+12) and best (2+6+8) tests, more than double the performance.
Random write again continues a general trend in line with the hypothesis and providing nearly the same answers as the sequential write tests, with a similar precipitous drop for the 2+2+12 configuration versus the all-core configuration, before rebounding and increasing with the 2+4+10 and 2+6+8 configurations. The overall margin is a very significant 7832 IOs per second between the worst (2+2+12) and best (2+6+8) tests, more than double the performance.
This test definitely shows that Ceph random writes can consume many CPU cores per OSD process, and that providing more, dedicated cores can provide significant uplift in random write performance. Thus, like random reads, there is a definite tradeoff between the CPU and storage performance requirements of VMs, so a balance must be struck. With regards to the second question, this test does show less clear diminishing returns as the number of dedicated cores increases, potentially indicating that it can scale almost indefinitely.
This test definitely shows that Ceph random writes can consume many CPU cores per OSD process, and that providing more, dedicated cores can provide significant uplift in random write performance. Thus, like random reads, there is a definite trade-off between the CPU and storage performance requirements of VMs, so a balance must be struck. With regards to the second question, this test does show less clear diminishing returns as the number of dedicated cores increases, potentially indicating that it can scale almost indefinitely.
System load shows an interesting trend compared to the other tests. Overall, the load remains in a fairly consistent spread between all 3 nodes, though with a closing gap by the 2+6+8 configuration. Of note is that the overall load drops singificantly on all nodes for the 2+2+12 configuration, showing quite clearly that the OSD processes are starved for CPU power during that test and explaining the overall poor performance there.
System load shows an interesting trend compared to the other tests. Overall, the load remains in a fairly consistent spread between all 3 nodes, though with a closing gap by the 2+6+8 configuration. Of note is that the overall load drops significantly on all nodes for the 2+2+12 configuration, showing quite clearly that the OSD processes are starved for CPU power during that test and explaining the overall poor performance there.
### 95th Percentile Latency Read & Write
@ -129,19 +129,19 @@ These tests are based on the 95th percentile latency numbers; thus, these are th
Read latency shows a consistent downwards trend like most of the tests so far, with a relatively large drop from the all-cores configuration to the 2+2+12 limited configuration, followed by steady decreases through each subsequent increase in cores. This does seem to indicate a clear benefit towards limiting CPUs, though like the random read tests, the point of diminishing returns comes fairly quickly.
System load also follows another hockey-stick-converging pattern, showing that CPU utilization is definitely corelated with the lower latency as the number of dedicated cores increases.
System load also follows another hockey-stick-converging pattern, showing that CPU utilization is definitely correlated with the lower latency as the number of dedicated cores increases.
![Write Latency (μs, 4k block size, 1 queue depth)](/images/pvc-ceph-tuning-adventures-part-2/latency-write.png)
Write latency shows another result consistent with the other write tests, where the 2+2+12 configuration fares (slightly) worse than the all-cores configuration before rebounding. Here the latency difference becomes significant, with the spread of 252 μs being enough to become noticeable in high-performance applications. There is also no clear point of diminishing returns, just like the other write tests.
System load follows a very curious curve, with node1 load dropping off and leveling out with the 2+4+10 and 2+6+8 configurations, while the other nodes continue to increase. I'm not sure exactly what to make of this result, but the overall performance trend does seem to indicate that, like other write tests, more cores dedicated to the OSDs results in higher utilization and performance.
System load follows a very curious curve, with node1 load dropping off and levelling out with the 2+4+10 and 2+6+8 configurations, while the other nodes continue to increase. I'm not sure exactly what to make of this result, but the overall performance trend does seem to indicate that, like other write tests, more cores dedicated to the OSDs results in higher utilization and performance.
## Conclusions
With a valid testing methodology, I believe we can demonstrate some clear takeaways from this testing.
First, our originaly hypothesis that "more cores means better performance" certainly holds. Ceph is absolutely CPU-bound, and better (newer) CPUs at higher frequencies with more cores are always a benefit to a hyperconverged cluster system like PVC.
First, our original hypothesis that "more cores means better performance" certainly holds. Ceph is absolutely CPU-bound, and better (newer) CPUs at higher frequencies with more cores are always a benefit to a hyperconverged cluster system like PVC.
Second, our first unanswered question, "is a limit worthwhile over no limit", seems to be a definitive "yes" in all except for one case: sequential reads. Only in that situation was the all-cores configuration able to beat all other configurations. However, given that sequential read performance is, generally, a purely artificial benchmark, I would say that it is definitely the case that a dedicated set of CPUs for the OSDs is a good best-practice to follow, as the results from all other tests do show a clear benefit.