diff --git a/content/pvc-ceph-tuning-adventures-part-3.md b/content/pvc-ceph-tuning-adventures-part-3.md index fada21e..226e561 100644 --- a/content/pvc-ceph-tuning-adventures-part-3.md +++ b/content/pvc-ceph-tuning-adventures-part-3.md @@ -22,7 +22,7 @@ Like part 2, I'll jump right into the cluster specifications, changes to the tes Parts 1 and 2 used my own home server setup, based on Dell R430 servers using Broadwell-era Intel Xeon CPUs, for analysis. But being my homelab, I'm quite limited in what hardware I have access to: namely, I'm using several-generations-old hardware and SATA SSDs, so despite having some very interesting results, they are very limited by the hardware. -Enter my employer. We've been using my PVC project as the basis for our on-premesis customer deployments for almost 3 years now, and our customers are not buying used hardware. Instead, they're buying brand-new, top-of-the-line Dell systems featuring modern AMD Epyc CPUs and NVMe SSDs. I wanted to get performance numbers from one of these systems to determine if the conclusions from part 2 continue to hold even with this much more performant, newer hardware. +Enter my employer. We've been using my PVC project as the basis for our on-premises customer deployments for almost 3 years now, and our customers are not buying used hardware. Instead, they're buying brand-new, top-of-the-line Dell systems featuring modern AMD Epyc CPUs and NVMe SSDs. I wanted to get performance numbers from one of these systems to determine if the conclusions from part 2 continue to hold even with this much more performant, newer hardware. Like my home cluster, these clusters use 3 nodes, with the following specifications: @@ -93,7 +93,7 @@ Sequential read shows a significant difference with the NVMe SSDs and newer CPUs This test instead shows a result much more inline with expectations: no-limit performance is significantly lower than the dedicated limits, and by a relatively large 13% margin. -The best result was with the 4+1+27 configuration, with a decreasing stairstep pattern to the 4+4+24 configuration. However, all the limit tests were within 1% of each other, which I would consider the margin of error. +The best result was with the 4+1+27 configuration, with a decreasing stair-step pattern to the 4+4+24 configuration. However, all the limit tests were within 1% of each other, which I would consider the margin of error. Thus, this test upholds the hypothesis: a limit is a good thing to avoid scheduler overhead, though there is no clear winner in terms of the number of dedicated OSD CPUs. @@ -101,9 +101,9 @@ CPU load does show an interesting drop with the 4+3+25 configuration before jump ![Sequential Write Bandwidth (MB/s, 4M block size, 64 queue depth)](/images/pvc-ceph-tuning-adventures-part-3/seq-write.png) -Sequential write shows a similar stairstep pattern, though more pronounced. The no-limit performance is actually the second-best here, which is an interesting result, though again the results are all within a nearly margin-of-error 2% of each other. +Sequential write shows a similar stair-step pattern, though more pronounced. The no-limit performance is actually the second-best here, which is an interesting result, though again the results are all within a nearly margin-of-error 2% of each other. -The higest performance was the 4+2+26, though interestingly the 4+3+25 configuration performed the worst. Further, these results are not leaps-and-bounds higher than the results with SATA SSDs, potentially pointing to a hard limit within Ceph 14.x "Nautilus" in terms of sequential writes. Though again since these are all within a reasonable margin of error, I think we can conclude that for sequential writes, there is no conclusive benefit to a CPU limit. +The highest performance was the 4+2+26, though interestingly the 4+3+25 configuration performed the worst. Though again since these are all within a reasonable margin of error, I think we can conclude that for sequential writes, there is no conclusive benefit to a CPU limit. System load follows the same trend as did sequential reads, with a drop off for each test until a bottom with the 4+3+25 configuration before rebounding slightly higher for the 4+4+24 configuration. I'm not sure at this point if these load numbers are even showing anything at all, but it is still interesting to see. @@ -121,7 +121,7 @@ Random IO tests tend to better reflect the realities of VM clusters, and thus ar Random read shows a similar trend as sequential reads, and one completely in-line with our hypothesis. There is definitely a more pronounced trend here though, with a clear increase in performance of about 8% between the worst (4+1+27) and best (4+8+24) results. -However this test shows yet another stairstep pattern where the 4+2+26 configuration outpaced the 4+3+25 configuration. I suspect this might be due to the on-package NUMA domains and chiplet architecture of the Epyc chips, whereby the 3rd core has to traverse a higher-latency interconnect and thus hurts performance when going from 2 to 3 dedicated CPUs, though more in-depth testing would be needed to definitively confirm this. +However this test shows yet another stair-step pattern where the 4+2+26 configuration outpaced the 4+3+25 configuration. I suspect this might be due to the on-package NUMA domains and chiplet architecture of the Epyc chips, whereby the 3rd core has to traverse a higher-latency interconnect and thus hurts performance when going from 2 to 3 dedicated CPUs, though more in-depth testing would be needed to definitively confirm this. System load continues to show almost no correlation at all with performance, and thus can be ignored. @@ -129,7 +129,9 @@ System load continues to show almost no correlation at all with performance, and Random writes bring back the strange anomaly that we saw with sequential reads in the previous post. Namely, that for some reason, the no-limit configuration performs significantly better than all limits. After that, the performance seems to scale roughly linearly with each increase in CPU core count, exactly as was seen with the SATA SSDs in the previous post. -The system load here shows a possibly explanation for the anomalous results though. Random writes seem to hit the CPU much harder than the other tests, and the baseline load of all nodes with the no-limit configuration is about 8, which would indicate that the OSD processes want about 8 CPU cores per OSD here. Adding in the 4+8+20 configuration, we can see that this is definitely higher than all the other limit configurations, but is still less than the no-limit configuration. It does appear that the scaling is not linear as well, since doubling the cores only brought us about half-way up to the no-limit performance, thus giving us a pretty conclusively "yes" answer to our first main question. +One possible explanation is again the NUMA domains within the CPU package. The Linux kernel is aware of these limitations, and thus could potentially be assigning CPU resources to optimize performance, especially for the CPU-to-NIC pipeline. Again this would need some more thorough, in-depth testing to confirm, but it is my hunch that this is occurring. + +The system load here shows another possibly explanation for the anomalous results though. Random writes seem to hit the CPU much harder than the other tests, and the baseline load of all nodes with the no-limit configuration is about 8, which would indicate that the OSD processes want about 8 CPU cores per OSD here. Adding in the 4+8+20 configuration, we can see that this is definitely higher than all the other limit configurations, but is still less than the no-limit configuration, so this doesn't seem to be the *only* explanation. It does appear that the scaling is not linear as well, since doubling the cores only brought us about half-way up to the no-limit performance, thus pointing towards the NUMA limit as well and giving us a pretty conclusively "yes" answer to our first main question. For write-heavy workloads, this is a very important takeaway. This test clearly shows that the no-limit configuration is ideal for random writes on NVMe drives, as the Linux scheduler seems better able to distribute the load among many cores. I'd be interested to see how this is affected by many CPU-heavy noisy-neighbour VMs, but testing this is extremely difficult and thus is not in scope for this series. @@ -141,16 +143,16 @@ These tests are based on the 95th percentile latency numbers; thus, these are th ![Read Latency (μs, 4k block size, 1 queue depth)](/images/pvc-ceph-tuning-adventures-part-3/latency-read.png) -Read latency shows a consistent downwards trend thoughout the configurations, though with the 4+4+24 result being a strange outlier. However the latency here is very good, only 1/4 of the latency of the SATA SSDs in the previous post, and the results are all so low that they are not likely to be particularly impactful. +Read latency shows a consistent downwards trend throughout the configurations, though with the 4+4+24 and 4+8+24 results being outliers. However the latency here is very good, only 1/4 of the latency of the SATA SSDs in the previous post, and the results are all so low that they are not likely to be particularly impactful. We're really pushing raw network latency and packet processing overheads with these results. ![Write Latency (μs, 4k block size, 1 queue depth)](/images/pvc-ceph-tuning-adventures-part-3/latency-write.png) -Write latency also shows a major improvement over SATA SSDs, being only 1/5 of those results. It also, like the read latency, shows a fairly limited spread in results, though with a similar uptick from 4+3+25 to 4+4+24. Like read latency, I don't believe these numbers are significant enoug hto show a major benefit to the CPU limits. +Write latency also shows a major improvement over SATA SSDs, being only 1/5 of those results. It also, like the read latency, shows a fairly limited spread in results, though with a similar uptick from 4+3+25 to 4+4+24 to 4+8+20. Like read latency, I don't believe these numbers are significant enough to show a major benefit to the CPU limits. ## Conclusions -Our results with NVMe drives shows some interesting differences from SATA SSDs. For sequential reads, the outlier result of the SATA drives is elimiated, but it is replaced instead with an outlier result for random writes, likely one of the most important metrics when talking of VM workloads. In addition the better CPUs are also likely impacting the results, and the limitations of the 10GbE networking really come into play here: I expect we might see some diferences if we were running on a much faster network interconnect. +Our results with NVMe drives shows some interesting differences from SATA SSDs. For sequential reads, the outlier result of the SATA drives is eliminated, but it is replaced instead with an outlier result for random writes, likely one of the most important metrics when talking of VM workloads. In addition the better CPUs are also likely impacting the results, and the limitations of the 10GbE networking really come into play here: I expect we might see some differences if we were running on a much faster network interconnect. -Based primarily on that one result, I think we can safely conclude that while there are some minor gains to be made with sequential read performance and some more major gains with random read performance, overall a CPU limit on the Ceph OSD processes does not seem to be worth the tradeoff, at least in write-heavy workloads. If your workload is extremely read-heavy, then a limit might be beneficial, but if it is more write-heavy, CPU limits seem to hurt more than help. This is in contrast to SATA SSDs on the older processors where there were clear benefits to the CPU limit. +Based primarily on that one result, I think we can safely conclude that while there are some minor gains to be made with sequential read performance and some more major gains with random read performance, overall a CPU limit on the Ceph OSD processes does not seem to be worth the trade-offs for NVMe SSDs, at least in write-heavy workloads. If your workload is extremely random-read-heavy, then a limit might be beneficial, but if it is more write-heavy, CPU limits seem to hurt more than help. This is in contrast to SATA SSDs on the older processors where there were clear benefits to the CPU limit. The final part of this series will investigate the results if we put multiple OSDs on one NVMe drive and then rerun these same tests. Stay tuned for that in the next few months!