blog/content/en/posts/pvc-ceph-tuning-adventures-.../index.md

174 lines
15 KiB
Markdown
Raw Normal View History

---
title: "Adventures in Ceph tuning, part 4"
description: "A third follow-up to my analysis of Ceph system tuning for Hyperconverged Infrastructure"
date: 2024-09-23
tags:
- PVC
- Ceph
- Development
- Systems Administration
---
In 2021, [I made a post](https://www.boniface.me/posts/pvc-ceph-tuning-adventures/) about Ceph storage tuning with my [my Hyperconverged Infrastructure (HCI) project PVC](https://github.com/parallelvirtualcluster/pvc), and [in 2022](https://www.boniface.me/posts/pvc-ceph-tuning-adventures-part-2/) and [in 2023](https://www.boniface.me/posts/pvc-ceph-tuning-adventures-part-3/) I wrote two follow-ups. The original post covered some ideas about tuning, the second clarified the methodologies and tests to provide more accurate data, and the third covered the differences between SATA and NVMe SSDs and more modern systems.
At the end of that second part, I said:
> The final part of this series will investigate the results if we put multiple OSDs on one NVMe drive and then rerun these same tests.
Well, here is that final part!
Like parts 2 and 3, I'll jump right into the cluster specifications, changes to the tests, and results. If you haven't read the first 3 parts yet, I suggest you do so now to get the proper context before proceeding.
## The Cluster Specs (even better)
2024-09-23 13:58:00 -04:00
Parts 1 and 2 used my own home server setup, based on Dell R430 servers using Broadwell-era Intel Xeon CPUs, for analysis; then part 3 used a more modern AMD Epyc based Dell system through my employer. In this part, we have significantly more powerful machines, featuring a 64-core high speed Epyc processor, 1TB of RAM, and 2x 100GbE ports per node. Like all previous test clusters, there are 3 nodes:
| **Part**           | **node1 + node2 + node3** |
| :-------------------------------------------------------------- | :------------------------ |
| Chassis | Dell R6615 |
| CPU | 1x [AMD 9534](https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9534.html) (64 core, 128 thread, 2.45 GHz base, 3.7 GHz maximum boost, 3.55 GHz all-core boost) |
| Memory | 1024 GB DDR5 (16x 64 GB) |
| OSD DB/WAL | N/A |
| OSD Data | 4x Dell CM7 6.4TB U.2 NVMe SSD |
| Networking | 2x Broadcom BCM57508 100GbE in 802.3ad (LACP) bond |
## Test Outline and Hypothesis
2024-09-23 13:58:00 -04:00
The primary hypothesis of this set of benchmarks is that there is a linear scaling of performance the more OSD processes that are added to the Ceph subsystem. In addition, a secondary hypothesis is that adding additional OSD processes per NVMe disk (i.e. splitting a single NVMe disk into several smaller "virtual" NVMe disks) will increase performance.
Based on the results of the last post, I've focused this test suite mostly on determining the levels of performance scaling and exactly how many OSDs will optimize performance on such a powerful system. CPU sets provided some very contradictory results for NVMe drives in part 3, so I have excluded them from any of the testing here, since I do not believe them to be significantly useful in most workloads. In addition, these tests were conducted on a completely empty cluster, with no VMs active, so these tests are truly of the theoretical maximum performance of the Ceph subsystem on the given hardware and nothing else.
There are 3 distinct OSD configurations being tested:
* 1 OSD process per disk (6.4TB each)
* 2 OSD processes per disk (3.2TB each)
* 4 OSD processes per disk (1.6TB each)
Then within each configuration, I ran tests with 1 physical disk, 2 physical disks, 3 physical disks, and all 4 physical disks active, providing a 3x4 matrix of 12 test types for each configuration.
Within each test type, I ran 3 benchmarks to provide an average, leveraging PVC's built-in benchmarking system, identical to the previous 2 posts, and plotted the results.
As preparation for this test, I implemented a new benchmark format in PVC (benchmark format 2, available in version 0.9.102) which, in addition to the previous FIO performance output, also collects and presents additional information on the overall (percentage) CPU and memory utilization of the testing node's Ceph monitor and OSD processes, as well as the testing node's average network throughput on the `brstorage` backend interface during the test. This helps ensure that all possible bottlenecks in the tests are accounted for and recorded, and provides useful numbers for comparison there.
The results here are presented with very minimal analysis, as the goal here is mostly to present the results on such a large cluster, and the hypothesis is quite general. Overall conclusions will be presented below the results.
## Test Results
The test results are provided as graphs of the various test phases and number of active disks for each type of test as outlined above. In addition to raw performance numbers (IOPS + latency for random I/O tests and bandwidth + latency for sequential I/O tests), I also provide graphs of the CPU utilization (total, `ceph-mon`, and `ceph-osd`), memory utilization (same), and network throughput (total, send, and receive) to provide additional context and to help spot any bottlenecks.
### Sequential I/O (high queue depth, high block size)
These tests are primarily to determine the maximum possible read and write speed of sequential data to the cluster, for instance copying large files. Within the PVC system, this is a fairly rare occurrence, but this test attempts to saturate the system with as much data as possible to find bottlenecks.
#### Read
![Sequential Read Performance (bandwidth in MB/s, 4M block size, 64 queue depth) + average request latency (µs)](seq-read-bandwidth-latency.svg)
![Sequential Read CPU Utilization](seq-read-cpu.svg)
![Sequential Read Memory Utilization](seq-read-mem.svg)
![Sequential Read Network Utilization](seq-read-net.svg)
Sequential read speed does not seem to scale significantly with the number of OSD processes, being significantly above 45GB/s in all cases. There does seem to be a peak with 2 OSD processes per NVMe disk with both 1 and 4 disks, but this trend does not hold for 2 and 3 disks, or scale evenly, so we can conclude that there is no real benefit to sequential read from having either more OSDs total, or more OSD processes on each NVMe disk.
CPU utilization remains steady for all configurations, so this particular test is not taxing on the CPU. Memory usage increases steadily as would be expected from the increase in the number of OSD processes. Finally network throughput is minimal, as reads are done from the "closest" OSD and thus would always remain on the same system.
#### Write
![Sequential Write Performance (bandwidth in MB/s, 4M block size, 64 queue depth) + average request latency (µs)](seq-write-bandwidth-latency.svg)
![Sequential Write CPU Utilization](seq-write-cpu.svg)
![Sequential Write Memory Utilization](seq-write-mem.svg)
![Sequential Write Network Utilization](seq-write-net.svg)
Sequential write speed seems to scale fairly dramatically between 1 and 4 OSDs per node, with over double the throughput, but this very quickly levels out at 4 total OSDs per node, and additional OSDs do not significantly increase performance further, and in fact slightly harm it, with the peak performance being a 2-disk-2-OSD-per-disk configuration. Latency also begins to increase significantly both as the number of disks increase, and as the number of OSDs per disk increases. Thus for write latency-sensitive workloads, fewer total OSDs is indeed better, at the cost of overall throughput.
OSD CPU utilization does climb in a similar pattern to the performance, indicating the tight correlation between those values, and memory utilization followes the standard expected increase from more OSD processes. Network utilization reveals a possible bottleneck here, the likely cause of both the latency increase and performance cap: total throughput on the two NICs caps out at a combined 100Gbps on multiple tests; given that these cards are full-duplex, it would seem the bottleneck is somewhere else, perhaps in the kernel or within the CPU I/O layer. This also seems to show a theoretical maximum write performance in general of about 6.5GB/s.
### Random I/O (high queue-depth, low block size)
High queue-depth random I/O is the primary real-world metric of a PVC cluster's storage subsystem, as many VMs all performing small operations is the normal storage load of a cluster. This test attempts to saturate the system with as much I/O as possible simulating VM traffic to determine the highest number of simultaneous I/O operations per second (IOPS).
#### Read
![Random Read Performance (IOPS, 4k block size, 64 queue depth) + average request latency (µs)](rand-read-bandwidth-latency.svg)
![Random Read CPU Utilization](rand-read-cpu.svg)
![Random Read Memory Utilization](rand-read-mem.svg)
![Random Read Network Utilization](rand-read-net.svg)
Random read performance shows a similar pattern of performance scaling to sequential writes, with a clear jump between 1 and 4 OSDs per node, before similarly maxing out at around 120,000 IOPS. There is a clear increase as more OSDs are added to each disk as well, showing that a single OSD process is not quite enough to get all the read performance out of the NVMe disk.
CPU and memory utilization are identical to sequential read, and network throughput, while higher, is still less than 3.5Gbps total with plenty of available performance there.
#### Write
![Random Write Performance (IOPS, 4k block size, 64 queue depth) + average request latency (µs)](rand-write-bandwidth-latency.svg)
![Random Write CPU Utilization](rand-write-cpu.svg)
![Random Write Memory Utilization](rand-write-mem.svg)
![Random Write Network Utilization](rand-write-net.svg)
Random write performance continues to show the same trend as random read and sequential write in terms of scaling from 1 to 4 OSDs before maxing out, with a similar signficant jump followed by a clear plateau. Latency also follows the same trajectory as sequential write, though overall much lower.
CPU utilization does climb fairly significantly as the total number of OSDs increases, while memory utilization follows the same trends based on total OSD count. Network throughput is higher still, matching the performance trend and maxing out around 6Gbps of total throughput.
### Random I/O (low queue-depth, low block size)
Low queue-depth random I/O primarily tests the actual latency of individual requests in the Ceph subsystem, to determine what the theoretical "best case" is for any single I/O operation to the storage cluster, without the I/O queue adding additional latency. While most real-world applications will see at least some queue latency due to noisy neighbours, this test shows the best case.
#### Read
![Random Read Performance (IOPS, 4k block size, 1 queue depth) + average request latency (µs)](randlowdepth-read-bandwidth-latency.svg)
![Random Read CPU Utilization](randlowdepth-read-cpu.svg)
![Random Read Memory Utilization](randlowdepth-read-mem.svg)
![Random Read Network Utilization](randlowdepth-read-net.svg)
Single queue-depth random reads show an expected diminishing return as more OSDs are added, along with a very slight increase in latency. This is expected because individual read requests must hit more and more disks, and thus shuffle more and more data around between the PCIe bus, CPU, and main memory.
CPU utilization is consistently low, while memory follows the usual trend. Network throughput is also low but follows a slight downward trajectory matching the slight drops in performance.
#### Write
![Random Write Performance (IOPS, 4k block size, 1 queue depth) + average request latency (µs)](randlowdepth-write-bandwidth-latency.svg)
![Random Write CPU Utilization](randlowdepth-write-cpu.svg)
![Random Write Memory Utilization](randlowdepth-write-mem.svg)
![Random Write Network Utilization](randlowdepth-write-net.svg)
Single queue-depth random writes show a similar dimishing return to reads, with spikier latency.
CPU, memory, and network throughput shows similar trends as well though network throughput is overall quite a bit higher as expected.
## Overall Conclusions and Key Lessons Learned
For sequentual read I/O, there is a clear benefit to more OSDs up to a point, with the maximum performance falling somewhere around 8 total OSD processes on 4 disks, or 4 total OSD processes on 1 disk. Overall for random reads there does not seem to be a clear performance benefit to more disks beyond 2, but there is a benefit to more OSD processes on those disks, especially with 1 disk.
2024-09-23 13:58:00 -04:00
For sequential write I/O, there is a similar benefit to more OSDs up to about 4 total, at which point it seems that network throughput - or, more specifically, throughput between the internal system components - becomes a bottleneck. CPU utilization also remains fairly low at ~10% maximum, so this does not appear to be a constraint on these systems.
Overall, while sequential performance is purely synthetic, these results do help us draw some useful conclusions: namely, that 1-2 disks per node is plenty, and that ~2 OSD processes per disk provides close to optimal performance.
For random read I/O, the returns of more OSD processes is signficant, with noticeable scaling, though nowhere close to linear once above 4 total OSD processes per node.
2024-09-23 13:58:00 -04:00
For random write I/O, the story is similar, with a clear peak around 4 total OSD processes per node, but whether these are on 1 or 4 disks does not seem to matter. Where the bottleneck seems to be on this system would need to be investigated further.
2024-09-23 13:58:00 -04:00
Overall, random I/O performance definitely points to a sweet spot of 4 OSD processes per node, though this appears to be a clear maximum; one disk per node with 4 OSD processes can also maintain the same throughput.
For low queue-depth I/O, as expected fewer OSDs are better in terms of performance, but latency seems to scale fairly consistently with no major changes, and thus, the number of disks does not significantly alter this metric for good or bad.
Thus, our overall conclusion would be that somewhere between 1 and 4 OSD processes is optimal with these systems, and that this seems to hold true regardless of the number of actual underlying disks. In terms of our hypotheses, we can definitely say that scaling is not linear beyond 4 OSD processes, and that for any case where there is a clear benefit, whether this is done purely with multiple processes on one disks, or on multiple disks, does not matter. Given this particular hardware configuration, the optimal choice would be 2 OSD processes per disk with all 4 disks being utilized, at least from a performance standpoint; more disks of course mean more space, which is more likely to be the constraint in real deployments versus a few percentage points of performance.
Thank you for joining me for this long-running series, which has hopefully provided some useful information into the scaling of storage performance within PVC and in small Ceph clusters more broadly. Hopefully you found this information useful; I know I have, and have made countless tweaks to the PVC system as a result! Happy serving.