From 1ecfa0bb6e2bb75423fbce1c8542a0c77ec5e0ea Mon Sep 17 00:00:00 2001 From: "Joshua M. Boniface" Date: Fri, 1 Oct 2021 02:59:05 -0400 Subject: [PATCH] Correct spelling errors --- content/pvc-ceph-tuning-adventures.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/content/pvc-ceph-tuning-adventures.md b/content/pvc-ceph-tuning-adventures.md index de2f93a..93d949e 100644 --- a/content/pvc-ceph-tuning-adventures.md +++ b/content/pvc-ceph-tuning-adventures.md @@ -14,7 +14,7 @@ In early 2018, I started work on [my Hyperconverged Infrastructure (HCI) project Ceph is a distributed, replicated, self-managing, self-healing object store, with exposes 3 primary interfaces: a raw object store, a block device emulator, and a POSIX filesystem. Under the hood, at least in recent releases, it makes use of a custom block storage system called Bluestore with entirely removes a filesystem and OS tuning from the equation. Millions of words have been written about Ceph, its interfaces, and Bluestore elsewhere, so I won't bore you with rehashed eulogies of its benefits here. -In the typical PVC usecase, we have 3 nodes, each running the Ceph monitor and manager, as well as 2 to 4 OSDs (Object Storage Daemons, what Ceph calls its disks and their management processes). It's a fairly basic Ceph configuratio, and I use exactly one feature on top: the block device emulator, RBD (RADOS Block Device), to provide virtual machine disk images to KVM. +In the typical PVC use-case, we have 3 nodes, each running the Ceph monitor and manager, as well as 2 to 4 OSDs (Object Storage Daemons, what Ceph calls its disks and their management processes). It's a fairly basic Ceph configuration, and I use exactly one feature on top: the block device emulator, RBD (RADOS Block Device), to provide virtual machine disk images to KVM. The main problem comes when Ceph is placed under heavy load. It is very CPU-bound, especially when writing random data, and further the replication scheme means that it is also network- and disk- bound in some cases. But primarily, the CPU speed (both in frequency and IPC) is the limiting factor. @@ -36,7 +36,7 @@ Emboldened by the sheer performance of the drives, I quickly implemented OSD DB ## The Bane of Hyperconverged Architectures: Sharing Resources -I quicky noticed a slight problem, however. My home cluster, which was doing these tests, is a bit of a hodge-podge of server equipment, and runs a fair number (68 at the time of testing) of virtual machines across its 3 nodes. The hardware breakdown is as follows: +I quickly noticed a slight problem, however. My home cluster, which was doing these tests, is a bit of a hodge-podge of server equipment, and runs a fair number (68 at the time of testing) of virtual machines across its 3 nodes. The hardware breakdown is as follows: | **Part**     | **node1**            | **node2 + node3** | | :-------- | :------- | :------------- | @@ -44,11 +44,11 @@ I quicky noticed a slight problem, however. My home cluster, which was doing the | CPU | 2x [Intel E5-5649](https://ark.intel.com/content/www/us/en/ark/products/52581/intel-xeon-processor-e5649-12m-cache-2-53-ghz-5-86-gt-s-intel-qpi.html) | 1x [Intel E5-2620 v4](https://ark.intel.com/content/www/us/en/ark/products/92986/intel-xeon-processor-e52620-v4-20m-cache-2-10-ghz.html) | | Memory | 144 GB (18x 8 GB) | 128 GB (4x 32 GB) | -The VMs themselves also range from basically-idle to very CPU-intensive, with a wide range of vCPU allocations. I quicky realized that there might be another tuning aspect to consider: CPU (and NUMA, for `node1`) pinning. +The VMs themselves also range from basically-idle to very CPU-intensive, with a wide range of vCPU allocations. I quickly realized that there might be another tuning aspect to consider: CPU (and NUMA, for `node1`) pinning. I decided to try implementing a basic CPU pinning scheme with the `cpuset` Linux utility. This tool allows the administrator to create static `cset`s, which are logical groups assigned to specific CPUs, and then place processes - either during runtime or at process start - into these `cset`s. So, in addition to testing the Optane drives, I decided to also test Optane-less configurations whereby specific numbers of cores (and their corresponding hyperthreads) were dedicated to the Ceph OSDs instead of all CPUs shared by both OSDs, VMs, and PVC host daemons. -Ultimately, the dispirate configurations here do present potential problems in interpreting the results, however within this particular cluster the comparisons are valid, and I do hope to repeat these tests (and update this post) in the future when I'm able to simplify and unify the server configurations. +Ultimately, the disparate configurations here do present potential problems in interpreting the results, however within this particular cluster the comparisons are valid, and I do hope to repeat these tests (and update this post) in the future when I'm able to simplify and unify the server configurations. ## Test Explanation @@ -68,7 +68,7 @@ The results were fairly interesting, to say the least. First, I'll present the 6 * No-O, C=4: No Optane, `cpuset` OSD group with 4 CPU cores (+ hyperthreads, on `node1` within CPU0 NUMA domain) * No-O, C=6: No Optane, `cpuset` OSD group with 6 CPU cores (+ hyperthreads, on `node1` within CPU0 NUMA domain) -It's worth noting that the 5th test left just 2 CPU cores (+ hyperthreads) to run VMs - the performance inside them was definitely suboptimal! +It's worth noting that the 5th test left just 2 CPU cores (+ hyperthreads) to run VMs - the performance inside them was definitely sub-optimal! Each test, in each configuration mode, was run 3 times, with the results presented here being an average of the results of the 3 tests. @@ -101,7 +101,7 @@ For reads, the performance is nearly identical, and almost within margin-of-erro For writes, the performance shows some very noteworthy results. The Optane drive makes a noticeable, thought not significant, difference in the write performance, likely due to the WAL. A larger drive, and thus larger WAL, might make an even more significant improvement. The `cpuset` tuning, for 2 and 4 CPU `cset`s`, seems to make no difference over no limiting; however once the limit was raised to 6 CPU cores, write performance did increase somewhat, though not as noticeably as with the Optane cache. -The two main takeaways from these tests seem to be that (a) Optane database/WAL drives do have a noticeable effect on write performance; and (b) that dedicating 3 (or potentially more) CPU cores per OSD *increases* write performance while *decreasing* read performance. The increase in write performance would seem to indicate a CPU bottleneck is occurring with the lower CPU counts (or when contending with VM/`fio` processes), but this does not match the results of the read tests, which in the same situation should increase as well. One possible explanation might lie in the Ceph monitor processes, which direct clients to block on OSDs, but in no test did I see the `ceph-mon` process become a signfiicant CPU user. Perhaps more research into the inner workings of Ceph OSDs and CRUSH maps will reveal the source of this apparent contradiction, but at this time I can not explain it. +The two main takeaways from these tests seem to be that (a) Optane database/WAL drives do have a noticeable effect on write performance; and (b) that dedicating 3 (or potentially more) CPU cores per OSD *increases* write performance while *decreasing* read performance. The increase in write performance would seem to indicate a CPU bottleneck is occurring with the lower CPU counts (or when contending with VM/`fio` processes), but this does not match the results of the read tests, which in the same situation should increase as well. One possible explanation might lie in the Ceph monitor processes, which direct clients to block on OSDs, but in no test did I see the `ceph-mon` process become a significant CPU user. Perhaps more research into the inner workings of Ceph OSDs and CRUSH maps will reveal the source of this apparent contradiction, but at this time I can not explain it. ### Random I/O Performance @@ -124,13 +124,13 @@ For writes, the Optane drive is a clear winner, reducing the average latency by ## Overall Conclusions and Takeaways -Going into this project, I had hoped that both the Optane drive and the `cpuset` core dedication would make profound, dramatic, and consistent differences to the Ceph performance. However, the results instead show, like much in the realm of computer storage, tradeoffs and caveats. As takeaways from the project, I have the following 4 main thoughts: +Going into this project, I had hoped that both the Optane drive and the `cpuset` core dedication would make profound, dramatic, and consistent differences to the Ceph performance. However, the results instead show, like much in the realm of computer storage, trade-offs and caveats. As takeaways from the project, I have the following 4 main thoughts: 1. For write-heavy workloads, especially random writes, an Optane DB/WAL device can make a not-insignificant difference in overall performance. However, the money spent on an Optane drive might better be spent elsewhere... 2. CPU is, as always with Ceph, king. The more CPU cores you can get in your machine, and the faster those CPU cores are, the better, even ignoring the VM side of the equation. Going forward I will definitely be allocating more than my original 1 CPU core per OSD assumption into my overall CPU core count calculations. -3. While I have not been able to definitely test and validate it myself, it seems that `cpuset` options are, at best, only worthwhile in very read-heavy usecases and in cases where VMs are extremely noisy neighbours and there are insufficient physical CPU cores to satiate them. While there is a marked increase in random write performance, the baseline matching the 4-core limit seems to show that the effect would be minimized the more cores there are for both workloads to use. +3. While I have not been able to definitely test and validate it myself, it seems that `cpuset` options are, at best, only worthwhile in very read-heavy use-cases and in cases where VMs are extremely noisy neighbours and there are insufficient physical CPU cores to satiate them. While there is a marked increase in random write performance, the baseline matching the 4-core limit seems to show that the effect would be minimized the more cores there are for both workloads to use. 4. While it was not exactly tested here, memory performance would certainly make a difference to read performance. I expect that reads would be much higher if all nodes were using the latest DDR4 memory.