Fix additional content errors and undraft

This commit is contained in:
Joshua Boniface 2021-10-01 03:05:09 -04:00
parent 1ecfa0bb6e
commit eb3f4c63e8
1 changed files with 13 additions and 13 deletions

View File

@ -1,30 +1,30 @@
+++
date = "2021-10-01T00:34:00-04:00"
tags = ["pvc","ceph"]
tags = ["systems administration", "pvc","ceph"]
title = "Adventures in Ceph tuning"
description = "An analysis of Ceph system tuning for Hyperconverged Infrastructure"
type = "post"
weight = 1
draft = true
draft = false
+++
In early 2018, I started work on [my Hyperconverged Infrastructure (HCI) project PVC](https://github.com/parallelvirtualcluster/pvc). Very quickly, I decided to use Ceph as the storage backend, for a number of reasons, including its built-in host-level redundancy, self-managing and self-healing functionality, and general good performance. With PVC now being used in numerous production clusters, I decided to tackle optimization. This turned out to be a bit of rabbit hole, which I will detail below. Happy reading.
## Ceph: A Primer
Ceph is a distributed, replicated, self-managing, self-healing object store, with exposes 3 primary interfaces: a raw object store, a block device emulator, and a POSIX filesystem. Under the hood, at least in recent releases, it makes use of a custom block storage system called Bluestore with entirely removes a filesystem and OS tuning from the equation. Millions of words have been written about Ceph, its interfaces, and Bluestore elsewhere, so I won't bore you with rehashed eulogies of its benefits here.
Ceph is a distributed, replicated, self-managing, self-healing object store, which exposes 3 primary interfaces: a raw object store, a block device emulator, and a POSIX filesystem. Under the hood, at least in recent releases, it makes use of a custom block storage system called Bluestore which entirely removes a filesystem and OS tuning from the equation. Millions of words have been written about Ceph, its interfaces, and Bluestore elsewhere, so I won't bore you with rehashed eulogies of its benefits here.
In the typical PVC use-case, we have 3 nodes, each running the Ceph monitor and manager, as well as 2 to 4 OSDs (Object Storage Daemons, what Ceph calls its disks and their management processes). It's a fairly basic Ceph configuration, and I use exactly one feature on top: the block device emulator, RBD (RADOS Block Device), to provide virtual machine disk images to KVM.
The main problem comes when Ceph is placed under heavy load. It is very CPU-bound, especially when writing random data, and further the replication scheme means that it is also network- and disk- bound in some cases. But primarily, the CPU speed (both in frequency and IPC) is the limiting factor.
After having one cluster placed under extreme load by a client application PostgreSQL database, I began looking into additional tuning, in order to squeeze every bit of performance I could out of the storage layer. The disks we are using are nothing special: fairly standard SATA SSDs with relatively low performance and endurance, but with upgrade costs being a concern, and the monitoring graphs showing plenty of raw disk performance on the table, I turned my attention to the Ceph later, with very interesting results.
After having one cluster placed under extreme load by a client application PostgreSQL database, I began looking into additional tuning, in order to squeeze every bit of performance I could out of the storage layer. The disks we are using are nothing special: fairly standard SATA SSDs with relatively low performance and endurance, but with upgrade costs being a concern, and the monitoring graphs showing plenty of raw disk performance on the table, I turned my attention to the Ceph layer, with very interesting results.
## Ceph Tuning: A Dead End
The first thought was, of course, to tune the Ceph parameters themselves. Unfortunately for me, or, perhaps, fortunately for everyone, there isn't much to tune here. Using the Luminous release (14.x) with the Bluestore backing store, most of the defaults seem to be extremely optimal. In fact, despite finding some Red Hat blogs to the contrary, I found that almost nothing I could change would make any appreciable difference to the performance of the Ceph cluster. I had to go deeper.
## The Ceph OSD Database and WAL: ??
## The Ceph OSD Database and WAL
With Ceph Bluestore, there are 3 main components of an OSD: the main data block device, the database block device, and the write-ahead log (WAL). In the most basic configuration, all 3 are placed on the same disk. However Ceph provides the option to move the database (and WAL, if it is large enough) onto a separate block device. It isn't correct to call this a "cache", except in a general, technically-incorrect sense: the database houses mostly metadata about the objects stored on the OSD, and the WAL handles sequential write journaling and can thus be thought of similar to a RAID controller write cache, but not precisely the same. In this configuration, one can leverage a very fast device - for example, and Intel Optane SSD - to handle metadata and WAL operations for a relatively "slow" SSD block device, and thus in theory increase performance.
@ -42,7 +42,7 @@ I quickly noticed a slight problem, however. My home cluster, which was doing th
| :-------- | :------- | :------------- |
| Chassis | HP Proliant DL-360 G6 | Dell R430 |
| CPU | 2x [Intel E5-5649](https://ark.intel.com/content/www/us/en/ark/products/52581/intel-xeon-processor-e5649-12m-cache-2-53-ghz-5-86-gt-s-intel-qpi.html) | 1x [Intel E5-2620 v4](https://ark.intel.com/content/www/us/en/ark/products/92986/intel-xeon-processor-e52620-v4-20m-cache-2-10-ghz.html) |
| Memory | 144 GB (18x 8 GB) | 128 GB (4x 32 GB) |
| Memory | 144 GB DDR3 (18x 8 GB) | 128 GB DDR4 (4x 32 GB) |
The VMs themselves also range from basically-idle to very CPU-intensive, with a wide range of vCPU allocations. I quickly realized that there might be another tuning aspect to consider: CPU (and NUMA, for `node1`) pinning.
@ -52,9 +52,9 @@ Ultimately, the disparate configurations here do present potential problems in i
## Test Explanation
The benchmarks themselves were run with the system in production, running the full set of VMs. This was done intentionally, to simulate a real-world scenario with numerous noisy neighbours. The tests were run with PVC's in-built benchmark system, which creates a new, dedicated Ceph RBD volume and then runs the `fio` tests against it directly using the `rbd` engine.
The benchmarks themselves were run with the system in production, running the full set of VMs. This was done, both for practical reasons, but also to simulate a real-world scenario with numerous noisy neighbours. While this might affect a single random test, I ran 3 tests each and staggered them over time to minimize the impact of bursty VM effects. Further the `cpuset` tuning would be fairly moot without additional real load on the nodes, and thus I believe this to be a worthwhile assumption. A future addition to the results might be to run a similar set of tests against an empty cluster, and if and when I am able to do so, I will add to this post.
To ensure `fio` itself was not limited by noisy neighbours, the node running the tests was flushed of VMs.
The tests were run with PVC's in-built benchmark system, which creates a new, dedicated Ceph RBD volume and then runs the `fio` tests against it directly using the `rbd` engine. To ensure `fio` itself was not limited by noisy neighbours, the node running the tests was flushed of VMs.
For the 3 `cpuset` tests, the relevant `cset` configuration was applied to all 3 nodes, regardless of the number of or load in the VMs, and putting the `fio` process inside the "VM" `cpuset`. Thus the CPUs set aside for them were completely dedicated to the OSDs.
@ -68,7 +68,7 @@ The results were fairly interesting, to say the least. First, I'll present the 6
* No-O, C=4: No Optane, `cpuset` OSD group with 4 CPU cores (+ hyperthreads, on `node1` within CPU0 NUMA domain)
* No-O, C=6: No Optane, `cpuset` OSD group with 6 CPU cores (+ hyperthreads, on `node1` within CPU0 NUMA domain)
It's worth noting that the 5th test left just 2 CPU cores (+ hyperthreads) to run VMs - the performance inside them was definitely sub-optimal!
It's worth noting that the 5th test left just 2 CPU cores (+ hyperthreads) to run VMs on `hv2` - the performance inside them was definitely sub-optimal!
Each test, in each configuration mode, was run 3 times, with the results presented here being an average of the results of the 3 tests.
@ -101,7 +101,7 @@ For reads, the performance is nearly identical, and almost within margin-of-erro
For writes, the performance shows some very noteworthy results. The Optane drive makes a noticeable, thought not significant, difference in the write performance, likely due to the WAL. A larger drive, and thus larger WAL, might make an even more significant improvement. The `cpuset` tuning, for 2 and 4 CPU `cset`s`, seems to make no difference over no limiting; however once the limit was raised to 6 CPU cores, write performance did increase somewhat, though not as noticeably as with the Optane cache.
The two main takeaways from these tests seem to be that (a) Optane database/WAL drives do have a noticeable effect on write performance; and (b) that dedicating 3 (or potentially more) CPU cores per OSD *increases* write performance while *decreasing* read performance. The increase in write performance would seem to indicate a CPU bottleneck is occurring with the lower CPU counts (or when contending with VM/`fio` processes), but this does not match the results of the read tests, which in the same situation should increase as well. One possible explanation might lie in the Ceph monitor processes, which direct clients to block on OSDs, but in no test did I see the `ceph-mon` process become a significant CPU user. Perhaps more research into the inner workings of Ceph OSDs and CRUSH maps will reveal the source of this apparent contradiction, but at this time I can not explain it.
The two main takeaways from these tests seem to be that (a) Optane database/WAL drives do have a noticeable effect on write performance; and (b) that dedicating 3 (or potentially more) CPU cores per OSD *increases* write performance while *decreasing* read performance. The increase in write performance would seem to indicate a CPU bottleneck is occurring with the lower CPU counts (or when contending with VM/`fio` processes), but this does not match the results of the read tests, which in the same situation should increase as well. One possible explanation might lie in the Ceph monitor processes, which direct clients to data objects on OSDs and were in the "VM" `cset`, but in no test did I see the `ceph-mon` process become a significant CPU user. Perhaps more research into the inner workings of Ceph OSDs and CRUSH maps will reveal the source of this apparent contradiction, but at this time I can not explain it.
### Random I/O Performance
@ -128,10 +128,10 @@ Going into this project, I had hoped that both the Optane drive and the `cpuset`
1. For write-heavy workloads, especially random writes, an Optane DB/WAL device can make a not-insignificant difference in overall performance. However, the money spent on an Optane drive might better be spent elsewhere...
2. CPU is, as always with Ceph, king. The more CPU cores you can get in your machine, and the faster those CPU cores are, the better, even ignoring the VM side of the equation. Going forward I will definitely be allocating more than my original 1 CPU core per OSD assumption into my overall CPU core count calculations.
2. CPU is, as always with Ceph, king. The more CPU cores you can get in your machine, and the faster those CPU cores are, the better, even ignoring the VM side of the equation. Going forward I will definitely be allocating more than my original 1 CPU core per OSD assumption into my overall CPU core count calculations, with 4 cores per OSD being a good baseline.
3. While I have not been able to definitely test and validate it myself, it seems that `cpuset` options are, at best, only worthwhile in very read-heavy use-cases and in cases where VMs are extremely noisy neighbours and there are insufficient physical CPU cores to satiate them. While there is a marked increase in random write performance, the baseline matching the 4-core limit seems to show that the effect would be minimized the more cores there are for both workloads to use.
3. While I have not been able to definitely test and validate it myself, it seems that `cpuset` options are, at best, only worthwhile in very read-heavy use-cases and in cases where VMs are extremely noisy neighbours and there are insufficient physical CPU cores to satiate them. While there is a marked increase in random I/O performance, the baseline write performance matching the 4-core limit seems to show that the effect would be minimized the more cores there are for both workloads to use, and the seemingly-dramatic read improvement might be due to the age of some of the CPUs in my particular cluster. More investigation is definitely warranted.
4. While it was not exactly tested here, memory performance would certainly make a difference to read performance. I expect that reads would be much higher if all nodes were using the latest DDR4 memory.
4. While it was not exactly tested here, memory performance would certainly make a difference to read performance. Like with CPUs, I expect that read rates would be much higher if all nodes were using the latest DDR4 memory.
Hopefully this analysis of my recent Ceph tuning adventures was worthwhile, and that you learned something. And of course, I definitely welcome any comments, suggestions, or corrections!