Implement configurable replcfg (documentation)

Implements administrator-selectable replication configurations for new pools in PVC clusters, overriding the default of copies=3,mincopies=2.
2019-08-23 22:14:28 -04:00
parent 7c4d18691a
commit 5158cec0ec
1 changed files with 6 additions and 2 deletions
--- a/docs/architecture/cluster.md
+++ b/docs/architecture/cluster.md
@@ -28,9 +28,13 @@ The Ceph subsystem of PVC, if enabled, creates a "hyperconverged" setup whereby

 The Ceph system is laid out similar to the other daemons. The Ceph Monitor and Manager functions are delegated to the Coordinators over the cluster network, with all nodes connecting to these hosts to obtain the CRUSH maps and select OSD disks. OSDs are then distributed on all hosts, including non-coordinator hypervisors, and communicate with clients over the cluster network and with each other (for replication, failover, etc.) over the storage network.

-Without exception for proper redundancy, Ceph pools on the cluster use the `copies=3` `mincopies=2` replication scheme. That is to say, for each 4MB "object" the cluster stores, it will store 3 copies on 3 different nodes; if one copy becomes unavailable, due to a node maintenance or failure, the other 2 copies continue to enable read/write access to the cluster; if two copies become unavailable, writes to the cluster will block however reads will still proceed from the single remaining copy, allowing recovery. More than 3 nodes running OSD disks increases the resiliency of the cluster, however object placement is decided at write time and is evenly distributed across the cluster, so even in very large clusters only 1 node can be down at one time and writes guaranteed to succeed.
+PVC Ceph pools make use of the replication mechanism of Ceph to store multiple copies of each object, thus ensuring that data is always available even when a host is unavailable. Note that, mostly for performance reasons related to rewrites and random I/O, erasure coding is *not* supported in PVC.

-In this configuration, therefore, each 1MB of storage at the VM layer consumes 3MB (3 copies) of storage at the raw disk layer. Size OSD disks accordingly to ensure sufficient storage space and performance. Future versions of PVC may support more complex Ceph storage layouts, such as `copies=4` `mincopies=2` or multiple-parity Erasure Coding pools.
+The default replication level for a new pool is `copies=3, mincopies=2`. This will store 3 copies of each object, with a host-level failure domain, and will allow I/O as long as 2 copies are available. Thus, in a cluster of any size, all data is fully available even if a single host becomes unavailable. It will however use 3x the space for each piece of data stored, which must be considered when sizing the disk space for the cluster: a pool in this configuration, running on 3 nodes each with a single 400GB disk, will effectively have 400GB of total space available for use. Additionally, new disks must be added in groups of 3 spread across the nodes in order to be able to take advantage of the additional space, since each write will require creating 3 copies across each of the 3 hosts.
+
+Non-default values can also be set at pool creation time. For instance, one could create a `copies=3, mincopies=1` pool, which would allow I/O with two hosts down but leaves the cluster susceptible to a write hole should a disk fail in this state. Alternatively, for more resilience, one could create a `copies=4, mincopies=2` pool, which will allow 2 hosts to fail without a write hole, but would consume 4x the space for each piece of data stored and require new disks to be added in groups of 4 instead. Practically any combination of values is possible, however these 3 are the most relevant for most use-cases, and for most, especially small, clusters, the default is sufficient to provide solid redundancy and guard against host failures until the administrator can respond.
+
+Replication levels cannot be changed within PVC once a pool is created, however they can be changed via manual Ceph commands on a coordinator should the administrator require this. In any case, the administrator should carefully consider sizing, failure domains, and performance when selecting storage devices to ensure the right level of resiliency versus data usage for their use-case and cluster size.

 ## Network Layout: Considering the required networks