Update cluster documentation

Add a TOC, add additional sections, improve wording in some sections, spellcheck.
2020-03-25 15:38:00 -04:00
parent d84e94eff4
commit 97a560fcbe
1 changed files with 77 additions and 29 deletions
--- a/docs/architecture/cluster.md
+++ b/docs/architecture/cluster.md
@@ -1,32 +1,62 @@
 # PVC Cluster Architecture considerations
 - [PVC Cluster Architecture considerations](#pvc-cluster-architecture-considerations)
  * [Node Specifications: Considering the size of nodes](#node-specifications--considering-the-size-of-nodes)
  * [Storage Layout: Ceph and OSDs](#storage-layout--ceph-and-osds)
  * [Physical network considerations](#physical-network-considerations)
  * [Network Layout: Considering the required networks](#network-layout--considering-the-required-networks)
    + [PVC system networks](#pvc-system-networks)
        - [Upstream: Connecting the nodes to the wider world](#upstream--connecting-the-nodes-to-the-wider-world)
        - [Cluster: Connecting the nodes with each other](#cluster--connecting-the-nodes-with-each-other)
        - [Storage: Connecting Ceph OSD with each other](#storage--connecting-ceph-osd-with-each-other)
    + [PVC client networks](#pvc-client-networks)
        - [Bridged (unmanaged) Client Networks](#bridged--unmanaged--client-networks)
        - [VXLAN (managed) Client Networks](#vxlan--managed--client-networks)
        - [Other Client Networks](#other-client-networks)
  * [Node Layout: Considering how nodes are laid out](#node-layout--considering-how-nodes-are-laid-out)
    + [Node Functions: Coordinators versus Hypervisors](#node-functions--coordinators-versus-hypervisors)
        - [Coordinators](#coordinators)
            * [The Primary Coordinator](#the-primary-coordinator)
        - [Hypervisors](#hypervisors)
    + [Geographic redundancy](#geographic-redundancy)
  * [Example Configurations](#example-configurations)
    + [Basic 3-node cluster](#basic-3-node-cluster)
    + [Mid-sized 8-node cluster with 3 coordinators](#mid-sized-8-node-cluster-with-3-coordinators)
    + [Large 17-node cluster with 5 coordinators](#large-17-node-cluster-with-5-coordinators)
 This document contains considerations the administrator should make when preparing for and building a PVC cluster. It includes four main subsections: node specifications, storage specifications, network layout, and node layout, plus a fifth section featuring diagrams of 3 example topologies.
 It is important that prospective PVC administrators read this document *thoroughly* before deploying a cluster to ensure they understand the requirements, caveats, and important details about how PVC operates.
 ## Node Specifications: Considering the size of nodes
-Each node in the cluster must be sized based on the needs of the cluster and the load placed on it. In general, taller nodes are better for performance and allow for a more powerful cluster on less hardware, though the needs of each specific environment and workload my affect this differently.
+PVC nodes, especially coordinator nodes, run a significant number of software applications in addition to the virtual machines (VMs). It is therefore extremely important to size the systems correctly for the expected workload while planning both for redundancy and future capacity. In general, taller nodes are better for performance, providing a more powerful cluster on fewer physical machines, though each workload may be different in this regard.
-At a bare minimum, each node should have the following specifications:
+The following table provides bare-minimum, recommended, and optimal specifications for a cluster. The bare-minimum specification would be suitable for testing or a small lab, but not for production use. The recommended specification would be suitable for a small production cluster running lightweight VMs. The optimal cluster would be the ideal for running a demanding, resource-intensive production cluster. Note that these are the minimum resources required, and actual usage will likely require more resources than those presented here - this is mostly to show the minimums for each specified configuration (i.e. testing, light production, heavy production).
-* 12x 1.8GHz or better Intel/AMD cores from at least the Nehalem/Bulldozer eras (~2008 or newer)
+| Resource | Minimum | Recommended | Optimal|
-* 48GB of RAM
+|--------------|-----------|---------------|----------|
-* 2x 1Gbps Ethernet interfaces
+| CPU generation | Intel Nehalem (2008) / AMD Bulldozer (2011) | Intel Sandy Bridge (2011) / AMD Naples (2017) | Intel Haswell (2013) / AMD Rome (2019) |
-* 1x 10GB+ system disk (SSD/HDD/USB/SD/eMMC flash)
+| CPU cores (per node) | 4x @1.8GHz | 8x @2.0GHz | 12x @2.2 GHz |
-* 1x 400GB+ OSD data disk (SSD)
+| RAM (per node) | 16GB | 48GB | 64GB |
 | System disk (SSD/HDD/USB/SD/eMMC) | 1x 10GB | 2x 10GB RAID-1 | 2x 32GB RAID-1 |
 | Data disk (SSD only) | 1x 200GB | 1x 400GB | 2x 400GB |
 | Network interfaces | 1x 1Gbps | 2x 1Gbps LAG | 2x 10Gbps LAG |
 | Total CPU cores (healthy) | 12x | 24x | 36x |
 | Total CPU cores (n-1) | 8x | 16x | 24x |
 | Total RAM (healthy) | 48GB | 144GB | 192GB |
 | Total RAM (n-1) | 32GB | 96GB | 128GB |
 | Total disk space | 200GB | 400GB | 800GB |
-For a cluster of 3 such nodes, this will provide a total of:
+Of these totals, some amount of CPU and RAM will be used by the storage subsystem and the PVC daemons themselves, meaning that the total available for virtual machines is slightly less. Generally, each OSD data disk will consume 1 vCPU at load and 1-2GB RAM, so nodes should be sized not only according to the VM workload, but the number of storage disks per node. Additionally the coordinator databases will use additional RAM and CPU resources of up to 1-4GB per node, though there is generally little need to spec coordinators any larger than non-coordinator nodes and the VM automatic node selection process will take used RAM into account by default.
-* 36 total CPU cores
+Care should also be taken to examine the "healthy" versus "n-1" total resource availability. Under normal operation, PVC will use all the available resources, however the total cluster utilization should never exceed the "n-1" quantity otherwise automatic failure recovery of 1-node failures may be impacted.
 * 144GB RAM
 * 400GB usable Ceph storage space (`copies=3`)
 Of this, some amount of CPU and RAM will be used by the storage subsystem and the PVC daemons themselves, meaning that the total available for virtual machines is slightly less. Generally, each OSD data disk will consume 1 vCPU at load and 1-2GB RAM, so nodes should be sized not only according to the VM workload, but the number of storage disks per node. Additionally the coordinator databases will use additional RAM and CPU resources of up to 1-4GB per node, though there is generally little need to spec coordinators any larger than non-coordinator nodes and the VM automatic node selection process will take used RAM into account by default.
 ## Storage Layout: Ceph and OSDs
-The Ceph subsystem of PVC, if enabled, creates a "hyperconverged" setup whereby storage and VM hypervisor functions are collocated onto the same physical servers. The performance of the storage must be taken into account when sizing the nodes as mentioned above.
+The Ceph subsystem of PVC, if enabled, creates a "hyperconverged" cluster whereby storage and VM hypervisor functions are collocated onto the same physical servers. The performance of the storage must be taken into account when sizing the nodes as mentioned above.
-The Ceph system is laid out similar to the other daemons. The Ceph Monitor and Manager functions are delegated to the Coordinators over the cluster network, with all nodes connecting to these hosts to obtain the CRUSH maps and select OSD disks. OSDs are then distributed on all hosts, including non-coordinator hypervisors, and communicate with clients over the cluster network and with each other (for replication, failover, etc.) over the storage network.
+The Ceph system is laid out similar to the other daemons. The Ceph Monitor and Manager functions are delegated to the Coordinators over the storage network, with all nodes connecting to these hosts to obtain the CRUSH maps and select OSD disks. OSDs are then distributed on all hosts, including non-coordinator hypervisors, and communicate with clients and each other over the storage network.
 PVC Ceph pools make use of the replication mechanism of Ceph to store multiple copies of each object, thus ensuring that data is always available even when a host is unavailable. Note that, mostly for performance reasons related to rewrites and random I/O, erasure coding is *not* supported in PVC.
@@ -36,17 +66,19 @@ Non-default values can also be set at pool creation time. For instance, one coul
 Replication levels cannot be changed within PVC once a pool is created, however they can be changed via manual Ceph commands on a coordinator should the administrator require this. In any case, the administrator should carefully consider sizing, failure domains, and performance when selecting storage devices to ensure the right level of resiliency versus data usage for their use-case and cluster size.
 ## Physical network considerations
 At a minimum, a production PVC cluster should use at least two 1Gbps Ethernet interfaces, connected in an LACP or active-backup bond on one or more switches. On top of this bond, the various cluster networks should be configured as vLANs. PVC is be able to support configurations without 802.1q vLAN support using multiple physical interfaces and no bridged client networks, but this is strongly discouraged.
 More advanced physical network layouts are also possible. For instance, one could have two isolated networks. On the first network, each node has two 10Gbps Ethernet interfaces, which are combined in a bond across two redundant switch fabrics and that handle the upstream and cluster networks. On the second network, each node has an additional two 10Gbps, which are also combined in a bond across the redundant switch fabrics and handle the storage network. This configuration could support up to 10Gbps of aggregate client traffic while also supporting 10Gbps of aggregate storage traffic. Even more complex network configurations are possible if the cluster requires such performance. See the [Example Configurations](#example-configurations) section for some examples.
 ## Network Layout: Considering the required networks
 A PVC cluster needs, at minimum, 3 networks in order to function properly. Each of the three networks and its function is detailed below. An additional two sections cover the two kinds of client networks and the considerations for them.
-### Physical network considerations
+### PVC system networks
-At a minimum, a production PVC cluster should use at least two 1Gbps Ethernet interfaces, connected in an LACP or active-backup bond on one or more switches. On top of this bond, the various cluster networks should be configured as vLANs.
+#### Upstream: Connecting the nodes to the wider world
 More advanced physical network layouts are also possible. For instance, one could have two isolated networks. On the first network, each node has two 10Gbps Ethernet interfaces, which are combined in a bond across two redundant switch fabrics and that handle the upstream and cluster networks. On the second network, each node has an additional two 10Gbps, which are also combined in a bond across the redundant switch fabrics and handle the storage network. This configuration could support up to 10Gbps of aggregate client traffic while also supporting 10Gbps of aggregate storage traffic. Even more complex network configurations are possible if the cluster requires such performance. See the [Example Configurations](#example-configurations) section for some examples.
 ### Upstream: Connecting the nodes to the wider world
 The upstream network functions as the main upstream for the cluster nodes, providing Internet access and a way to route managed client network traffic out of the cluster. In most deployments, this should be an RFC1918 private subnet with an upstream router which can perform NAT translation and firewalling as required, both for the cluster nodes themselves, but also for the RFC1918 managed client networks.
@@ -80,7 +112,7 @@ For example, for a 3+ node cluster, up to about 90 nodes, the following configur
 For even larger clusters, a `/23` or even larger network may be used.
-### Cluster: Connecting the nodes with each other
+#### Cluster: Connecting the nodes with each other
 The cluster network is an unrouted private network used by the PVC nodes to communicate with each other for database access and Libvirt migrations. It is also used as the underlying interface for the BGP EVPN VXLAN interfaces used by managed client networks.
@@ -88,9 +120,9 @@ The floating IP address in the cluster network can be used as a single point of
 Nodes in this network are generally assigned IPs automatically based on their node number (e.g. node1 at `.1`, node2 at `.2`, etc.). The network should be large enough to include all nodes sequentially.
-Generally the cluster network should be completely separate from the upstream network, either a separate physical interface (or set of bonded interfaces) or a dedicated vLAN on an underlying physical device, but they can be colocated if required.
+Generally the cluster network should be completely separate from the upstream network, either a separate physical interface (or set of bonded interfaces) or a dedicated vLAN on an underlying physical device, but they can be collocated if required.
-### Storage: Connecting Ceph OSD with each other
+#### Storage: Connecting Ceph OSD with each other
 The storage network is an unrouted private network used by the PVC node storage OSDs to communicated with each other, without using the main cluster network and introducing potentially large amounts of traffic there.
@@ -100,25 +132,27 @@ Nodes in this network are generally assigned IPs automatically based on their no
 The administrator may choose to collocate the storage network on the same physical interface as the cluster network, or on a separate physical interface. This should be decided based on the size of the cluster and the perceived ratios of client network versus storage traffic. In large (>3 node) or storage-intensive clusters, this network should generally be a separate set of fast physical interfaces, separate from both the upstream and cluster networks, in order to maximize and isolate the storage bandwidth.
-### Bridged (unmanaged) Client Networks
+### PVC client networks
 #### Bridged (unmanaged) Client Networks
 The first type of client network is the unmanaged bridged network. These networks have a separate vLAN on the device underlying the other networks, which is created when the network is configured. VMs are then bridged into this vLAN.
 With this client network type, PVC does no management of the network. This is left entirely to the administrator. It requires switch support and the configuration of the vLANs on the switchports of each node's physical interfaces before enabling the network.
-### VXLAN (managed) Client Networks
+#### VXLAN (managed) Client Networks
 The second type of client network is the managed VXLAN network. These networks make use of BGP EVPN, managed by route reflection on the coordinators, to create virtual layer 2 Ethernet tunnels between all nodes in the cluster. VXLANs are then run on top of these virtual layer 2 tunnels, with the active primary PVC node providing routing, DHCP, and DNS functionality to the network via a single IP address.
 With this client network type, PVC is in full control of the network. No vLAN configuration is required on the switchports of each node's physical interfaces, as the virtual layer 2 tunnel travels over the cluster layer 3 network. All client network traffic destined for outside the network will exit via the upstream network interface of the active primary coordinator node. NOTE: This may introduce a bottleneck and tromboning if there is a large amount of external and/or inter-network traffic on the cluster. The administrator should consider this carefully when sizing the cluster network.
-### Other Client Networks
+#### Other Client Networks
 Future PVC versions may support other client network types, such as direct-routing between VMs.
 ## Node Layout: Considering how nodes are laid out
-A production-grade PVC cluster requires 3 nodes running the PVC Daemon software. 1-node clusters are supported for very small clusters, homelabs, and testing, but provide no redundancy; they should not be used in production situations.
+A production-grade PVC cluster requires 3 nodes running the PVC Daemon software. 1-node clusters are supported for very small clusters, home labs, and testing, but provide no redundancy; they should not be used in production situations.
 ### Node Functions: Coordinators versus Hypervisors
@@ -151,9 +185,23 @@ PVC gracefully handles transitioning primary coordinator state, to minimize down
 Hypervisors consist of all other PVC nodes in the cluster. For small clusters (3 nodes), there will generally not be any non-coordinator nodes, though adding a 4th would require it to be a hypervisor to preserve quorum between the coordinators. Larger clusters should generally add new nodes as Hypervisors rather than coordinators to preserve the small set of coordinator nodes previously mentioned.
 ### Geographic redundancy
 PVC supports geographic redundancy of nodes in order to facilitate disaster recovery scenarios when uptime is critical. Functionally, PVC behaves the same regardless of whether the 3 or more coordinators are in the same physical location, or remote physical locations.
 When using geographic redundancy, there are several caveats to keep in mind:
 * The Ceph storage subsystem is latency-sensitive. With the default replication configuration, at least 2 writes must succeed for the write to return a success, so the total write latency of a write on any system may be equal to the maximum latency between any two nodes. It is recommended to keep all PVC nodes as "close" latency-wise as possible or storage performance may suffer.
 * The inter-node PVC networks must be layer-2 networks (broadcast domains). These networks must be spanned to all nodes in all locations.
 * The number of sites and positioning of coordinators at those sites is important. A majority (at least 2 in a 3-coordinator cluster, or 3 in a 5-coordinator) of coordinators must be able to reach each other in a failure scenario for the cluster as a whole to remain functional. Thus, configurations such as 2 + 1 or 3 + 2 split across 2 sites do *not* provide full redundancy, and the whole cluster will be down if the majority site is down. It is thus recommended to always have an odd number of sites to match the odd number of coordinators, for instance a 1 + 1 + 1 or 2 + 2 + 1 configuration. Also note that all hypervisors much be able to reach the majority coordinator group or their storage will be impacted as well.
 If these requirements cannot be fulfilled, it may be best to have separate PVC clusters at each site and handle service redundancy at a higher layer to avoid a major disruption.
 ## Example Configurations
-This section provides diagrams of 3 possible node configurations, providing an idea of the sort of cluster topologies supported by PVC.
+This section provides diagrams of 3 possible node configurations. These diagrams can be extrapolated out to almost any possible configuration and number of nodes.
 #### Basic 3-node cluster