Compare commits

...

120 Commits

Author SHA1 Message Date
f486b2d3ae Fix incorrect version 2024-12-27 21:22:28 -05:00
4190b802a5 Fix link name 2024-12-27 21:19:41 -05:00
0e48068d19 Fix links 2024-12-27 21:18:01 -05:00
98aff8fb4c Add missing step 2024-12-27 21:16:06 -05:00
f891b3c501 Fix hostnames 2024-12-27 21:13:24 -05:00
d457f74d39 Fix indenting 2024-12-27 21:12:19 -05:00
c26034381b Improve formatting 2024-12-27 21:08:17 -05:00
a782113d44 Remove header title 2024-12-27 20:46:52 -05:00
1d2b9d7a99 Add TOC for getting started guide 2024-12-27 20:45:30 -05:00
6db2201c24 Improve wording 2024-12-27 20:44:42 -05:00
1f419ddc64 Be more forceful 2024-12-27 20:37:50 -05:00
8802b28034 Fix incorrect heading 2024-12-27 20:29:46 -05:00
0abbc34ba4 Improve georedundancy documentation 2024-12-27 20:25:47 -05:00
f22fae8277 Add automirror references 2024-11-18 11:00:12 -05:00
f453bd6ac5 Add automirror and adjust snapshot age 2024-11-16 13:47:48 -05:00
ab800a666e Adjust emojis consistently and add net warning 2024-11-14 12:16:39 -05:00
8d584740ea Update Docs index 2024-10-25 23:53:18 -04:00
10ed0526d1 Update README badge order 2024-10-25 23:48:11 -04:00
03aecb1a26 Update README 2024-10-25 23:45:51 -04:00
13635eb883 Update README 2024-10-25 23:38:36 -04:00
ba46e8c3ef Update README 2024-10-25 23:29:16 -04:00
043fa0da7f Fix formatting 2024-10-25 03:03:06 -04:00
6769bdb086 Unify formatting 2024-10-25 03:01:15 -04:00
9039bc0b9d Fix indentation 2024-10-25 02:57:24 -04:00
004be3de16 Add OVA mention 2024-10-25 02:56:09 -04:00
5301e47614 Rework some wording 2024-10-25 02:54:48 -04:00
8ddd9dc965 Move the provisioner guide 2024-10-25 02:52:51 -04:00
fdabdbb52d Remove mention of obsolete worksheet 2024-10-25 02:50:47 -04:00
0e10395419 Bump the base Debian version 2024-10-25 02:50:13 -04:00
1dd8e07a55 Add notes about mirrors 2024-10-25 02:48:26 -04:00
add583472a Fix typo 2024-10-25 02:43:21 -04:00
93ef4985a5 Up to 8 spaces 2024-10-25 02:39:47 -04:00
17244a4b96 Add sidebar to API reference 2024-10-25 02:38:59 -04:00
3973d5d079 Add more indentation 2024-10-25 02:37:07 -04:00
35fe7c64b6 Add trailing spaces too 2024-10-25 02:36:39 -04:00
ecffc2412c Try more spaces 2024-10-25 02:35:07 -04:00
43bc38b998 Try more newlines 2024-10-25 02:33:50 -04:00
0aaa28d5df Try backticks 2024-10-25 02:32:22 -04:00
40647b785c Try to fix formatting of hosts example 2024-10-25 02:31:05 -04:00
968855eca8 Fix formatting 2024-10-25 02:29:31 -04:00
e81211e3c6 Fix formatting 2024-10-25 02:28:26 -04:00
c6eddb6ece Fix formatting 2024-10-25 02:25:47 -04:00
e853f972f1 Fix indents 2024-10-25 02:22:37 -04:00
9c6ed63278 Update the Getting Started documentation 2024-10-25 02:13:08 -04:00
7ee0598b08 Update swagger spec 2024-10-19 11:48:52 -04:00
f809bf8166 Update API doc for 0.9.101/102 2024-10-18 01:29:17 -04:00
1b15c92e51 Update the description of VM define endpoint 2024-10-01 13:31:07 -04:00
4dc77a66f4 Add proper response schema for 202 responses 2024-10-01 13:26:23 -04:00
f940b2ff44 Update API documentation and link 2024-09-30 20:51:12 -04:00
1c3eec48ea Update spec for upcoming release 2024-09-07 12:33:09 -04:00
e0081f73f8 Update Swagger spec for 0.9.99 2024-08-28 11:35:32 -04:00
1a615dbf50 Fix invalid ref 2024-08-19 16:56:32 -04:00
36c2237f6c Update swagger docs 2024-08-19 16:49:52 -04:00
d43ee44a0a Update Swagger doc for 0.9.94 2024-05-27 09:31:55 -04:00
54a00497db Add white transparent logo 2024-05-27 09:31:26 -04:00
b9f3bcbb00 Update Software diagram for 0.9.86+ 2024-01-11 10:37:40 -05:00
b1e06dbf54 Add migration max downtime metafield for VMs 2024-01-10 16:30:52 -05:00
1eeb3bd778 Add Zookeeper metric endpoint and update descrs 2024-01-10 16:30:32 -05:00
3b11a74597 Update Metrics endpoint details 2023-12-25 03:05:26 -05:00
a424e420b5 Remove WebUI from README 2023-12-25 02:49:53 -05:00
a07c995a5a Add VNC info to screenshots 2023-12-11 03:41:07 -05:00
33d67fe03a Remove debug output from image 2023-12-11 03:14:26 -05:00
863202293c Fix output bugs in VM information 2023-12-11 03:05:03 -05:00
6f7f0a834e Finish missing sentence 2023-12-11 02:39:51 -05:00
da61d92a67 Add Grafana dashboard screenshot 2023-12-11 00:39:41 -05:00
ac0bab2b29 Update index.md to match project README 2023-12-10 23:53:09 -05:00
d28246a15b Add API endpoints for 0.9.83 and 0.9.84 2023-12-09 23:47:42 -05:00
b9500034c7 Add new features and version link 2023-10-27 09:47:50 -04:00
e4372b354c Further point tweaks 2023-10-24 16:34:06 -04:00
105c122a6d Revamp a few more points 2023-10-24 16:32:31 -04:00
44b9278cb6 Fix some entryies 2023-10-24 16:09:38 -04:00
a6f29bd350 Spice up initial tagline 2023-10-24 16:05:57 -04:00
c3ae2ae622 Add core features to About page 2023-10-24 16:04:54 -04:00
7e72a0cd66 Add VM backup and restore API endpoints 2023-10-24 02:13:54 -04:00
b86033f2f3 Update API swagger definitions 2023-10-03 09:43:00 -04:00
eaf9b6927c Update API JSON for 0.9.78 2023-10-01 15:24:54 -04:00
5d3ec9a793 Fix bad link 2023-09-21 22:51:48 -04:00
8832a81fa7 Correct spelling errors 2023-09-21 22:50:53 -04:00
09b485988a Fix spelling errors 2023-09-21 22:50:02 -04:00
8964e0aa3c Reorganize node placement 2023-09-21 22:48:20 -04:00
0185853873 Adjust wording in final recommendations 2023-09-21 22:44:32 -04:00
70766dfef2 Avoid saying significant too much 2023-09-21 22:43:29 -04:00
9c1ed0bc57 Add line explaining the diagram 2023-09-21 22:42:04 -04:00
9168792e51 Fix image link 2023-09-21 22:39:00 -04:00
1673448228 Add Georedundancy documentation 2023-09-21 22:13:11 -04:00
3638f3ff21 Mention software req of monitoring 2023-09-21 00:31:35 -04:00
defe4719c5 Remove year ranges 2023-09-20 22:31:09 -04:00
49f391206d Add architecture notes 2023-09-20 22:30:24 -04:00
216cd4426c Fix spelling of ProLiant 2023-09-20 22:26:06 -04:00
cb408b506d Improve wording of N-1 section 2023-09-20 22:25:17 -04:00
e49091f6d4 Mention that fancing only occurs to run state nodes 2023-09-17 20:30:43 -04:00
5262cabaff Move video to top and adjust wording 2023-09-17 13:05:04 -04:00
1172745a96 Fix video embedding 2023-09-17 13:01:05 -04:00
b1e39ff4af Add video to Fencing article 2023-09-17 12:58:19 -04:00
2f998069f6 Fix remaining links 2023-09-17 00:46:31 -04:00
476eddc0f6 Try fixing nested links 2023-09-17 00:43:11 -04:00
8f490a6bfb Rehash table titles for width 2023-09-17 00:41:22 -04:00
f3c513a262 Fix about link 2023-09-17 00:40:27 -04:00
ff0b00683d Remove absolute paths from md links 2023-09-17 00:37:18 -04:00
0777823695 Fix bad links 2023-09-17 00:08:10 -04:00
a9dde4b65e Update wording 2023-09-17 00:04:14 -04:00
0372584f08 Fix bad table row 2023-09-17 00:03:22 -04:00
d8ebc7de1f Reorganize documentation 2023-09-16 23:59:55 -04:00
28950a4d90 Fix broken table and situations 2023-09-16 23:55:50 -04:00
836a61708e Try adjusting width randomly 2023-09-16 23:52:09 -04:00
d3de778ca3 Remove unneeded column 2023-09-16 23:50:23 -04:00
ac7b25dac1 Update navigation pages 2023-09-16 23:46:55 -04:00
daddf13b01 Add fencing documentation 2023-09-16 23:46:18 -04:00
bd541bbe49 Remove double word 2023-09-16 18:40:08 -04:00
47be134ef9 Mention PERC cards 2023-09-16 18:36:34 -04:00
436506d3e2 Remove spurious reference to Ceph 2023-09-16 18:34:39 -04:00
5e078ed193 Correct references to Fencing page 2023-09-16 17:52:39 -04:00
3dbd86a898 Correct spelling and grammar 2023-09-16 15:20:41 -04:00
e2a0ff2c1c Move explanation paragraphs above TOCs 2023-09-16 14:55:39 -04:00
e64f50924c Add hardware requirements document 2023-09-16 14:52:41 -04:00
b0acfa0db7 Fix bad image element 2023-09-16 13:21:12 -04:00
243186af35 Add header paragraph 2023-09-16 13:15:19 -04:00
a336f01952 Rewrite cluster architecture document 2023-09-16 13:12:52 -04:00
a3c22a5c77 WIP cluster architecture revamp 2023-09-15 13:34:12 -04:00
45fa715a27 Remove .bak file 2023-09-15 13:18:29 -04:00
40 changed files with 4570 additions and 1159 deletions

View File

@ -1,10 +1,11 @@
<p align="center">
<img alt="Logo banner" src="docs/images/pvc_logo_black.png"/>
<img alt="Logo banner" src="https://docs.parallelvirtualcluster.org/en/latest/images/pvc_logo_black.png"/>
<br/><br/>
<a href="https://www.parallelvirtualcluster.org"><img alt="Website" src="https://img.shields.io/badge/visit-website-blue"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Latest Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
<a href="https://docs.parallelvirtualcluster.org/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
<a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
<a href="https://docs.parallelvirtualcluster.org/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
</p>
## What is PVC?
@ -19,41 +20,13 @@ As a consequence of its features, PVC makes administrating very high-uptime VMs
PVC also features an optional, fully customizable VM provisioning framework, designed to automate and simplify VM deployments using custom provisioning profiles, scripts, and CloudInit userdata API support.
Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client or WebUI.
Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client ~~or WebUI~~ (eventually).
Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.
More information about PVC, its motivations, the hardware requirements, and setting up and managing a cluster [can be found over at our docs page](https://docs.parallelvirtualcluster.org).
## What is it based on?
## Documentation
The core node and API daemons, as well as the CLI API client, are written in Python 3 and are fully Free Software (GNU GPL v3). In addition to these, PVC makes use of the following software tools to provide a holistic hyperconverged infrastructure solution:
This repository contains the MKdocs configuration for the https://docs.parallelvirtualcluster.org ReadTheDocs page.
* Debian GNU/Linux as the base OS.
* Linux KVM, QEMU, and Libvirt for VM management.
* Linux `ip`, FRRouting, NFTables, DNSMasq, and PowerDNS for network management.
* Ceph for storage management.
* Apache Zookeeper for the primary cluster state database.
* Patroni PostgreSQL manager for the secondary relation databases (DNS aggregation, Provisioner configuration).
## Getting Started
To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/getting-started/) page for details on configuring your first cluster.
## Changelog
View the changelog in [CHANGELOG.md](CHANGELOG.md).
## Screenshots
While PVC's API and internals aren't very screenshot-worthy, here is some example output of the CLI tool.
<p><img alt="Node listing" src="docs/images/pvc-nodes.png"/><br/><i>Listing the nodes in a cluster</i></p>
<p><img alt="Network listing" src="docs/images/pvc-networks.png"/><br/><i>Listing the networks in a cluster, showing 3 bridged and 1 IPv4-only managed networks</i></p>
<p><img alt="VM listing and migration" src="docs/images/pvc-migration.png"/><br/><i>Listing a limited set of VMs and migrating one with status updates</i></p>
<p><img alt="Node logs" src="docs/images/pvc-nodelog.png"/><br/><i>Viewing the logs of a node (keepalives and VM [un]migration)</i></p>

View File

@ -2,6 +2,8 @@
title: "About PVC"
---
This document outlines the basic ideas and inspiration behind PVC as well as the core feature set, the underlying technology, and some frequently asked questions.
[TOC]
## Project Motivation
@ -22,6 +24,71 @@ PVC aims to bridge the gaps between these 3 categories. Like the larger FLOSS an
In short, it is a Free Software, scalable, redundant, self-healing, and self-managing private cloud solution designed with administrator simplicity in mind.
## Core Features
All features are as of the latest version: <a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
### Overall/Nodes
* Node-level redundancy & node N-1 fault tolerance
* Cluster- and node-level monitoring
* Stable base operating system (5+ year support)
* Convenient, holistic view of the cluster (resources, devices, VMs, etc.) via CLI and API
* Deployment, management, updates, and base OS upgrades via straightforward [Ansible playbooks](https://github.com/parallelvirtualcluster/pvc-ansible) and [a custom installer ISO](https://github.com/parallelvirtualcluster/pvc-installer)
* External [bootstrap system](https://github.com/parallelvirtualcluster/pvc-bootstrap) for low-touch cluster deployment
* Cluster-level backup and restore
* Node hot add/remove from service (flush/unflush/restore) for maintenance
* Automatic fencing of unresponsive node(s) and recovery of affected VMs (conditional)
* Cluster maintenance state (allows monitoring/alerting pause while performing maintenance)
* Included Munin and CheckMK monitoring plugins
### Virtual Machine Management
* Full VM lifecycle management (start/stop/restart/shutdown/disable)
* Live-migration (zero-downtime move) of VMs between nodes
* Automatic restarting of failed VMs
* (For supporting VMs) Serial console logging with interactive follow
* VNC console support with flexible listen directives
* Simple resource management (vCPU/memory) w/restart
* Hot attach/detach of virtual NICs and block devices
* Tag support for organization/classification
* VM hot/online snapshot creation (disks + configuration), with incremental image support, management (delete), and restore
* VM autobackups with self-contained backup rotation and optional automatic mounting of remote storage resources
* VM snapshot shipping to external clusters (mirroring) and mirror promotion
* VM automirrors with self-contained snapshot rotation for regular creation of mirrors
### Network Management
* Bridged (vLAN), Managed (VXLAN, virtual), and Direct (SR-IOV) VM networks
* Consistent cluster view (all nodes are provisioned with all networks) for Bridged and Managed VM networks
* DHCP, DNS, NTP, and TFTP support for Managed VM networks
* Upstream BGP for route learning for Managed VM networks
### Storage Management
* Distributed & replicated self-healing storage backend (Ceph Object Store) with high availability and node-level redundancy
* Shared storage for VM storage volumes/virtual disks (Ceph RBD)
* Integrated monitoring and alerting into PVC frontend
* Zero-cost snapshots
* Flexible pool replication conigurations for large or complex clusters
* Support for arbitrary data disk sizes (with limits)
### Provisioning
* Integrated, highly flexible VM provisioning system
* Define custom Python 3 install scripts or use included examples for common OSes
* CloudInit Amazon EC2-compatible CloudInit "userdata" support
* Define dynamic VM profiles from component templates (system, network, disk), scripts, and userdata
* OVA VM package support
* Virtual disk import (raw, VMDK, qcow2, and others) support
* Volume cloning support (cloning VMs)
### Other
* Free, Libre and Open Source (FLOSS) software
* Written in modern Python 3
* Well-maintained and frequently updated
## Building Blocks
PVC itself is a series of software daemons (services) written in Python 3, with the CLI interface also written in Python 3, designed to glue other FLOSS tools together in order to provide a consistent cluster operation and management experience.
@ -63,7 +130,7 @@ If all you want is a simple home server solution, or you demand scalability beyo
For a redundant cluster, yes. PVC requires a majority quorum for proper operation at various levels, and the smallest possible majority quorum is 2-of-3; thus 3 nodes is the smallest safe minimum. That said, you can run PVC on a single node for testing/lab purposes without host-level redundancy, should you wish to do so, and it might also be possible to run 2 "main" systems with a 3rd "quorum observer" hosting only the management tools but no VMs; however these options are not officially supported, as PVC is designed primarily for 3+ node operation.
For more details, see the [Cluster Architecture page](/deployment/cluster-architecture).
For more details, see the [Cluster Architecture page](architecture/cluster-architecture.md).
#### Does PVC support containers (Docker/Kubernetes/LXC/etc.)?

View File

@ -0,0 +1,194 @@
---
title: Cluster Architecture
---
This document details the general architecture of a PVC cluster, including its main components, how the interoperate, how nodes are laid out, and the networks required by a cluster. This document must be read and understood to properly administrate a PVC cluster.
[TOC]
## Operating System
PVC is designed to run on top of Debian GNU/Linux, of any version since Debian 10.x "Buster". This is the operating system installed by the [PVC Node installer](https://github.com/parallelvirtualcluster/pvc-installer) and expected by the [PVC Ansible deployment playbooks](https://github.com/parallelvirtualcluster/pvc-ansible). No other operating systems, including derivatives of Debian, are supported by PVC.
At this time, only the `amd64` (`x86_64`) architecture is supported by PVC, though this may change in future versions.
PVC requires Python 3, at least version 3.7. It is designed to operate using the system Python3 packages of the supported Debian releases, and does not require a virtualenv, PIP, or other similar tools.
## PVC Terminology
PVC uses several terms throughout that are important to define.
* cluster: A PVC "cluster" is a collection of PVC "nodes" that function together in a redundant configuration. Management is done against a cluster, and resources allocated, assigned, and run on the cluster as a whole.
* node: PVC clusters consist of several "nodes", which are physical server computers running the PVC system daemons.
* virtual machine/VM: A guest operating system running inside of the KVM hypervisor on top of a PVC cluster. VMs are assigned to individual nodes, but can float across the various nodes in the cluster using live-migration (if supported by the VM configuration).
* OSD: PVC clusters allocate storage for virtual machines from a set of special disks called "OSDs"; this term is borrowed from the terminology of the underlying Ceph storage subsystem.
* pool: Storage in the PVC cluster is allocated to "pools", which reside on a set of OSDs using a given replication profile.
* volume: Individual storage "volumes" for VMs are allocated to storage pools.
* network: PVC makes use of several different networks (vLANs or VXLANs) of different classes.
## Cluster Layout
The following diagrams show both a 3-node cluster and an 8-node cluster for comparison, along with how their networks (detailed below) are laid out both physically and logically. These diagrams can be referenced during the following sections to understand the layout of the cluster.
[![Small 3-node cluster](images/pvc-3-node-cluster.png)](images/pvc-3-node-cluster.png)
[![Large 8-node cluster](images/pvc-8-node-cluster.png)](images/pvc-8-node-cluster.png)
## Software Stack
The software stack of a PVC node consists of several daemons as well as the core operating system, and can be represented by the following diagram:
[![PVC software stack](images/pvc-software.png)](images/pvc-software.png)
The 3 classes of node roles are shown, as will be discussed below.
## Node Roles
### Coordinator
Coordinators are nodes which contain the various databases and core daemons which cannot be scaled indefinitely across the entire cluster. This includes the Zookeeper database instances, Patroni database instances, FRRouting instances, and the PVC API and provisioner worker.
Coordinators are decided at deploy time, and the number of coordinators as well as their network addresses cannot be changed at runtime without extensive downtime and reconfiguration. Thus extreme care must be taken to choose the optimal coordinator configuration and network layout when deploying the cluster.
A normal cluster will contain either 3 or 5 coordinators, depending on its size. For clusters of between 3 and 12 nodes, 3 coordinators is generally sufficient; for larger clusters, 5 coordinators may provide additional resiliency as two (2) could then be brought down for maintenance instead of just one (1) while still retaining quorum.
### Primary Coordinator
The primary coordinator takes on some additional, special functionality from the other coordinators during runtime. This includes the active instance of the PVC API and provisioner worker, dnsmasq to provide managed network services, the Patroni leader, and some PVC-internal services for statistics collection.
Only a single coordinator is primary at any given time, and the primary coordinator is selected from the available coordinators at runtime via either automatic selection (contention) or administrator intervention.
The primary coordinator transitions between coordinators via a lockstep migration process taking approximately 5-10 seconds, during which some of the provided services are unavailable, including the API. During this transition, the special "takeover" and "relinquish" states are used.
### Hypervisor
Any PVC node that is not a coordinator is a hypervisor. Hypervisors do not run any of the indicated services, and are used exclusively for VM compute and, optionally, OSD storage disks.
All nodes added to a cluster beyond the initial coordinators must be added as hypervisors.
## Quorum and Node Loss
Many PVC daemons, as discussed below, leverage a majority quorum to function. A majority quorum requires an *absolute* majority - that is, 2-of-3 or 3-of-5 - nodes to function; any less, and cluster will become inoperable. For this reason, there must always be an odd number of coordinators.
This is an important consideration when deciding the number of coordinators to allocate: a 3-coordinator system can tolerate the loss of a single coordinator without impacting the cluster, but losing 2 would render the cluster inoperable; similarly, a 5-coordinator system can tolerate the loss of 2 coordinators, but losing 3 would render the cluster inoperable. In addition, these coordinators must be located in such a way that a majority can communicate in outage events, in order for the cluster to remain operational. This affects the network and physical design of a cluster and must be carefully considered during deployment; for instance, network switches and links, and power, should be redundant.
For more details on this, see the [Fencing](fencing.md) and [Georedundancy](georedundancy.md) documentation. The first also covers the node fencing process, which allows automatic recovery from a node failure in certain outage events.
Hypervisors are not affected by the coordinator quorum: a cluster can lose any number of non-coordinator hypervisors without impacting core services, though compute resources (CPU and memory) must be available on the remaining nodes for VMs to function properly, and any OSDs on these hypervisors, if applicable, would become unavailable, potentially impacting storage availability.
## Databases
### Zookeeper
PVC uses Zookeeper for its state database. Zookeeper provides a multi-master, always-consistent key-value store, which PVC uses to store data about its operation, configuration, and state.
The Zookeeper database runs on the coordinator nodes, and requires a majority quorum (2-of-3, 3-of-5, etc.) of coordinators to function properly.
### Patroni/PostgreSQL
PVC uses the Patroni PostgreSQL cluster manager to store relational data for use by the [Provisioner subsystem](../deployment/provisioner) and managed network DNS aggregation.
The Patroni system runs on the coordinator nodes, with the primary coordinator taking on the "leader" role (read-write) and all others taking on the "follower" role (read-only). Patroni leverages Zookeeper to handle state, and is thus dependent on Zookeeper to function.
## Cluster Storage
PVC leverages a "hyperconverged" storage subsystem, whereby the storage backend (Ceph) runs on the same systems as the VMs themselves, providing a fully colocated storage and compute solution in one set of physical servers.
The core Ceph daemons run on the coordinator nodes, and this includes the Monitor (used to direct clients to OSDs), Manager (used to monitor the Ceph cluster), and OSDs. Additional OSDs can also be added to non-coordinator hypervisor nodes. Like other aspects of the cluster, this system uses quorum and thus requires a majority quorum (2-of-3, 3-of-5, etc.) of coordinators to function properly.
PVC leverages Ceph's Rados Block Device (RBD) interface, which provides virtual block devices similar to Linux LVM or ZFS zvols. These virtual block devices are then mapped to VMs to provide VM disks.
The Ceph storage system features multiple layers. First, OSDs are created on dedicated physical disks (SSDs) on each hypervisor; at least 3 total OSDs, spread evenly across 3 nodes, is recommended for full redundancy. Next, a Pool is created on top of the OSDs, which defines a number of Placement Groups (PGs) which divide data between OSDs, as well as a replication strategy (a total number of copies of each block of data, and a minimum number of blocks for a write to be considered a success). Finally, Volumes (RBD volumes) are created on top of the pool to provide the virtual block devices; a Volume is fixed to a particular Pool.
This default layout provides several benefits, including multi-node replication, the ability to tolerate the loss of a full node without impacting storage, and shared storage facilitating live migration, at the cost of a 3x storage penalty. Additional replication modes (for instance, more copies) are possible to provide more resiliency at the cost of a larger storage penalty.
It can be important to more advanced configurations to understand how disk writes work in this system to properly understand the implications of this replication. Please see the [Ceph Write Process](../manuals/ceph-write-process.md) documentation for a full explanation.
## Cluster Networking
### Core Networks
PVC requires a number of core networks, which are required independent of any client networks (see below) for the cluster to function correctly. These networks are configured at cluster deploy time, and cannot be easily changed later.
Within each core network, each node is assigned a static IP address; DHCP is not supported. This address must be defined in the `pvcnoded.yaml` file of each node, and the core networks are brought up by the PVC node daemon, not by the operating system. The daemon includes a feature to auto-select the node IP address from the network via its node ID (so that node 1 would receive .1, etc.) which requires at least a `/24` subnet to function.
In addition to the main static IP of each node, there is also a "floating" IP in each network which is bound to the primary coordinator. This IP can be used as a single point of access into the cluster for the API or other services that need to see the "cluster as a whole" rather than individual nodes.
Some or all of these networks can be collapsed, though for optimal performance and security, it is recommended that, at a minimum, the "upstream" and "cluster"/"storage" networks be separated. The physical aspect is discussed further in the [Hardware Requirements](hardware-requirements.md) documentation, however larger clusters should generally lean towards splitting these networks into separate physical, as well as logical, links.
#### Upstream
The "upstream" network provides outside connectivity to and from the PVC cluster. It is the network which should contain the default gateway, and through which the cluster can be managed via the API and SSH and through which upstream routing to and from managed client networks (see below) will occur.
The "upstream" network requires outbound Internet access, as it will be used to fetch updates and install software.
This network, though it requires Internet access, should not be exposed directly to the Internet or to other untrusted local networks for security reasons. PVC itself makes no attempt to hinder access to nodes from within this network. At a minimum, an upstream firewall should prevent external access to this network, and only trusted hosts or on-cluster VMs should be added to it.
In addition to all other functions, server IPMI interfaces should reside either directly in this network, or in a network directly reachable from this network, to provide fencing and auto-recovery functionality. For more details, see the [Fencing](fencing.md) documentation.
#### Cluster
The "cluster" network provides inter-node connectivity within the PVC cluster, for all purposes except backend storage. Managed network VXLANs, VM migrations, database updates, and node heartbeats use this network.
The "cluster" network requires no outside routing, and is entirely local to the PVC cluster itself. A `/24` subnet of RFC1918 addresses should be used. This network should use the largest possible MTU on the underlying network hardware for optimal performance.
#### Storage
The "storage" network provides inter-node connectivity within the PVC cluster, for storage backend purposes. OSD replication, VM access to storage, and Ceph backend communications use this network.
The "storage" network requires no outside routing, and is entirely local to the PVC cluster itself. A `/24` subnet of RFC1918 addresses should be used. This network should use the largest possible MTU on the underlying network hardware for optimal performance.
For small clusters, a common configuration is to collocate the Storage and Cluster networks onto the same vLAN and IP space, in effect merging their functions. Note that this precludes separation of the networks onto different physical links in the future. Very high performance or large clusters should thus avoid this.
### Client Networks
In addition to the core networks, at least one or more client network should be created. Client networks are managed on a running cluster via the API, and can be changed dynamically (assuming VMs are reconfigured afterwards).
#### Managed
Managed client networks leverage the EBGP VXLAN subsystem to provide virtual layer 2 networks between VMs. These networks do not require any additional configuration on the underlying network hardware and are completely transparent to it, operating over the "cluster" core network.
PVC can provide services to clients in this network via the DNSMasq subsystem, including IPv4 and IPv6 routing, firewalling, DHCP, DNS, and NTP. An upstream router must be configured to accept and return traffic from these network(s), either via BGP or static routing, if outside access is required.
📝 **NOTE** Be aware of the potential for "tromboning" when routing between managed networks. All traffic to and from a managed network will flow out the primary coordinator. Thus, if there is a large amount of inter-network traffic between two managed networks, all this traffic will traverse the primary coordinator, introducing a potential bottleneck. To avoid this, keep the amount of inter-network routing between managed networks or between managed networks and the outside world to a minimum.
One major purpose of managed networks is to provide a bootstrapping mechanism for new VMs deployed using the [PVC provisioner](../deployment/provisioner) with CloudInit metadata services (see that documentation for details). Such deployments will require at least one managed network to provide access to the CloudInit metadata system.
#### Bridged
Bridged client networks leverage vLANs on the underlying network hardware to provide access to externally-managed networks inside VMs. These networks will require the necessarily vLAN(s) to be plumbed into all PVC cluster nodes as a trunk on the `bridge_device` interface configured in the `pvcnoded.yaml` configuration file.
Bridged networks provide more external flexibility than managed networks, and can handle far more network traffic, though at the cost of losing cluster-aware networking facilities.
Bridged networks are explicitly isolated on the hypervisors, so that traffic to and from them cannot interface with the PVC nodes unless explicitly (and manually, outside of PVC) configured to do so.
#### SR-IOV
SR-IOV provides two mechanisms for directly passing underlying network devices into VMs. On supported network interface cards (NICs), the SR-IOV subsystem can create multiple virtual interfaces ("VFs") from a single physical interface ("PF"), managed via PVC, which can then be passed into VMs using two methods, "macvtap" and "hostdev".
SR-IOV networks require static configuration of the hypervisor nodes, both to define the PFs and to define how many VFs can be created on each PF. These options are defined with the `sriov_device` and `vfcount` options in the `pvcnoded.yaml` configuration file.
📝 **NOTE** Changing the PF or VF configuration cannot be done dynamically, and requires a restart of the `pvcnoded` daemon.
📝 **NOTE** Some SR-IOV NICs, specifically Intel NICs, cannot have the `vfcount` modified during runtime after being set. The node must be rebooted for changes to be applied.
Once one or more PFs are configured, VFs can then be created on individual nodes via the PVC API, which can then be mapped to VMs in a 1-to-1 relationship.
📝 **NOTE** The administrator must be careful to ensure the allocated VFs and PFs are identical between all nodes, otherwise migration of VMs between nodes can result in incorrect network assignments.
Once VFs are created, they may be attached to VMs using one of the two strategies mentioned above. Each strategy has trade-offs, so careful consideration is required:
* `macvtap` bindings allow VMs to be live-migrated (assuming the required VF exists on the target node), but are less flexible: the vLAN of the VF must be configured in PVC and are limited to one vLAN per VF, and maximum performance can be impacted. `macvtap` is mostly useful for isolation of networks to a slightly higher degree than normal Bridged networks, rather than performance, though the hypervisor could still inspect and access traffic on the VF.
* `hostdev` bindings allow the maximum flexibility inside the guest, allowing very high performance and the configuration of an arbitrary number of vLANs on the VF from within the guest, but **the VM cannot be live-migrated** between nodes and must be shut down instead. They also provide the maximum isolation possible, with the VF not being visible in any way to the hypervisor (bi-directional isolation).
#### Direct Pass-through
Though not explicitly managed by PVC, it is also possible to use direct PCIe pass-through mechanisms in Libvirt to pass NICs (or other devices) into a guest. These must be configured manually, and have all the same benefits and caveats as the `hostdev` SR-IOV interface mentioned above.

View File

@ -0,0 +1,101 @@
---
title: Fencing
---
PVC features a fencing system to provide automatic recovery of nodes from certain failure scenarios. This document details the fencing process, limitations, and expectations.
You can also view a video demonstration of the fencing process in action here:
[![Fencing Demonstration](https://img.youtube.com/vi/ZnhJ91-5y1Q/hqdefault.jpg)](https://youtu.be/ZnhJ91-5y1Q)
[TOC]
## Overview
Fencing in PVC provides a mechanism for a cluster's nodes to determine if one of their active (`run` state) peers has stopped responding, take action to ensure the failed node is fully power-cycled, and then, if successful, automatically bring up affected VMs from the dead node onto others awaiting its return to service.
Properly configured fencing can thus help ensure the maximum uptime for VMs in the case of a faulty node.
Fencing is enabled by default for all nodes that have the `fence_intervals` configuration key set and for which the node's IPMI is reachable and usable via `ipmitool` on the peers. Nodes check their own IPMI at daemon startup to validate this and print a warning if failed; in addition a regular health check monitors the IPMI interface and will degrade the node health if it is not reachable or not responding.
Fencing can be temporarily disabled by setting the cluster maintenance mode to `on` and resumed by setting it `off`. This can be useful during maintenance events however the administrator should be careful to `flush` any affected nodes of running VMs first to avoid trouble.
## IPMI Configuration
For fencing to be enabled, several configurations must be correctly set.
* The node must have a proper IPMI interface, as detailed in the [Hardware Requirements](hardware-requirements.md#ipmilights-out-management) documentation.
* The IPMI interface must be either in the [cluster "upstream" network](cluster-architecture.md#upstream), or in another network reachable by it. The former is strongly recommended, because the latter is potentially susceptible to network faults in the routing between the networks which might cause fencing to fail in otherwise valid scenarios.
* The IPMI BMC must be configured with an `Administrator`-level user with IPMI-over-LAN privileges enabled.
* The IPMI interface (IP or hostname) and aforementioned user of each node must be configured in the `fencing` -> `ipmi` section of the `pvcnoded.yaml` file of that node.
PVC will automatically check the reachability of its IPMI and its functionality early during node startup. The functionality can also be tested via the `ipmitool -I lanplus` command from a node.
The [PVC Ansible framework](../deployment/getting-started.md) will automatically configure most aspects of this IPMI setup, though some might require manual configuration. Ensure you test before putting the cluster into production.
## Fencing Process
### Dead Node Detection
Node fencing is handled during regular node keepalive events. Keepalives occur every 5 seconds (default `keepalive_interval`), during which each node checks into the cluster by providing the current UNIX epoch timestamp in a configuration key.
At the end of each keepalive event, all nodes check their peers' timestamps and compare them against the current time. If the peers detect that a node in `run` daemon state has not checked in for 6 intervals (default `fence_intervals`), or 30 seconds by default, one node at random will begin the fencing process as the watching node. First, a timer is started for 6 more `keepalive_intervals` (hard-coded), during which a check-in from the dead node will cancel the fence (a "saving throw").
### Dead Node Fencing
If all 6 saving throw intervals pass without further updates to the dead node's timestamp, actual fencing will begin; by default this will be 60-65 seconds after the last valid keepalive. The exact process is as follows, all run from the selected watching node:
1. The dead node is issued a `chassis power off` via IPMI-over-LAN to trigger an immediate power off.
1. Wait 1 second.
1. The `chassis power state` of the dead node is checked and recorded.
1. The dead node is issued a `chassis power on` via IPMI-over-LAN to trigger a power on.
1. Wait 2 seconds
1. The `chassis power state` of the dead node is checked and recorded.
With these 6 steps and the 2 saved results of the `chassis power state`, PVC can determine with near certainty that the dead node was actually powered off, and thus that any VMs that were potentially running on it were terminated. Specifically, if the first result was `Off` and the second was any valid value, the node was definitely shut down (either on its own, or by the first `chassis power off` command). If it cannot determine this, for instance because IPMI was unreachable or neither power state result was `Off`, no action is taken.
### VM Recovery
Once a dead node has been successfully fenced and at least 1 more `keepalive_interval` has passed, the watching node will begin fencing recovery.
What action is taken during fencing recovery is dependent on the `successful_fence` configuration key, which can either be `migrate`, which will perform the below steps, or `none` which will perform no recovery action and stop here.
First, the node is put into a special `fencing-flush` domain state, to indicate that it is undergoing a forced flush after fencing. Then, for each VM which was running on the dead node:
1. The RBD locks on all VM storage volumes are cleared.
1. The VM is temporarily `migrate`d to one active peer node based on the node's configured `target_selector` (default `mem`).
1. The VM is started up.
If, at a later time, the dead node successfully recovers and resumes normal operation, it can be put back into service. This **will not** occur automatically, as the node could still be in a bad state and only barely operating; an administrator must closely inspect the node and restore it to service manually after confirming correct operation.
### Failures
If a fence fails for any reason (for instance, the IPMI of the dead node is not reachable), by default no action is taken, as this could be unsafe for the integrity of VM data. This can be overridden by adjusting the `failed_fence` configuration key in conjunction with the node suicide discussed below, however this is strongly discouraged.
### Node Suicide
As an alternative to remote fencing, nodes can be configured to kill themselves by adjusting the `suicide_intervals` configuration key to a non-zero value. If the node itself does not check in for this many intervals, it will trigger a self restart via the `reboot -f` command. However, this is not reliable, and the other nodes will have no way of accurately determining the state of the node and whether VMs are safe to migrate, so this is strongly discouraged.
## Valid Fencing Conditions
The conditions in which a node can be successfully fenced are limited, and thus, auto-recovery is limited only to those situations where a fence can succeed. In short, any situation whereby a node's OS is not responding normally, but its IPMI interface is still up and available, should succeed in a fence; in contrast, those where the IPMI interface is also unavailable will fail.
The following table covers some common scenarios, and whether fencing (and subsequent automatic recovery) can be expected to occur.
| Situation | Fence? | Notes |
| --------- | --------------------- | ----- |
| OS lockup (load, OOM, etc.) | ✅ | A key design situation for the fencing system |
| OS kernel panic | ✅ | A key design situation for the fencing system |
| Primary network failure | ✅ | Only affecting primary links, not IPMI (see below); a key design situation |
| Full network failure | ❌ | All links are down, e.g. full network failure including IPMI |
| Power loss | ❌ | Impossible to determine if this is a transient network cut or actual power loss without IPMI |
| Hardware failure (CPU, memory) | ✅ | IPMI interface should remain up in these scenarios; a key design situation |
| Hardware failure (motherboard) | ✅ | If IPMI is **online** after failure |
| Hardware failure (motherboard) | ❌ | If IPMI is **offline** after failure |
| Hardware failure (full chassis) | ❌ | If IPMI is **offline** after failure |
Care should be taken to understand these scenarios and which situations can be recovered from automatically, and which require manual human intervention to confirm the situation ("is the node actually physically off?") and manual recovery.
## Future Development
Future versions of PVC may add support for additional fencing modes, for instance the ability for a fence to trigger a remote power device (switched PDU, etc.) or to detect more esoteric situations with the node power state via IPMI, as need requires. The author however believes that the current implementation satisfies the vast majority of potential situations for which auto-recovery is beneficial and thus such work would not see much benefit, though he is open to changing his mind.

View File

@ -0,0 +1,89 @@
---
title: Georedundancy
---
Georeundancy refers to the ability of a system to run across multiple physical geographic areas, and to help tolerate the loss of one of those areas due to a catastrophic event. With respect to PVC, there are two primary types of georedundancy: single-cluster georedundancy, which covers the distribution of the nodes of a single cluster across multiple locations; and multi-cluster georedundancy, in which individual clusters are created at multiple locations and communicate at a higher level. This page will outline the implementation, important caveats and potential solutions if possible, for both kinds of georeundancy.
[TOC]
## Single-Cluster Georedundancy
In a single-cluster georedundant design, one logical cluster can have its nodes, and specifically it's coordinator nodes, placed in different physical locations. This can help ensure that the cluster remains available even if one of the physical locations becomes unavailable, but it has multiple major caveats to consider.
### Number of Locations
Since the nodes in a PVC cluster require a majority quorum to function, there must be at least 3 sites, of which any 2 must be able to communicate directly with each other should the 3rd fail. A single coordinator (for a 3 node cluster) would then be placed at each site.
2 site georedundancy is functionally worthless within a single PVC cluster: if the primary site were to go down, the secondary site will not have enough coordinator nodes to form a majority quorum, and the entire cluster would fail.
[![2 Site Caveats](images/pvc-georedundancy-2-site.png)](images/pvc-georedundancy-2-site.png)
In addition, a 3 site configuration configuration without a full mesh or ring, i.e. where a single site functions as an anchor between the other two, would be a point of failure and would render the cluster non-functional if offline.
[![3 Site Caveats](images/pvc-georedundancy-broken-mesh.png)](images/pvc-georedundancy-broken-mesh.png)
Thus, the smallest useful georedundant physical design is 3 sites in full mesh or ring. The loss of any one site in this scenario will still allow the remaining nodes to form quorum and function.
[![3 Site Solution](images/pvc-georedundancy-full-mesh.png)](images/pvc-georedundancy-full-mesh.png)
A larger cluster could theoretically span 3 (as 2+2+1) or more sites, however with a maximum of 5 coordinators recommended, this many sites is likely to be overkill for the PVC solution; multi-cluster georedundancy would be a preferable solution for such a large distribution of nodes.
Since hypervisors are not affected by nor affect the quorum, any number can be placed at any site. Only compute resources would thus be affected should that site go offline. For instance, a design with one coordinator and one hypervisor at each site would provide a full 4 nodes of compute resources even if one site is offline.
### Fencing
PVC's [fencing mechanism](fencing.md) relies entirely on network access. First, network access is required for a node to update its keepalives to the other nodes via Zookeeper. Second, IPMI out-of-band connectivity is required for the remaining nodes to fence a dead node.
Georedundancy introduces significant complications to this process. First, it makes network cuts more likely, as the cut can now occur somewhere outside of the administrator's control (e.g. on a public utility pole, or in a provider's upstream network). Second, the nature of the cut means that without backup connectivity for the IPMI functionality, any fencing attempt would fail, thus preventing automatic recovery of VMs from the cut site onto the remaining sites. Thus, in this design, several normally-possible recovery situations become impossible to recover from automatically, up to and including any recovery at all. Situations where individual VM availability is paramount are therefore not ideally served by single-cluster georedundancy.
### Orphaned Site Availability
It is also important to note that network cut scenarios in this case will result in the outage of the orphaned site, even if it is otherwise functional. As the single node could no longer communicate with the majority of the storage cluster, its VMs will become unresponsive and blocked from I/O. Thus, splitting a single cluster between sites like this will **not** help ensure that the cut site remains available; on the contrary, the cut site will effectively be sacrificed to preserve the *remainder* of the cluster. For instance, office workers in that location would still not be able to access services on the cluster, even if those services happening to be running in the same physical location.
### Network Speed
PVC clusters are quite network-intensive, as outlined in the [hardware requirements](hardware-requirements.md#networking) documentation. This can pose problems with multi-site clusters with slower interconnects. At least 10Gbps is recommended between nodes, and this includes nodes in different physical locations. In addition, the traffic here is bursty and dependent on VM workloads, both in terms of storage and VM migration. Thus, the site interconnects must account for the speed required of a PVC cluster in addition to any other traffic.
### Network Latency & Distance
The storage write performance within PVC is heavily dependent on network latency. To explain why, one must understand the process behind writes within the Ceph storage subsystem:
[![Ceph Write Process](images/pvc-ceph-write-process.png)](images/pvc-ceph-write-process.png)
As illustrated in this diagram, a write will only be accepted by the client once it has been successfully written to at least `min_copies` OSDs, as defined by the pool replication level (usually 2). Thus, the latency of network communications between any two nodes becomes a major factor in storage performance for writes, as the write cannot complete without at least 4x this latency (send, ack, receive, ack). Significant physical distances and thus latencies (more than about 3ms) begin to introduce performance degradation, and latencies above about 5-10ms can result in a significant drop in write performance.
To combat this, georedundant nodes should be as close as possible, ideally within 20-30km of each other at a maximum. Thus, a ring *within* a city would work well; a ring *between* cities would likely hamper performance significantly.
## Overall Conclusion: Avoid Single-Cluster Georedundancy
It is the opinion of the author that the caveats of single-cluster georedundancy outweigh the benefits in almost every case. The only situation for which multi-site georedundancy provides a notable benefit is in ensuring that copies of data are stored online at multiple locations, but this can also be achieved at higher layers as well. Thus, we strongly recommend against this solution for most use-cases.
## Multi-Cluster Georedundancy
Starting with PVC version 0.9.104, the system now supports online VM snapshot transfers between clusters. This can help enable a second georedundancy mode, leveraging a full cluster in two sites, between which important VMs replicate. In addition, this design can be used with higher-layer abstractions like service-level redundancy to ensure the optimal operation of services even if an entire cluster becomes unavailable. Service-level redundancy between two clusters is not addressed here.
Multi-cluster redundancy eliminates most of the caveats of single-cluster georedundancy while permitting single-instance VMs to be safely replicated for hot availability, but introduces several additional caveats regarding promotion of VMs between clusters that must be considered before and during failure events.
### No Failover Automation
Georedundancy with multiple clusters offers no automation within the PVC system for transitioning VMs like with single-cluster fencing and recovery. If a fault occurs necessitating promotion of services to the secondary cluster, this must be completed manually by the administrator. In addition, once the primary site recovers, it must be handled carefully to re-converge the clusters (see below).
### VM Automirrors
The VM automirror subsystem must be used for proper automatic redundancy on any single-instance VMs within the cluster. A "primary" side must be selected to run the service normally, while a "secondary" site receives regular mirror snapshots to update its local copy and be ready for promotion should this be necessary. Note that controlled cutovers (for e.g. maintenance events) do not present issues aside from brief VM downtime, as a final snapshot is sent during these operations.
The automirror schedule is very important to define here. Since automirrors are point-in-time snapshots, only data at the last sent snapshot will be available on the secondary cluster. Thus, extremely frequent automirrors, on the order of hours or even minutes, are recommended. In addition note that automirrors are run on a fixed schedule for all VMs in the cluster; it is not possible to designate some VMs to run more frequently at this time.
It is also recommended that the guest OSes of any VMs set for automirror support use atomic writes if possible, as online snapshots must be crash-consistent. Most modern operating and file systems are supported, but care must be taken when using e.g. in-memory caching of writes or other similar mechanisms to avoid data loss.
### Data Loss During Transitions
VM automirror snapshots are point-in-time; for a clean promotion without data loss, the `pvc vm mirror promote` command must be used. This affects both directions:
* When promoting a VM on the secondary after a catastrophic failure of the primary (i.e. one in which `pvc vm mirror promote` cannot be used), any data written to the primary side since the last snapshot will be lost. As mentioned above, this necessitates very frequent automirror snapshots to be viable, but even with frequent snapshots some amount of data loss will occur.
* Once the secondary is promoted to become the primary manually, both clusters will consider themselves primary for the VM, should the original primary cluster recover. At that time, there will be a split-brain between the two, and one side's changes must be discarded; there is no reconciliation possible on the PVC side between the two instances. Usually, recovery here will mean the removal of the original primary's copy of the VM and a re-synchronization from the former secondary (now primary) to the original primary cluster with `pvc vm mirror create`, followed by a graceful transition with `pvc vm mirror promote`. Note that the transition will also result in additional downtime for the VM.
## Overall Conclusion: Proceed with Caution
Ultimately the potential for data loss during unplanned promotions must be carefully weighed against the benefits of manually promoting the peer cluster. For short or transient outages, it is highly likely to result in more data loss and impact than is acceptable, and thus a manual promotion should only be considered in truly catastrophic situations. In such situations, the amount of acceptable data loss must inform the timing of the automirrors, and thus how frequently snapshots are taken and transmitted. Ultimately, service-level redundancy is advised when any data loss would be catastrophic.

View File

@ -0,0 +1,246 @@
---
title: Hardware Requirements
---
PVC has relatively stringent restrictions on the hardware it can run on, in order to provide an optimal experience and production-grade performance, redundancy, and security. This document outlines the **minimum** and **recommended** hardware specifications, of each major device class, for running a production-grade PVC cluster.
Note, however, that your individual needs may be different, and thus your own recommended minimum requirements may be higher, even significantly higher, than those outlined here. One of PVC's benefits is the ability to seamlessly take nodes offline for maintenance, so upgrading hardware in the future is a fairly trivial task, but ensuring you get the right performance from the beginning can be critical to the success of a cluster deployment project.
[TOC]
## N-1 Redundancy
This document details the recommendations for *individual* node hardware choices, however it is important to consider the entire cluster when sizing nodes.
PVC is designed to operate in "N-1" mode, that is, all sizing of the cluster should take into account the loss of 1 node after pooling all the available resources.
For example, consider 3 nodes each with 16 CPU cores and 128GB of RAM. This totals 48 CPU cores and 384GB of RAM, however we should consider the N-1 number, in this case 2 nodes, which would be 32 CPU cores and 256GB of RAM, to be the maximum usable quantity of each available across the entire cluster. PVC will warn the administrator when RAM provisioning exceeds the N-1 number.
Disks are even more limited. As outlined in the [Cluster Storage section of the Cluster Architecture](cluster-architecture.md#cluster-storage) documentation, a normal pool replication level for reliable redundant operation is 3 copies with 2 minimum copies. Thus, to continue the above 3 node example, if each node features a 2TB data SSD, the total available N-1 storage is 2TB (as 3 x 2TB / 3 = 2TB). On larger clusters, this calculation is more complex, so care should be taken to optimize the available number of disks with the pool replication size to ensure efficient utilization.
## Hardware Vendors
PVC places no limitations of the hardware vendor for nodes; any vendor that produces a system compatible with the rest of these requirements will be suitable.
Some common recommended vendors, with whom the author has had good experiences, include Dell (PowerEdge line, various tiers and generations) and Cisco (UCS C-series line, M4 and M5 era specifically). The author does not recommend Hewlett-Packard ProLiant servers due to severe limitations and issues with their storage controller cards, even though they are otherwise sufficient.
### IPMI/Lights-out Management
All aforementioned server vendors support some form of IPMI Lights-out Management, e.g. Dell iDRAC, Cisco CIMC, HP iLO, etc. with IPMI-over-LAN functionality. Consumer and low-end Workstation hardware does not normally support IPMI Lights-out Management and is thus unsuitable for a production node.
* It is **recommended** for a redundant, production PVC node to feature IPMI Lights-out Management, on a dedicated Ethernet port, with support for IPMI-over-LAN functionality, reachable from or in the [cluster "upstream" network](cluster-architecture.md#upstream).
This feature is not strictly required, however it is required for the [PVC fencing system](fencing.md) to function properly, which is required for auto-recovery from node failures. PVC will detect the lack of a reachable IPMI interface at startup and disable fencing and auto-recovery in such a case.
## CPU
PVC requires a relatively large amount of CPU horsepower. In addition to any CPU required by VMs, the storage subsystem can consume a large amount of CPU power, as can other daemons on the system. Recent CPU vulnerabilities and their mitigations have also severely affected performance, and thus this should be considered carefully.
### Vendor & Architecture
PVC will work equally well on (modern, see below) Intel- and AMD-based CPUs using the x86_64 architecture (as implemented in IA64 and AMD64, respectively). Which you select depends primarily on availability, workload, and which, if any, additional CPU feature(s) complement it. The author has used both extensively to good results.
### Era/Generation
Modern CPUs are a must, as generation improvements compound and can make a major difference in performance. Each CPU vendor has different minimums however.
#### Intel
* The **minimum** generation/era for a functional PVC node is "Nehalem", i.e. the Xeon L/X/W-3XXX, 2009.
* The **recommended** generation/era for a production PVC node is "Haswell", i.e. the Xeon E5-2XXX V3, 2013. Processors older than this will be a significant bottleneck due to the slower DDR3 memory system and lower general IPC per clock, especially affecting the storage subsystem.
#### AMD
* The **minimum** generation/era for a functional PVC node is "Naples", i.e. the EPYC 7XX1, 2017. Older AMD processors perform significantly worse than their Intel counterparts of similar vintage and should be avoided completely.
* The **recommended** generation/era for a production PVC node is "Rome", i.e. the EPYC 7XX2, 2019. The first-generation "Naples" processors feature strange NUMA limitations that can negatively affect performance, which the second-generation "Rome" processors corrected.
### Cores (+ Single/Multi-processor, SMT/Hyperthreading)
PVC requires a non-trivial number of CPU cores for its internal workload in addition to any VMs that it might run. A normal system should allocate 2 CPU cores for the core system, plus an additional 2 cores for every SATA/SAS OSD or 4 cores for every NVMe OSD for optimal storage performance.
CPU cores can be significantly over-provisioned, with a 3-1 or even 4-1 ratio being acceptable for most workloads; heavily CPU-dependent workloads might lower this calculation, so consider your VM workload carefully. Generally speaking however as long as you have enough cores to cover the system plus the *maximum* number of vCPUs a single VM will be allocated, with a few to spare, this should be sufficient.
* The **minimum** number of CPU cores for a functional PVC node should be 8 CPU cores; any lower and even very light storage and VM workloads will be affected negatively by CPU contention.
* The **recommended** number of CPU cores for a production PVC node can be given by:
```
2 + ( [# SATA/SAS OSDs] * 2 ) + ( [# NVMe OSDs] * 4 ) + [# vCPUs of largest expected VM] + 2, round up to the nearest CPU core count
```
#### Multiple Processors
Generally, modern AMD systems can leverage the `P` series single-processor systems to provide a similar number of cores, memory bandwidth, and PCIe connectivity versus a similar dual-CPU Intel system. Once you have the minimum required CPU core count from the calculation above, choose the best option based on the selected vendor and generation.
For example, if a system needs a total of 24 CPU cores, there are several options available:
* A single 24-core AMD Epyc processor
* A single 24-core Intel processor
* Two 12-core Intel processors
Which to select will depend on the available options from the server vendor you choose as well as cost.
#### SMT/Hyperthreading
SMT/Hyperthreading, the ability of a single CPU core to present itself as 2 (or more) virtual CPU cores, should be considered a **bonus only** and not included in any of the above calculations. Thus, in the above example, two 6-core processors with SMT are *not* a substitute for two 12-core processors.
Recent CPU vulnerabilities have resulted in recommendations to disable SMT on some processors. If this is required by your security practices (e.g. if you will run untrusted guest VMs), this should be done.
### Clock Speed
Several aspects of the storage cluster are limited by core clock, such that the fastest possible CPU clock is **recommended** within a given generation and core or power target.
* The **minimum** CPU clock speed for a functional PVC node depends on the generation (as newer CPUs perform more calculations at the same clock), but usually anything lower than 2.0GHz will result in substandard performance.
* The **recommended** minimum CPU clock speed for a production PVC node depends on the generation, but should be as fast as possible given the system power constraints and core count.
## Memory
### Generation/Speed
Since RAM generation speed is governed by the chosen CPUs, the following mirror the recommendations from the CPU section:
* The **minimum** RAM generation for a functional PVC cluster is DDR3, running at DDR3-1333 speeds or faster for optimal bandwidth.
* The **recommended** RAM generation for a production PVC node is DDR4, running at DDR4-2133 speeds or faster for optimal bandwidth.
### Quantity
Like CPU cores, PVC requires a non-trivial amount of RAM for its internal workload in addition to any VMs that it might run. A normal system should allocate at least 8GB for the core system, plus an additional 4GB for every OSD.
In addition, unlike CPU cores, Memory is not easily over-provisioned. While a PVC node will only use an amount of memory equal to the amount actually used inside the VM, the amount allocated to each VM is used when considering the state of the cluster, since the usage inside VMs can change randomly. Thus, carefully consider your VM workload.
* The **minimum** amount of RAM for a functional PVC node should be 32 GB; any lower and RAM contention might become a major issue even with a relatively light VM workload.
* The **recommended** amount of RAM for a production PVC node can be given by:
```
8 + ( [# OSDs] * 4 ) + ( [# VMs] * [Avg # RAM/VM] ), round up to the nearest common RAM quantity
```
## System Disks
### Type
SSDs are critically important for the proper operation of a PVC node's system disks. The system performs a very large number of writes to its system disks, and requires speedy random I/Os both in reads and writes to function correctly. Spinning hard drives or slow flash (e.g. SD cards or USB flash drives) are **not** usable and should not be considered; SATA-DOMs can be sufficient if performant enough.
In addition, power loss protection (PLP) and large drive write endurance - normally collectively covered under the label of "datacenter-grade" - are important to avoid potential data loss during power events and premature failure of disks. For write endurance, a rating of 1 Drive Write Per Day (DWPD) is usually sufficient.
* The **minimum** system disk type for a functional PVC node is SATA SSDs or SATA-DOMs.
* The **recommended** system disk type for a production PVC node is datacenter-grade (>=1 DWPD + PLP) SATA, SAS or NVMe SSDs.
### Size
PVC does not require a large amount of space for its system drives. The default ideal allocation of actual volumes is approximately 88GB, while the minimum allocation is approximately 28GB.
* The **minimum** system disk size for a functional PVC node is 32GB.
* The **recommended** system disk size for a production PVC node is 120GB.
### Quantity/Redundancy
The PVC system disks should be deployed in mirrored mode, via an internal RAID controller or dedicated redundant device (e.g. Dell BOSS card). Note that PVC features a monitoring plugin which can alert to degraded RAID arrays of various types (MegaRAID/Dell PERC, HPSA, and Dell BOSS) when the appropriate software is installed.
* The **minimum** system disk quantity for a functional PVC node is 1.
* The **recommended** system disk quantity for a production PVC node is 2 in RAID-1 mode.
## Data Disks
### Type
Data disks underlay the Ceph OSDs which provide VM storage for the cluster. They should be as fast as possible to optimize the storage performance of VMs.
In addition, power loss protection (PLP) and large drive write endurance - normally collectively covered under the label of "datacenter-grade" - are important to avoid potential data loss during power events and premature failure of disks, especially given Ceph's replication resulting in write amplification. For write endurance, a rating of 1 Drive Write Per Day (DWPD) is usually sufficient.
* The **minimum** data disk type for a functional PVC node is SATA SSDs.
* The **recommended** data disk type for a production PVC node is datacenter-grade (>=1 DWPD + PLP) SATA, SAS or NVMe SSDs.
### Size
Data disks should be as large as possible for the storage expected by VMs, plus overhead of approximately 30% (i.e. the cluster should ideally never exceed 70% full).
Since this sizing is based entirely on VM requirements, no minimum or recommended values can reasonably be given.
### Quantity/Redundancy
Data disks in the PVC system **must** be added in groupings equal to the pool replication level, across the same number of nodes. For example, in a 3 node cluster, 3 disks, of identical sizes, must be added at the same time, 1 to each of the 3 nodes. Large node counts require more careful calculation of the replication split, though 1-disk-per-node expansion across all nodes is generally recommended.
Data disk redundancy is provided by the Ceph pool replication across nodes, and thus, data disks **should**, if at all possible, be passed **directly** into the system without any intervening RAID or other layers. If this is not possible (e.g. on HP SmartArray controllers), disks should be allocated as single-disk RAID-0 volumes in the storage controller.
The number of data disks, as mentioned above in the CPU section, directly affects the number of recommended CPU cores, as each disk adds additional computation. It is **recommended** to use fewer, larger disks over more, smaller disks as much as possible. For example, 1 4TB disk would generally be preferable to 4 1TB disks, as it would reduce the CPU overhead by as much as 75%, though this is a trade-off with reliability, as a single large disk would affect more data when failing than a smaller disk.
## Networking
### Type
Only Ethernet networking with the TCP/IP stack is supported by PVC. Infiniband or other fabric technologies are not supported.
All physical modes (copper, fibre, DAC) are supported by PVC, dependent on the chosen network interface cards and switch(es).
### Speed
Various PVC functions, but especially the storage backend, are limited primarily by network speed, which should thus be as fast as possible.
* The **minimum** network speed for a functional PVC node is 1 Gbps.
* The **recommended** network speed for a production PVC node is 10 Gbps.
### Link Redundancy
Link redundancy is critical for proper network layer redundancy and failover, and can provide additional performance in some cases. LACP (802.3ad) is recommended as it provides both features, though active-backup can also be acceptable.
* The **minimum** number of network links for a functional PVC node is 1.
* The **recommended** number of network links for a production PVC node is 2 in LACP (802.3ad) aggregation mode.
## Switching
### Link Aggregation
As detailed above, LACP (802.3ad) link aggregation is **recommended** for network link redundancy.
### vLANs
An optimal PVC deployment will make extensive use of virtual LANs (vLANs), both for the core networks and to provide bridged client networks. While it is possible to assemble a cluster using only dedicated links, this is highly unusual and it is thus **recommended** to use switches that feature vLAN support.
### Multiple Switches
For true network redundancy, **at least 2** switches should be used, with a pair of links each from one switch to each node, aggregated *across* the switches. This ensures that even in the event of a switch failure or maintenance, the networking of the cluster remains available.
## Power
### Power Type & Wattage
PVC will work equally well regardless of power type (A/C vs D/C, various voltages, etc.) and wattage of power supplies, as long as it is supported by the chosen hardware vendor and within the bounds of the selected hardware.
### Power Supplies
Redundant power supplies will ensure that even if a power supply or power feed fails, the PVC node will continue to function. Note that PVC features a monitoring plugin which can alert to degraded power redundancy from most IPMI-capable vendors.
* The **minimum** number of power supplies for a functional PVC node is, of course, 1.
* The **recommended** number of power supplies for a production PVC node is 2 with active power redundancy configured.
### Power Feeds
For true power redundancy, **at least 2** power feeds should be used, with a pair of power connections, one from each feed to a different power supply on each node. This ensures that even in the event of a power failure or maintenance, the cluster remains available.
## Other
### TPM Chip
A TPM chip is not required for PVC, though it may improve some cryptographic performance or enable advanced functionality in the Libvirt system (not managed by PVC).
### IPMI Serial Console
The IPMI Lights-out Management interface should provide some form of serial console in addition to a GUI console to assist in managing the PVC nodes, though neither is required.
### Warranty
For production deployments, hardware warranty is essential for speedy replacement of parts. While PVC can tolerate running in a degraded mode for some time, avoiding this state as much as possible is ideal, and hardware warranties can help ensure this.

View File

Before

Width:  |  Height:  |  Size: 134 KiB

After

Width:  |  Height:  |  Size: 134 KiB

View File

Before

Width:  |  Height:  |  Size: 148 KiB

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

View File

@ -1,414 +0,0 @@
# PVC Cluster Architecture considerations
- [PVC Cluster Architecture considerations](#pvc-cluster-architecture-considerations)
* [Node Specification](#node-specification)
+ [n-1 Redundancy](#n-1-redundancy)
+ [CPU](#cpu)
+ [Memory](#memory)
+ [Disk](#disk)
+ [Network](#network)
* [PVC architecture](#pvc+architecture)
+ [Operating System](#operating-system)
+ [Ceph Storage Layout](#ceph-storage-layout)
+ [Networks](#networks)
- [System Networks](#system+networks)
- [Client Networks](#client+networks)
+ [Fencing and Recovery](#fencing-and-recovery)
* [Advanced Layouts](#advanced+layouts)
+ [Coordinators versus Hypervisors](#coordinators-versus-hypervisors)
+ [Georedundancy](#georedundancy)
* [Example System Diagrams](#example+system-diagrams)
+ [Small 3-node cluster](#small-3-node-cluster)
+ [Large 8-node cluster](#large-8-node-cluster)
This document contains considerations the administrator should make when preparing for and building a PVC cluster. It is important that prospective PVC administrators read this document *thoroughly* before deploying a cluster to ensure they understand the requirements, caveats, and important details about how PVC operates.
## Node Specification
PVC nodes, especially coordinator nodes, run a significant number of software applications in addition to the virtual machines (VMs). It is therefore extremely important to size the systems correctly for the expected workload while planning both for redundancy and future capacity. In general, taller nodes are better for performance, providing a more powerful cluster on fewer physical machines, though each workload may be different in this regard.
The following table provides recommended minimum specifications for each component of the cluster nodes. In general, these minimums are the lowest possible for a production-quality cluster that would provide decent performance for up to about a dozen virtual machines. Of course, further upward scaling is recommended and the specific computational and storage needs of the VM workloads should be taken into account.
| Resource | Recommended Minimum |
| -------- | --------------------|
| CPU generation | Intel Sandy Bridge (2011) *or* AMD Naples (2017) |
| CPU cores per node | 8 @ 2.0GHz |
| RAM per node | 32GB |
| System disk (SSD/HDD/USB/SD/eMMC) | 2x 100GB RAID-1 |
| Data disk (SSD only) | 1x 400GB |
| Network interfaces | 2x 10Gbps (LACP LAG) |
| Remote IPMI-over-IP | Available and connected |
| Total CPU cores (3 nodes healthy) | 24 |
| Total CPU cores (3 nodes n-1) | 16 |
| Total RAM (3 nodes healthy) | 96GB |
| Total RAM (3 nodes n-1) | 64GB |
| Total disk space (3 nodes) | 400GB |
For testing, or low-budget homelab applications, some aspects can be further tuned down, however consider the following sections carefully.
### n-1 Redundancy
Care should be taken to examine the "healthy" versus "n-1" total resource availability. Under normal operation, PVC will use all available resources and distribute VMs across all cluster nodes. However, during single-node failure or maintenance conditions, all VMs will be required to run on the remaining hypervisors. Thus, care should be taken during planning to ensure there is sufficient resources for the expected workload of the cluster.
The general values for default resource availability of a 3-node cluster for n-1 availability (1 node offline) are:
* 1/3 of the total data disk space (3 copies of all data, distributed across all 3 nodes)
* 2/3 of the total RAM
* 2/3 of the total CPU cores
For memory provisioning of VMs, PVC will warn the administrator, via a Degraded cluster state, if the "n-1" RAM quantity is exceeded by the total maximum allocation of all running VMs. If nodes are of mismatched sizes, the "n-1" RAM quantity is calculated by removing (one of) the largest node in the cluster and adding the remaining nodes' RAM counts together.
### CPU
CPU resources are a very important part of the overall performance of a PVC cluster. Numerous aspects of the system require high-performance CPU cores, including the VM workloads themselves, the PVC databases, and, especially, the Ceph storage subsystem.
As a general rule, more cores, and faster cores, are always better, and real cores are preferable to SMT virtual cores in most cases.
#### SMT
SMT in particular can be a contentious subject, and performance can vary wildly for different workloads; thus, while they are useful, in terms of performance calculations they should always be considered as an afterthought or "bonus" to assist with many VMs contending for resources, and base specifications should be done based on the number of real CPU cores instead.
#### CPU core counts
The following should be considered recommended minimums for CPU core allocations:
* PVC system daemons, including Zookeeper and PostgreSQL databases: 2 CPU cores
* Ceph Monitor and Manager processes: 1 CPU core
* Ceph OSD processes: 2 CPU cores *per OSD disk*
* Virtual Machines: 1 CPU core per vCPU in the largest spec'd VM (e.g. 12 vCPUs in a VM = 12 cores here)
To provide an example, consider a cluster that would run 2 OSD disks per node, and want to run several VMs, the largest of which would require 12 vCPUs:
* PVC system: 2 cores
* Ceph Mon/Mgr: 1 core
* Ceph OSDs: 2 * 2 = 4 cores
* VMs: 12 cores
This gives a total of 19 cores, and thus a 20+ core CPU would be recommended.
Additional CPU cores, as previously mentioned, are always better. For instance, though 2 is the recommended minimum per OSD disk, better performance can be achieved if there are 4 cores available per OSD instead. This trade-off depends heavily on the required workload and VM specifications and should be carefully considered.
#### CPU performance
While CPU frequency is not a tell-all or even particularly useful metric across generations or manufacturers, within a specific generation and manufacturer, faster CPUs will almost always improve performance across the board, especially when considering the Ceph storage subsystem. If a 2.0GHz and a 2.6GHz CPU of the same core count are both available, the 2.6GHz one is almost always the better choice from a pure performance perspective.
### Memory
Memory is extremely important to PVC clusters, and like CPU resources a not-insignificant amount of memory is required for the baseline cluster before VMs are considered.
#### Memory allocations
The following should be considered recommended minimums for memory allocations:
* PVC daemons: 1 GB
* Zookeeper database: 1 GB
* PostgreSQL database: 1 GB
* Ceph Monitor and Manager processes: 1 GB
* Ceph OSD processes: 1 GB *per OSD disk*
All additional memory can be consumed by virtual machines.
To provide an example, in the same cluster as mentioned in the CPU section:
* PVC system: 1 GB
* Zookeeper: 1 GB
* PostgreSQL: 1 GB
* Ceph Mon/Mgr: 1 GB
* Ceph OSDs: 2 * 1 GB = 2 GB
This gives a total of 6 GB of memory for the base system, with VMs requiring additional memory.
#### VM Memory Overprovisioning
An important consideration is that the KVM hypervisor used by PVC will only allocate guest memory *as required by the guest*, but PVC tracks memory allocation based on the allocated maximum. Thus, for example, a VM may be allocated 8192 MB of memory in PVC, and thus the PVC system considers 8 GB "allocated" and "provisioned" to this VM, but if the actual guest is only using 500 MB of that memory, the actual memory usage on the hypervisor node will be 500 MB for that VM. Thus it is possible for "all" memory to be allocated on a node but there still be many GB of "free" memory. This is an intentional design decision to avoid excessive overprovisioning of memory and thus situations where non-VM processes become memory starved, as the PVC system itself does *not* track the usage by the aforementioned processes.
#### Memory Performance
Given the recommended CPU requirements, all PVC hypervisors should contain at least DDR3 memory, which is sufficiently performant for all tasks. Memory latency and performance, however, can become important especially in large NUMA systems, and especially with regards to the Ceph storage subsystem. Care should be taken to optimize the memory layout in nodes, for instance making use of all available memory channels in the CPU architecture and preferring 1 DIMM-per-channel (DPC) over 2 DPC.
#### Ceph OSD memory utilization
While the recommended *minimum* is 1 GB per OSD process, in reality, Ceph can allocate between 4 and 6 GB of memory per OSD process, especially for caching metadata and other frequently-used data. Thus, for maximum performance, 4 GB instead of 1 GB should be allocated per-OSD.
#### Memory limit tuning
The PVC Ansible deployment system allows the administrator to specify limits on some aspects of the aforementioned memory requirements, for instance limiting Zookeeper or Ceph OSD processes to lower amounts of memory. This is not recommended except in situations where memory is extremely constrained; in such situations adding additional memory to nodes is always preferable. For details and examples please see the Ansible variable files.
### Disk
#### System Disks
The performance of system disks is of critical importance in the PVC cluster. At least 32GB of space are required, and at least 100GB is recommended to ensure optimal performance. The system disks should be fast SAS HDDs, SSDs, eMMC flash, class-10 SD, or other flash-based mediums, and RAID-1 is critical for reliability purposes, especially for more wear- or failure-sensitive media types.
PVC will store the various databases on these disks, so overall performance can affect the responsiveness of the system. However note that no VM data is ever stored on system disks; this is provided exclusively by the Ceph data disks (OSDs).
#### Ceph OSD disks
All VM block devices are stored on Ceph OSD data disks. The default pool configuration of the Ceph storage subsystem uses a `copies=3` layout with a `host`-level failure domain; thus, in a 3-node cluster, each block of data is stored 3 times, once per node. This ensures that 2 copies of each piece of data are available even if a host is down, at the cost of 1/3 of the total overall storage space. Other configurations are possible, but this is the minimum recommended.
The performance of VM disks will be dictated almost exclusively by the performance of these disks in combination with the CPU resources of the system as discussed previously. Very fast, robust, and resilient storage is highly recommended for OSD disks to maximize performance and longevity. High-performance SATA, SAS, or NVMe SSDs are recommended for this task, sized according to the expected workload. Spinning disks (HDDs) are *not* recommended for this purpose, and their very low random performance will significantly limit the overall storage performance of the cluster.
Initially, it is optimal if all nodes contain the same number and same size of OSD disks, to ensure even distribution of the data across all disks and thus maximize performance. PVC supports adding additional OSDs at a later time, however the administrator should be cautious to always add new disks in parallel on all nodes at the same time, as otherwise the replication ratio will prevent the new space from being utilized. Thus, in a 3-node cluster, disks must be added 3-at-a-time to all 3 nodes, and these disks must be identically sized, in order to increase the total usable storage space by the value of one of these disks.
In addition to the primary data disks, PVC also supports the offloading of the Ceph BlueStore OSD database and WAL functions of the OSDs onto a separate OSD database volume group on a dedicated storage device. In the normal use-case, this would be an extremely fast and endurant Intel Optane or similar extremely-performant NVMe SSD which is significantly faster than the primary data SSDs. This will help accelerate random write I/Os and metadata lookups, especially when using lower-performance SATA or SAS SSDs. Generally speaking this volume should be large enough to support 5% of the capacity of all OSDs on a node, with some room for future expansion. Only one such device and volume group is supported at this time.
### Network
Because PVC makes extensive use of cross-node communications, high-throughput and low-latency networking is critical. At a minimum, 10-gigabit networking is recommended to ensure suitable throughput for the storage subsystem as well as for VM traffic. Higher-speed networking can also improve performance, especially when using extremely fast Ceph OSD disks.
A minimum of 2 network interfaces is recommended. These should then be combined into a logical aggregate (LAG) using 802.3ad (LACP) to provide redundant links and a boost in available bandwidth. Additional NICs can also be used to separate discrete parts of the networking stack, which will be discussed below.
#### Remote IPMI-over-IP
IPMI provides a method to manage the physical chassis' of nodes from outside of their operating system. Common implementations include Dell iDRAC, HP iLO, Cisco CIMC, and others.
PVC nodes in production deployments should always feature an IPMI-over-IP interface of some kind, which is then reachable either in, or via, the Upstream system network (see [System Networks](#system-networks)). This requirement is discussed in more detail during the [Fencing and Recovery](#fencing-and-recovery) section below.
## PVC Architecture
### Operating System
As an underlying OS, only Debian GNU/Linux 10.x "Buster" or 11.x "Bullseye" are supported by PVC. This is the operating system installed by the PVC [node installer](https://github.com/parallelvirtualcluster/pvc-installer) and expected by the PVC [Ansible configuration system](https://github.com/parallelvirtualcluster/pvc-ansible). Ubuntu or other Debian-derived distributions may work, but are not officially supported. PVC also makes use of a custom repository to provide the PVC software and (for Debian Buster) an updated version of Ceph beyond what is available in the base operating system, and this is only compatible officially with Debian 10 or 11. PVC will generally be upgraded regularly to support new Debian versions. As a rule, using the current versions of the official node installer and Ansible repository is the preferred and only supported method for deploying PVC.
Currently, only the `amd64` (Intel 64 or AMD64) architecture is officially supported by PVC. Given the cross-platform nature of Python and the various software components in Debian, it may work on `armhf` or `arm64` systems as well, however this has not been tested by the author and is not officially supported at this time.
### Ceph Storage Layout
PVC makes use of Ceph, a distributed, replicated, self-healing, and self-managing storage system to provide shared VM storage. While a PVC administrator is not required to understand Ceph for day-to-day administration, and PVC provides interfaces to most of the common storage functions required to operate a cluster, at least some knowledge of Ceph is advisable.
The Ceph subsystem of PVC creates a "hyperconverged" cluster whereby storage and VM hypervisor functions are collocated onto the same physical servers; PVC does not differentiate between "storage" and "compute" nodes, and while storage support can be disabled and an external Ceph cluster used, this is not recommended. The performance of the storage must be taken into account when sizing the nodes as mentioned above.
Ceph on PVC is laid out similar to the other daemons. The Ceph Monitor and Manager functions are delegated to the Coordinators over the storage network, with all nodes connecting to these hosts to obtain the CRUSH maps and select OSD disks. OSDs are then distributed on all hosts, potentially including non-coordinator hypervisors if desired, and communicate with clients and each other over the storage network.
Disks must be balanced across all storage-containing nodes. For instance, adding 1 disk to 1 node is not sufficient to increase storage space; 1 disk must be added to all storage-containing nodes, based on the configured replication scheme of the various pools (see below), at the same time for the available space to increase. Ideally, disk sizes should also be identical across all storage disks, though the weight of each disk can be configured when added to the cluster. Generally speaking, fewer larger disks are preferable to many smaller disks to minimize storage resource utilization, however slightly more storage performance can be gained from using many small disks, if the other cluster hardware, and specifically CPUs, are performant enough. The administrator should therefore always aim to choose the biggest disks they can and grow by adding more identical disks as space or performance needs grow.
PVC Ceph pools make use of the replication mechanism of Ceph to store multiple copies of each object, thus ensuring that data is always available even when a host is unavailable. Only "replica"-based Ceph redundancy is supported by PVC; erasure coded pools are not supported due to major performance impacts related to rewrites and random I/O as well as management overhead.
The default replication level for a new pool is `copies=3, mincopies=2`. This will store 3 copies of each object, with a host-level failure domain, and will allow I/O as long as 2 copies are available. Thus, in a cluster of any size, all data is fully available even if a single host becomes unavailable. It will however use 3x the space for each piece of data stored, which must be considered when sizing the disk space for the cluster: a pool in this configuration, running on 3 nodes each with a single 400GB disk, will effectively have 400GB of total space available for use. As mentioned above, new disks must also be added in groups across nodes equal to the total number of `copies` to ensure new space is usable; for instance in a `copies=3` scheme, at least 3 disks must thus be added to different hosts at the same time for the available space to grow.
Non-default values can also be set at pool creation time. For instance, one could create a `copies=3, mincopies=1` pool, which would allow I/O with two hosts down, but leaves the cluster susceptible to a write hole should a disk fail in this state; this configuration is not recommended in most situations. Alternatively, for additional resilience, one could create a `copies=4, mincopies=2` pool, which would also allow 2 hosts to fail, without a write hole, but would consume 4x the space for each piece of data stored and require new disks to be added in groups of 4 instead. Practically any combination of values is possible, however these 3 are the most relevant for most use-cases, and for most, especially small, clusters, the default is sufficient to provide solid redundancy and guard against host failures until the administrator can respond.
Replication levels cannot be changed within PVC once a pool is created, however they can be changed via manual Ceph commands on a coordinator should the administrator require this, though discussion of this process is outside of the scope of this documentation. The administrator should carefully consider sizing, failure domains, and performance when first selecting storage devices and creating pools, to ensure the right level of resiliency versus data usage for their use-case and planned cluster size.
### Networks
At a minimum, a production PVC cluster should use at least two 10Gbps Ethernet interfaces, connected in an LACP or active-backup bond on one or more switches. On top of this bond, the various cluster networks should be configured as 802.3q vLANs. PVC is be able to support configurations without bonding or 802.1q vLAN support, using multiple physical interfaces and no bridged client networks, but this is strongly discouraged due to the added complexity this introduces; the switches chosen for the cluster should include these requirements as a minimum.
More advanced physical network layouts are also possible. For instance, one could have two isolated networks. On the first network, each node has two 10Gbps Ethernet interfaces, which are combined in a bond across two redundant switch fabrics and that handle the upstream and cluster networks. On the second network, each node has an additional two 10Gbps, which are also combined in a bond across the redundant switch fabrics and handle the storage network. This configuration could support up to 10Gbps of aggregate client traffic while also supporting 10Gbps of aggregate storage traffic. Even more complex network configurations are possible if the cluster requires such performance. See the [Example System Diagrams](#example-system-diagrams) section for some basic topology examples.
Only Ethernet networks are supported by PVC. More exotic interconnects such as Infiniband are not supported by default, and must be manually set up with Ethernet (e.g. EoIB) layers on top to be usable with PVC.
Lower-speed networks (e.g. 1Gbps or 100Mbps) should not be used as these will severely bottleneck the performance of the storage subsystem. In an advanced split layout, it may be acceptable to use 1Gbps interfaces for VM guest networks, however the core system networks should always be a minimum of 10Gbps.
PVC manages the IP addressing of all nodes itself and creates the required addresses during node daemon startup; thus, the on-boot network configuration of each interface should be set to "manual" with no IP addresses configured. This can be ignored safely, however, and the addresses specified manually in the networking configurations. PVC nodes use a split (`/etc/network/interfaces.d/<iface>`) network configuration model.
### System Networks
#### Upstream: Connecting the nodes to the wider world
The upstream network functions as the main upstream for the cluster nodes, providing Internet access and a way to route managed client network traffic out of the cluster. In most deployments, this should be an RFC1918 private subnet with an upstream router which can perform NAT translation and firewalling as required, both for the cluster nodes themselves, and also for any RFC1918 managed client networks.
The floating IP address in the cluster network can be used as a single point of communication with the active primary node, for instance to access the DNS aggregator instance or the management API. PVC provides only limited access control mechanisms to the API interface, so the upstream network should always be protected by a firewall; running PVC directly accessible on the Internet is strongly discouraged and may post a serious security risk, and all access should be restricted to the smallest possible set of remote systems.
Nodes in this network are generally assigned static IP addresses which are configured at node install time in the [Ansible deployment configuration](https://github.com/parallelvirtualcluster/pvc-ansible).
The upstream router should be able to handle static routes to the PVC cluster, or form a BGP neighbour relationship with the coordinator nodes and/or floating IP address to learn routes to the managed client networks.
The upstream network should generally be large enough to contain:
0. The upstream router(s)
0. The nodes themselves
0. In most deployments, the node IPMI management interfaces.
For example, for a 3+ node cluster, up to about 90 nodes, the following configuration might be used:
| Description | Address |
|-------------|---------|
| Upstream network | 10.0.0.0/24 |
| Router VIP address | 10.0.0.1 |
| Router 1 address | 10.0.0.2 |
| Router 2 address | 10.0.0.3 |
| PVC floating address | 10.0.0.10 |
| node1 | 10.0.0.11 |
| node2 | 10.0.0.12 |
| etc. | etc. |
| node1-ipmi | 10.0.0.111 |
| node2-ipmi | 10.0.0.112 |
| etc. | etc. |
For even larger clusters, a `/23` or even larger network may be used.
#### Cluster: Connecting the nodes with each other
The cluster network is an unrouted private network used by the PVC nodes to communicate with each other for database access and Libvirt migrations. It is also used as the underlying interface for the BGP EVPN VXLAN interfaces used by managed client networks.
The floating IP address in the cluster network can be used as a single point of communication with the active primary node.
Nodes in this network are generally assigned IPs automatically based on their node number (e.g. node1 at `.1`, node2 at `.2`, etc.). The network should be large enough to include all nodes sequentially.
Generally the cluster network should be completely separate from the upstream network, either a separate physical interface (or set of bonded interfaces) or a dedicated vLAN on an underlying physical device, but they can be collocated if required.
#### Storage: Connecting Ceph daemons with each other and with OSDs
The storage network is an unrouted private network used by the PVC node storage OSDs to communicated with each other, for Ceph management functionality, and for QEMU-to-Ceph disk access, without using the main cluster network and introducing potentially large amounts of traffic there.
The floating IP address in the storage network can be used as a single point of communication with the active primary node, though this will generally be of little use.
Nodes in this network are generally assigned IPs automatically based on their node number (e.g. node1 at `.1`, node2 at `.2`, etc.). The network should be large enough to include all nodes sequentially.
The administrator may choose to collocate the storage network on the same physical interface as the cluster network, or on a separate physical interface. This should be decided based on the size of the cluster and the perceived ratios of client network versus storage traffic. In large (>3 node) or storage-intensive clusters, this network should generally be a separate set of fast physical interfaces, separate from both the upstream and cluster networks, in order to maximize and isolate the storage bandwidth. If the administrator does choose to collocate these networks, they may also share the same IP address, thus eliminating any distinction between the Cluster and Storage networks. The PVC software handles this natively when the Cluster and Storage IPs of a node are identical.
### Client Networks
#### Bridged (unmanaged) Client Networks
The first type of client network is the unmanaged bridged network. These networks have a separate vLAN on the device underlying the other networks, which is created when the network is configured. VMs are then bridged into this vLAN.
With this client network type, PVC does no management of the network. This is left entirely to the administrator. It requires switch support and the configuration of the vLANs on the switchports of each node's physical interfaces before enabling the network.
Generally, the same physical network interface will underlay both the cluster networks as well as bridged client networks. PVC does however support specifying a separate physical device for bridged client networks, for instance to separate these networks onto a different physical interface from the main cluster networks.
#### VXLAN (managed) Client Networks
The second type of client network is the managed VXLAN network. These networks make use of BGP EVPN, managed by route reflection on the coordinators, to create virtual layer 2 Ethernet tunnels between all nodes in the cluster. VXLANs are then run on top of these virtual layer 2 tunnels, with the active primary PVC node providing routing, DHCP, and DNS functionality to the network via a single IP address.
With this client network type, PVC is in full control of the network. No vLAN configuration is required on the switchports of each node's physical interfaces, as the virtual layer 2 tunnel travels over the cluster layer 3 network. All client network traffic destined for outside the network will exit via the upstream network interface of the active primary coordinator node.
NOTE: These networks may introduce a bottleneck and tromboning if there is a large amount of external and/or inter-network traffic on the cluster. The administrator should consider this carefully when deciding whether to use managed or bridged networks and properly evaluate the inter-network traffic requirements.
#### SR-IOV Client Networks
The third type of client network is the SR-IOV network. SR-IOV (Single-Root I/O Virtualization) is a technique and feature enabled on modern high-performance NICs (for instance, those from Intel or nVidia) which allows a single physical Ethernet port (a "PF" in SR-IOV terminology) to be split, at a hardware level, into multiple virtual Ethernet ports ("VF"s), which can then be managed separately. Starting with version 0.9.21, PVC support SR-IOV PF and VF configuration at the node level, and these VFs can be passed into VMs in two ways.
SR-IOV's main benefit is to offload bridging and network functions from the hypervisor layer, and direct them onto the hardware itself. This can increase network throughput in some situations, as well as provide near-complete isolation of guest networks from the hypervisors (in contrast with bridges which *can* expose client traffic to the hypervisors, and VXLANs which *do* expose client traffic to the hypervisors). For instance, a VF can have a vLAN specified, and the tagging/untagging of packets is then carried out at the hardware layer.
There are however caveats to working with SR-IOV. At the most basic level, the biggest difference with SR-IOV compared to the other two network types is that SR-IOV must be configured on a per-node basis. That is, each node must have SR-IOV explicitly enabled, it's specific PF devices defined, and a set of VFs created at PVC startup. Generally, with identical PVC nodes, this will not be a problem but is something to consider, especially if the servers are mismatched in any way. It is thus also possible to set some nodes with SR-IOV functionality, and others without, though care must be taken in this situation to set node limits in the VM metadata of any VMs which use SR-IOV VFs to prevent failed migrations.
PFs are defined in the `pvcnoded.yml` configuration of each node, via the `sriov_device` list. Each PF can have an arbitrary number of VFs (`vfcount`) allocated, though each NIC vendor and model has specific limits. Once configured, specifically with Intel NICs, PFs (and specifically, the `vfcount` attribute in the driver) are immutable and cannot be changed easily without completely flushing the node and rebooting it, so care should be taken to select the desired settings as early in the cluster configuration as possible.
Once created, VFs are also managed on a per-node basis. That is, each VF, on each host, even if they have the exact same device names, is managed separately. For instance, the PF `ens1f0` creating a VF `ens1f0v0` on "`hv1`", can have a different configuration from the identically-named VF `ens1f0v0` on "`hv2`". The administrator is responsible for ensuring consistency here, and for ensuring that devices do not overlap (e.g. assigning the same VF name to VMs on two separate nodes which might migrate to each other). PVC will however explicitly prevent two VMs from being assigned to the same VF on the same node, even if this may be technically possible in some cases.
When attaching VFs to VMs, there are two supported modes: `macvtap`, and `hostdev`.
`macvtap`, as the name suggests, uses the Linux `macvtap` driver to connect the VF to the VM. Once attached, the vNIC behaves just like a "bridged" network connection above, and like "bridged" connections, the "mode" of the NIC can be specified, defaulting to "virtio" but supporting various emulated devices instead. Note that in this mode, vLANs cannot be configured on the guest side; they must be specified in the VF configuration (`pvc network sriov vf set`) with one vLAN per VF. VMs with `macvtap` interfaces can be live migrated between nodes without issue, assuming there is a corresponding free VF on the destination node, and the SR-IOV functionality is transparent to the VM.
`hostdev` is a direct PCIe pass-through method. With a VF attached to a VM in `hostdev` mode, the virtual PCIe NIC device itself becomes hidden from the node, and is visible only to the guest, where it appears as a discrete PCIe device. In this mode, vLANs and other attributes can be set on the guest side at will, though setting vLANs and other properties in the VF configuration is still supported. The main caveat to this mode is that VMs with connected `hostdev` SR-IOV VFs *cannot be live migrated between nodes*. Only a `shutdown` migration is supported, and, like `macvtap`, an identical PCIe device at the same bus address must be present on the target node. To prevent unexpected failures, PVC will explicitly set the VM metadata for the "migration method" to "shutdown" the first time that a `hostdev` VF is attached to it; if this changes later, the administrator must change this back explicitly.
Generally speaking, SR-IOV connections are not recommended unless there is a good use-case for them. On modern hardware, software bridges are extremely performant, and are much simpler to manage. The functionality is provided for those rare use-cases where SR-IOV is absolutely required by the administrator, but care must be taken to understand all the requirements and caveats of SR-IOV before using it in production.
#### Other Client Networks
Future PVC versions may support other client network types, such as direct-routing between VMs.
### Fencing and Recovery
Self-management and self-healing are important components of PVC's design, and to accomplish this, PVC contains automated fencing and recovery functions to handle situations where nodes crash or become unreachable. PVC is then able, if properly configured, to directly power-cycle the failed node, and bring up any VMs that were running on it on the remaining hypervisors. This ensures that, while there might be a few minutes of downtime for VMs, they are recovered as quickly as possible without human intervention.
To operate correctly, these functions require each node in the cluster to have a functional IPMI-over-IP setup with a configured user who is able to perform chassis power commands. This differs depending on the chassis manufacturer and model, and should be tested prior to deploying any production cluster. If IPMI is not configured correctly at node startup, the daemon will warn and disable automatic recovery of the node. The IPMI should be present in the Upstream system network (see [System Networks](#system-networks) above), or in another secured network which is reachable from the Upstream system network, whichever is more convenient for the layout of the networks.
The general process is divided into 3 sections: detecting node failures, fencing nodes, and recovering from fenced nodes. Note that this process only applies to nodes in the `run` "daemon state"; if a node daemon cleanly shuts down (for instance due to a service restart or administrative action), it will not be fenced.
#### Detecting Failed Nodes
Within the PVC configuration, each node has 3 settings which determine the failure detection time. The first is the `keepalive_interval` setting. This is normally set to 5 seconds, and is the interval at which the node daemon of each node sends its keepalives (as well as gathers statistics about running VMs, Ceph components, etc.). This interval should never need to be changed, but is configurable for maximum flexibility in corner cases. During each keepalive, the node updates a specific key in the Zookeeper cluster with the current UNIX timestamp, which determines when the node was last alive. During their own keepalives, the other nodes check their peers' timestamps to confirm if they are updating normally. Note that, due to this happening during the peer keepalives, if all nodes lose contact with the Zookeeper database, they will *not* immediately begin fencing each other, since the keepalives will not complete; they will, however, upon recovery, jump immediately to the next section when they all realize that their last keepalives were over the threshold, and this situation is discussed there.
The second option is the `fence_intervals` setting. This option determines how many keepalive intervals a node can miss before it is marked `dead` and a fencing sequence started. This is normally set to 6 intervals, which combined with the 5 second `keepalive_interval`, gives a total of 30 seconds (+/- up to another 5 second `keepalive_interval` for peers should they not line up) for the node to be without updates before fencing begins.
The third setting is optional, and is best used in situations where the IPMI connectivity of a node is excessively flaky or can be impaired (e.g. georedundant clusters), or where VM uptime is more important than the burden of recovering from a split-brain situation, and is not as extensively tested. This option is `suicide_intervals`, and if set to a non-0 value, is the number of keepalive intervals before a node *itself* determines that it should forcibly power itself off, which should always be equal to or less than the normal `fence_intervals` setting. Naturally, the node must be somewhat functional to do this, and this can go very wrong, so using this option is not normally recommended.
#### Fencing Nodes
Once the cluster, and specifically one node in the cluster, has determined that a given node is `dead` due to a lack of keepalives, the fencing process starts. This spawns a dedicated child thread within the node daemon of the detecting node, which continually monitors the state of the `dead` node and then performs the fence.
During the `dead` process, the failed node has 6 chances, called "saving throws", at `keepalive_interval` second windows, to send another keepalive before it is fenced. This additional, fixed, delay helps ensure that the cluster will gracefully recover from intermittent network failures or loss of Zookeeper contact, by providing nodes up to another 6 keepalive intervals to save themselves once the fence timer actually begins. This bring the total time, with default options, of a node stopping contact to a node being fenced, to between 60 and 65 seconds. This duration is considered by the author an acceptable compromise between speedy recovery and avoiding false positives (and hence larger outages).
Once a node has been marked `dead` and has failed its 6 "saving throws", the fence process triggers an IPMI chassis reset sequence. First, the node is issued an IPMI `chassis power off` command to trigger a cold system shutdown. Next, it waits a fixed 1 second and then checks and logs the current `chassis power state`, and then issues a `chassis power on` signal to start up the node. It then finally waits a fixed 2 seconds, and then checks the current `chassis power status`. Using the results of these 3 commands, PVC is then able to determine with near certainty whether the node has truly been forced offline or not, and it can proceed to the next step.
#### Recovery from Node Fences
Once a node has been fenced, successfully or not, the system waits for one keepalive interval before proceeding.
The cluster then determines what to do based both on the result of the fence (whether the node was determined to have been successfully cold-reset or not) and on two additional configuration values. The first, `successful_fence`, specifies what action to take when the fence was successful, and is either `migrate` (VMs to other nodes), the default, or `None` (no action). The second, `failed_fence`, is an identical choice for when the fence was unsuccessful, and defaults to `None`.
If the fence was successful and `successful_fence` is set to `None`, then no migration takes place and the VMs on the fenced node will remain offline until the node recovers. If instead `successful_fence` is set to the default of `migrate`, the system will then begin migrating (and hence, starting) VMs that were active on the failed node to other nodes in the cluster. During this special `fence-flush` action, any stale RBD locks on the storage volumes are forcibly cleared, and this is considered safe since the fenced node is determined to have successfully been powered off and the VMs thus terminated. Once all VMs are migrated, the fenced node will then be set to a normal `flushed` state, as if it had been cleanly flushed before powering off. If and when the node returns to active, healthy service, either automatically (if the reset cleared the fault condition) or after human intervention, VMs can then migrate back and the cluster can resume normal operation; otherwise the cluster will remain in the degraded state until corrected.
If the fence was unsuccessful and `failed_fence` is set to the default of `None`, no automatic recovery takes place, since the cluster cannot determine that it is safe to do so. This would most commonly occur during network partitions where the `dead` node potentially remains up with VMs running on it, and the cluster is now in a split-brain situation. The `suicide_interval` option mentioned above is provided for this specific situation, and would allow the administrator to set the `failed_fence` action to `migrate` as well, as they could be somewhat confident that the node will have forcibly terminated itself. However due to the inherent potential for danger in this scenario, it is recommended to leave these options at their defaults, and handle such situations manually instead, as well as ensuring proper network design to avoid the potential for such split-brain situations to occur.
## Advanced Layouts
### Coordinators versus Hypervisors
While a normal basic PVC cluster would consist of 3, or perhaps 5, nodes, PVC is able to scale up much further by differentiating between "coordinator" and "hypervisor" nodes. Such a basic cluster would consist only of coordinator nodes. Scaling up however, it is prudent to add new nodes as hypervisor nodes instead to minimize database scaling problems.
#### Coordinators
Coordinators are a special set of 3 or 5 nodes with additional functionality. The coordinator nodes run, in addition to the PVC software itself, a number of databases and additional functions which are required by the whole cluster. An odd number of coordinators is *always* required to maintain quorum, though there are diminishing returns when creating more than 3. As mentioned above, generally for small clusters all nodes are coordinators.
These additional functions are:
0. The Zookeeper database cluster containing the cluster state and configuration
0. The Patroni PostgreSQL database cluster containing DNS records for managed networks and provisioning configurations
0. The FRR EBGP route reflectors and upstream BGP peers
In addition to these functions, coordinators can usually also run all other PVC node functions.
The set of coordinator nodes is generally configured at cluster bootstrap, initially with 3 nodes, which are then bootstrapped together to form a basic 3-node cluster. Additional nodes, either as coordinators or as hypervisors, can then be added to the running cluster to bring it up to its final size, either immediately or as the needs of the cluster change.
##### The Primary Coordinator
Within the set of coordinators, a single primary coordinator is elected at cluster startup and as nodes start and stop, or in response to administrative commands. Once a node becomes primary, it will remain so until it stops or is told not to be. This coordinator is responsible for some additional functionality in addition to the other coordinators. These additional functions are:
0. The floating IPs in the main networks
0. The default gateway IP for each managed client network
0. The DNSMasq instance handling DHCP and DNS for each managed client network
0. The API and provisioner clients and workers
PVC gracefully handles transitioning primary coordinator state, to minimize downtime. Workers will continue to operate on the old coordinator if available after a switchover and the administrator should be aware of any active tasks before switching the active primary coordinator.
#### Hypervisors
Hypervisor nodes do not run any of the database or routing functionality of coordinator nodes, nor can they become the primary coordinator node (for obvious reasons). When scaling a cluster up beyond the initial 3, or perhaps 5, coordinator nodes, or when an even number of nodes (e.g. 4) may be desired, any nodes beyond the 3 coordinators should be added as hypervisors.
Hypervisor nodes are capable of running VMs and Ceph OSD disks, just like coordinator nodes, though the latter is optional.
PVC has no limit to the number of hypervisor nodes that can connect to a set of coordinators, though beyond a dozen or so total nodes, a more scale-focused infrastructure solution may be warranted.
### Georedundancy
PVC supports geographic redundancy of nodes in order to facilitate disaster recovery scenarios when uptime is critical. Functionally, PVC behaves the same regardless of whether the 3 or more coordinators are in the same physical location, or remote physical locations.
When using geographic redundancy, there are several caveats to keep in mind:
* The Ceph storage subsystem is latency-sensitive. With the default replication configuration, at least 2 writes must succeed for the write to return a success, so the total write latency of a write on any system will be equal to the maximum latency between any two nodes. It is recommended to keep all PVC nodes as "close" as possible latency-wise or storage performance may suffer.
* The inter-node PVC networks (see [System Networks](#system-networks)) must be layer-2 networks (broadcast domains). These networks must be spanned to all nodes in all locations.
* The number of sites and positioning of coordinators at those sites is important. A majority (at least 2 in a 3-coordinator cluster, or 3 in a 5-coordinator cluster) of coordinators must be able to reach each other in a failure scenario for the cluster as a whole to remain functional. Thus, configurations such as 2 + 1 or 3 + 2 splits across 2 sites do *not* provide full redundancy, and the whole cluster will be down if the majority site is down. It is thus recommended to always have an odd number of sites to match the odd number of coordinators, for instance a 1 + 1 + 1 or 2 + 2 + 1 configuration. Also note that all hypervisors much be able to reach the majority coordinator group or their storage will be impacted as well.
This diagram outlines the supported and unsupported/unreliable georedundant configurations for 3 nodes. Care must always be taken to ensure that the cluster can operate with the loss of any given georeundant site.
![georeundancy-caveats](/images/georedundancy-caveats.png)
*Above: Supported and unsupported/unreliable georedundant configurations*
* Even if the PVC software itself is in an unmanageable state, VMs will continue to run if at all possible. However, since the storage subsystem makes use of the same quorum, losing more than half of the coordinator nodes will very likely result in storage interruption as well, which will affect running VMs.
* Nodes in remote geographic locations might not be able to be fenced by the remaining PVC nodes if the entire site is unreachable. The cluster will thus be unable to automatically recover VMs at the failed site should it go down. If at all possible, redundant links to georedundant sites are recommended to ensure there is always a network path. Note that the `suicide_interval` configuration option, while it might seem to help here, will not, because the remaining nodes will not be able to reliably confirm if the remote site actually *did* shut itself off. Thus automatic failover of georedundant sides is a potential deficiency that must be considered.
If these requirements cannot be fulfilled, it may be best to have separate PVC clusters at each site and handle service redundancy at a higher layer to avoid a major disruption.
## Example System Diagrams
This section provides diagrams of 2 best-practice cluster configurations. These diagrams can be extrapolated out to almost any possible configuration and number of nodes.
#### Small 3-node cluster
[![Small 3-node cluster](/images/pvc-3-node-cluster.png)](/images/pvc-3-node-cluster.png)
*Above: A diagram of a simple 3-node cluster with all nodes as coordinators. Dual 10 Gbps network interface per node, unified physical networking with collapsed cluster and storage networks.*
#### Large 8-node cluster
[![Larger 8-node cluster](/images/pvc-8-node-cluster.png)](/images/pvc-8-node-cluster.png)
*Above: A diagram of a large 8-node cluster with 3 coordinators and 5 hypervisors. Quad 10Gbps network interfaces per node, split physical networking into guest/cluster and storage networks.*

File diff suppressed because it is too large Load Diff

View File

@ -1,8 +1,6 @@
# PVC Provisioner Manual
# PVC Provisioner Guide
The PVC provisioner is a subsection of the main PVC API. It interfaces directly with the Zookeeper database using the common client functions, and with the Patroni PostgreSQL database to store details. The provisioner also interfaces directly with the Ceph storage cluster, for mapping volumes, creating filesystems, and installing guests.
Details of the Provisioner API interface can be found in [the API manual](/manuals/api).
The PVC provisioner is a subsection of the main PVC system, designed to aid administrators in quickly deploying virtual machines (mostly major Linux flavours) according to defined templates and profiles, leveraging CloudInit and customizable provisioning scripts, or by deploying OVA images.
- [PVC Provisioner Manual](#pvc-provisioner-manual)
* [Overview](#overview)
@ -19,15 +17,13 @@ Details of the Provisioner API interface can be found in [the API manual](/manua
## Overview
The purpose of the Provisioner API is to provide a convenient way for administrators to automate the creation of new virtual machines on the PVC cluster.
The purpose of the Provisioner is to provide a convenient way for administrators to automate the creation of new virtual machines on the PVC cluster.
The Provisioner allows the administrator to construct descriptions of VMs, called profiles, which include system resource specifications, network interfaces, disks, cloud-init userdata, and installation scripts. These profiles are highly modular, allowing the administrator to specify arbitrary combinations of the mentioned VM features with which to build new VMs.
The provisioner supports creating VMs based off of installation scripts, by cloning existing volumes, and by uploading OVA image templates to the cluster.
Examples in the following sections use the CLI exclusively for demonstration purposes. For details of the underlying API calls, please see the [API interface reference](/manuals/api-reference.html).
Use of the PVC Provisioner is not required. Administrators can always perform their own installation tasks, and the provisioner is not specially integrated, calling various other API commands as though they were run from the CLI or API.
Use of the PVC Provisioner is not required. Administrators can always perform their own installation tasks, and the provisioner is not specially integrated, calling various other commands as though they were run from the CLI or API.
# PVC Provisioner concepts
@ -218,87 +214,86 @@ As mentioned above, the `VMBuilderScript` instance includes several instance var
* `self.vm_data`: A full dictionary representation of the data provided by the PVC provisioner about the VM. Includes many useful details for crafting the VM configuration and setting up disks and networks. An example, in JSON format:
```
{
"ceph_monitor_list": [
"hv1.pvcstorage.tld",
"hv2.pvcstorage.tld",
"hv3.pvcstorage.tld"
],
"ceph_monitor_port": "6789",
"ceph_monitor_secret": "96721723-8650-4a72-b8f6-a93cd1a20f0c",
"mac_template": null,
"networks": [
{
"eth_bridge": "vmbr1001",
"id": 72,
"network_template": 69,
"vni": "1001"
},
{
"eth_bridge": "vmbr101",
"id": 73,
"network_template": 69,
"vni": "101"
}
],
"script": [contents of this file]
"script_arguments": {
"deb_mirror": "http://ftp.debian.org/debian",
"deb_release": "bullseye"
},
"system_architecture": "x86_64",
"system_details": {
"id": 78,
"migration_method": "live",
"name": "small",
"node_autostart": false,
"node_limit": null,
"node_selector": null,
"ova": null,
"serial": true,
"vcpu_count": 2,
"vnc": false,
"vnc_bind": null,
"vram_mb": 2048
},
"volumes": [
{
"disk_id": "sda",
"disk_size_gb": 4,
"filesystem": "ext4",
"filesystem_args": "-L=root",
"id": 9,
"mountpoint": "/",
"pool": "vms",
"source_volume": null,
"storage_template": 67
},
{
"disk_id": "sdb",
"disk_size_gb": 4,
"filesystem": "ext4",
"filesystem_args": "-L=var",
"id": 10,
"mountpoint": "/var",
"pool": "vms",
"source_volume": null,
"storage_template": 67
},
{
"disk_id": "sdc",
"disk_size_gb": 4,
"filesystem": "ext4",
"filesystem_args": "-L=log",
"id": 11,
"mountpoint": "/var/log",
"pool": "vms",
"source_volume": null,
"storage_template": 67
}
]
}
```
{
"ceph_monitor_list": [
"hv1.pvcstorage.tld",
"hv2.pvcstorage.tld",
"hv3.pvcstorage.tld"
],
"ceph_monitor_port": "6789",
"ceph_monitor_secret": "96721723-8650-4a72-b8f6-a93cd1a20f0c",
"mac_template": null,
"networks": [
{
"eth_bridge": "vmbr1001",
"id": 72,
"network_template": 69,
"vni": "1001"
},
{
"eth_bridge": "vmbr101",
"id": 73,
"network_template": 69,
"vni": "101"
}
],
"script": [contents of this file]
"script_arguments": {
"deb_mirror": "http://ftp.debian.org/debian",
"deb_release": "bullseye"
},
"system_architecture": "x86_64",
"system_details": {
"id": 78,
"migration_method": "live",
"name": "small",
"node_autostart": false,
"node_limit": null,
"node_selector": null,
"ova": null,
"serial": true,
"vcpu_count": 2,
"vnc": false,
"vnc_bind": null,
"vram_mb": 2048
},
"volumes": [
{
"disk_id": "sda",
"disk_size_gb": 4,
"filesystem": "ext4",
"filesystem_args": "-L=root",
"id": 9,
"mountpoint": "/",
"pool": "vms",
"source_volume": null,
"storage_template": 67
},
{
"disk_id": "sdb",
"disk_size_gb": 4,
"filesystem": "ext4",
"filesystem_args": "-L=var",
"id": 10,
"mountpoint": "/var",
"pool": "vms",
"source_volume": null,
"storage_template": 67
},
{
"disk_id": "sdc",
"disk_size_gb": 4,
"filesystem": "ext4",
"filesystem_args": "-L=log",
"id": 11,
"mountpoint": "/var/log",
"pool": "vms",
"source_volume": null,
"storage_template": 67
}
]
}
Since the `VMBuilderScript` runs within its own context but within the PVC Provisioner/API system, it is possible to use many helper libraries from the PVC system itself, including both the built-in daemon libraries (used by the API itself) and several explicit provisioning script helpers. The following are commonly-used (in the examples) imports that can be leveraged:
@ -311,9 +306,9 @@ Since the `VMBuilderScript` runs within its own context but within the PVC Provi
* `daemon_lib.common`: Part of the PVC daemon libraries, provides several common functions, including, most usefully, `run_os_command` which provides a wrapped, convenient method to call arbitrary shell/OS commands while returning a POSIX returncode, stdout, and stderr (a tuple of the 3 in that order).
* `daemon_lib.ceph`: Part of the PVC daemon libraries, provides several commands for managing Ceph RBD volumes, including, but not limited to, `clone_volume`, `add_volume`, `map_volume`, and `unmap_volume`. See the `debootstrap` example for a detailed usage example.
For safety reasons, the script runs in a modified chroot environment on the hypervisor. It will have full access to the entire / (root partition) of the hypervisor, but read-only. In addition it has read-write access to /dev, /sys, /run, and a fresh /tmp to write to; use /tmp/target (as convention) as the destination for any mounting of volumes and installation. Thus it is not possible to do things like `apt-get install`ing additional programs within a script; any such requirements must be set up before running the script (e.g. via `pvc-ansible`).
For safety reasons, the script runs in a modified chroot environment on the hypervisor. It will have full access to the entire `/` (root partition) of the hypervisor, but read-only. In addition it has read-write access to `/dev`, `/sys`, `/run`, and a fresh `/tmp` to write to; use `/tmp/target` (as convention) as the destination for any mounting of volumes and installation. Thus it is not possible to do things like `apt-get install`ing additional programs within a script; any such requirements must be set up before running the script (e.g. via `pvc-ansible`).
**WARNING**: Of course, despite this "safety" mechanism, it is VERY IMPORTANT to be cognizant that this script runs AS ROOT ON THE HYPERVISOR SYSTEM with FULL ACCESS to the cluster. You should NEVER allow arbitrary, untrusted users the ability to add or modify provisioning scripts. It is trivially easy to write scripts which will do destructive things - for example writing to arbitrary /dev objects, running arbitrary root-level commands, or importing PVC library functions to delete VMs, RBD volumes, or pools. Thus, ensure you vett and understand every script on the system, audit them regularly for both intentional and accidental malicious activity, and of course (to reiterate), do not allow untrusted script creation!
⚠️ **WARNING** Of course, despite this "safety" mechanism, it is **very important** to be cognizant that this script runs **as root on the hypervisor system** with **full access to the cluster**. You should **never** allow arbitrary, untrusted users the ability to add or modify provisioning scripts. It is trivially easy to write scripts which will do destructive things - for example writing to arbitrary `/dev` objects, running arbitrary root-level commands, or importing PVC library functions to delete VMs, RBD volumes, or pools. Thus, ensure you vett and understand every script on the system, audit them regularly for both intentional and accidental malicious activity, and of course (to reiterate), do not allow untrusted script creation!
## Profiles
@ -398,11 +393,9 @@ Using cluster "local" - Host: "10.0.0.1:7370" Scheme: "http" Prefix: "/api/v1"
Task ID: 39639f8c-4866-49de-8c51-4179edec0194
```
**NOTE**: A VM that is set to do so will be defined on the cluster early in the provisioning process, before creating disks or executing the provisioning script, with the special status `provision`. Once completed, if the VM is not set to start automatically, the state will remain `provision`, with the VM not running, until its state is explicitly changed with the client (or via autostart when its node returns to `ready` state).
📝 **NOTE** A VM that is set to do so will be defined on the cluster early in the provisioning process, before creating disks or executing the provisioning script, with the special status `provision`. Once completed, if the VM is not set to start automatically, the state will remain `provision`, with the VM not running, until its state is explicitly changed with the client (or via autostart when its node returns to `ready` state).
**NOTE**: Provisioning jobs are tied to the node that spawned them. If the primary node changes, provisioning jobs will continue to run against that node until they are completed, interrupted, or fail, but the active API (now on the new primary node) will not have access to any status data from these jobs, until the primary node status is returned to the original host. The CLI will warn the administrator of this if there are active jobs while running `node primary` or `node secondary` commands.
**NOTE**: Provisioning jobs cannot be cancelled, either before they start or during execution. The administrator should always let an invalid job either complete or fail out automatically, then remove the erroneous VM with the `vm remove` command.
📝 **NOTE** Provisioning jobs cannot be cancelled, either before they start or during execution. The administrator should always let an invalid job either complete or fail out automatically, then remove the erroneous VM with the `vm remove` command.
# Deploying VMs from OVA images
@ -438,4 +431,4 @@ During import, PVC splits the OVA into its constituent parts, including any disk
Because of this, OVA profiles do not include storage templates like other PVC profiles. A storage template can still be added to such a profile, and the block devices will be added after the main block devices. However, this is generally not recommended; it is far better to modify the OVA to add additional volume(s) before uploading it instead.
**WARNING**: Never adjust the sizes of the OVA VMDK-formatted storage volumes (named `ova_<NAME>_sdX`) or remove them without removing the OVA itself in the provisioner; doing so will prevent the deployment of the OVA, specifically the conversion of the images to raw format at deploy time, and render the OVA profile useless.
⚠️ **WARNING** Never adjust the sizes of the OVA VMDK-formatted storage volumes (named `ova_<NAME>_sdX`) or remove them without removing the OVA itself in the provisioner; doing so will prevent the deployment of the OVA, specifically the conversion of the images to raw format at deploy time, and render the OVA profile useless.

View File

@ -1,145 +0,0 @@
# Getting started - deploying a Parallel Virtual Cluster
PVC aims to be easy to deploy, letting you get on with managing your cluster in just a few hours at most. Once initial setup is complete, the cluster is managed via the clients, though the Ansible framework is used to add, remove, or modify nodes as required.
This guide will walk you through setting up a simple 3-node PVC cluster from scratch, ending with a fully-usable cluster ready to provision virtual machines. Note that all domains, IP addresses, etc. used are examples - when following this guide, be sure to modify the commands and configurations to suit your needs.
### Part One - Preparing for bootstrap
0. Read through the [Cluster Architecture documentation](/cluster-architecture). This documentation details the requirements and conventions of a PVC cluster, and is important to understand before proceeding.
0. Download the latest copy of the [`pvc-ansible`](https://github.com/parallelvirtualcluster/pvc-ansible) repository to your local machine.
0. Leverage the `create-local-repo.sh` script in the `pvc-ansible` directory to set up a local cluster configuration directory; follow the instructions the script provides, as all future steps will be done inside your new local configuration directory.
0. Create an initial `hosts` inventory, using `hosts.default` in the `pvc-ansible` repo as a template. You can manage multiple PVC clusters ("sites") from the Ansible repository easily, however for simplicity you can use the simple name `cluster` for your initial site. Define the 3 hostnames you will use under the site group; usually the provided names of `pvchv1`, `pvchv2`, and `pvchv3` are sufficient, though you may use any hostname pattern you wish. It is *very important* that the names all contain a sequential number, however, as this is used by various components.
0. Create an initial set of `group_vars` for your cluster at `group_vars/<cluster>`, using the `group_vars/default` in the `pvc-ansible` repo as a template. Inside these group vars are two main files: `base.yml` and `pvc.yml`. These example files are well-documented; read them carefully and specify all required options before proceeding, and reference the [Ansible setup examples](https://github.com/parallelvirtualcluster/pvc-ansible) for more detailed descriptions of the options.
* `base.yml` configures the `base` role and some common per-cluster configurations such as an upstream domain, a root password, a set of administrative users, various hardware configuration items, as well as and most importantly, the basic network configuration of the nodes. Make special note of the various items that must be generated such as passwords; these should all be cluster-unique.
* `pvc.yml` configures the `pvc` role, including all the dependent software and PVC itself. Important to note is the `pvc_nodes` list, which contains a list of all the nodes as well as per-node configurations for each. All nodes must be a part of this list.
0. In the `pvc-installer` directory, run the `buildiso.sh` script to generate an installer ISO. This script requires `debootstrap`, `isolinux`, and `xorriso` to function. The resulting file will, by default, be named `pvc-installer_<date>.iso` in the current directory. For additional options, use the `-h` flag to show help information for the script.
### Part Two - Preparing and installing the physical hosts
0. Prepare 3 physical servers with IPMI. The servers should match the specifications and requirements outlined in the [Cluster Architecture documentation](/cluster-architecture). Connect their networking based on the configuration set in the `base.yml` group vars file for your cluster.
0. Load the installer ISO generated in step 6 of the previous section onto a USB stick, or using IPMI virtual media, on the physical servers.
0. Boot the physical servers off of the installer ISO. Use UEFI mode - if available - for maximum flexibility and longevity.
0. Follow the prompts from the installer ISO. It will ask for a hostname, the system disk device to use, the initial network interface to configure as well as vLANs and either DHCP or static IP information, and finally either an HTTP URL containing an SSH `authorized_keys` to use for the `deploy` user, or a password for this user if key auth is unavailable.
0. Wait for the installer to complete. This may take several minutes.
0. At the end of the install process, follow the prompts carefully; it is usually prudent to pre-see the `/etc/network/interfaces` configuration based on your expected final physical network config (e.g. set up bonding, etc.) before proceeding, especially if you use DHCP, as the bonding configuration applied later could affect the address. The `chroot` is likely unneeded unless you have good reason to edit the system in this way.
0. Make note of the (temporary and insecure!) root password set by the installer; you may need it to troubleshoot the system if it does not come up properly. This will be overwritten later in the setup process.
0. Press "Enter" to reboot the system and confirm it is reachable.
0. Repeat the above steps for all 3 initial nodes. On boot, they will display their configured IP address to be used in the next steps.
### Part Three - Initial bootstrap with Ansible
0. Make note of the IP addresses of all 3 initial nodes, and configure DNS, `/etc/hosts`, or Ansible `ansible_host=` hostvars to map these IP addresses to the hostnames set in the Ansible `hosts` and `group_vars` files.
0. Verify connectivity from your administrative host to the 3 initial nodes, including SSH access as the `deploy` user. Accept their host keys as required before proceeding as Ansible does not like those prompts. If you did not configure SSH key auth during the PVC installer process, configure it now, as it greatly simplifies Ansible configuration.
0. Verify your `group_vars` setup from part 1, as errors here may require a re-installation and restart of the bootstrap process.
0. Perform the initial bootstrap. From your local configuration repository directory, execute the following `ansible-playbook` command, replacing `<cluster_name>` with the Ansible group name from the `hosts` file. Make special note of the additional `bootstrap=yes` variable, which tells the playbook that this is an initial bootstrap run.
`$ ansible-playbook -v -i hosts pvc.yml -l <cluster_name> -e bootstrap=yes`
**WARNING:** Never run this playbook with the `-e bootstrap=yes` option against an active, already-bootstrapped cluster. This will have **disastrous consequences** including the **loss of all data** in the Ceph system as well as any configured networks, VMs, etc.
0. Wait for the Ansible playbook run to finish. Once completed, the cluster bootstrap will be finished, and all 3 nodes will have rebooted into a working PVC cluster. If any errors occur, carefully evaluate them and re-run the playbook (with `-o bootstrap=yes` - your cluster is not active yet!) as required.
0. Download and install the CLI client package (`pvc-client-cli.deb`) on your administrative host, and add and verify connectivity to the cluster; this will also verify that the API is working. You will need to know the cluster upstream floating IP address you configured in the `networks` section of the `base.yml` playbook, and if you configured SSL or authentication for the API in your `group_vars`, adjust the first command as needed (see `pvc cluster add -h` for details). A human-readable description can also be specified, which is useful if you manage multiple clusters and their names become unweildy.
`$ pvc cluster add -a <upstream_floating_ip> -d "My first PVC cluster" mycluster`
`$ pvc -c mycluster node list`
You can also set a default cluster by exporting the `PVC_CLUSTER` environment variable to avoid requiring `-c cluster` with every subsequent command:
`$ export PVC_CLUSTER="mycluster"`
**Note:** It is fully possible to administer the cluster from the nodes themselves via SSH should you so choose, to avoid requiring the PVC client on your local machine.
### Part Four - Configuring the Ceph storage cluster
0. Determine the Ceph OSD block devices on each host via an `ssh` shell. For instance, use `lsblk` or check `/dev/disk/by-path` to show the block devices by their physical SAS/SATA bus location, and obtain the relevant `/dev/sdX` name for each disk you wish to be a Ceph OSD on each host.
0. Cofigure an OSD device for each data disk in each host. The general command is:
`$ pvc storage osd add --weight <weight> <node> <device>`
For example, if each node has two data disks, as `/dev/sdb` and `/dev/sdc`, run the commands as follows to add the first disk to each node, then the second disk to each node:
`$ pvc storage osd add --weight 1.0 pvchv1 /dev/sdb`
`$ pvc storage osd add --weight 1.0 pvchv2 /dev/sdb`
`$ pvc storage osd add --weight 1.0 pvchv3 /dev/sdb`
`$ pvc storage osd add --weight 1.0 pvchv1 /dev/sdc`
`$ pvc storage osd add --weight 1.0 pvchv2 /dev/sdc`
`$ pvc storage osd add --weight 1.0 pvchv3 /dev/sdc`
**NOTE:** On the CLI, the `--weight` argument is optional, and defaults to `1.0`. In the API, it must be specified explicitly, but the CLI sets a default value. OSD weights determine the relative amount of data which can fit onto each OSD. Under normal circumstances, you would want all OSDs to be of identical size, and hence all should have the same weight. If your OSDs are instead different sizes, the weight should be proportional to the size, e.g. `1.0` for a 100GB disk, `2.0` for a 200GB disk, etc. For more details, see the [Cluster Architecture](/cluster-architecture) and Ceph documentation.
**NOTE:** OSD commands wait for the action to complete on the node, and can take some time (up to 30 seconds).
**NOTE:** You can add OSDs in any order you wish, for instance you can add the first OSD to each node and then add the second to each node, or you can add all nodes' OSDs together at once like the example. This ordering does not affect the cluster in any way.
0. Verify that the OSDs were added and are functional (`up` and `in`):
`$ pvc storage osd list`
0. Create an RBD pool to store VM images on. The general command is:
`$ pvc storage pool add <name> <placement_groups>`
**NOTE:** Ceph placement groups are a complex topic; as a general rule it's easier to grow than shrink, so start small and grow as your cluster grows. The following are some good starting numbers for 3-node clusters, though the Ceph documentation and the [Ceph placement group calculator](https://ceph.com/pgcalc/) are advisable for anything more complex. There is a trade-off between CPU usage and the number of total PGs for all pools in the cluster, with more PGs meaning more CPU usage.
* 3 OSDs total: 128 PGs (1 pool) or 64 PGs (2 or more pools, each)
* 6 OSDs total: 256 PGs (1 pool) or 128 PGs (2 or more pools, each)
* 9+ OSDs total: 256 PGs
For example, to create a pool named `vms` with 256 placement groups, run the command as follows:
`$ pvc storage pool add vms 256`
**NOTE:** As detailed in the [cluster architecture documentation](/cluster-architecture), you can also set a custom replica configuration for each pool if the default of 3 replica copies with 2 minimum copies is not acceptable. See `pvc storage pool add -h` or that document for full details.
0. Verify that the pool was added:
`$ pvc storage pool list`
### Part Five - Creating virtual networks
0. Determine a domain name and IPv4, and/or IPv6 network for your first client network, and any other client networks you may wish to create. These networks must not overlap with the cluster networks. For full details on the client network types, see the [cluster architecture documentation](/cluster-architecture).
0. Create the virtual network. There are many options here, so see `pvc network add -h` for details.
For example, to create the managed (EVPN VXLAN) network `100` with subnet `10.100.0.0/24`, gateway `.1` and DHCP from `.100` to `.199`, run the command as follows:
`$ pvc network add 100 --type managed --description my-managed-network --domain myhosts.local --ipnet 10.100.0.0/24 --gateway 10.100.0.1 --dhcp --dhcp-start 10.100.0.100 --dhcp-end 10.100.0.199`
For another example, to create the static bridged (switch-configured, tagged VLAN, with no PVC management of IPs) network `200`, run the command as follows:
`$ pvc network add 200 --type bridged --description my-bridged-network`
**NOTE:** Network descriptions cannot contain spaces or special characters; keep them short, sweet, and dash or underscore delimited.
0. Verify that the network(s) were added:
`$ pvc network list`
0. On the upstream router, configure one of:
a) A BGP neighbour relationship with the cluster upstream floating address to automatically learn routes.
b) Static routes for the configured client IP networks towards the cluster upstream floating address.
0. On the upstream router, if required, configure NAT for the configured client IP networks.
0. Verify the client networks are reachable by pinging the managed gateway from outside the cluster.
### You're Done!
0. Set all 3 nodes to `ready` state, allowing them to run virtual machines. The general command is:
`$ pvc node ready <node>`
Congratulations, you now have a basic PVC storage cluster, ready to run your VMs.
For next steps, see the [Provisioner manual](/manuals/provisioner) for details on how to use the PVC provisioner to create new Virtual Machines, as well as the [CLI manual](/manuals/cli) and [API manual](/manuals/api) for details on day-to-day usage of PVC.

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 109 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 177 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 300 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

View File

@ -1,9 +1,11 @@
<p align="center">
<img alt="Logo banner" src="images/pvc_logo_black.png"/>
<img alt="Logo banner" src="https://docs.parallelvirtualcluster.org/en/latest/images/pvc_logo_black.png"/>
<br/><br/>
<a href="https://www.parallelvirtualcluster.org"><img alt="Website" src="https://img.shields.io/badge/visit-website-blue"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Latest Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
<a href="https://docs.parallelvirtualcluster.org/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
<a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
<a href="https://parallelvirtualcluster.readthedocs.io/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
<a href="https://github.com/psf/black"><img alt="Code style: Black" src="https://img.shields.io/badge/code%20style-black-000000.svg"/></a>
</p>
## What is PVC?
@ -18,42 +20,68 @@ As a consequence of its features, PVC makes administrating very high-uptime VMs
PVC also features an optional, fully customizable VM provisioning framework, designed to automate and simplify VM deployments using custom provisioning profiles, scripts, and CloudInit userdata API support.
Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client or WebUI.
Installation of PVC is accomplished by two main components: a [Node installer ISO](https://github.com/parallelvirtualcluster/pvc-installer) which creates on-demand installer ISOs, and an [Ansible role framework](https://github.com/parallelvirtualcluster/pvc-ansible) to configure, bootstrap, and administrate the nodes. Installation can also be fully automated with a companion [cluster bootstrapping system](https://github.com/parallelvirtualcluster/pvc-bootstrap). Once up, the cluster is managed via an HTTP REST API, accessible via a Python Click CLI client ~~or WebUI~~ (eventually).
Just give it physical servers, and it will run your VMs without you having to think about it, all in just an hour or two of setup time.
For more details on the project motivation, please see the [About](https://parallelvirtualcluster.readthedocs.io/en/latest/about/) page.
## What is it based on?
The core node and API daemons, as well as the CLI API client, are written in Python 3 and are fully Free Software (GNU GPL v3). In addition to these, PVC makes use of the following software tools to provide a holistic hyperconverged infrastructure solution:
* Debian GNU/Linux as the base OS.
* Linux KVM, QEMU, and Libvirt for VM management.
* Linux `ip`, FRRouting, NFTables, DNSMasq, and PowerDNS for network management.
* Ceph for storage management.
* Apache Zookeeper for the primary cluster state database.
* Patroni PostgreSQL manager for the secondary relation databases (DNS aggregation, Provisioner configuration).
More information about PVC, its motivations, the hardware requirements, and setting up and managing a cluster [can be found over at our docs page](https://docs.parallelvirtualcluster.org).
## Getting Started
To get started with PVC, read over the [Cluster Architecture](https://parallelvirtualcluster.readthedocs.io/en/latest/cluster-architecture/) page then see the [Getting Started](https://parallelvirtualcluster.readthedocs.io/en/latest/getting-started/) guide for details on configuring your first cluster.
To get started with PVC, please see the [About](https://docs.parallelvirtualcluster.org/en/latest/about-pvc/) page for general information about the project, and the [Getting Started](https://docs.parallelvirtualcluster.org/en/latest/deployment/getting-started/) page for details on configuring your first cluster.
## Changelog
View the changelog in [CHANGELOG.md](https://github.com/parallelvirtualcluster/pvc/blob/master/CHANGELOG.md).
View the changelog in [CHANGELOG.md](https://github.com/parallelvirtualcluster/pvc/blob/master/README.md). **Please note that any breaking changes are announced here; ensure you read the changelog before upgrading!**
## Screenshots
While PVC's API and internals aren't very screenshot-worthy, here is some example output of the CLI tool.
These screenshots show some of the available functionality of the PVC system and CLI as of PVC v0.9.85.
<p><img alt="Node listing" src="images/pvc-nodes.png"/><br/><i>Listing the nodes in a cluster</i></p>
<p><img alt="0. Integrated help" src="images/0-integrated-help.png"/><br/>
<i>The CLI features an integrated, fully-featured help system to show details about every possible command.</i>
</p>
<p><img alt="Network listing" src="images/pvc-networks.png"/><br/><i>Listing the networks in a cluster, showing 3 bridged and 1 IPv4-only managed networks</i></p>
<p><img alt="1. Connection management" src="images/1-connection-management.png"/><br/>
<i>A single CLI instance can manage multiple clusters, including a quick detail view, and will default to a "local" connection if an "/etc/pvc/pvc.conf" file is found; sensitive API keys are hidden by default.</i>
</p>
<p><img alt="VM listing and migration" src="images/pvc-migration.png"/><br/><i>Listing a limited set of VMs and migrating one with status updates</i></p>
<p><img alt="2. Cluster details and output formats" src="images/2-cluster-details-and-output-formats.png"/><br/>
<i>PVC can show the key details of your cluster at a glance, including health, persistent fault events, and key resources; the CLI can output both in pretty human format and JSON for easier machine parsing in scripts.</i>
</p>
<p><img alt="Node logs" src="images/pvc-nodelog.png"/><br/><i>Viewing the logs of a node (keepalives and VM [un]migration)</i></p>
<p><img alt="3. Node information" src="images/3-node-information.png"/><br/>
<i>PVC can show details about the nodes in the cluster, including their live health and resource utilization.</i>
</p>
<p><img alt="4. VM information" src="images/4-vm-information.png"/><br/>
<i>PVC can show details about the VMs in the cluster, including their state, resource allocations, current hosting node, and metadata.</i>
</p>
<p><img alt="5. VM details" src="images/5-vm-details.png"/><br/>
<i>In addition to the above basic details, PVC can also show extensive information about a running VM's devices and other resource utilization.</i>
</p>
<p><img alt="6. Network information" src="images/6-network-information.png"/><br/>
<i>PVC has two major client network types, and ensures a consistent configuration of client networks across the entire cluster; managed networks can feature DHCP, DNS, firewall, and other functionality including DHCP reservations.</i>
</p>
<p><img alt="7. Storage information" src="images/7-storage-information.png"/><br/>
<i>PVC provides a convenient abstracted view of the underlying Ceph system and can manage all core aspects of it.</i>
</p>
<p><img alt="8. VM and node logs" src="images/8-vm-and-node-logs.png"/><br/>
<i>PVC can display logs from VM serial consoles (if properly configured) and nodes in-client to facilitate quick troubleshooting.</i>
</p>
<p><img alt="9. VM and worker tasks" src="images/9-vm-and-worker-tasks.png"/><br/>
<i>PVC provides full VM lifecycle management, as well as long-running worker-based commands (in this example, clearing a VM's storage locks).</i>
</p>
<p><img alt="10. Provisioner" src="images/10-provisioner.png"/><br/>
<i>PVC features an extensively customizable and configurable VM provisioner system, including EC2-compatible CloudInit support, allowing you to define flexible VM profiles and provision new VMs with a single command.</i>
</p>
<p><img alt="11. Prometheus and Grafana dashboard" src="images/11-prometheus-grafana.png"/><br/>
<i>PVC features several monitoring integration examples under "node-daemon/monitoring", including CheckMK, Munin, and, most recently, Prometheus, including an example Grafana dashboard for cluster monitoring and alerting.</i>
</p>

View File

@ -4,7 +4,7 @@ The PVC API is a standalone client application for PVC. It interfaces directly w
The API is built using Flask and is packaged in the Debian package `pvc-client-api`. The API depends on the common client functions of the `pvc-client-common` package as does the CLI client.
Details of the API interface can be found in [the manual](/manuals/api).
The full API endpoint and schema documentation [can be found here](/en/latest/manuals/api-reference.html).
# PVC HTTP API manual
@ -349,7 +349,3 @@ The Ceph monitor port. Should always be `6789`.
* *required*
The Libvirt storage secret UUID for the Ceph cluster.
## API Endpoint Documentation
The full API endpoint and schema documentation [can be found here](/manuals/api-reference.html).

View File

@ -332,7 +332,7 @@ The action to take regarding VMs once a node is *successfully* fenced, i.e. the
The action to take regarding VMs once a node fencing *fails*, i.e. the IPMI command to restart the node reports a failure. Can be one of `None`, to perform no action and the default, or `migrate` to migrate and start all failed VMs on other nodes.
**WARNING:** This functionality is potentially **dangerous** and can result in data loss or corruption in the VM disks; the post-fence migration process *explicitly clears RBD locks on the disk volumes*. It is designed only for specific and advanced use-cases, such as servers that do not reliably report IPMI responses or servers without IPMI (not recommended; see the [cluster architecture documentation](/architecture/cluster)). If this is set to `migrate`, the `suicide_intervals` **must** be set to provide at least some guarantee that the VMs on the node will actually be terminated before this condition triggers. The administrator should think very carefully about their setup and potential failure modes before enabling this option.
⚠️ **WARNING** This functionality is potentially **dangerous** and can result in data loss or corruption in the VM disks; the post-fence migration process *explicitly clears RBD locks on the disk volumes*. It is designed only for specific and advanced use-cases, such as servers that do not reliably report IPMI responses or servers without IPMI (not recommended; see the [cluster architecture documentation](/architecture/cluster)). If this is set to `migrate`, the `suicide_intervals` **must** be set to provide at least some guarantee that the VMs on the node will actually be terminated before this condition triggers. The administrator should think very carefully about their setup and potential failure modes before enabling this option.
#### `system` → `fencing` → `ipmi` → `host`

File diff suppressed because it is too large Load Diff

View File

@ -4,6 +4,7 @@ theme:
name: readthedocs
titles_only: yes
logo: "images/pvc_logo_black_transparent.png"
width: "100%"
markdown_extensions:
- toc:
permalink: yes
@ -11,14 +12,18 @@ markdown_extensions:
nav:
- 'Home': 'index.md'
- 'About PVC': 'about-pvc.md'
- 'Architecture':
- 'Cluster Architecture': 'architecture/cluster-architecture.md'
- 'Hardware Requirements': 'architecture/hardware-requirements.md'
- 'Fencing': 'architecture/fencing.md'
- 'Georedundancy': 'architecture/georedundancy.md'
- 'Deployment':
- 'Cluster Architecture': 'deployment/cluster-architecture.md'
- 'Hardware Requirements': 'deployment/hardware-requirements.md'
- 'Fencing & Georedundancy': 'deployment/fencing-and-georedundancy.md'
- 'Getting Started Guide': 'deployment/getting-started.md'
- 'Provisioner Guide': 'deployment/provisioner.md'
- 'Manuals':
- 'PVC CLI': 'manuals/cli.md'
- 'PVC HTTP API': 'manuals/api.md'
- 'PVC Node Daemon': 'manuals/daemon.md'
- 'PVC Provisioner': 'manuals/provisioner.md'
- 'PVC Node Health Plugins': 'manuals/health-plugins.md'
- 'API':
- 'API Reference': 'manuals/api-reference.html'