pvc/docs/manuals/daemon.md

19 KiB

PVC Node Daemon architecture

The PVC Node Daemon is the heart of the PVC system and runs on each node to manage the state of the node and its configured resources. The daemon connects directly to the Zookeeper cluster for coordination and state.

The node daemon is build using Python 3.X and is packaged in the Debian package pvc-daemon.

Configuration of the daemon is documented in the manual, however it is recommended to use the Ansible configuration system to configure the PVC cluster for you from scratch.

Overall architecture

The PVC daemon is object-oriented - each cluster resource is represented by an Object, which is then present on each node in the cluster. This allows state changes to be reflected across the entire cluster should their data change.

During startup, the system scans the Zookeeper database and sets up the required objects. The database is then watched in real-time for additional changes to the database information.

Startup sequence

The daemon startup sequence is documented below. The main daemon entry-point is Daemon.py inside the pvcnoded folder, which is called from the pvcnoded.py stub file.

  1. The configuration is read from /etc/pvc/pvcnoded.yaml and the configuration object set up.

  2. Any required filesystem directories, mostly dynamic directories, are created.

  3. The logger is set up. If file logging is enabled, this is the state when the first log messages are written.

  4. Host networking is configured based on the pvcnoded.yaml configuration file. In a normal cluster, this is the point where the node will become reachable on the network as all networking is handled by the PVC node daemon.

  5. Sysctl tweaks are applied to the host system, to enable routing/forwarding between nodes via the host.

  6. The node determines its coordinator state and starts the required daemons if applicable. In a normal cluster, this is the point where the dependent services such as Zookeeper, FRR, and Ceph become available. After this step, the daemon waits 5 seconds before proceeding to give these daemons a chance to start up.

  7. The daemon connects to the Zookeeper cluster and starts its listener. If the Zookeeper cluster is unavailable, it will wait some time before abandoning the attempt and starting again from step 1.

  8. Termination handling/cleanup is configured.

  9. The node checks if it is already present in the Zookeeper cluster; if not, it will add itself to the database. Initial static options are also updated in the database here. The daemon state transitions from stop to init.

  10. The node checks if Libvirt is accessible.

  11. The node starts up the NFT firewall if applicable and configures the base rule-set.

  12. The node ensures that dnsmasq is stopped (legacy check, might be safe to remove eventually).

  13. The node begins setting up the object representations of resources, in order:

    a. Node entries

    b. Network entries, creating client networks and starting them as required.

    c. Domain (VM) entries, starting up the VMs as required.

    d. Ceph storage entries (OSDs, Pools, Volumes, Snapshots).

  14. The node activates its keepalived timer and begins sending keepalive updates to the cluster. The daemon state transitions from init to run and the system has started fully.

PVC Node Daemon manual

The PVC node daemon ins build with Python 3 and is run directly on nodes. For details of the startup sequence and general layout, see the architecture document.

Configuration

The Daemon is configured using a YAML configuration file which is passed in to the API process by the environment variable PVCD_CONFIG_FILE. When running with the default package and SystemD unit, this file is located at /etc/pvc/pvcnoded.yaml.

For most deployments, the management of the configuration file is handled entirely by the PVC Ansible framework and should not be modified directly. Many options from the Ansible framework map directly into the configuration options in this file.

Conventions

  • Settings may be required, optional, or ignored.

  • Settings may depends on other settings. This indicates that, if one setting is enabled, the other setting is very likely required by that setting.

pvcnoded.yaml

Example configuration:

pvc:
  node: pvchv1
  debug: False
  functions:
    enable_hypervisor: True
    enable_networking: True
    enable_storage: True
    enable_api: True
  cluster:
    coordinators:
      - pvchv1
      - pvchv2
      - pvchv3
    networks:
      upstream:
        domain: "mydomain.net"
        network: "1.1.1.0/24"
        floating_ip: "1.1.1.10/24"
        gateway: "1.1.1.1"
      cluster:
        domain: "pvc.local"
        network: "10.255.0.0/24"
        floating_ip: "10.255.0.254/24"
      storage:
        domain: "pvc.storage"
        network: "10.254.0.0/24"
        floating_ip: "10.254.0.254/24"
  coordinator:
    dns:
      database:
        host: localhost
        port: 5432
        name: pvcdns
        user: pvcdns
        pass: pvcdnsPassw0rd
    metadata:
      database:
        host: localhost
        port: 5432
        name: pvcapi
        user: pvcapi
        pass: pvcapiPassw0rd
  system:
    fencing:
      intervals:
        keepalive_interval: 5
        fence_intervals: 6
        suicide_intervals: 0
      actions:
        successful_fence: migrate
        failed_fence: None
      ipmi:
        host: pvchv1-lom
        user: admin
        pass: Passw0rd
    migration:
      target_selector: mem
    configuration:
      directories:
        dynamic_directory: "/run/pvc"
        log_directory: "/var/log/pvc"
        console_log_directory: "/var/log/libvirt"
      logging:
        file_logging: True
        stdout_logging: True
        log_colours: True
        log_dates: True
        log_keepalives: True
        log_keepalive_cluster_details: True
        log_keepalive_storage_details: True
        console_log_lines: 1000
      networking:
        bridge_device: ens4
        bridge_mtu: 1500
        sriov_enable: True
        sriov_device:
          - phy: ens1f0
            mtu: 9000
            vfcount: 7
        upstream:
          device: ens4
          mtu: 1500
          address: None
        cluster:
          device: ens4
          mtu: 1500
          address: by-id
        storage:
          device: ens4
          mtu: 1500
          address: by-id

node

  • required

The (short) hostname of the node; host-specific.

debug

  • required

Whether to enable or disable debug mode. Debug mode enables additional logging of subtasks throughout the system.

functionsenable_hypervisor

  • required

Whether to enable the hypervisor functionality of the PVC Daemon or not. This should usually be enabled except in advanced deployment scenarios (such as a dedicated quorum-keeping micro-node or dedicated network routing node).

functionsenable_networking

  • required

Whether to enable the client network functionality of the PVC Daemon or not. This should usually be enabled except in deployment scenarios where networking is completely unmanaged by PVC.

functionsenable_storage

  • required

Whether to enable the virtual storage functionality of the PVC Daemon or not. This should usually be enabled except in advanced deployment scenarios featuring unmanaged external storage.

functionsenable_api

Whether to enable the PVC API client on the cluster floating IPs or not.

clustercoordinators

  • required

A list of coordinator hosts, used to generate the Zookeeper connection string and determine if the current host is a coordinator or not .

clusternetworks

  • optional
  • requires functionsenable_networking

Contains a dictionary of networks and their configurations for the PVC cluster. Optional only if enable_networking is False. The three required network types/names are upstream, cluster, and storage. Each network type contains the following entries.

domain
  • required

The domain name for the network. Should be a valid domain name, or None. Specifically for the upstream network, this should match the domain portion of the node hostname.

network

The CIDR-formatted IPv4 address block for the network.

floating_ip

The CIDR-formatted IPv4 address for the floating IP within the network. This IP will belong exclusively to the primary coordinator node to provide a central entrypoint for functionality on the cluster.

gateway

The IPv4 address for the gateway of the network. Usually applicable only to the upstream network, as the other two are normally unrouted and local to the cluster.

coordinator

  • optional
  • requires functionsenable_networking

Configuration for coordinator functions on the node. Optional only if enable_networking is False. Not optional on non-coordinator hosts, though unused. Contains the following sub-entries.

dnsdatabasehost
  • required

The hostname of the PostgreSQL instance for the DNS aggregator database. Should always be localhost except in advanced deployment scenarios.

dnsdatabaseport
  • required

The port of the PostgreSQL instance for the DNS aggregator database. Should always be 5432.

dnsdatabasename
  • required

The database name for the DNS aggregator database. Should always be pvcdns.

dnsdatabaseuser
  • required

The username for the PVC node daemon to access the DNS aggregator database.

dnsdatabasepass
  • required

The password for the PVC node daemon to access the DNS aggregator database.

metadatadatabasehost
  • required

The hostname of the PostgreSQL instance for the Provisioner database. Should always be localhost except in advanced deployment scenarios.

metadatadatabaseport
  • required

The port of the PostgreSQL instance for the Provisioner database. Should always be 5432.

metadatadatabasename
  • required

The database name for the Provisioner database. Should always be pvcapi.

metadatadatabaseuser
  • required

The username for the PVC node daemon to access the Provisioner database.

metadatadatabasepass
  • required

The password for the PVC node daemon to access the Provisioner database.

systemintervalskeepalive_interval

  • required

The number of seconds between keepalive messages to the cluster. The default is 5 seconds; for slow cluster nodes, 10-30 seconds may be more appropriate however this will result in slower responses to changes in the cluster and less accurate/up-to-date information in the clients.

systemintervalsfence_intervals

  • required

The number of keepalive messages that can be missed before a node is considered dead and the fencing cycle triggered on it. The default is 6, or 30 seconds of inactivity with a 5 second keepalive_interval. Can be set to 0 to disable fencing as the timeout will never trigger.

systemintervalssuicide_intervals

  • required

The number of keepalive message that can be missed before a node considers itself dead and forcibly resets itself. Note that, due to the large number of reasons a node could become unresponsive, the suicide interval alone should not be relied upon. The default is 0, which disables this functionality. If set, should usually be equal to or less than fence_intervals for maximum safety.

systemfencingactionssuccessful_fence

  • required

The action to take regarding VMs once a node is successfully fenced, i.e. the IPMI command to restart the node reports a success. Can be one of migrate, to migrate and start all failed VMs on other nodes and the default, or None to perform no action.

systemfencingactionsfailed_fence

  • required

The action to take regarding VMs once a node fencing fails, i.e. the IPMI command to restart the node reports a failure. Can be one of None, to perform no action and the default, or migrate to migrate and start all failed VMs on other nodes.

WARNING: This functionality is potentially dangerous and can result in data loss or corruption in the VM disks; the post-fence migration process explicitly clears RBD locks on the disk volumes. It is designed only for specific and advanced use-cases, such as servers that do not reliably report IPMI responses or servers without IPMI (not recommended; see the cluster architecture documentation). If this is set to migrate, the suicide_intervals must be set to provide at least some guarantee that the VMs on the node will actually be terminated before this condition triggers. The administrator should think very carefully about their setup and potential failure modes before enabling this option.

systemfencingipmihost

  • required

The hostname or IP address of this node's IPMI interface. Must be reachable from the nodes.

systemfencingipmiuser

  • required

The username for the PVC node daemon to log in to the IPMI interface. Must have permission to reboot the host (command ipmitool chassis power reset).

systemfencingipmipass

  • required

The password for the PVC node daemon to log in to the IPMI interface.

systemmigrationtarget_selector

  • required

The default selector algorithm to use when migrating VMs away from a node; individual VMs can override this default.

Valid target_selector values are:

  • mem: choose the node with the least provisioned VM memory
  • memfree: choose the node with the most (real) free memory
  • vcpus: choose the node with the least allocated VM vCPUs
  • load: choose the node with the lowest current load average
  • vms: choose the node with the least number of provisioned VMs

For most clusters, mem should be sufficient, but others may be used based on the cluster workload and available resources. The following caveats should be considered:

  • mem looks at the provisioned memory, not the allocated memory; thus, stopped or disabled VMs are counted towards a node's memory for this selector, even though their memory is not actively in use.
  • memfree looks at the free memory of the node in general, ignoring the amount provisioned to VMs; if any VM's internal memory usage changes, this value would be affected. This might be preferable to mem on clusters with very high memory utilization versus total capacity or if many VMs are stopped/disabled.
  • load looks at the system load of the node in general, ignoring load in any particular VMs; if any VM's CPU usage changes, this value would be affected. This might be preferable on clusters with some very CPU intensive VMs.

systemconfigurationdirectoriesdynamic_directory

  • required

The directory to store ephemeral configuration files. Usually /run/pvc or a similar temporary directory.

systemconfigurationdirectorieslog_directory

  • required

The directory to store log files for file_logging. Usually /var/log/pvc or a similar directory. Must be specified even if file_logging is False, though ignored.

systemconfigurationdirectoriesconsole_log_directory

  • required

The directory to store VM console logs. Usually /var/log/libvirt or a similar directory.

systemconfigurationloggingfile_logging

  • required

Whether to enable direct logging to a file in log_directory or not.

systemconfigurationloggingstdout_logging

  • required

Whether to enable logging to stdout or not; captured by SystemD and JournalD by default.

systemconfigurationlogginglog_colours

  • required

Whether to log ANSI colour sequences in the log output or not.

systemconfigurationlogginglog_dates

  • required

Whether to log the current date and time in the log output or not.

systemconfigurationlogginglog_keepalives

  • required

Whether to log keepalive messages or not.

systemconfigurationlogginglog_keepalive_cluster_details

  • required

Whether to log node status information during keepalives or not.

systemconfigurationlogginglog_keepalive_storage_details

  • required

Whether to log storage cluster status information during keepalives or not.

systemconfigurationloggingconsole_log_lines

  • required

How many lines of VM console logs to keep in the Zookeeper database for each VM.

systemconfigurationnetworkingbridge_device

  • optional
  • requires functionsenable_networking

The network interface device used to create Bridged client network vLANs on. For most clusters, should match the underlying device of the various static networks (e.g. ens4 or bond0), though may also use a separate network interface.

systemconfigurationnetworkingbridge_mtu

  • optional
  • requires functionsenable_networking

The network interface MTU for the Bridged client network device. This is the maximum MTU a bridged client network can use.

systemconfigurationnetworkingsriov_enable

  • optional, defaults to False
  • requires functionsenable_networking

Enables (or disables) SR-IOV functionality in PVC. If enabled, at least one sriov_device entry should be specified.

systemconfigurationnetworkingsriov_device

  • optional
  • requires functionsenable_networking

Contains a list of SR-IOV PF (physical function) devices and their basic configuration. Each element contains the following entries:

phy:
  • required

The raw Linux network device with SR-IOV PF functionality.

mtu

The MTU of the PF device, set on daemon startup.

vfcount

The number of VF devices to create on this PF. VF devices are then managed via PVC on a per-node basis.

systemconfigurationnetworking

  • optional
  • requires functionsenable_networking

Contains a dictionary of networks and their configurations on this node. Optional only if enable_networking is False. The three required network types/names are upstream, cluster, and storage. Each network type contains the following entries.

device
  • required

The raw Linux network device that the network exists on.

mtu
  • required

The MTU of the network device.

address
  • required

The IPv4 address of the interface. Can be one of: None, for no IP address; by-id, to automatically select an address in the relevant networks section via the host ID (e.g. node1 will get .1, node2 will get .2, etc.); or a static CIDR-formatted IP address.