blog/content/en/posts/.drafts/scalable-redundant-matrix-h.../index.md

51 KiB

+++

class = "post" date = "2020-05-31T00:00:00-04:00" tags = ["systems administration", "development", "matrix"] title = "Building a scalable, redundant Matrix Homeserver" description = "Deploy an advanced, highly-scalable Matrix instance with split-workers and backends from scratch" type = "post" weight = 1 draft = true

+++

What is Matrix?

Matrix is, fundamentally, a combination of the best parts of IRC, XMPP, and Slack-like communication platforms (Discord, Mattermost, Rocketchat, etc.) built to modern standards. In the Matrix ecosystem, users can run their own server instances, called "homeservers", which then federate amongst themselves to create a "fediverse". It is thus fully distributed, allowing users to communicate with each other on their own terms, while providing all the features one would expect of a global chat system, such as large public rooms, as well as standard features of more modern platforms, like small private groups, direct messages, file uploads, and advanced integration and moderation features, such as bots. The reference homeserver application is called "Synapse", written in Python 3, and released under an Apache 2.0 license.

In this guide, I seek to provide a document detailing the full steps to deploy a highly-available, redundant, multi-worker Matrix instance, with a fully redundant PostgreSQL database and LDAP authentication and 3PID backend. For those of you who just want to run a quick-and-easy Matrix instance with few advanced features, this guide is probably not for you, and there are numerous guides out there for setting up basic Matrix Synapse instances instead.

Most of the concepts in this guide, as well as most of the configuration files given, can be adapted to a single-host but still split-worker instance instead, should the configuration below be deemed too complicated or excessive for your usecase. Be sure to carefully read this document and the Matrix documentation if you wish to do so, though most sections can be adapted verbatim.

The problem with Synapse

The main issue with Synapse in its default configuration, as documented by the Matrix project themselves, is that it is single-threaded and non-redundant. Since a lot of actions inside Synapse require significant CPU resources, especially those related to federation, this can be a significant bottleneck. This is especially true in very large rooms, where there are potentially hundreds of joined users on multiple homeservers that all must be communicated to. Without tweaking, this can manifest as posts to large rooms taking an extrordanarily long time, upwards of 10 seconds, to send, as well as problems joining very large rooms for the first time (significant delays, timeouts, join failures, etc.).

Unfortunately, most homeserver users aren't running their instance on the fastest possible CPU, thus, the only solution to improve performance in this area is to somehow allow the Synapse process to use multiple threads. Luckily for us, Matrix Synapse, since about version 1.10, supports this via workers. Workers allow one to split various functions out of the main Synapse process, which then allows multi-threaded operation and thus, increased performance.

The configuration of workers is discussed in the Synapse documentation, however a number of details are glossed over or not mentioned completely. Thus, this blog post will outline some of the specific details involved in tuning workers for maximum performance.

Step 1 - Prerequisites and planning

The system outlined in this guide is designed to provide a very scalable and redundant Matrix experience. To this end, the entire system is split up into multiple hosts. In most cases, these should be Virtual Machines running on at least 2 hypervisors for redundancy at the lower layers, though this is outside of the scope of this guide. For our purposes, we will assume that the VMs discussed below are already installed, configured, and operating.

The configuration outlines here makes use of a total of 14 VMs, with 6 distinct roles. Within each role, either 2 or 3 individual VMs are configured to provide redundancy. The roles can be roughly divided into two categories, frontends that expose services to users, and backends that expose databases to the frontend instances.

The full VM list, with an example naming convention where X is the host "ID" (e.g. 1, 2, etc.), is as follows:

Quantity Name Description


2 flbX Frontend load balancers running HAProxy, handling incoming requests from clients and federated servers. 2 rwX Riot Web instances under Nginx. 3 hsX Matrix Synapse homeserver instances running the various workers. 2 blbX Backend load balancers running HAProxy, handling database requests from the homeserver instances. 3 mpgX PostgreSQL database instances running Patroni with Zookeeper. 2 mldX OpenLDAP instances.

While this setup may seem like overkill, it is, aside from the homeserver instances, the minimum configuration possible while still providing fully redundancy. If redundancy is not desired, a smaller configuration, down to as little as one host, is possible, though this is not detailed below.

In addition to these 14 VMs, some sort of shared storage must be provided for the sharing of media files (e.g. uploaded files) between the homeservers. For the purpose of this guide, we assume that this is an NFS export at /srv/matrix from a system called nas. The configuration of redundant, shared storage is outside of the scope of this guide, and thus we will not discuss this beyond this paragraph, though prospective administrators of highly-available Matrix instances should consider this as well.

All the VMs mentioned above should be running the same operating system. I recommended Debian 10.X (Buster) here, both because it is the distribution I run myself, and also because it provides nearly all the required packages with minimal fuss. If you wish to use another distribution, you must adapt the commands and examples below to fit. Additionally, this guide expects that you are running the Systemd init system. This is not the place for continuing the seemingly-endless initsystem debate, but some advanced features of Systemd (such as template units) are used below and in the official Matrix documentation, so we expect this is the initsystem you are running, and you are on your own if you choose to use an alternative.

For networking purposes, it is sufficient to place all the above servers in a single RFC1918 network. Outbound NAT should be configured to allow all hosts to reach the internet, and a small number of ports should be permitted through a firewall towards the external load balancer VIP (virtual IP address). The following is an example IP configuration in the network 10.0.0.0/24 that can be used for this guide, though you may of course choose a different subnet and host IP allocation scheme if you wish. All these names should resolve in DNS, or be configured in /etc/hosts on all machines.

IP address Hostname Description


10.0.0.1 gw NAT gateway and firewall, upstream router. 10.0.0.2 blbvip Floating VIP for blbX instances. 10.0.0.3 blb1 blbX host 1. 10.0.0.4 blb2 blbX host 2. 10.0.0.5 mpg1 mpgX host 1. 10.0.0.6 mpg2 mpgX host 2. 10.0.0.7 mpg3 mpgX host 3. 10.0.0.8 mld1 mldX host 1. 10.0.0.9 mld2 mldX host 2. 10.0.0.10 flbvip Floating VIP for flbX instances. 10.0.0.11 flb1 flbX host 1. 10.0.0.12 flb2 flbX host 2. 10.0.0.13 rw1 rwX host 1. 10.0.0.14 rw2 rwX host 2. 10.0.0.15 hs1 hsX host 1. 10.0.0.16 hs2 hsX host 2. 10.0.0.17 hs3 hsX host 3.

Step 2 - Installing and configuring OpenLDAP instances

OpenLDAP is a common LDAP server, which provides centralized user administration as well as the configuration of additional details in a user directory. Installing and configuring OpenLDAP is beyond the scope of this guide, though the Matrix Homeserver configurations below assume that this is operating and that all Matrix users are stored in the LDAP database. In our example configuration, there are 2 OpenLDAP instances running with replication (syncrepl) between them, which are then load-balanced in a multi-master fashion. Since no services below here will be performing writes to this database, this is fine. The administrator is expected to configure some sort of user management layer of their choosing (e.g. scripts, or a web-based frontend) for managing users, resetting passwords, etc.

While this short section may seem like a cop-out, this is an extensive topic with many potential caveats, and should thus have its own (future) post on this blog. Until then, I trust that the administrator is able to look up and configure this themselves. I include these references only to help guide the administrator towards full-stack redundancy and to explain why there are LDAP sections in the backend load balancer configurations.

Step 3 - Installing and configuring Patroni instances

Patroni is a service manager for PostgreSQL which provides automated failover and replication support for a PostgreSQL database. Like OpenLDAP above, the configuration of Patroni is beyond the scope of this guide, and the configurations below assume that this is operating and already configured. In our example configuration, there are 3 Patroni instances, which is the minimum required for quorum among the members. As above, I do plan to document this in a future post, but until then, I recommend the administrator reference the Patroni documentation as well as this other post on my blog for details on setting up the Patroni instances.

Step 4 - Installing and configuring backend load balancers

While I do not go into details in the previous two steps, this section details how to make use of a redundant pair of HAProxy instances to expose the redundant databases mentioned above to the Homeserver instances.

In order to provide a single entrypoint to the load balancers, the administrator should first install and configure Keepalived. The following /etc/keepalived/keepalived.conf configuration will set up the blbvip floating IP address between the two instances, while providing checking of the HAProxy instance health. This configuration below can be used on both proxy hosts, and inline comments provide additional clarification and information as well as indicating any changes required between the hosts.

# Global configuration options.
global_defs {
    # Use a dedicated IPv4 multicast group; adjust the last octet if this conflicts within your network.
    vrrp_mcast_group4 224.0.0.21

    # Use VRRP version 3 in strict mode and with no iptables configuration.
    vrrp_version 3
    vrrp_strict
    vrrp_iptables
}

# HAProxy check script, to ensure that this host will not become PRIMARY if HAProxy is not active.
vrrp_script chk {
    script "/usr/bin/haproxyctl show info"
    interval 5
    rise 2
    fall 2
}

# Primary IPv4 VIP configuration.
vrrp_instance VIP_4 {
    # Initial state, MASTER on both hosts to ensure that at least one host becomes active immediately on boot.
    state MASTER

    # Interface to place the VIP on; this is optional though still recommended on single-NIC machines; replace "ens2" with your actual NIC name.
    interface ens2

    # A dedicated, unique virtual router ID for this cluster; adjust this if required.
    virtual_router_id 21

    # The priority. Set to 200 for the primary (first) server, and to 100 for the secondary (second) server.
    priority 200

    # The (list of) virtual IP address(es) with CIDR subnet mask for the "blbvip" host.
    virtual_ipaddress {
        10.0.0.2/24
    }

    # Use the HAProxy check script for this VIP.
    track_script {
        chk
    }
}

Once the above configuration is installed at /etc/keepalived/keepalived.conf, restart the Keepalived service with sudo systemctl restart keepalived on each host. You should see the VIP become active on the first host.

The HAProxy configuration below can be used verbatim on both proxy hosts, and inline comments provide additional clarification and information to avoid breaking up the configuration snippit. This configuration makes use of an advanced feature for the Patroni hosts which is detailed in another post on this blog, to ensure that only the active Patroni node is sent traffic and to avoid the other two database hosts from reporting DOWN state all the time.

# Global settings - tune HAProxy for optimal performance, administration, and security.
global
    # Send logs to the "local6" service on the local host, via an rsyslog UDP listener. Enable debug logging to log individual connections.
    log ip6-localhost:514 local6 debug
    log-send-hostname
    chroot /var/lib/haproxy
    pidfile /run/haproxy/haproxy.pid

    # Use multi-threadded support (available with HAProxy 1.8+) for optimal performance in high-load situations. Adjust `nbthread` as needed for your host's core count (1/2 is optimal).
    nbproc 1
    nbthread 2

    # Provide a stats socket for `hatop`
    stats socket /var/lib/haproxy/admin.sock mode 660 level admin process 1
    stats timeout 30s

    # Run in daemon mode as the `haproxy` user
    daemon
    user haproxy
    group haproxy

    # Set the global connection limit to 10000; this is certainly overkill but avoids needing to tweak this for larger instances.
    maxconn 10000

# Default settings - provide some default settings that are applicable to (most) of the listeners and backends below.
defaults
    log global
    timeout connect 30s
    timeout client 15m
    timeout server 15m
    log-format "%ci:%cp [%t] %ft %b/%s %Tw/%Tc/%Tt %B %ts %ac/%fc/%bc/%sc/%rc %sq/%bq %bi:%bp"

# Statistics listener with authentication - provides stats for the HAProxy instance via a WebUI (optional)
userlist admin
    # WARNING - CHANGE ME TO A REAL PASSWORD OR A SHA512-hashed PASSWORD (with `password` instead of `insecure-password`). IF YOU USE `insecure-password`, MAKE SURE THIS CONFIGURATION IS NOT WORLD-READABLE.
    user admin insecure-password P4ssw0rd 
listen stats
    bind :::5555 v4v6
    mode http
    stats enable
    stats uri /
    stats hide-version
    stats refresh 10s
    stats show-node
    stats show-legends
    acl is_admin http_auth(admin)
    http-request auth realm "Admin access required" if !is_admin

# Stick-tables peers configuration
peers keepalived-pair
    peer blb1 10.0.0.3:1023
    peer blb1 10.0.0.4:1023

# LDAP frontend
frontend ldap
    bind :::389 v4v6
    maxconn 1000
    mode tcp
    option tcpka
    default_backend ldap

# PostgreSQL frontend
frontend pgsql
    bind :::5432 v4v6
    maxconn 1000
    mode tcp
    option tcpka
    default_backend pgsql

# LDAP backend
backend ldap
    mode tcp
    option tcpka
    balance leastconn
    server mld1 10.0.0.8:389 check inter 2000 maxconn 64
    server mld2 10.0.0.9:389 check inter 2000 maxconn 64

# PostgreSQL backend using agent check
backend pgsql
    mode tcp
    option tcpka
    option httpchk OPTIONS /master
    http-check expect status 200
    server mpg1 10.0.0.5:5432 maxconn 1000 check agent-check agent-port 5555 inter 1s fall 2 rise 2 on-marked-down shutdown-sessions port 8008
    server mpg2 10.0.0.6:5432 maxconn 1000 check agent-check agent-port 5555 inter 1s fall 2 rise 2 on-marked-down shutdown-sessions port 8008
    server mpg3 10.0.0.7:5432 maxconn 1000 check agent-check agent-port 5555 inter 1s fall 2 rise 2 on-marked-down shutdown-sessions port 8008

Once the above configurations are installed on each server, restart the HAProxy service with sudo systemctl restart haproxy. Use sudo hatop -s /var/lib/haproxy/admin.sock to view the status of the backends, and continue once all are running correctly.

Step 5 - Install and configure Synapse instances

The core homeserver processes should be configured on all homeserver machines. There are numerous options but

Step 2 - Configure systemd units

The easiest way to set up workers is to use a template unit file with a series of individual worker configurations. A series of unit files are provided within the Synapse documentation, which can be used to set up template-based workers.

I decided to modify these somewhat, by replacing the configuration directory at /etc/matrix-synapse/workers with /etc/matrix-synapse/worker.d, but this is just a personal preference. If you're using official Debian packages (as I am), you will also need to adjust the path to the Python binary. I also adjust the description to be a little more consistent. The resulting template worker unit looks like this:

[Unit]
Description = Synapse Matrix worker %i
PartOf = matrix-synapse.target                                                                                                                                                                                                                 

[Service]
Type = notify
NotifyAccess = main
User = matrix-synapse
WorkingDirectory = /var/lib/matrix-synapse
EnvironmentFile = /etc/default/matrix-synapse
ExecStart = /usr/bin/python3 -m synapse.app.generic_worker --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/ --config-path=/etc/matrix-synapse/worker.d/%i.yaml
ExecReload = /bin/kill -HUP $MAINPID
Restart = on-failure
RestartSec = 3 
SyslogIdentifier = matrix-synapse-%i

[Install]
WantedBy = matrix-synapse.target

There is also a generic target unit that should be installed to provide a unified management point for both the primary Synapse process as well as the workers. After some similar tweaks, including adjusting the After condition to use network-online.target instead of network.target, the resulting file looks like this:

[Unit]
Description = Synapse Matrix homeserver target
After = network-online.target                                                                                                                                                                                                                  

[Install]
WantedBy = multi-user.target

Install both of these units, as matrix-synapse-worker@.service and matrix-synapse.target respectively, to /etc/systemd/system, and run sudo systemctl daemon-reload.

Once the unit files are prepared, you can begin building each individual worker configuration.

Step 3 - Configure the individual workers

Each worker is configured via an individual YAML configuration file, with our units under /etc/matrix-synapse/worker.d. By design, each worker makes use of homeserver.yaml for all global configuration values, then the individual worker configurations override specific settings for the particular worker. The Synapse documentation on workers provides a good starting point, but some sections are vague, and thus this guide hopes to provide more detailed instructions and explanations.

Each worker is given a specific section below, which includes the full YAML configuration I use, as well as any notes about the configuration that are worth mentioning. They are provided in alphabetical order, rather than the order provided in the documentation above, for clarity.

For any worker which responds to REST, a port must be selected for the worker to listen on. The main homeserver runs by default on port 8008, and I have ma1xd running on port 8090, so I chose ports from 8091 to 8097 for the various REST workers in order to keep them in a consistent range.

Finally, the main homeserver must be configured with both TCP and HTTP replication listeners, to provide communication between the workers and the main process. For this I use the ports provided by the Matrix documentation above, 9092 and 9093, with the following configuration in the main homeserver.yaml listeners section:

listeners:
  - port: 8008
    tls: false
    bind_addresses:
      - '::'
    type: http
    x_forwarded: true
    resources:
      - names: [client, webclient]
        compress: true
  - port: 9092
    bind_addresses:
      - '::'
    type: replication
    x_forwarded: true
  - port: 9093
    bind_addresses:                                                                                                                                                                                                                            
      - '::'
    type: http
    x_forwarded: true
    resources:
     - names: [replication]

There are a couple adjustments here from the default configuration. First, the federation resource has been removed from the primary listener, since this is implemented as a worker below. TLS is disabled here, and x_forwarded: true is added to all 3 frontends, since this is handled by a reverse proxy, as discussed later in this guide. All three listeners use a global IPv6+IPv4 bind address of :: so they will be accessible by other machines on the network, which is important for the final, multi-host setup. As noted in the Matrix documentation, ensure that the replication ports are not publicly accessible, since they are unauthenticated and unencrypted; I run these servers on an RFC1918 private network behind a firewall so this is secure, but you will need to provide some sort of firewall if your Synapse instance is directly available on the public Internet.

The configurations below show a hostname, mlbvip, for all instances of worker_replication_host. This will be explained and discussed further in the reverse proxy section. If you are only interested in running a "single-server" instance, you may use localhost, 127.0.0.1, or ::1 here instead, as these ports will not managed by the reverse proxy in such a setup.

appservice worker (/etc/matrix-synapse/worker.d/appservice.yaml)

The appservice worker does not service REST endpoints, and thus has a minimal configuration.

---
worker_app: synapse.app.appservice

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@appservice.service. It will be started later in the process.

client_reader worker (/etc/matrix-synapse/worker.d/client_reader.yaml)

The client_reader worker services REST endpoints, and thus has a listener section, with port 8091 chosen.

---
worker_app: synapse.app.client_reader

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

worker_listeners:
 - type: http
   port: 8091
   resources:
     - names:
       - client

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@client_reader.service. It will be started later in the process.

event_creator worker (/etc/matrix-synapse/worker.d/event_creator.yaml)

The event_creator worker services REST endpoints, and thus has a listener section, with port 8092 chosen.

---
worker_app: synapse.app.event_creator

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

worker_listeners:
 - type: http
   port: 8092
   resources:
     - names:
       - client

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@event_creator.service. It will be started later in the process.

federation_reader worker (/etc/matrix-synapse/worker.d/federation_reader.yaml)

The federation_reader worker services REST endpoints, and thus has a listener section, with port 8093 chosen. Note that this worker, in addition to a client resource, also provides a federation resource.

---
worker_app: synapse.app.federation_reader

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

worker_listeners:
 - type: http
   port: 8093
   resources:
     - names:
       - client
       - federation

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@federation_reader.service. It will be started later in the process.

federation_sender worker (/etc/matrix-synapse/worker.d/federation_sender.yaml)

The federation_sender worker does not service REST endpoints, and thus has a minimal configuration.

---
worker_app: synapse.app.federation_sender

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@federation_sender.service. It will be started later in the process.

frontend_proxy worker (/etc/matrix-synapse/worker.d/frontend_proxy.yaml)

The frontend_proxy worker services REST endpoints, and thus has a listener section, with port 8094 chosen. This worker has an additional configuration parameter, worker_main_http_uri, which allows the worker to direct requests back to the primary Synapse instance. Similar to the worker_replication_host value, this uses mlbvip in this example, and for "single-server" instances must be replaced with localhost, 127.0.0.1, or ::1 instead, as this port will not managed by the reverse proxy in such a setup.

---
worker_app: synapse.app.frontend_proxy

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

worker_main_http_uri: http://mlbvip:8008

worker_listeners:
 - type: http
   port: 8094
   resources:
     - names:
       - client

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@frontend_proxy.service. It will be started later in the process.

media_repository worker (/etc/matrix-synapse/worker.d/media_repository.yaml)

The media_repository worker services REST endpoints, and thus has a listener section, with port 8095 chosen. Note that this worker, in addition to a client resource, also provides a media resource.

---
worker_app: synapse.app.media_repository

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

worker_listeners:
 - type: http
   port: 8095
   resources:
     - names:
       - client
       - media

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@media_repository.service. It will be started later in the process.

pusher worker (/etc/matrix-synapse/worker.d/pusher.yaml)

The pusher worker does not service REST endpoints, and thus has a minimal configuration.

---
worker_app: synapse.app.pusher

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@pusher.service. It will be started later in the process.

synchrotron worker (/etc/matrix-synapse/worker.d/synchrotron.yaml)

The synchrotron worker services REST endpoints, and thus has a listener section, with port 8096 chosen.

---
worker_app: synapse.app.synchrotron

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

worker_listeners:
 - type: http
   port: 8096
   resources:
     - names:
       - client

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@synchrotron.service. It will be started later in the process.

user_dir worker (/etc/matrix-synapse/worker.d/user_dir.yaml)

The user_dir worker services REST endpoints, and thus has a listener section, with port 8097 chosen.

---
worker_app: synapse.app.user_dir

worker_replication_host: mlbvip
worker_replication_port: 9092
worker_replication_http_port: 9093

worker_listeners:
 - type: http
   port: 8097
   resources:
     - names:
       - client

Once the configuration is in place, enable the worker by running sudo systemctl enable matrix-synapse-worker@user_dir.service. It will be started later in the process.

Step 4 - Riot instance

Riot Web is the reference frontend for Matrix instances, allowing a user to access Matrix via a web browser. Riot is an optional, but recommended, feature for your homeserver

Step 5 - ma1sd instance

ma1sd is an optional component for Matrix, providing 3PID (e.g. email, phone number, etc.) lookup services for Matrix users. I use ma1sd with my Matrix instance for two main reasons: first, to map nice-looking user data such as full names to my Matrix users, and also as RESTful authentication provider to interface Matrix with my LDAP instance. For this guide, I assume that you already have an LDAP instance set up and that you are using it in this manner too.

Step 6 - Reverse proxy

For this guide, HAProxy was selected as the reverse proxy of choice. This is mostly due to my familiarity with it, but also to a lesser degree for its more advanced functionality and, in my opinion, nicer configuration syntax. This section provides configuration for a "load-balanced", multi-server instance with an additional 2 slave worker servers and with separate proxy servers; a single-server instance with basic split workers can be made by removing the additional servers. This will allow the homeserver to grow to many dozens or even hundreds of users. In this setup, the load balancer is separated out onto a separate pair of servers, with a keepalived VIP (virtual IP address) shared between them. The name mlbvip should resolve to this IP, and all previous worker configurations should use this mlbvip hostname as the connection target for the replication directives. Both a reasonable keepalived configuration for the VIP and the HAProxy configuration are provided.

The two proxy hosts can be named as desired, in my case using the names mlb1 and mlb2. These names must resolve in DNS, or be specified in /etc/hosts on both servers.

The Keepalived configuration below can be used on both proxy hosts, and inline comments provide additional clarification and information as well as indicating any changes required between the hosts. The VIP should be selected from the free IPs of your server subnet.

# Global configuration options.
global_defs {
    # Use a dedicated IPv6 multicast group; adjust the last octet if this conflicts within your network.
    vrrp_mcast_group4 224.0.0.21

    # Use VRRP version 3 in strict mode and with no iptables configuration.
    vrrp_version 3
    vrrp_strict
    vrrp_iptables
}

# HAProxy check script, to ensure that this host will not become PRIMARY if HAProxy is not active.
vrrp_script chk {
    script "/usr/bin/haproxyctl show info"
    interval 5
    rise 2
    fall 2
}

# Primary IPv4 VIP configuration.
vrrp_instance VIP_4 {
    # Initial state, MASTER on both hosts to ensure that at least one host becomes active immediately on boot.
    state MASTER

    # Interface to place the VIP on; this is optional though still recommended on single-NIC machines; replace "ens2" with your actual NIC name.
    interface ens2

    # A dedicated, unique virtual router ID for this cluster; adjust this if required.
    virtual_router_id 21

    # The priority. Set to 200 for the primary (first) server, and to 100 for the secondary (second) server.
    priority 200

    # The (list of) virtual IP address(es) with CIDR subnet mask.
    virtual_ipaddress {
        10.0.0.10/24
    }

    # Use the HAProxy check script for this VIP.
    track_script {
        chk
    }
}

Once the above configuration is installed at /etc/keepalived/keepalived.conf, restart the Keepalived service with sudo systemctl restart keepalived on each host. You should see the VIP become active on the first host.

The HAProxy configuration below can be used verbatim on both proxy hosts, and inline comments provide additional clarification and information to avoid breaking up the configuration snippit. In this example we use peer configuration to enable the use of stick-tables directives, which ensure that individual user sessions are synchronized between the HAProxy instances during failovers; with this setting, if the hostnames of the load balancers do not resolve, HAProxy will not start. Some additional, advanced features are used in several ACLs to ensure that, for instance, specific users and rooms are always directed to the same workers if possible, which is required by the individual workers as specified in the Matrix documentation.

global
    # Send logs to the "local6" service on the local host, via an rsyslog UDP listener. Enable debug logging to log individual connections.
    log ip6-localhost:514 local6 debug
    log-send-hostname
    chroot /var/lib/haproxy
    pidfile /run/haproxy/haproxy.pid

    # Use multi-threadded support (available with HAProxy 1.8+) for optimal performance in high-load situations. Adjust `nbthread` as needed for your host's core count (2-4 is optimal).
    nbproc 1
    nbthread 4

    # Provide a stats socket for `hatop`
    stats socket /var/lib/haproxy/admin.sock mode 660 level admin process 1
    stats timeout 30s

    # Run in daemon mode as the `haproxy` user
    daemon
    user haproxy
    group haproxy

    # Set the global connection limit to 10000; this is certainly overkill but avoids needing to tweak this for larger instances.
    maxconn 10000

    # Set default SSL configurations, including a modern highly-secure configuration requiring TLS1.2 client support.
    ca-base /etc/ssl/certs
    crt-base /etc/ssl/private
    tune.ssl.default-dh-param 2048
    ssl-default-bind-ciphers ECDHE-RSA-AES256-GCM-SHA512:DHE-RSA-AES256-GCM-SHA512:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384
    ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets
    ssl-default-server-ciphers ECDHE-RSA-AES256-GCM-SHA512:DHE-RSA-AES256-GCM-SHA512:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384
    ssl-default-server-options ssl-min-ver TLSv1.2 no-tls-tickets

defaults
    log global

    option http-keep-alive
    option forwardfor except 127.0.0.0/8
    option redispatch
    option dontlognull
    option splice-auto
    option log-health-checks

    default-server init-addr libc,last,none

    timeout client          30s
    timeout connect         30s
    timeout server         300s
    timeout tunnel        3600s
    timeout http-keep-alive 60s
    timeout http-request    30s
    timeout queue           60s
    timeout tarpit          60s

peers keepalived-pair
    # Peers for site bl0
    peer mlb1.i.bonilan.net mlb1.i.bonilan.net:1023
    peer mlb2.i.bonilan.net mlb2.i.bonilan.net:1023

resolvers nsX
        nameserver ns1 10.101.0.61:53
        nameserver ns2 10.101.0.62:53

userlist admin
        user admin password MySuperSecretPassword123

listen stats
        bind :::5555 v4v6
        mode http
        stats enable
        stats uri /
        stats hide-version
        stats refresh 10s
        stats show-node
        stats show-legends
        acl is_admin http_auth(admin)
        http-request auth realm "Admin access" if !is_admin

frontend http
    bind :::80 v4v6
    mode http
    option httplog
    acl url_letsencrypt path_beg /.well-known/acme-challenge/
    use_backend letsencrypt if url_letsencrypt
    redirect scheme https if !url_letsencrypt !{ ssl_fc }

frontend https
    bind :::443 v4v6 ssl crt /etc/ssl/letsencrypt/ alpn h2,http/1.1
    bind :::8448 v4v6 ssl crt /etc/ssl/letsencrypt/ alpn h2,http/1.1
    mode http
    option httplog
    capture request header Host len 64

    http-request set-header X-Forwarded-Proto https
    http-request add-header X-Forwarded-Host %[req.hdr(host)]
    http-request add-header X-Forwarded-Server %[req.hdr(host)]
    http-request add-header X-Forwarded-Port %[dst_port]

    # Method ACLs
    acl http_method_get           method    GET

    # Domain ACLs
    acl host_matrix               hdr_dom(host) im.bonifacelabs.ca
    acl host_element              hdr_dom(host) chat.bonifacelabs.ca

    # URL ACLs
    # Sync requests
    acl url_workerX_stick-auth    path_reg    ^/_matrix/client/(r0|v3)/sync$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3)/events$
    acl url_workerX_stick-auth    path_reg    ^/_matrix/client/(api/v1|r0|v3)/initialSync$
    acl url_workerX_stick-auth    path_reg    ^/_matrix/client/(api/v1|r0|v3)/rooms/[^/]+/initialSync$

    # Federation requests
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/event/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/state/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/state_ids/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/backfill/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/get_missing_events/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/publicRooms
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/query/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/make_join/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/make_leave/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/(v1|v2)/send_join/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/(v1|v2)/send_leave/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/(v1|v2)/invite/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/event_auth/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/exchange_third_party_invite/
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/user/devices/
    acl url_workerX_generic       path_reg    ^/_matrix/key/v2/query
    acl url_workerX_generic       path_reg    ^/_matrix/federation/v1/hierarchy/

    # Inbound federation transaction request
    acl url_workerX_stick-src     path_reg    ^/_matrix/federation/v1/send/

    # Client API requests
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/createRoom$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/publicRooms$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/joined_members$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/context/.*$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/members$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/state$
    acl url_workerX_generic       path_reg    ^/_matrix/client/v1/rooms/.*/hierarchy$
    acl url_workerX_generic       path_reg    ^/_matrix/client/unstable/org.matrix.msc2716/rooms/.*/batch_send$
    acl url_workerX_generic       path_reg    ^/_matrix/client/unstable/im.nheko.summary/rooms/.*/summary$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/account/3pid$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/account/whoami$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/devices$
    acl url_workerX_generic       path_reg    ^/_matrix/client/versions$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/voip/turnServer$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/event/
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/joined_rooms$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/search$

    # Encryption requests
    # Note that ^/_matrix/client/(r0|v3|unstable)/keys/upload/ requires `worker_main_http_uri`
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/keys/query$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/keys/changes$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/keys/claim$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/room_keys/
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/keys/upload/

    # Registration/login requests
    acl url_workerX_generic       path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/login$
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/register$
    acl url_workerX_generic       path_reg    ^/_matrix/client/v1/register/m.login.registration_token/validity$

    # Event sending requests
    acl url_workerX_stick-path    path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/redact
    acl url_workerX_stick-path    path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/send
    acl url_workerX_stick-path    path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/state/
    acl url_workerX_stick-path    path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/(join|invite|leave|ban|unban|kick)$
    acl url_workerX_stick-path    path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/join/
    acl url_workerX_stick-path    path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/profile/

    # User directory search requests
    acl url_workerX_generic       path_reg    ^/_matrix/client/(r0|v3|unstable)/user_directory/search$

    # Pagination requests
    acl url_workerX_stick-path    path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/messages$

    # Push rules (GET-only)
    acl url_push-rules            path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/pushrules/

    # Directory worker endpoints
    acl url_directory-worker      path_reg    ^/_matrix/client/(r0|v3|unstable)/user_directory/search$

    # Event persister endpoints
    acl url_stream-worker         path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/rooms/.*/typing
    acl url_stream-worker         path_reg    ^/_matrix/client/(r0|v3|unstable)/sendToDevice/
    acl url_stream-worker         path_reg    ^/_matrix/client/(r0|v3|unstable)/.*/tags
    acl url_stream-worker         path_reg    ^/_matrix/client/(r0|v3|unstable)/.*/account_data
    acl url_stream-worker         path_reg    ^/_matrix/client/(r0|v3|unstable)/rooms/.*/receipt
    acl url_stream-worker         path_reg    ^/_matrix/client/(r0|v3|unstable)/rooms/.*/read_markers
    acl url_stream-worker         path_reg    ^/_matrix/client/(api/v1|r0|v3|unstable)/presence/

    # Backend directors
    use_backend synapseX_worker_generic if host_matrix url_workerX_generic
    use_backend synapseX_worker_generic if host_matrix url_push-rules http_method_get
    use_backend synapseX_worker_stick-auth if host_matrix url_workerX_stick-auth
    use_backend synapseX_worker_stick-src if host_matrix url_workerX_stick-src
    use_backend synapseX_worker_stick-path if host_matrix url_workerX_stick-path
    use_backend synapse0_directory_worker if host_matrix url_directory-worker
    use_backend synapse0_stream_worker if host_matrix url_stream-worker

    # Master workers (single-instance) - Federation media repository requests
    acl url_mediarepository       path_reg  ^/_matrix/media/
    acl url_mediarepository       path_reg  ^/_synapse/admin/v1/purge_media_cache$
    acl url_mediarepository       path_reg  ^/_synapse/admin/v1/room/.*/media.*$
    acl url_mediarepository       path_reg  ^/_synapse/admin/v1/user/.*/media.*$
    acl url_mediarepository       path_reg  ^/_synapse/admin/v1/media/.*$
    acl url_mediarepository       path_reg  ^/_synapse/admin/v1/quarantine_media/.*$
    acl url_mediarepository       path_reg  ^/_synapse/admin/v1/users/.*/media$
    use_backend synapse0_media_repository if host_matrix url_mediarepository

    # MXISD/MA1SD worker
    acl url_ma1sd                 path_reg  ^/_matrix/client/(api/v1|r0|unstable)/user_directory
    acl url_ma1sd                 path_reg  ^/_matrix/client/(api/v1|r0|unstable)/login
    acl url_ma1sd                 path_reg  ^/_matrix/identity
    use_backend synapse0_ma1sd if host_matrix url_ma1sd

    # Webhook service
    acl url_webhook               path_reg  ^/webhook
    use_backend synapse0_webhook if host_matrix url_webhook

    # .well-known configs
    acl url_wellknown             path_reg ^/.well-known/matrix
    use_backend elementX_http if host_matrix url_wellknown

    # Catchall Matrix and RElement
    use_backend synapse0_master if host_matrix
    use_backend elementX_http if host_element

    # Default to Riot
    default_backend elementX_http

frontend ma1sd_http
    bind :::8090 v4v6
    mode http
    option httplog
    use_backend synapse0_ma1sd

backend letsencrypt
    mode http
        server elbvip.i.bonilan.net elbvip.i.bonilan.net:80 resolvers nsX resolve-prefer ipv4

backend elementX_http
    mode http
    balance leastconn
    option httpchk GET /index.html
    # Force users (by source IP) to visit the same backend server
    stick-table type ipv6 size 5000k peers keepalived-pair expire 72h
    stick on src
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server element1 element1.i.bonilan.net:80 resolvers nsX resolve-prefer ipv4 check inter 5000 cookie element1.i.bonilan.net
        server element2 element2.i.bonilan.net:80 resolvers nsX resolve-prefer ipv4 check inter 5000 cookie element2.i.bonilan.net

backend synapse0_master
    mode http
    balance roundrobin
    option httpchk
    retries 0
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse0.i.bonilan.net synapse0.i.bonilan.net:8008 resolvers nsX resolve-prefer ipv4 check inter 5000 backup

backend synapse0_directory_worker
    mode http
    balance roundrobin
    option httpchk
    retries 0
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse0.i.bonilan.net synapse0.i.bonilan.net:8033 resolvers nsX resolve-prefer ipv4 check inter 5000 backup

backend synapse0_stream_worker
    mode http
    balance roundrobin
    option httpchk
    retries 0
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse0.i.bonilan.net synapse0.i.bonilan.net:8035 resolvers nsX resolve-prefer ipv4 check inter 5000 backup

backend synapse0_media_repository
    mode http
    balance roundrobin
    option httpchk
    retries 0
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse0.i.bonilan.net synapse0.i.bonilan.net:8095 resolvers nsX resolve-prefer ipv4 check inter 5000 backup

backend synapse0_ma1sd
    mode http
    balance roundrobin
    option httpchk
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse0.i.bonilan.net synapse0.i.bonilan.net:8090 resolvers nsX resolve-prefer ipv4 check inter 5000

backend synapse0_webhook
    mode http
    balance roundrobin
    option httpchk GET /
        server synapse0.i.bonilan.net synapse0.i.bonilan.net:4785 resolvers nsX resolve-prefer ipv4 check inter 5000 backup

backend synapseX_worker_generic
    mode http
    balance roundrobin
    option httpchk
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse1.i.bonilan.net synapse1.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000
        server synapse2.i.bonilan.net synapse2.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000

backend synapseX_worker_stick-auth
    mode http
    balance roundrobin
    option httpchk
    # Force users (by Authorization header) to visit the same backend server
    stick-table type string len 1024 size 5000k peers keepalived-pair expire 72h
    stick on hdr(Authorization)
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse1.i.bonilan.net synapse1.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000
        server synapse2.i.bonilan.net synapse2.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000

backend synapseX_worker_stick-path
    mode http
    balance roundrobin
    option httpchk
    # Force users to visit the same backend server
    stick-table type string len 1024 size 5000k peers keepalived-pair expire 72h
    stick on path,word(5,/) if { path_reg ^/_matrix/client/(r0|unstable)/rooms }
    stick on path,word(6,/) if { path_reg ^/_matrix/client/api/v1/rooms }
    stick on path
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse1.i.bonilan.net synapse1.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000
        server synapse2.i.bonilan.net synapse2.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000

backend synapseX_worker_stick-src
    mode http
    balance roundrobin
    option httpchk
    # Force users (by source IP) to visit the same backend server
    stick-table type ipv6 size 5000k peers keepalived-pair expire 72h
    stick on src
        errorfile 500 /etc/haproxy/sorryserver.http
        errorfile 502 /etc/haproxy/sorryserver.http
        errorfile 503 /etc/haproxy/sorryserver.http
        errorfile 504 /etc/haproxy/sorryserver.http
        server synapse1.i.bonilan.net synapse1.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000
        server synapse2.i.bonilan.net synapse2.i.bonilan.net:8030 resolvers nsX resolve-prefer ipv4 check inter 5000

Once the above configurations are installed on each server, restart the HAProxy service with sudo systemctl restart haproxy. You will now have access to the various endpoints on ports 443 and 8448 with a redirection from port 80 to port 443 to enforce SSL from clients.

Final steps

Now that your proxy is running, test connectivity to your servers. For Riot, visit the bare VIP IP or the Riot subdomain. For Matrix, visit the Matrix subdomain. In both cases, ensure that the page loads properly. Finally, use the Matirx Homserver Federation Tester to verify that Federation is correctly configured for your Homserver.

Congratulations, you now have a fully-configured, multi-worker and, if configured, load-balanced Matrix instance capable of handling many dozens or hundreds of users with optimal performance!

If you have any feedback about this post, including corrections, please contact me - you can find me in the #synapse:matrix.org Matrix room, or via email!