From b0f8382bbeed42e7e12840213697115f08a24459 Mon Sep 17 00:00:00 2001 From: Joshua Boniface Date: Mon, 17 Sep 2018 12:11:44 -0400 Subject: [PATCH] Add blog post on haproxy agent checks for Patroni --- .../post/patroni-and-haproxy-agent-checks.md | 126 ++++++++++++++++++ 1 file changed, 126 insertions(+) create mode 100644 content/post/patroni-and-haproxy-agent-checks.md diff --git a/content/post/patroni-and-haproxy-agent-checks.md b/content/post/patroni-and-haproxy-agent-checks.md new file mode 100644 index 0000000..dd09959 --- /dev/null +++ b/content/post/patroni-and-haproxy-agent-checks.md @@ -0,0 +1,126 @@ ++++ + +class = "post" +date = "2018-09-17T1:43:25-04:00" +tags = ["devops","postgresql","patroni","haproxy"] +title = "Patroni and HAProxy Agent Checks" +description = "Using HAProxy agent checks to clean up your Patroni load balancing" +type = "post" +weight = 1 + ++++ + +#### _Patroni and HAProxy Agent Checks_ + +[Patroni](https://github.com/zalando/patroni) is a wonderful piece of technology. In short, it [allows an administrator to configure a self-healing and self-managing replicated PostgreSQL cluster](https://patroni.readthedocs.io/en/latest/), and [quite simply at that](https://www.opsdash.com/blog/postgres-getting-started-patroni.html). With Patroni, gone are the days of having to manage your PostgreSQL replication manually, worrying about failover and failback during an outage or maintenance. Having a tool like this was paramount to supporting PostgreSQL in my own cluster, and after a lot of headaches with [repmgr](https://repmgr.org/) finding Patroni was a dream come true. If you haven't heard of it before, definitely check it out! + +Once you have a working Patroni cluster, managing client access to it becomes the next major step. And probably the easiest (and, in their docs, recommended) method to do so is using HAProxy. With its integrated health checking and simple load balancing, an HAProxy-fronted Patroni cluster provides the maximum flexibility for the administrator while seamlessly handling failovers. + +### The problem - Do you like `DOWN` hosts? + +However, the [official HAProxy configuration template](https://github.com/zalando/patroni/blob/master/extras/confd/templates/haproxy.tmpl) has a problem - in a read-write backend, you want your non-`master` hosts to be inaccessable to clients, to prevent write attempts against a read-only replica. However this configuration results in the `replica` hosts being marked `DOWN` in HAProxy. + +Now, some people might ask "well, why is that a big deal"? And they may be right. However, as soon as you start trying to monitor your HAProxy backends via an external monitoring tool, you see the problem: "CRITICAL" alerts during normal operation! After all, a `DOWN` host is considered a _problem_ in 99.9% of HAProxy usecases. But with Patroni, it's expected behaviour, which is not ideal. + +So what can we do? + +### HAProxy's `agent-check` directive + +HAProxy, since at least version 1.5, supports [a feature called `agent-check`](https://cbonte.github.io/haproxy-dconv/1.5/configuration.html#5.2-agent-check). In short, this "enable[s] an auxiliary agent check which is run independently of a regular health check". The `agent-check` will connect to a specific port on either the backend host or another target, and will modify the backend status based on the response, which must be one of the common HAProxy keyworks (eg. `MAINT` or `READY`). + +So how does this help us? Well, if we had some way to obtain Patroni's `master`/`replica` status for each host, we could, instead of having the `replica` machines marked `DOWN`, put them into `MAINT` mode instead. This provides cleanliness for monitoring purposes while still letting us use the typical Patroni HAProxy configuration, with just minimal modifications to the HAProxy configuation and deploying an additional daemon on the Patroni hosts. + +### The Code - Python 3 daemom + +The following piece of code is a Python 3 daemon I wrote that uses the `socket` and `requests` (requires the `python3-requests` package on Debian, or `requests` via `pip3`) libraries to: + +1. Listen for the agent check on port `5555`. +2. In response to a request, query Patroni's local API to determine that host's `role`. +3. Return `MAINT` or `READY` to HAProxy based on the role. + +Here is the code - I'm sure it can be improved significantly but it works for me! + +``` +#!/usr/bin/env python3 + +# Simple agent check for HAProxy to determine Patroni master/replica status + +import socket, requests + +# Make sure we clean up when we fail +def cleanup(e): + print(e) + conn.close() + sock.close() + exit(1) + +# External port to listen on and report status to HAProxy +listen_port = 5555 + +# Location of the Patroni API +data_target = 'http://localhost:8008' + +# Get the current role from Patroni's API +def getstate(): + try: + r = requests.get(data_target) + except: + return 'null' + data = r.json() + role = data['role'] + return role + +# Set up a listen socket on listen_port +sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) +sock.bind(('', listen_port)) +sock.listen(1) + +# Loop waiting for client requests in blocking mode +while True: + conn, addr = sock.accept() + state = getstate() + + # Set our response based on the state; only `master` should be READY in read-write mode + if state == 'master': + data = b'READY\n' + else: + data = b'MAINT\n' + + # Send the data to the client + try: + conn.sendall(data) + conn.close() + except Exception as e: + cleanup(e) +``` + +### Running the daemon with systemd + +Running the above Python code is really simple with systemd. I use the following unit file, assuming the above code is located at `/usr/local/bin/patroni-check`. + +``` +# Patroni agent check systemd unit file + +[Unit] +Description=HAProxy agent check for Patroni status +After=syslog.target network.target patroni.service + +[Service] +Type=simple +User=postgres +Group=postgres +StartLimitInterval=15 +ExecStart=/usr/local/bin/patroni-check +KillMode=process +TimeoutSec=30 +Restart=on-failure + +[Install] +WantedBy=multi-user.target +``` + +This is a really straightfoward unit with one deviation - `StartLimitInterval=15` is used to prevent the daemon from restarting immediately on failure. In my experience (probably a n00b error), Python doesn't properly clean up the socket immediately, leading to the daemon blowing through its ~5 restart attempts in under a second and failing every time. This interval gives some breathing room for the socket to free up. And luckily, HAProxy won't change the stage if the agent check becomes unreachable, so this should be safe. + +### Conclusion + +I hope that this provides some help to those who wan to use Patroni fronted by HAProxy but don't want `DOWN` backends all the time!