99 Commits

Author SHA1 Message Date
83b806d0b5 Move intervals config one level up
Makes for a slightly-better-organized configuration and explanation.
2019-07-28 19:33:23 -04:00
96bc181877 Set the routerstate on daemon startup
Allows switching from coordinator to not coordinator with a service
restart.
2019-07-12 09:51:56 -04:00
2a220cd16e Nicer colour output for coordinator state client 2019-07-12 09:31:42 -04:00
439c5f18c3 Add router_state to output of keepalives 2019-07-11 20:11:05 -04:00
f30be555c1 Improve message output for logging
Improve some formatting of the messages being printed to make it nicer
for long-term logging.
2019-07-10 22:38:32 -04:00
ac36870a86 Implement hup for log rotation
This function was long-existent, but never used; implement it.
2019-07-10 22:22:02 -04:00
58f4222ee7 Support disabling log colours and dates
For usecases such as a pure-syslog, allow disabling of dates or colours
in the log messages (separately).
2019-07-10 22:17:23 -04:00
7df200ac44 Improve ZK connection loss handling 2019-07-09 19:17:32 -04:00
47f86475f8 Handle failures of Ceph commands gradefully
If these commands fail, catch the error, print a message, and set up
empty lists. Also handle later data parsing in this case.
2019-07-09 16:43:38 -04:00
1a8e7509f7 Support run_os_command timeout; use timeouts 2019-07-09 15:09:13 -04:00
83a4140703 Allow enabling debug mode in config
Makes debugging easier without modifying code.
2019-07-09 14:59:00 -04:00
8eeba9bc9b Make Ceph commands time out if needed 2019-07-09 14:35:53 -04:00
19701c66e4 Move fencing to after keepalive output
Just makes the messages a little easier to read when triggered.
2019-07-09 14:24:31 -04:00
b551b54642 Rename message when contending 2019-07-09 14:03:48 -04:00
4249d5d982 Always load and store IPMI on daemon start
Without this, the IPMI information set during initial node creation can
never be changed, which can cause issues later. Instead, always set it
fresh on each node boot.
2019-07-09 14:00:31 -04:00
cda690e94f Set RADOS df information in ZK 2019-07-08 10:19:56 -04:00
0d398f663b Rename "Domain" to "VM" in various class names
The name "Domain", though technically correct from a Libvirt
perspective, was unnecessarily confusing. Call the class instances what
they are, VMs.
2019-07-07 15:20:37 -04:00
8216125b02 Enable autostart of API client on Primary
Adds a config flag that turns on the API client following the Primary
coordinator. The retcode of the start/stop commands is ignore so this
can fail gracefully if e.g. the client isn't installed.
2019-07-06 02:42:56 -04:00
3e591bd09e Remove extra whitespaces on blank lines 2019-06-25 22:33:23 -04:00
d336fce253 Connect to actual IP not localhost for Libvirt 2019-06-25 22:09:32 -04:00
75d0e7f989 Revert "Only perform fencing duties on primary"
This reverts commit 464c69aac67303a7eb16c0061ad8fa202b15a535.

Actually, yea, this made sense - if the primary fails, it can't
fence itself.
2019-06-25 12:36:48 -04:00
464c69aac6 Only perform fencing duties on primary
There was really no need for this to be shared among all the
coordinators, which seemed more fragile. This way only the primary will
try to fence dead nodes.
2019-06-24 20:17:51 -04:00
0f15e7cda5 Set shutdown state after final keepalive 2019-06-19 14:52:47 -04:00
0060c0313b Put daemonstate to shutdown when stopping
This way it isn't "run" all the way until it shuts down.
2019-06-19 14:23:07 -04:00
a940d03959 Fix some bugs and add RBD volume stats 2019-06-19 10:25:22 -04:00
db0b382b3d Don't bother with snapshot management by Daemon
This is *definitely* not needed in the end, and just uses RAM for
no conceivable purpose. Snapshots are fully client-managed.
2019-06-19 09:43:04 -04:00
1c9f606480 Implement volume and snapshot handling by daemon
This seems like a super-gross way to do this, but at the moment
I don't have a better way. Maybe just remove this component since
none of the volume/snapshot stuff is dynamic; will see as this
progresses.
2019-06-19 09:40:32 -04:00
784b428ed0 Add creation of volume and snapshot lists 2019-06-19 09:29:36 -04:00
2bbbda3da5 Only trigger pool updates on primary 2019-06-18 21:26:05 -04:00
443108f53d Add support for enable/disable keepalive detail 2019-06-18 19:54:42 -04:00
79f284a0a9 Pass logger into run_command 2019-06-18 13:45:59 -04:00
080ca3201c Correct actual problem with this_node 2019-06-18 13:43:54 -04:00
aee078f3eb Support disabling keepalive logging 2019-06-18 12:44:07 -04:00
b0411e8e1a Remove "error" message from Ceph commands
This triggeres at every node start and isn't useful.
2019-06-18 12:41:38 -04:00
8d9007f697 Remove OSD stat collection if count is zero
Otherwise, ceph osd df will hang indefinitely trying to get data
for the zero OSDs.
2019-06-18 12:36:53 -04:00
5a327dc41a Clean up Ceph pipeline and add more debug logs 2019-06-18 11:19:03 -04:00
1f92b90a3e Don't encode initial data as we're using zkhander 2019-06-17 23:53:16 -04:00
d4ebe63d9b Rename network device field
It seems much nicer and more consistent as "device" rather than as
"name".
2019-06-17 23:44:41 -04:00
1d3f868206 Unify network devices and addresses in config
The old way of doing this was a little cumbersome, with an upper YAML
tree split between "devices" (name and MTU) and addresses. This commit
unifies these under the root "networking" section to make this section
clearer.
2019-06-17 23:41:07 -04:00
e70255dbd6 Support configurable interface MTUs
MTUs were hardcoded at 9000, which breaks if the underlying interface
or network switch does not support jumbo frames, a possible deployment
limitation. This has non-obvious consequences due to MTU mismatches
for certain services (Ceph, Zookeeper, etc.).

This commit adds support for configurable MTUs for each interface,
set in pvcd.yaml. The example has been updated to reflect this, with
a default of 1500 (the Ethernet standard).

This commit also adds autoconfiguration of the VNI device MTU based
on the `vni_mtu` value, the same for bridge networks and minus 50
(rather than 200 from the hardcoded value, based on the following
resource [1]) for VXLAN networks.

[1] http://ipengineer.net/2014/06/vxlan-mtu-vs-ip-mtu-consideration/
2019-06-17 23:34:48 -04:00
c583ee1709 Revert "Wait a little longer"
This reverts commit bd7a55e9e1de08f00208e641b237b1bbe7ab420f.

This is not really needed, but do keep the 5s wait
2019-06-17 21:56:06 -04:00
bd7a55e9e1 Wait a little longer 2019-06-17 12:14:13 -04:00
23994f8a11 Increase wait time for daemons and log message 2019-06-17 10:30:46 -04:00
fe654aa5a2 Correct typo in daemon 2019-06-16 19:27:20 -04:00
e8b666708c Add one final keepalive update before exiting 2019-05-23 23:23:03 -04:00
8881b97e8b Correct a missing capitalization 2019-05-21 23:19:19 -04:00
595cf1782c Switch DNS aggregator to PostgreSQL
MariaDB+Galera was terribly unstable, with the cluster failing to
start or dying randomly, and generally seemed incredibly unsuitable
for an HA solution. This commit switches the DNS aggregator SQL
backend to PostgreSQL, implemented via Patroni HA.

It also manages the Patroni state, forcing the primary instance to
follow the PVC coordinator, such that the active DNS Aggregator
instance is always able to communicate read+write with the local
system.

This required some logic changes to how the DNS Aggregator worked,
specifically ensuring that database changes aren't attempted while
the instance isn't actively running - to be honest this was a bug
anyways that had just never been noticed.

Closes #34
2019-05-21 01:07:41 -04:00
2151566b74 Send total memory via ZK so its accurate 2019-05-10 23:26:59 -04:00
7416d440d5 Use zkhandler when writing initial node config 2019-05-10 23:26:59 -04:00
b6ecd36588 Implement domain log watching
Implements the ability for a client to watch almost-live domain
console logs from the hypervisors. It does this using a deque-based
"tail -f" mechanism (with a configurable buffer per-VM) that watches
the domain console logfile in the (configurable) directory every
half-second. It then stores the current buffer in Zookeeper when
changed, where a client can then request it, either as a static piece
of text in the `less` pager, or via a similar "tail -f" functionality
implemented using fixed line splitting and comparison to provide a
generally-seamless output.

Enabling this feature requires each guest VM to implement a Libvirt
serial log and write its (text) console to it, for example using the
default logging directory:

```
<serial type='pty'>
    <log file='/var/log/libvirt/vmname.log' append='off'/>
<serial>
```

The append mode can be either on or off; on grows files unbounded,
off causes the log (and hence the PVC log data) to be truncated on
initial VM startup from offline. The administrator must choose how
they best want to handle this until Libvirt implements their own
clog-type logging format.
2019-05-10 23:26:59 -04:00