One Woman Infrastructure - Keeping Things Up to Date

I’ve been running portions of this infrastructure since about 2013 or so. It started with one VPS, and expanded to adding a high-traffic server to host a Tor node, and from there it’s expanded to 3 VPSes, 2 servers, and 3 containers, for a total of 8 hosts to maintain full-time. This is a lot! This is going to be the first in a series of posts about what it’s like to maintain these many hosts for this long period of itme. This week’s theme is updates.

Some of you may know I live on the questionable bleeding-edge and run Arch Linux on all my servers, but the challenge of keeping them up to date specifically revolves around remembering when to log in to update everything. After exploring a few fleet management systems (e.g. Chef, Puppet, etc.) I settled on using Ansible. Specifically the Ansible Tower project now open-sourced as AWX has been really easy for me to set up and use. My method was as follows:

Pick a host, and install AWX via Docker Compose. This is your main area of operations.
(optional) You may need to log into the AWX docker container and install additional python modules for the AWX features you want, e.g. one of my templates uses DNS resolution.
Generate a SSH key and store the private key as an AWX secret.
Deploy the public key as the root user on your fleet. Set the appropriate protections you want (I used key-only login, restricted login via certain hosts, etc.)
Install python on your fleet.

Within AWX you need to do additional setup pieces:

Configure your user account
Set up your host inventory with the list of hosts you want to manage. Bonus: set tags which become useful for your ansible templates later on.
Add the repository to source for templates.
Add the templates themselves with their appropriate configs.
(optional) Configure schedules for those templates so that they get run frequently as Jobs.

From there, I have a series of ansible templates for provisioning (YAML config templates with j2 file templates):

I have also some ansible templates I set on a schedule to keep the hosts up to date automatically:

“Safe” upgrade (ignores kernel, pacman, glibc)
“System” upgrade + reboot (only kernel, pacman, glibc)
Certificate renewals (same as above)

I’ll finish off with some problems I’ve run into over the past few years from trying to update things (or in some cases, forgetting to update):

RAID arrays desyncing because a drive is about to fail, causing a mysterious “missing kernel module” issue from updates only hitting one drive while the host flip-flop boots to the unsynced drive
Exim RCEs causing mysterious errors as a result of the box being pwned
Grub just mysteriously failing on PV hosts, necessitating a migration from PV to HVM, resulting in addition config needed to be provided to grub to view the serial console
Certificate expiries preventing e-mails from being received.
The random GLIBC dependency issue that’s not captured properly during a partial update that breaks everything.

Happy updating.