← all posts

· Infra · 7 min read

Making nginx more resilient: runtime DNS resolution and auto-restart

The goal

A service in the lab — atlas.nginx.dnsif.ca — stopped responding. I only noticed when I happened to visit it. The container behind it was fine, healthy on its own port for days. The reverse proxy had just... quietly disappeared.

This post is about the two things I changed in the nginx config so that specific failure mode can't happen again, and the one thing I deliberately left alone.

The forensics

First pass: ping the upstream directly.

$ curl -s http://docker.dnsif.ca:5011/api/health
{"status":"ok"}

Upstream is healthy. Next the reverse proxy itself:

$ ssh nginx.dnsif.ca 'systemctl is-active nginx'
failed

Interesting — not "stopped," failed. Check the journal:

Apr 16 07:10:10  nginx  Reloaded nginx.service
Apr 17 05:40:23  nginx  Reloading nginx.service
Apr 17 05:40:24  nginx  Reloaded nginx.service
...
Apr 22 06:15:27  nginx  [emerg] host not found in upstream "docker.dnsif.ca" in /etc/nginx/sites-enabled/atlas:15
Apr 22 06:15:27  nginx  nginx: configuration file /etc/nginx/nginx.conf test failed
Apr 25 00:18:56  nginx  Starting nginx.service ...      ← me, today, realising
Apr 25 00:18:57  nginx  Started nginx.service.

Three days and eighteen hours between the last healthy log entry and me starting it by hand.

Host uptime says the machine hasn't rebooted. nginx.service is enabled, so a reboot would have restarted it anyway. What happened is that a config reload failed, and at some later point nginx ended up in a failed systemd state — possibly because something (cron? acme-sh's deploy hook? my finger?) issued a systemctl restart that couldn't come back up with a config nginx still thought was broken. Unit in failed state, no auto-restart policy, nobody notices until a human hits the URL.

So there are two separate fragilities here, and I want to fix both.

Fragility #1 — DNS is evaluated at reload time

This is the one I didn't know about until I read the error carefully. My atlas vhost did this:

location / {
    proxy_pass http://docker.dnsif.ca:5011;
    # ... headers ...
}

Looks innocent. But nginx resolves docker.dnsif.ca to an IP once, at config load time — either at boot, at explicit reload, or at nginx -t. If resolution fails in that instant, nginx refuses to load the config at all. That's what the [emerg] line is telling us: nginx was asked to reload, DNS for docker.dnsif.ca was briefly unresolvable, and the whole reload was aborted.

It doesn't matter that DNS was back to normal a minute later. The config had failed its test. The service was never told to try again.

The fix: a resolver + a variable

If you put the upstream name in a variable, nginx refuses to resolve it at parse time — it has to defer to request time. You pair that with a resolver directive telling nginx which nameserver to ask:

server {
    listen 443 ssl;
    server_name atlas.nginx.dnsif.ca;

    ssl_certificate     /home/altanc/.acme.sh/*.nginx.dnsif.ca_ecc/fullchain.cer;
    ssl_certificate_key /home/altanc/.acme.sh/*.nginx.dnsif.ca_ecc/*.nginx.dnsif.ca.key;

    # Re-resolve the upstream at request time via systemd-resolved.
    resolver 127.0.0.53 valid=30s ipv6=off;
    resolver_timeout 5s;

    location / {
        set $atlas_upstream docker.dnsif.ca:5011;
        proxy_pass http://$atlas_upstream;
        # ... headers ...
    }
}

Now:

Why 127.0.0.53? That's the systemd-resolved stub resolver — this host already has it running on loopback and it's what the rest of the system uses via /etc/resolv.conf. No new dependency.

Gotcha. Using a variable in proxy_pass changes how nginx forwards the request URI. With a literal proxy_pass http://upstream:port; and no trailing path, nginx passes the original URI. With a variable — same. With a variable AND a path, the behaviour differs from the literal case. If you're ever tempted to write proxy_pass http://$upstream/prefix/;, check the docs twice. For pass-through vhosts like this one, the simple form is fine.

Fragility #2 — nothing restarts nginx when it dies

If nginx ever lands in failed state — from a bad reload, a crash, a systemctl stop someone forgot about — the default systemd unit doesn't try to bring it back. Stays down until a human intervenes.

That's a one-line fix too, as a drop-in instead of editing the packaged unit (so apt upgrades don't stomp it):

# /etc/systemd/system/nginx.service.d/restart.conf
[Service]
Restart=on-failure
RestartSec=10s
sudo systemctl daemon-reload
systemctl show nginx.service -p Restart -p RestartUSec
# Restart=on-failure
# RestartUSec=10s

Belt-and-suspenders: the first fix makes the failure mode near-impossible, this one makes sure the whole class of "nginx silently stopped" is self-healing.

I deliberately went with Restart=on-failure rather than Restart=always. always would keep restarting nginx even when I've explicitly stopped it, which is not what I want.

What I didn't change

Two things I considered and decided against.

Rewriting every vhost — this host has ~25 sites and I only touched atlas. Most of the others proxy to targets whose names have never bounced in years (127.0.0.1:XXXX or names inside .docker.internal-style scopes that can't fail on this host). Adding a resolver directive to every vhost would be a lot of churn for upstreams that don't need it. If one of them bites me the same way, I'll migrate it — and by then it'll be an established pattern.

An external monitor pinging the site — sure, Uptime Kuma or similar would have caught this in minutes. I'll add that eventually. But monitoring tells you that something is down; the config changes above keep it from going down in the first place. They're doing different jobs.

Verifying

$ sudo nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

$ sudo systemctl reload nginx
$ curl -s -o /dev/null -w '%{http_code}\n' https://atlas.nginx.dnsif.ca/api/health
200

(The warning nginx emitted after the edit — "conflicting server name atlas.nginx.dnsif.ca" — was from the backup file I'd created as sites-enabled/atlas.bak.20260425. nginx loads every file in sites-enabled/ regardless of extension. Backups belong in sites-available/ or /tmp. Moving the backup out of sites-enabled/ made the warning go away.)

What the final shape looks like

Two three-line additions, no new dependencies, no new moving parts:

                                 ┌──────────────────────────┐
                                 │  systemd-resolved        │
                                 │  listens on 127.0.0.53   │
                                 └───────────▲──────────────┘
                                             │
                                             │ per-request DNS
                                             │ (30s cache)
┌────────────────────┐                       │
│  request to atlas  │ ─────▶  nginx ────────┘─────▶ docker.dnsif.ca:5011
└────────────────────┘        │
                              │ if nginx ever dies
                              ▼
                         systemd  →  Restart=on-failure, 10s delay

The change takes 60 seconds to apply. The failure mode it prevents took me three days to notice. That's a decent trade.