· Infra · 13 min read
Adding HA to a single-container SQLite web app, without moving to Postgres
What we needed
I have a small FastAPI app in the lab called Atlas — an inventory of VMs, networks, endpoints, secrets. One container, SQLite on a Docker volume, a React frontend baked into the same image. Boring on purpose. Fits on a 1 GB VM.
Then the usual thing happened: the more I put in it, the more it mattered that it stayed up. A lost DB stopped being "oh well, rebuild it" and started being "weekend of re-typing."
So I wanted HA, but I wanted to keep the things I liked about the app:
- Single container. Whatever replicates the DB has to live inside the same image. No second container, no sidecar.
- No third-party storage. No S3, no MinIO, no shared filesystem to provision.
- Active/passive with manual failover. No heartbeat/RAFT quorum deciding the role. A "Make me primary" button I press.
- No SSHing into the standby to fail over. The whole point of a button is that it's a button.
- Dated, gzipped local backups on each host, with a skip-if-unchanged check so the dir doesn't fill up with identical copies.
Why stay on SQLite at all
The most obvious answer to "I want HA" is "switch to Postgres," so let's get that out of the way first.
Postgres has streaming replication everyone's debugged for twenty years. True, and if I were starting today I'd probably reach for it. But Atlas already exists, and migrating buys: a schema rewrite (SQLAlchemy + SQLite ≠ Postgres — JSON columns, booleans, INTEGER PRIMARY KEY AUTOINCREMENT all differ); a multi-container deployment; a new failure mode (app is up, DB unreachable); a new thing to patch and back up. None of which I'd accept for "single-writer inventory app survives a host dying."
Multi-master systems — Galera, BDR, Patroni, CockroachDB, rqlite/dqlite — all need three nodes for quorum, not two. Two nodes is a coin flip; an odd count is required to avoid split-brain deadlock. Going from two hosts to three, each running a stateful DB, isn't a lab inventory app anymore — it's a database cluster with version-upgrade ordering and quorum monitoring. Wrong shape.
So I'm staying on SQLite. The question becomes: how do you replicate it?
The three options
┌────────────────────────────────────────────────────────────────────┐
│ Option 1 Raft-replicated SQLite (rqlite, dqlite) │
│ Option 2 Stream WAL via Litestream (s3 / sftp / file backend) │
│ Option 3 Roll our own shipper (HTTP POST the DB itself) │
└────────────────────────────────────────────────────────────────────┘
Up-front guess at how this will go: Option 1 is overkill, Option 2 is the obvious right answer, Option 3 is a fallback I won't need. Option 2 won the cold pick. Option 3 is what's actually running today.
Option 1 — Raft-replicated SQLite (rqlite, dqlite)
Wrap SQLite in a Raft consensus layer. Any node can take writes, quorum decides who's alive, failover is automatic, everyone's strongly consistent.
Ruled out for the same reason as Postgres-with-Patroni: three nodes minimum. Plus rqlite isn't actually SQLite-the-file-on-disk that the rest of my tooling speaks (sqlite3 .backup, my existing backup job, etc.) — it's a Raft cluster that uses SQLite internally. Different operational model.
For "two boxes, one button, manual failover," Raft is the wrong layer of the stack.
Option 2 — Litestream (the obvious answer that didn't fit)
Litestream is a single Go binary that tails SQLite's WAL and streams the frames to an external store. The author's careful engineering shows: retention policies, point-in-time restore, supports S3 / GCS / Azure / SFTP / plain file backends. The replication is somebody else's code. That's the appeal — I want to own the operator flow (pairing, role state, the failover button), not the byte-shipping.
The catch is somebody else's code needs somewhere to put the bytes. My constraints rule out cloud (S3/GCS/Azure) and shared storage. Of the local backends, the realistic one is SFTP.
So the design became: bake openssh-server into the Atlas container as an SFTP-only chroot, exchange SSH client keys during pairing, have each node SFTP into the other. Single container preserved (just one with sshd inside), no shared storage, no third party.
┌─────────── host A ───────────┐ ┌─────────── host B ───────────┐
│ Atlas container │ │ Atlas container │
│ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │
│ │ uvicorn │ │ WAL via │ │ uvicorn (standby) │ │
│ │ ┌──────────┐ │ │ SFTP │ │ ┌──────────┐ │ │
│ │ │ litestream replicate │──┼──────────┼─▶│ │ sshd (sftp-only, │ │
│ │ └──────────┘ │ │ │ │ │ chrooted) │ │
│ │ ┌──────────┐ │ │ │ │ └──────────┘ │ │
│ │ │ sshd │ ◀────────┼──┼───────────┼──┤ /srv/replica-inbound │ │
│ │ └──────────┘ │ │ (failover │ │ │ │
│ └────────────────────────┘ │ reverses) │ └────────────────────────┘ │
└──────────────────────────────┘ └──────────────────────────────┘
It went sideways at every layer of the stack. Yaks I shaved, in the order I hit them:
useradd --systemmakes the account locked. sshd rejects locked accounts even for pubkey auth.passwd -d atlasafter useradd unlocks.StrictModes yesrejects authorized_keys under/data./datais bind-mounted from the host's non-root user. sshd refuses to read auth files under non-root parents. SetStrictModes no.- Chroot ancestry rejection is a separate check — sshd refuses to chroot into a dir whose ancestors aren't root-owned. Same problem with
/data/.... Moved the chroot to/srv/replica-inbound, backed by a Docker named volume so its ownership stays root. - ed25519 keys can't be re-saved as PEM.
ssh-keygen -p -m PEMclaims success and silently does nothing on ed25519 files. - The actual blocker. Manual
sftp -i /data/atlas-ssh-key -P 2222 atlas@peersucceeded ("Accepted publickey for atlas" in sshd's logs). Litestream's own SSH client, with the same key against the same server, got rejected:attempted methods [none publickey], no supported methods remain. Container's sshd never even logged Litestream's connection — TCP reached the server, but auth never engaged.
I tried the obvious narrowings — RSA-PEM key instead of ed25519, password auth instead of pubkey, every log level Litestream offers, full key fingerprint cross-checks. All paths the same: manual works, Litestream doesn't. Litestream v0.5.x is a relatively recent rewrite around their LTX format; my best guess is an undocumented regression in the SFTP backend. I couldn't pin it down.
What I can count is what it cost:
Option 2 (Litestream + embedded sshd) — never authenticated successfully:
Dockerfile +20 lines (openssh-server + atlas user + dirs)
docker/sshd_atlas.conf 35 lines (chroot, AllowUsers, AuthorizedKeysFile, ...)
docker/entrypoint.sh 30 lines (host-key gen, dir setup)
ha.py ~590 lines (sshd supervisor, key gen, pubkey
install, register-peer pubkey exchange,
sftp_host derivation, litestream YAML
generation, litestream child-process mgmt)
routers/ha.py +60 lines (register-peer, sftp_host fields)
docker-compose +5 lines (port 2222, atlas-replica-inbound volume)
─────────
~740 lines of dedicated transport plumbing.
For something whose entire job was supposed to be not having to think about it. That's when I noticed the abstraction had become a tax — every problem was at a layer I hadn't signed up to debug. So I stopped, deleted all of it, and walked.
Option 3 — POST the DB over HTTPS
Originally dismissed as "fun to write, miserable to debug." I had the second adjective wrong for this problem.
The architecture, in full:
┌─────────── host A (primary) ────────────┐ ┌─── host B (standby) ───┐
│ │ every │ │
│ sqlite3 .backup → tmpfile │ 30s │ POST /api/ha/ │
│ gzip │ ──────▶ │ replica-push │
│ POST {peer}/api/ha/replica-push │ HTTPS │ atomic write to │
│ body: gzipped DB │ │ /data/atlas.db.replica
│ header: HA_TOKEN │ │ (live DB stays put) │
└─────────────────────────────────────────┘ └────────────────────────┘
No daemon to supervise. No key exchange. No chroot. Two containers Atlas already speaks to, two new endpoints. The primary's per-tick push:
def push_snapshot_to_peer(force=False):
if current_role() != "primary": return {"skipped": True}
version = read_data_version()
if not force and version == last_pushed_version:
return {"ok": True, "skipped": True} # PRAGMA data_version unchanged
raw = run_atomic_backup(DB_PATH) # sqlite3 .backup, atomic
body = gzip.compress(raw)
r = httpx.post(
f"{peer_base_url()}/api/ha/replica-push",
content=body,
headers={"Authorization": f"Bearer {ha_token()}",
"X-HA-Sender-Id": self_id(),
"X-HA-Data-Version": str(version)},
timeout=300, verify=False,
)
if r.status_code == 200:
record_pushed(version, len(body))
And the receiver:
@router.post("/replica-push")
async def replica_push(request: Request):
require_ha_token(request)
body = await request.body()
Path(REPLICA_DB_PATH + ".incoming").write_bytes(gzip.decompress(body))
os.replace(REPLICA_DB_PATH + ".incoming", REPLICA_DB_PATH) # atomic
record_received(request.headers["X-HA-Data-Version"])
return {"ok": True}
The full transport — push function, receive endpoint, scheduler tick, metadata reader/writer — is 80 lines. That's it. The standby holds the snapshot at /data/atlas.db.replica (a separate file from its own live /data/atlas.db); on promotion, os.replace() swaps it in.
It's full-DB-each-push, which sounds wasteful, but Atlas is 96 KB. Gzipped on the wire it's 14 KB. The PRAGMA data_version skip-if-unchanged gate means the inventory app's actual write rate (a few dozen times a day) is the only time anything moves.
Option 2 → Option 3 lines comparison:
Replication transport: ~740 lines → 80 lines ( ~9× smaller )
Net codebase delta: ~590 line ha.py → 270 line ha.py + endpoint
( ~half the module size )
The thing I built myself, in less code, does the job. The cost of the dumb thing here is invisible.
Role state: one file, one lock
The single source of truth for "am I primary or standby?" is a tiny JSON file outside the DB:
// /data/ha.json
{
"role": "primary",
"self_id": "A",
"last_promoted_at": "2026-04-25T14:02:11+00:00",
"last_demoted_at": null
}
load_state() / update_state(**kwargs) wrap it behind a threading.Lock, write atomically (tmp → rename), and are the only things allowed to touch the file. Every role transition is a single update_state(role="primary", last_promoted_at=now_iso()).
It lives outside the SQL DB on purpose. If I stored "I am primary" in a row inside atlas.db, then promotion-by-replica-restore would overwrite my role with the old primary's row, and the new primary would think it's still a standby. Files are the right shape for identity; rows are the right shape for shared state. Don't confuse them.
The standby middleware: a 503 with directions
The standby's uvicorn is up but refuses everything that isn't HA plumbing:
@app.middleware("http")
async def gateway_middleware(request, call_next):
path = request.url.path
if ha.ha_enabled() and ha.current_role() == "standby":
if path == "/api/health" or path.startswith("/api/ha/") or not path.startswith("/api/"):
return await call_next(request)
return JSONResponse(
{"detail": "This node is in standby. Direct writes to the primary.",
"primary_url": ha.peer_base_url()},
status_code=503,
)
# ... normal auth middleware continues ...
I debated letting the standby serve stale reads. The problem is the standby's live DB is empty by design — the only DB on the machine is the replica file, which we deliberately keep separate. Serving 503 with a pointer to the primary is honest. The frontend detects role == "standby" from /api/ha/status (open endpoint) up front and renders a dedicated Standby screen with a single Make me primary button.
The failover button
@router.post("/failover")
def failover(body: FailoverRequest, request: Request):
require_session_or_ha_token(request)
if ha.load_state().get("role") == "primary":
raise HTTPException(400, "already primary")
peer = ha.peer_status()
if peer and peer.get("role") == "primary" and not body.force:
raise HTTPException(409, {"error": "peer is still primary",
"hint": "set force=true if you're sure peer is dead"})
promoted = ha.promote_to_primary() # mv .replica → atlas.db, dispose engine, flip role
demoted, msg = ha.demote_peer() # POST /api/ha/demote on the old primary
return {"ok": True, "new_role": "primary", "peer_demoted": demoted, "demote_msg": msg}
The promotion itself is a four-line function:
def promote_to_primary():
from app.database import engine
engine.dispose() # drop pooled connections
os.replace(REPLICA_DB_PATH, DB_PATH) # atomic file swap
for s in ("-wal", "-shm"):
Path(DB_PATH + s).unlink(missing_ok=True)
update_state(role="primary", last_promoted_at=now_iso())
engine.dispose() doesn't have to be perfect — SQLAlchemy will create new connections lazily. Requests in flight may briefly fail; new requests after the swap see the promoted DB. For an inventory app where failover happens at the operator's pace, that's fine.
Auto-demotion
Same as before, simpler now (no Litestream child to stop):
@router.post("/demote")
def demote(request: Request):
require_ha_token(request)
ha.update_state(role="standby", last_demoted_at=ha.now_iso())
return {"ok": True, "new_role": "standby"}
After this returns, the old primary's next request hits the standby middleware and 503s, the next sync tick checks current_role() and skips the push. No process to kill, no SSH session to tear down. One file write, role flipped.
If the old primary was already dead when we failed over, the demote call fails — which is fine; the new primary is already taking traffic. When the old host comes back, its (stale) ha.json claims primary and it briefly tries to push to the new primary, which now refuses with "this node is primary; refusing to overwrite live DB". That's the split-brain mitigation: a primary can't accidentally clobber another primary's DB even by misconfiguration. Operator clicks Demote on the recovered host, and the cluster is consistent again.
Backups: PRAGMA data_version is all you need
Replication and backups are different products. Replication ships the latest state to a peer for failover. Backups keep historical states for "I deleted the wrong network six hours ago." Atlas does both, with the same trick at the gate:
PRAGMA data_version;
A single-row read, no table scan, increments on every committed write, identical across all SQLite connections. That's the signal you want for "anything changed?"
The backup job:
def run_backup(force=False):
version = _read_data_version()
if not force and version == _last_data_version:
return {"ok": True, "skipped": True, "reason": "no writes since last backup"}
ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")
target = BACKUP_DIR / f"atlas-{ts}.db"
subprocess.run(["sqlite3", DB_PATH, f".backup {target}"])
with open(target, "rb") as src, gzip.open(target.with_suffix(".db.gz"), "wb") as dst:
shutil.copyfileobj(src, dst)
target.unlink()
_prune_old_backups()
The replication push uses the same check on a tighter cadence (30 s vs 15 min), because the cost of a no-op push is one read of data_version. On a typical day for me — fewer than a hundred writes — replication and backups together produce a few kilobytes of gzipped diffs and otherwise sit silent.
Both run on both nodes. The standby keeps its own dated backups independent of replication, so even if the primary's snapshots got corrupted somehow, the standby has a parallel history.
Auth needs to be replicated, too
auth.json (password hash + API token) used to live next to the DB and not replicate. After failover, every integration using the old primary's API token would 401 on the new primary. Bad surprise.
The fix was a small refactor: a settings table inside the DB, with optional Fernet-encrypted values, and a one-shot migration that imports auth.json into it on first boot. Now auth replicates by virtue of being inside the same SQLite file the snapshots ship.
The KEK that protects the encrypted rows is held outside the DB (env var or a small file) — but pairing carries it across in the bundle, so both nodes use the same KEK without manual coordination. After pairing, every secret the cluster cares about (admin password hash, API token, peer base URL, HA token itself) replicates the moment the next snapshot ships.
What I learned
The whole exercise was a useful clarifier on how much abstraction is the right amount.
The Litestream answer was more abstract: it gives you a streaming WAL replication engine that's a wonder of careful engineering, with backends for cloud storage you might already have, retention policies, point-in-time restore. All things I might want for a different app. None of which I actually needed for an inventory tool with a 96 KB DB and twenty writes a day.
When the tool's smarts didn't translate to my environment (an embedded sshd I couldn't get its SSH client to talk to), the abstraction became a tax — every layer of the puzzle (chroot ownership, key formats, account locks, log levels) was a thing I had to learn before I could even see the actual blocker. Whereas the dumb shipper:
- I wrote it.
- When it broke, the stack trace pointed at my code.
- When it changed, it was three lines.
- It runs.
The lesson isn't "Litestream is bad" — it's a great tool, and I'd reach for it again on the right problem. The lesson is that for a small enough problem, the cost-floor of someone else's good abstraction is higher than the cost-ceiling of your own bad one. I should have noticed sooner that "few hundred writes per day, megabytes of total state" was deep in territory where the bad abstraction wins.
What it looks like from the outside
Day-to-day. Two hosts, same image. One container is primary, one is standby — set initially by env var, then persisted in /data/ha.json. They've been paired through the UI; both know each other's HTTPS URL and share an HA token. Every 30 seconds the primary checks PRAGMA data_version, and if it changed since last push, gzips the DB and POSTs ~14 KB to the standby. The standby's /data/atlas.db.replica is always within 30 seconds of the primary's live state.
A host dies. I open the standby's URL, see the Standby screen, click Make me primary. The new primary moves .replica into place, flips its role file, and the page reloads as the primary. DNS flip and done. I didn't SSH anywhere.
A planned swap. Click Demote on the current primary first — it stops pushing, flips role, starts serving 503s. Click Make me primary on the other. No force checkbox needed, because the former-primary now claims standby.
A silent, slow migration would still happen. If I push past SQLite's single-writer ceiling, or want multi-region active-active, this design isn't the right one — Postgres is. Today, SQLite + ~250 lines of Python + an HTTPS POST every 30 s is the right-sized answer.
Shape
backend/app/ha.py ~270 lines role state + sync timer + push/receive + pairing
backend/app/backup.py ~130 lines data_version gate + sqlite3 .backup + gzip
backend/app/routers/ha.py ~170 lines /api/ha/* endpoints (incl /replica-push)
backend/app/settings.py ~140 lines encrypted key/value table (auth + HA config)
backend/app/main.py +40 lines gateway middleware + startup hook
frontend/src/App.tsx +90 lines StandbyScreen + JoinClusterForm
frontend/src/pages/ClusterPage.tsx ~250 lines status, pairing, sync, advanced editor
~1100 lines total, including the abandoned-Litestream stuff that's now gone — the actual replication transport (push + receive + scheduler) is more like 80 lines. Single container. No database migration. No second container. No quorum. No CLI to operate.
Each piece does one job. The HA token authenticates peer calls. The role file says who's primary. The middleware blocks writes on the standby. The 30-second timer ships diffs. The promotion endpoint does its four-line dance. None of them know about the others except through tiny, explicit contracts. When something goes wrong, it's always obvious which box to open — usually mine.