← all posts

· Infra · 13 min read

Adding HA to a single-container SQLite web app, without moving to Postgres

What we needed

I have a small FastAPI app in the lab called Atlas — an inventory of VMs, networks, endpoints, secrets. One container, SQLite on a Docker volume, a React frontend baked into the same image. Boring on purpose. Fits on a 1 GB VM.

Then the usual thing happened: the more I put in it, the more it mattered that it stayed up. A lost DB stopped being "oh well, rebuild it" and started being "weekend of re-typing."

So I wanted HA, but I wanted to keep the things I liked about the app:

Why stay on SQLite at all

The most obvious answer to "I want HA" is "switch to Postgres," so let's get that out of the way first.

Postgres has streaming replication everyone's debugged for twenty years. True, and if I were starting today I'd probably reach for it. But Atlas already exists, and migrating buys: a schema rewrite (SQLAlchemy + SQLite ≠ Postgres — JSON columns, booleans, INTEGER PRIMARY KEY AUTOINCREMENT all differ); a multi-container deployment; a new failure mode (app is up, DB unreachable); a new thing to patch and back up. None of which I'd accept for "single-writer inventory app survives a host dying."

Multi-master systems — Galera, BDR, Patroni, CockroachDB, rqlite/dqlite — all need three nodes for quorum, not two. Two nodes is a coin flip; an odd count is required to avoid split-brain deadlock. Going from two hosts to three, each running a stateful DB, isn't a lab inventory app anymore — it's a database cluster with version-upgrade ordering and quorum monitoring. Wrong shape.

So I'm staying on SQLite. The question becomes: how do you replicate it?

The three options

┌────────────────────────────────────────────────────────────────────┐
│  Option 1   Raft-replicated SQLite       (rqlite, dqlite)          │
│  Option 2   Stream WAL via Litestream   (s3 / sftp / file backend) │
│  Option 3   Roll our own shipper         (HTTP POST the DB itself) │
└────────────────────────────────────────────────────────────────────┘

Up-front guess at how this will go: Option 1 is overkill, Option 2 is the obvious right answer, Option 3 is a fallback I won't need. Option 2 won the cold pick. Option 3 is what's actually running today.

Option 1 — Raft-replicated SQLite (rqlite, dqlite)

Wrap SQLite in a Raft consensus layer. Any node can take writes, quorum decides who's alive, failover is automatic, everyone's strongly consistent.

Ruled out for the same reason as Postgres-with-Patroni: three nodes minimum. Plus rqlite isn't actually SQLite-the-file-on-disk that the rest of my tooling speaks (sqlite3 .backup, my existing backup job, etc.) — it's a Raft cluster that uses SQLite internally. Different operational model.

For "two boxes, one button, manual failover," Raft is the wrong layer of the stack.

Option 2 — Litestream (the obvious answer that didn't fit)

Litestream is a single Go binary that tails SQLite's WAL and streams the frames to an external store. The author's careful engineering shows: retention policies, point-in-time restore, supports S3 / GCS / Azure / SFTP / plain file backends. The replication is somebody else's code. That's the appeal — I want to own the operator flow (pairing, role state, the failover button), not the byte-shipping.

The catch is somebody else's code needs somewhere to put the bytes. My constraints rule out cloud (S3/GCS/Azure) and shared storage. Of the local backends, the realistic one is SFTP.

So the design became: bake openssh-server into the Atlas container as an SFTP-only chroot, exchange SSH client keys during pairing, have each node SFTP into the other. Single container preserved (just one with sshd inside), no shared storage, no third party.

┌─────────── host A ───────────┐           ┌─────────── host B ───────────┐
│  Atlas container             │           │  Atlas container             │
│  ┌────────────────────────┐  │           │  ┌────────────────────────┐  │
│  │  uvicorn               │  │  WAL via  │  │  uvicorn (standby)     │  │
│  │  ┌──────────┐          │  │   SFTP    │  │  ┌──────────┐          │  │
│  │  │ litestream replicate │──┼──────────┼─▶│  │ sshd (sftp-only,    │  │
│  │  └──────────┘          │  │           │  │  │  chrooted)          │  │
│  │  ┌──────────┐          │  │           │  │  └──────────┘          │  │
│  │  │ sshd     │ ◀────────┼──┼───────────┼──┤  /srv/replica-inbound  │  │
│  │  └──────────┘          │  │ (failover  │  │                        │  │
│  └────────────────────────┘  │  reverses) │  └────────────────────────┘  │
└──────────────────────────────┘            └──────────────────────────────┘

It went sideways at every layer of the stack. Yaks I shaved, in the order I hit them:

  1. useradd --system makes the account locked. sshd rejects locked accounts even for pubkey auth. passwd -d atlas after useradd unlocks.
  2. StrictModes yes rejects authorized_keys under /data. /data is bind-mounted from the host's non-root user. sshd refuses to read auth files under non-root parents. Set StrictModes no.
  3. Chroot ancestry rejection is a separate check — sshd refuses to chroot into a dir whose ancestors aren't root-owned. Same problem with /data/.... Moved the chroot to /srv/replica-inbound, backed by a Docker named volume so its ownership stays root.
  4. ed25519 keys can't be re-saved as PEM. ssh-keygen -p -m PEM claims success and silently does nothing on ed25519 files.
  5. The actual blocker. Manual sftp -i /data/atlas-ssh-key -P 2222 atlas@peer succeeded ("Accepted publickey for atlas" in sshd's logs). Litestream's own SSH client, with the same key against the same server, got rejected: attempted methods [none publickey], no supported methods remain. Container's sshd never even logged Litestream's connection — TCP reached the server, but auth never engaged.

I tried the obvious narrowings — RSA-PEM key instead of ed25519, password auth instead of pubkey, every log level Litestream offers, full key fingerprint cross-checks. All paths the same: manual works, Litestream doesn't. Litestream v0.5.x is a relatively recent rewrite around their LTX format; my best guess is an undocumented regression in the SFTP backend. I couldn't pin it down.

What I can count is what it cost:

Option 2 (Litestream + embedded sshd) — never authenticated successfully:

  Dockerfile               +20 lines  (openssh-server + atlas user + dirs)
  docker/sshd_atlas.conf    35 lines  (chroot, AllowUsers, AuthorizedKeysFile, ...)
  docker/entrypoint.sh      30 lines  (host-key gen, dir setup)
  ha.py                   ~590 lines  (sshd supervisor, key gen, pubkey
                                       install, register-peer pubkey exchange,
                                       sftp_host derivation, litestream YAML
                                       generation, litestream child-process mgmt)
  routers/ha.py            +60 lines  (register-peer, sftp_host fields)
  docker-compose            +5 lines  (port 2222, atlas-replica-inbound volume)
                          ─────────
                          ~740 lines  of dedicated transport plumbing.

For something whose entire job was supposed to be not having to think about it. That's when I noticed the abstraction had become a tax — every problem was at a layer I hadn't signed up to debug. So I stopped, deleted all of it, and walked.

Option 3 — POST the DB over HTTPS

Originally dismissed as "fun to write, miserable to debug." I had the second adjective wrong for this problem.

The architecture, in full:

┌─────────── host A (primary) ────────────┐         ┌─── host B (standby) ───┐
│                                         │  every  │                        │
│   sqlite3 .backup → tmpfile             │   30s   │   POST /api/ha/        │
│   gzip                                  │ ──────▶ │     replica-push       │
│   POST {peer}/api/ha/replica-push       │  HTTPS  │   atomic write to      │
│     body:   gzipped DB                  │         │   /data/atlas.db.replica
│     header: HA_TOKEN                    │         │   (live DB stays put)  │
└─────────────────────────────────────────┘         └────────────────────────┘

No daemon to supervise. No key exchange. No chroot. Two containers Atlas already speaks to, two new endpoints. The primary's per-tick push:

def push_snapshot_to_peer(force=False):
    if current_role() != "primary": return {"skipped": True}
    version = read_data_version()
    if not force and version == last_pushed_version:
        return {"ok": True, "skipped": True}    # PRAGMA data_version unchanged

    raw = run_atomic_backup(DB_PATH)            # sqlite3 .backup, atomic
    body = gzip.compress(raw)

    r = httpx.post(
        f"{peer_base_url()}/api/ha/replica-push",
        content=body,
        headers={"Authorization": f"Bearer {ha_token()}",
                 "X-HA-Sender-Id": self_id(),
                 "X-HA-Data-Version": str(version)},
        timeout=300, verify=False,
    )
    if r.status_code == 200:
        record_pushed(version, len(body))

And the receiver:

@router.post("/replica-push")
async def replica_push(request: Request):
    require_ha_token(request)
    body = await request.body()
    Path(REPLICA_DB_PATH + ".incoming").write_bytes(gzip.decompress(body))
    os.replace(REPLICA_DB_PATH + ".incoming", REPLICA_DB_PATH)   # atomic
    record_received(request.headers["X-HA-Data-Version"])
    return {"ok": True}

The full transport — push function, receive endpoint, scheduler tick, metadata reader/writer — is 80 lines. That's it. The standby holds the snapshot at /data/atlas.db.replica (a separate file from its own live /data/atlas.db); on promotion, os.replace() swaps it in.

It's full-DB-each-push, which sounds wasteful, but Atlas is 96 KB. Gzipped on the wire it's 14 KB. The PRAGMA data_version skip-if-unchanged gate means the inventory app's actual write rate (a few dozen times a day) is the only time anything moves.

Option 2 → Option 3 lines comparison:

  Replication transport:    ~740 lines  →    80 lines     ( ~9× smaller )
  Net codebase delta:        ~590 line ha.py → 270 line ha.py  + endpoint
                                                   ( ~half the module size )

The thing I built myself, in less code, does the job. The cost of the dumb thing here is invisible.

Role state: one file, one lock

The single source of truth for "am I primary or standby?" is a tiny JSON file outside the DB:

// /data/ha.json
{
  "role": "primary",
  "self_id": "A",
  "last_promoted_at": "2026-04-25T14:02:11+00:00",
  "last_demoted_at": null
}

load_state() / update_state(**kwargs) wrap it behind a threading.Lock, write atomically (tmp → rename), and are the only things allowed to touch the file. Every role transition is a single update_state(role="primary", last_promoted_at=now_iso()).

It lives outside the SQL DB on purpose. If I stored "I am primary" in a row inside atlas.db, then promotion-by-replica-restore would overwrite my role with the old primary's row, and the new primary would think it's still a standby. Files are the right shape for identity; rows are the right shape for shared state. Don't confuse them.

The standby middleware: a 503 with directions

The standby's uvicorn is up but refuses everything that isn't HA plumbing:

@app.middleware("http")
async def gateway_middleware(request, call_next):
    path = request.url.path
    if ha.ha_enabled() and ha.current_role() == "standby":
        if path == "/api/health" or path.startswith("/api/ha/") or not path.startswith("/api/"):
            return await call_next(request)
        return JSONResponse(
            {"detail": "This node is in standby. Direct writes to the primary.",
             "primary_url": ha.peer_base_url()},
            status_code=503,
        )
    # ... normal auth middleware continues ...

I debated letting the standby serve stale reads. The problem is the standby's live DB is empty by design — the only DB on the machine is the replica file, which we deliberately keep separate. Serving 503 with a pointer to the primary is honest. The frontend detects role == "standby" from /api/ha/status (open endpoint) up front and renders a dedicated Standby screen with a single Make me primary button.

The failover button

@router.post("/failover")
def failover(body: FailoverRequest, request: Request):
    require_session_or_ha_token(request)
    if ha.load_state().get("role") == "primary":
        raise HTTPException(400, "already primary")

    peer = ha.peer_status()
    if peer and peer.get("role") == "primary" and not body.force:
        raise HTTPException(409, {"error": "peer is still primary",
                                  "hint": "set force=true if you're sure peer is dead"})

    promoted = ha.promote_to_primary()    # mv .replica → atlas.db, dispose engine, flip role
    demoted, msg = ha.demote_peer()       # POST /api/ha/demote on the old primary
    return {"ok": True, "new_role": "primary", "peer_demoted": demoted, "demote_msg": msg}

The promotion itself is a four-line function:

def promote_to_primary():
    from app.database import engine
    engine.dispose()                                  # drop pooled connections
    os.replace(REPLICA_DB_PATH, DB_PATH)              # atomic file swap
    for s in ("-wal", "-shm"):
        Path(DB_PATH + s).unlink(missing_ok=True)
    update_state(role="primary", last_promoted_at=now_iso())

engine.dispose() doesn't have to be perfect — SQLAlchemy will create new connections lazily. Requests in flight may briefly fail; new requests after the swap see the promoted DB. For an inventory app where failover happens at the operator's pace, that's fine.

Auto-demotion

Same as before, simpler now (no Litestream child to stop):

@router.post("/demote")
def demote(request: Request):
    require_ha_token(request)
    ha.update_state(role="standby", last_demoted_at=ha.now_iso())
    return {"ok": True, "new_role": "standby"}

After this returns, the old primary's next request hits the standby middleware and 503s, the next sync tick checks current_role() and skips the push. No process to kill, no SSH session to tear down. One file write, role flipped.

If the old primary was already dead when we failed over, the demote call fails — which is fine; the new primary is already taking traffic. When the old host comes back, its (stale) ha.json claims primary and it briefly tries to push to the new primary, which now refuses with "this node is primary; refusing to overwrite live DB". That's the split-brain mitigation: a primary can't accidentally clobber another primary's DB even by misconfiguration. Operator clicks Demote on the recovered host, and the cluster is consistent again.

Backups: PRAGMA data_version is all you need

Replication and backups are different products. Replication ships the latest state to a peer for failover. Backups keep historical states for "I deleted the wrong network six hours ago." Atlas does both, with the same trick at the gate:

PRAGMA data_version;

A single-row read, no table scan, increments on every committed write, identical across all SQLite connections. That's the signal you want for "anything changed?"

The backup job:

def run_backup(force=False):
    version = _read_data_version()
    if not force and version == _last_data_version:
        return {"ok": True, "skipped": True, "reason": "no writes since last backup"}

    ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")
    target = BACKUP_DIR / f"atlas-{ts}.db"
    subprocess.run(["sqlite3", DB_PATH, f".backup {target}"])

    with open(target, "rb") as src, gzip.open(target.with_suffix(".db.gz"), "wb") as dst:
        shutil.copyfileobj(src, dst)
    target.unlink()
    _prune_old_backups()

The replication push uses the same check on a tighter cadence (30 s vs 15 min), because the cost of a no-op push is one read of data_version. On a typical day for me — fewer than a hundred writes — replication and backups together produce a few kilobytes of gzipped diffs and otherwise sit silent.

Both run on both nodes. The standby keeps its own dated backups independent of replication, so even if the primary's snapshots got corrupted somehow, the standby has a parallel history.

Auth needs to be replicated, too

auth.json (password hash + API token) used to live next to the DB and not replicate. After failover, every integration using the old primary's API token would 401 on the new primary. Bad surprise.

The fix was a small refactor: a settings table inside the DB, with optional Fernet-encrypted values, and a one-shot migration that imports auth.json into it on first boot. Now auth replicates by virtue of being inside the same SQLite file the snapshots ship.

The KEK that protects the encrypted rows is held outside the DB (env var or a small file) — but pairing carries it across in the bundle, so both nodes use the same KEK without manual coordination. After pairing, every secret the cluster cares about (admin password hash, API token, peer base URL, HA token itself) replicates the moment the next snapshot ships.

What I learned

The whole exercise was a useful clarifier on how much abstraction is the right amount.

The Litestream answer was more abstract: it gives you a streaming WAL replication engine that's a wonder of careful engineering, with backends for cloud storage you might already have, retention policies, point-in-time restore. All things I might want for a different app. None of which I actually needed for an inventory tool with a 96 KB DB and twenty writes a day.

When the tool's smarts didn't translate to my environment (an embedded sshd I couldn't get its SSH client to talk to), the abstraction became a tax — every layer of the puzzle (chroot ownership, key formats, account locks, log levels) was a thing I had to learn before I could even see the actual blocker. Whereas the dumb shipper:

The lesson isn't "Litestream is bad" — it's a great tool, and I'd reach for it again on the right problem. The lesson is that for a small enough problem, the cost-floor of someone else's good abstraction is higher than the cost-ceiling of your own bad one. I should have noticed sooner that "few hundred writes per day, megabytes of total state" was deep in territory where the bad abstraction wins.

What it looks like from the outside

Day-to-day. Two hosts, same image. One container is primary, one is standby — set initially by env var, then persisted in /data/ha.json. They've been paired through the UI; both know each other's HTTPS URL and share an HA token. Every 30 seconds the primary checks PRAGMA data_version, and if it changed since last push, gzips the DB and POSTs ~14 KB to the standby. The standby's /data/atlas.db.replica is always within 30 seconds of the primary's live state.

A host dies. I open the standby's URL, see the Standby screen, click Make me primary. The new primary moves .replica into place, flips its role file, and the page reloads as the primary. DNS flip and done. I didn't SSH anywhere.

A planned swap. Click Demote on the current primary first — it stops pushing, flips role, starts serving 503s. Click Make me primary on the other. No force checkbox needed, because the former-primary now claims standby.

A silent, slow migration would still happen. If I push past SQLite's single-writer ceiling, or want multi-region active-active, this design isn't the right one — Postgres is. Today, SQLite + ~250 lines of Python + an HTTPS POST every 30 s is the right-sized answer.

Shape

backend/app/ha.py          ~270 lines  role state + sync timer + push/receive + pairing
backend/app/backup.py      ~130 lines  data_version gate + sqlite3 .backup + gzip
backend/app/routers/ha.py  ~170 lines  /api/ha/* endpoints (incl /replica-push)
backend/app/settings.py    ~140 lines  encrypted key/value table (auth + HA config)
backend/app/main.py         +40 lines  gateway middleware + startup hook
frontend/src/App.tsx        +90 lines  StandbyScreen + JoinClusterForm
frontend/src/pages/ClusterPage.tsx  ~250 lines  status, pairing, sync, advanced editor

~1100 lines total, including the abandoned-Litestream stuff that's now gone — the actual replication transport (push + receive + scheduler) is more like 80 lines. Single container. No database migration. No second container. No quorum. No CLI to operate.

Each piece does one job. The HA token authenticates peer calls. The role file says who's primary. The middleware blocks writes on the standby. The 30-second timer ships diffs. The promotion endpoint does its four-line dance. None of them know about the others except through tiny, explicit contracts. When something goes wrong, it's always obvious which box to open — usually mine.