← all posts

· Infra · 11 min read

GitLab Geo as an HA story: the seams that show

I spent the last few days standing up GitLab Geo as a two-box high-availability pair: streaming-replicated DB, paired sites on two hypervisors, a floating DNS name in front so clients always hit "the active one" without knowing which side that is. It mostly works. But a handful of design choices — sensible-for-DR, awkward-for-HA — kept tripping me. By the third or fourth I started seeing the shape of the underlying gap.

This post catalogs that gap. It's part rant, part roadmap-suggestion, part pragmatic-options. I'm writing it because every team that picks up Geo for HA hits these same walls in roughly the same order, and the docs don't really frame why.


The framing problem

GitLab Geo was built for disaster recovery. The assumed scenario is:

Region A is on fire. Region A is gone. You promote Region B. You tell users the URL has changed and we'll figure out the rest later.

In that scenario, almost every Geo behaviour makes sense. A one-way promotion. A specific URL per site that doesn't pretend to be anything else. A multi-step setup that requires copying files between hosts. Acceptable downtime during cutover. Manual reconfiguration because, hey, it's a disaster — you weren't expecting to be doing this and you'll have a checklist.

The mental shift to HA is different:

Both regions are healthy. I want to do maintenance on Region A. I'll fail traffic over to Region B for ninety seconds, do the work, and fail it back. Users shouldn't notice. Clone URLs shouldn't change. Bookmarks shouldn't break.

That's a load balancer pattern wearing the clothes of a replication pattern, and GitLab Geo wasn't designed to be that. So when you try to use it that way, several rough edges expose themselves at once.

Here are the ones I tripped over.


1. Per-site URL identity

This is the biggest one. Geo identifies sites by URL. GeoNode.url has a uniqueness constraint. You cannot register two Geo nodes with the same URL — the second one fails validation.

That means:

So a user who clicks "Clone with HTTPS" gets https://gitlab2.example.com/group/project.git even when they got there via gitlab.example.com. After a failover, that remote URL keeps working if their DNS resolver hasn't pinned the old IP — but a fresh clone the next day uses the new active site's specific name, and now their two clones disagree.

The Geo design didn't anticipate this because, in the DR scenario, telling users "the URL is different now" is acceptable. In an HA scenario, it isn't.

How GitLab could fix it: introduce a separate canonical_url (or user_facing_url) setting, decoupled from each node's external_url and from GeoNode.url. The canonical URL would be used everywhere user-visible: clone-string display, email link generation, OAuth callbacks, API responses' web_url fields. The internal Geo identity URL stays specific to each node. This pattern is well-trodden in other distributed-systems software (Kafka's advertised.listeners, MinIO's MINIO_DOMAIN, every Postgres replication setup with a vIP). GitLab just hasn't done it yet.

What you can do today: tell users to clone via the canonical name and add a ~/.gitconfig rewrite rule:

[url "https://gitlab.example.com/"]
  insteadOf = https://gitlab1.example.com/
  insteadOf = https://gitlab2.example.com/

Anything they copy from the UI gets rewritten before git sees it. Imperfect, but cheap.


2. gitlab-secrets.json is a manual prerequisite

GitLab encrypts certain sensitive columns of its database (CI variables, runner tokens, integration credentials, 2FA seeds) using application-level keys that live in /etc/gitlab/gitlab-secrets.json — a file, not a database row. Postgres streaming replication doesn't see files. So unless you manually copy the secrets file from the primary to the secondary before the secondary's first gitlab-ctl reconfigure, the two sites end up with different encryption keys.

The setup technically works in the wrong-key state: most of the DB is plaintext, so most operations look fine. But anything that touches an encrypted column on the wrong side fails opaquely. After a failover, you discover this when the user's API token mysteriously stops working, even though the row is right there in the database.

I covered this in detail in a separate post. The summary: the file isn't part of the database, the database is what gets replicated, and nobody warns you about the disconnect until something specific breaks.

How GitLab could fix it: ship a gitlab-ctl geo-bootstrap-secondary command that, given the primary's hostname and credentials, does the whole secondary bootstrap — including copying gitlab-secrets.json over SSH before the first reconfigure. The current replicate-geo-database script does part of the work but stops short of the file-copy step.

What you can do today: make the file-copy step part of your build automation. Don't trust yourself to remember.


3. Exact version matching, but no enforcement

Geo refuses to function correctly when the two sites disagree on patch version (18.9.1 vs 18.9.2 is enough to break things subtly). The standard pattern is to apt-mark hold gitlab-ee on both sides and upgrade them together.

But — none of that is enforced by Geo itself. Nothing in the setup process tells you to apt-hold. The first time you find out is when an unattended-upgrade ticks one side forward by a patch and replication mysteriously starts having problems.

There's also a small papercut around availability of specific patch versions: GitLab's apt repo tends to carry only the most recent few patches. If your primary was on 18.9.1 and you're building a secondary three weeks later, 18.9.1 may already be gone from the repo. You can still fetch the .deb directly from pkgcloud's storage pool — but you have to know to.

How GitLab could fix it: detect version drift between Geo sites and surface it loudly. The GeoNode model already knows the URL of every other node; it could ask each peer for its version and warn in the admin Geo dashboard when they diverge. Also: have the omnibus package set the apt-pin automatically when Geo is enabled, with an explicit unhold during upgrade workflows.

What you can do today: hold both packages immediately after install. Set a calendar reminder. If the patch version disappears, the URL pattern is https://packages.gitlab.com/gitlab/gitlab-ee/packages/ubuntu/<dist>/gitlab-ee_<VER>-ee.0_amd64.deb/download.deb.


4. gitlab-ctl geo promote leaves config drift

This one annoyed me. sudo gitlab-ctl geo promote --force does the runtime promotion — exits PG out of recovery mode, updates the GeoNode table, restarts the right services. It does not update /etc/gitlab/gitlab.rb, which still declares:

roles ['geo_secondary_role']

…and various secondary-specific PostgreSQL settings. The next gitlab-ctl reconfigure will try to reassert those settings on top of a now-primary PostgreSQL, and behaviour gets confusing fast.

You have to manually edit gitlab.rb after the promote to declare the node's new role. There's no warning about this. The promote command exits 0 and walks away.

How GitLab could fix it: have geo promote write the new role into gitlab.rb directly, and have a corresponding geo demote (or geo set-secondary <primary-url>) command that does the reverse. Reconfigure should always be idempotent against the current file state.

What you can do today: treat the role flip in gitlab.rb as a mandatory follow-up to every promote. Build it into the same script you use to do the promotion.


5. replicate-geo-database is the most fragile thing in the toolchain

This script bootstraps a secondary by doing pg_basebackup from the primary. It does a few useful things — stops services, runs the basebackup, sets up the standby config. It also has a number of quirks that, in my experience, kept biting:

For the reverse-direction replication (turning an old primary into a new secondary), I ended up bypassing the script entirely and running pg_basebackup by hand:

sudo -u gitlab-psql PGPASSWORD="$PASS" /opt/gitlab/embedded/bin/pg_basebackup \
  -h <new-primary> -U gitlab_replicator \
  -D /var/opt/gitlab/postgresql/data \
  -X stream --slot=gitlab1 \
  -d "sslmode=disable" \
  --write-recovery-conf --progress --verbose

This was dramatically more debuggable than the wrapper.

How GitLab could fix it: make the wrapper actually pass through PGPASSWORD/PGPASSFILE reliably, fix the misleading exit-code aggregation so failures upstream don't get reported as failures of the basebackup, and document --skip-backup more prominently. Honestly the cleanest fix is probably to rewrite the wrapper as a thin shell around pg_basebackup with explicit pre/post steps, rather than the monolithic Ruby thing it is now.

What you can do today: when in doubt, run pg_basebackup manually. Have --write-recovery-conf do the recovery config setup. Then gitlab-ctl reconfigure to wire up the Geo-specific tracking DB and logcursor. It's three commands, all documentable, all transparent.


6. Maintenance mode is a database row

Before doing the failover, I set maintenance mode on the primary to drain writes. That setting lives in the application_settings table — which is part of the database that streams to the secondary. So when the secondary became the new primary, it cheerfully inherited the maintenance flag and showed every visiting user a "read-only, Geo failover in progress" banner.

Disabling it required clearing the database row directly (the Rails update path failed, possibly because of a Geo-aware code path triggering mid-Rails-boot), then bouncing puma to flush the in-memory cache.

This isn't a bug per se — it's the consistent behaviour of "the whole DB replicates" — but it's a foot-gun that nobody warns you about. A textbook "drain primary, promote secondary, accept traffic" runbook hits this on the first run.

How GitLab could fix it: when gitlab-ctl geo promote runs, have it explicitly clear maintenance_mode on the newly-promoted primary as part of the post-promote steps. Alternatively, make maintenance mode site-local rather than DB-row-based, but that's a bigger change.

What you can do today: disable maintenance mode on the new primary right after promote. Or remember to disable it on the old primary before promoting, while the row will still be in sync with the secondary.


7. The "Geo Proxy" answer is a different question

GitLab has a feature called the Geo Proxy (matured in 15.x) that solves a related but distinct problem. The model is:

This is genuinely elegant for "always-on multi-region read replicas." It does not address "one floating canonical URL across a two-site HA pair." The Geo Proxy assumes you've already accepted that users use site-specific URLs; it just makes them more useful by routing writes intelligently.

For the HA-with-canonical-URL pattern, the Geo Proxy is half an answer. You could combine it with the canonical-URL idea (route gitlab.example.com to either site, both proxy writes to current primary), but you still hit the GeoNode.url uniqueness constraint and the external_url-leaking-into-emails problem.

How GitLab could mature it: extend the Geo Proxy with first-class support for a virtual front-end name. Have nginx on each site accept the floating name as an additional server_name, have Rails recognise it as a valid canonical URL distinct from any node's identity, and have the proxy use the active primary's URL only for backend routing — not for anything client-visible.


8. Per-site secrets, manual node registration

Smaller things, listed for completeness:


What I'd build, if I were GitLab

If GitLab handed me a roadmap budget tomorrow, this is roughly the order of operations I'd argue for:

  1. First-class canonical URL setting, decoupled from per-node identity. Used for clone strings, email links, OAuth callbacks, web_url in API responses, runner registration URLs. Backward-compatible default: same as external_url.

  2. gitlab-ctl geo-bootstrap-secondary <primary-url> that does the full setup in one command — pull gitlab-secrets.json, run pg_basebackup, write the standby config, configure gitlab.rb, reconfigure. With sane defaults, no five-step prerequisite list.

  3. gitlab-ctl geo failover as a single command that drains the current primary, promotes a named secondary, updates gitlab.rb on both sides, clears stale maintenance flags, and verifies replication is reversed. Today you script this yourself; it should be in the box.

  4. Version-drift detection and warning in the admin UI, automatic apt-pinning on Geo-enabled hosts.

  5. More transparent wrapper scripts — fewer Ruby monoliths that swallow errors and emit misleading banners, more thin shell wrappers around pg_basebackup and gitlab-psql with predictable env-var handling.

  6. First-class Geo Proxy + canonical URL composition — accept that the modern shape of "give users one URL and let GitLab figure out where the writes go" is the right primitive, and build everything else (failover, identity, replication) on top of it.

None of these are far-fetched. Most exist as long-tail issues in the GitLab tracker, with varying levels of discussion and zero clear ownership. The pieces are there; what's missing is someone treating "Geo as HA" as a first-class use case rather than a clever repurposing of a DR feature.


What you can do today

Pragmatic recommendations, in priority order:

  1. Pick your URL strategy up front. If users seeing a per-site name in clone URLs is acceptable, stay on external_url = gitlabN.example.com per site. If it isn't, build the failover-also-changes-external_url automation and accept ~5 minutes of reconfigure during cutover.

  2. Treat gitlab-secrets.json as a first-class artifact. Back it up. Drift-check it. Copy it during initial Geo setup. Re-copy it after upgrades that touch encrypted features. The file is small and the consequences of mismatch are big.

  3. apt-mark hold gitlab-ee on both sides immediately after install. Unhold only for coordinated upgrades.

  4. Automate the failover sequence end-to-end. A single script that handles maintenance mode, promote, gitlab.rb flip on both sides, GeoNode registry updates, DNS flip, and maintenance_mode cleanup. Test it on a lab pair before you need it in anger.

  5. Bypass replicate-geo-database if it gives you trouble. Manual pg_basebackup --write-recovery-conf plus an explicit replication slot is more debuggable and not really more code.

  6. Ship users a ~/.gitconfig rewrite rule so the per-site URL leak is invisible to them:

    [url "https://gitlab.example.com/"]
      insteadOf = https://gitlab1.example.com/
      insteadOf = https://gitlab2.example.com/
    
  7. Decide whether you actually need Geo. For two hosts in the same data center, a Patroni-managed PostgreSQL with a separate Gitaly cluster and a thin HAProxy in front might be a cleaner fit than Geo. Geo's strength is cross-region replication with the asynchronous Geo job framework; if you don't need that asymmetric topology, you may be paying for complexity you don't use.


Closing thought

GitLab Geo is genuinely good software for the disaster-recovery problem it was built to solve. It's also the most plausible "officially supported HA replication" story GitLab ships, which is why everyone reaches for it for HA — and then runs face-first into the seams between the two patterns.

The seams are fixable, mostly with relatively scoped product changes. None of them are deeply architectural; they're mostly about treating "Geo as HA" as a first-class story instead of an emergent one. Until that happens, expect a few weeks of pattern-discovery the first time you build one of these. Write down what you learn — somebody on your team is going to need it during their next promotion.