← all posts

· Infra · 12 min read

Planned GitLab Geo failover: nine things the docs glossed over

I had a GitLab EE primary on one hypervisor and wanted a hot, streaming-replicated secondary on a second hypervisor — a Geo pair, so that if either host died I could promote the survivor. Then, as a trial, I planned to failover into the secondary and convert the original primary into the new secondary. End state: the same hardware, just with replication flowing the other direction.

The setup is canonical GitLab Geo: PostgreSQL streaming replication for the database, plus Geo's own job/log machinery for repositories, LFS, uploads, and registry artifacts. Both sides on the same EE patch version (Geo is strict about that). Floating DNS name in front so a CNAME flip is the only client-side change at cutover.

The end state works. The path was full of small detours. Here are the ones worth remembering.


1. GitLab Geo requires exact version match — apt-mark hold both sides

Geo refuses to run when primary and secondary disagree on patch version. Not just minor — patch. So 18.9.1-ee.018.9.2-ee.0 and the secondary won't catch up.

Two consequences:

  1. Pin both sides immediately after the install:

    sudo apt-mark hold gitlab-ee
    

    Otherwise the next routine apt upgrade drifts you out of sync silently and your replicas stop.

  2. The pinned version might already be gone from the current apt index. I needed 18.9.1-ee.0 for the secondary, but the repo only had 18.9.2 through 18.9.7. The .deb is still in pkgcloud's storage backend though, and you can pull it directly:

    https://packages.gitlab.com/gitlab/gitlab-ee/packages/ubuntu/<dist>/gitlab-ee_<VER>-ee.0_amd64.deb/download.deb
    

    dpkg -i works on it, then apt-mark hold to stop apt from "fixing" your install.

TL;DR: hold the package on both sides, and remember pkgcloud's pool URLs when you need an older patch.


2. The external_url lock-in I avoided by issuing multi-SAN certs first

The long-term goal is a floating canonical name (gitlab.example.com) that CNAMEs to whichever site is currently primary (gitlab1.example.com or gitlab2.example.com). For the TLS not to break the instant a CNAME flips, both sides need a cert that already covers both names.

So when I issued the secondary's cert, I asked for both SANs up front:

acme.sh --issue --dns dns_dgon \
  -d gitlab2.example.com -d gitlab.example.com \
  --keylength ec-256 \
  --server letsencrypt \
  --dnssleep 60

acme.sh handles DNS-01 validation against both names automatically. Critically, the canonical name being a CNAME doesn't break this — the CA looks for _acme-challenge.gitlab.example.com directly, which lives at a sibling name, not at the CNAME target.

When I later promoted the secondary and CNAMEd gitlab.example.com over to it, no TLS warnings. Same trick applied in reverse: I re-issued the original primary's cert with multi-SAN before flipping its external_url.

TL;DR: if you have a CNAME flip in your future, issue both names on the cert now, not later.


3. postgresql['listen_address'] = '0.0.0.0' silently flips Rails to TCP

To allow the secondary to connect for streaming replication, the primary needs PostgreSQL listening on a public interface, not just the Unix socket. So I set:

postgresql['listen_address'] = '0.0.0.0'
postgresql['md5_auth_cidr_addresses'] = ['<secondary-ip>/32']

And reconfigured. Reconfigure succeeded. Then Puma went into a crash-loop, returning 502s. Logs:

PG::ConnectionBad: connection to server at "0.0.0.0", port 5432 failed:
FATAL: no pg_hba.conf entry for host "127.0.0.1", user "gitlab", database "gitlabhq_production"

The change to listen_address causes Omnibus to rewrite Rails' database.yml to connect via TCP at the listen address instead of via the Unix socket. The pg_hba.conf I'd modified added an entry for the secondary's IP only; loopback (127.0.0.1) had no host-level entry. Rails couldn't connect to its own database.

Fix:

postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32']

trust on loopback is safe for a single-host setup (a remote attacker can't impersonate 127.0.0.1; only a local privileged user can, and they already have everything).

TL;DR: any time you set listen_address, add a matching loopback auth entry, or your own Rails app stops being able to log in.


4. sql_replication_password doesn't apply to an existing role

I set this in gitlab.rb on the primary:

postgresql['sql_replication_password'] = 'md5<hash>'

…and reconfigured. From the secondary's box, pg_basebackup against the primary still returned:

FATAL: password authentication failed for user "gitlab_replicator"

It turns out the directive only sets the password at role creation. If gitlab_replicator already exists (Omnibus creates it implicitly the moment you enable Geo), reconfigure won't change its stored credential.

You have to set it directly:

ALTER USER gitlab_replicator WITH PASSWORD '<plaintext>';

PostgreSQL stores the new hash under whatever password_encryption is set to (modern default: SCRAM-SHA-256). pg_hba.conf with method md5 accepts both md5 and scram-sha-256 stored hashes transparently, so you don't need to also flip the method.

TL;DR: after every change to sql_replication_password, run ALTER USER directly to make sure PG actually has the password you think it has.


5. gitlab-ctl replicate-geo-database ignores PGPASSWORD

The tool that bootstraps a secondary by running pg_basebackup from the primary refuses to read the standard PG environment variables. I tried:

PGPASSWORD="$PASS" sudo gitlab-ctl replicate-geo-database --slot-name=secondary --host=primary

The script's first action is to call ask_pass, which does:

return STDIN.gets.chomp unless STDIN.tty?

…so when stdin is a non-tty (running under SSH non-interactively), it tries to read from stdin and crashes on nil.chomp if you didn't pipe anything in. The fix is to pipe the password through stdin:

echo "$PASS" | sudo gitlab-ctl replicate-geo-database \
  --slot-name=secondary \
  --host=primary \
  --sslmode=disable \
  --no-wait --force

That feeds the password into ask_pass cleanly. But this only works for the first prompt the script makes; subsequent calls to internal helpers (which we'll hit in section 7) need their own auth path.

TL;DR: pipe the password via stdin. PGPASSWORD will not be read.


6. The "replication failed" message is often a lie — check standby.signal

Twice in a row, replicate-geo-database printed an angry banner:

*** Initial replication failed! ***
Replication tool returned with a non zero exit status!

Failed to execute: gitlab-ctl start

Both times, the actual pg_basebackup had succeeded. The failure was in the post-replicate service restart phase — gitlab-kas or sidekiq timed out on its way back up, and the wrapper reported the last failed step as if the whole thing had failed.

The cheap way to tell whether replication actually worked:

ls /var/opt/gitlab/postgresql/data/standby.signal

If that file exists, PostgreSQL is in recovery mode. Confirm with:

SELECT pg_is_in_recovery();
-- t  ✅ secondary; streaming
-- f  ❌ no replication happened

Don't trust the wrapper's exit code. Trust the on-disk state.

TL;DR: after any failure from replicate-geo-database, check for standby.signal and pg_is_in_recovery() before assuming you have to start over.


7. gitlab-psql wrapper drops PGPASSFILE through its sudo

When I tried to set up backward replication — making the old primary into a secondary of the new primary — replicate-geo-database failed earlier in its own flow, at the slot-check step:

PGPASSFILE=/var/opt/gitlab/postgresql/.pgpass /opt/gitlab/bin/gitlab-psql \
  -h <new-primary> -p 5432 -U gitlab_replicator -d gitlabhq_production \
  -t -c "SELECT slot_name FROM pg_replication_slots WHERE slot_name = '...';"

That call hit "password authentication failed" even though .pgpass was on disk in the right place, owned by the right user, with the right mode. Why?

The gitlab-psql shell wrapper does this:

exec /opt/gitlab/embedded/bin/chpst -u gitlab-psql:gitlab-psql -U gitlab-psql \
     /usr/bin/env PGSSLCOMPRESSION=0 /opt/gitlab/embedded/bin/psql ...

chpst -u drops to the gitlab-psql user but does not change $HOME. So when libpq looks for a default password file, it checks the wrong home (the calling user's, or the empty default). And the explicit PGPASSFILE env we set? The downstream env PGSSLCOMPRESSION=0 doesn't strip it, but the wrapper's environment handling was just inconsistent enough that the variable didn't survive end-to-end in the way I expected.

Rather than debug the wrapper, I bypassed it entirely:

sudo -u gitlab-psql PGPASSWORD="$PASS" /opt/gitlab/embedded/bin/pg_basebackup \
  -h <new-primary> -U gitlab_replicator \
  -D /var/opt/gitlab/postgresql/data \
  -X stream --slot=<slot-name> \
  -d "sslmode=disable" \
  --write-recovery-conf --progress --verbose

This works because sudo -u preserves explicitly-passed env vars, and PGPASSWORD is read by libpq directly.

To make this stand-in for the wrapper, also:

  1. Create the replication slot manually on the new primary first:
    SELECT pg_create_physical_replication_slot('<slot-name>');
    
  2. Use --write-recovery-conf so pg_basebackup drops a valid standby.signal + postgresql.auto.conf with primary_conninfo and primary_slot_name baked in.
  3. After PG comes up in recovery mode, run gitlab-ctl reconfigure to set up the tracking database (geo-postgresql) and start geo-logcursor.

TL;DR: for backward replication, skip the wrapper. Manual pg_basebackup --write-recovery-conf plus a manual slot is dramatically more debuggable.


8. Maintenance mode rides along on PostgreSQL replication

Before promoting, I set maintenance_mode = true on the primary to drain writes. That setting lives in the application_settings table — which is part of the database that streams to the secondary. So when the secondary was promoted, it inherited the maintenance flag and proudly displayed the read-only banner on the (now-primary) UI.

Worse: trying to clear it via the obvious path didn't work:

sudo gitlab-rails runner '
  ApplicationSetting.current_without_cache.update!(maintenance_mode: false)
'

…crashed for reasons I didn't dig into post-failover (probably related to the Geo internal node-identity logic kicking in mid-Rails-boot). The blunt fix:

UPDATE application_settings SET maintenance_mode = false WHERE maintenance_mode = true;

…followed by:

sudo gitlab-ctl restart puma

Rails caches application_settings in memory; restarting Puma is the easiest way to invalidate that cache after a direct SQL update.

TL;DR: maintenance mode is database state, not local state. Either disable it on the old primary before promote, or disable it on the new primary right after.


9. The DNS flip needs propagation time even from the authoritative server

I deleted the old A record and created a CNAME in one tool call each. Within a second the DO API showed exactly what I wanted: no A record for gitlab.example.com, one CNAME pointing at gitlab2.example.com. Then:

$ dig +short @ns1.digitalocean.com gitlab.example.com
<old-IP>

The authoritative nameserver was still handing out the old A record. Not a recursive cache somewhere — the auth server itself. For about a minute.

So even from a system that should be the source of truth, you get stale responses during the changeover window. Resolve directly to the target's IP with --resolve if you need to verify the new path before DNS settles:

curl -k --resolve gitlab.example.com:443:<new-IP> https://gitlab.example.com/

That confirms TLS works for the floating name on the new host (because the cert has the SAN), independent of DNS. If that passes and you've waited a couple of minutes, you're done worrying about the flip.

TL;DR: post-flip, confirm via --resolve before fretting about DNS. The auth server lag is real but short.


10. gitlab-ctl geo promote succeeds but leaves gitlab.rb lying

After:

sudo gitlab-ctl geo promote --force

…on the secondary, the database is correctly out of recovery mode and the GeoNode table correctly shows the promoted node as primary. But /etc/gitlab/gitlab.rb still has:

roles ['geo_secondary_role']

Reconfigure cleanly anyway, services run fine, but the next reconfigure will reassert the secondary behaviour (re-enable hot_standby, etc.) on top of a non-replica PostgreSQL — which then fails to start properly.

After promote, edit gitlab.rb manually to declare what the node now is:

roles ['geo_primary_role']
# remove any hot_standby, geo_postgresql, max_standby_*_delay lines from the secondary block
postgresql['listen_address'] = '0.0.0.0'
postgresql['md5_auth_cidr_addresses'] = ['<old-primary-ip>/32']  # ready for backward replication
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32']
postgresql['sql_replication_password'] = 'md5<hash>'
postgresql['max_replication_slots'] = 4
postgresql['max_wal_senders'] = 10

Then reconfigure. The reconfigure is now declarative — your node is what its gitlab.rb says it is.

TL;DR: geo promote does runtime work; you do the config-file work. Don't leave them out of sync.


What the end-to-end recipe ended up looking like

For my own future reference (and yours), here's the trimmed sequence that actually works:

# === Pre-flight ===
# Confirm replication is at zero lag (on primary):
SELECT client_addr, state, replay_lsn FROM pg_stat_replication;

# === On primary (gitlab1) — set maintenance mode ===
sudo gitlab-rails runner '
  ApplicationSetting.current_without_cache.update!(
    maintenance_mode: true,
    maintenance_mode_message: "Failover in progress"
  )
'

# === On secondary (gitlab2) — promote ===
sudo gitlab-ctl geo promote --force

# Edit /etc/gitlab/gitlab.rb on gitlab2:
#   roles ['geo_secondary_role']  →  roles ['geo_primary_role']
#   Add primary's listen_address / md5_auth / trust_auth / sql_replication_password
# Then:
sudo gitlab-ctl reconfigure
sudo gitlab-ctl restart postgresql   # to pick up listen_address change

# Reset the replicator password so SCRAM hash is fresh:
sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -d postgres \
  -c "ALTER USER gitlab_replicator WITH PASSWORD '<plaintext>';"

# === On old primary (gitlab1) — turn it into the new secondary ===
# Park user services:
sudo gitlab-ctl stop puma sidekiq nginx gitlab-workhorse

# Edit /etc/gitlab/gitlab.rb on gitlab1:
#   external_url → https://gitlab1.example.com
#   ssl cert paths → new multi-SAN cert
#   roles ['geo_primary_role']  →  roles ['geo_secondary_role']
#   Remove all primary postgresql settings (listen_address, md5_auth, sql_replication_password, slots)
#   Add hot_standby, geo_postgresql settings.

# Stop pg, wipe data dir, manual basebackup:
sudo gitlab-ctl stop postgresql
sudo rm -rf /var/opt/gitlab/postgresql/data
sudo install -d -m 0700 -o gitlab-psql -g gitlab-psql /var/opt/gitlab/postgresql/data

# Create slot on new primary:
ssh new-primary "sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -d postgres \
  -c \"SELECT pg_create_physical_replication_slot('<slot>');\""

# Run basebackup manually:
sudo -u gitlab-psql PGPASSWORD="$PASS" /opt/gitlab/embedded/bin/pg_basebackup \
  -h <new-primary> -U gitlab_replicator \
  -D /var/opt/gitlab/postgresql/data \
  -X stream --slot=<slot> \
  -d "sslmode=disable" \
  --write-recovery-conf --progress --verbose

# Bring PG back up — it's now in recovery, streaming from new primary:
sudo gitlab-ctl start postgresql

# Reconfigure to set up tracking DB + geo-logcursor under geo_secondary_role:
sudo gitlab-ctl reconfigure
sudo gitlab-ctl start

# === Geo node registry — done on new primary ===
sudo gitlab-rails runner '
  GeoNode.create(name: "gitlab1.example.com", url: "https://gitlab1.example.com/", primary: false)
'

# === DNS flip ===
# Delete old A record gitlab → <old-primary>
# Create CNAME       gitlab → gitlab2.example.com
# Add A record       gitlab1 → <old-primary>

What the failover was supposed to be vs. what it was

A promote command takes 90 seconds. The cert dance for the floating name is another 90 seconds. The DNS triple-flip is a single API call. If everything else cooperates, this is genuinely a four-minute operation.

What ate the day was:

Most of those have one common workaround: bypass the wrapper, drive pg_basebackup directly, and verify on-disk state rather than trusting exit codes. Geo's replication mechanics are well-engineered. The wrapper around them, less so.

Next failover should be ten minutes. If it isn't, I'll write another post.