· Infra · 12 min read
Planned GitLab Geo failover: nine things the docs glossed over
I had a GitLab EE primary on one hypervisor and wanted a hot, streaming-replicated secondary on a second hypervisor — a Geo pair, so that if either host died I could promote the survivor. Then, as a trial, I planned to failover into the secondary and convert the original primary into the new secondary. End state: the same hardware, just with replication flowing the other direction.
The setup is canonical GitLab Geo: PostgreSQL streaming replication for the database, plus Geo's own job/log machinery for repositories, LFS, uploads, and registry artifacts. Both sides on the same EE patch version (Geo is strict about that). Floating DNS name in front so a CNAME flip is the only client-side change at cutover.
The end state works. The path was full of small detours. Here are the ones worth remembering.
1. GitLab Geo requires exact version match — apt-mark hold both sides
Geo refuses to run when primary and secondary disagree on patch version. Not just minor — patch. So 18.9.1-ee.0 ≠ 18.9.2-ee.0 and the secondary won't catch up.
Two consequences:
Pin both sides immediately after the install:
sudo apt-mark hold gitlab-eeOtherwise the next routine
apt upgradedrifts you out of sync silently and your replicas stop.The pinned version might already be gone from the current apt index. I needed
18.9.1-ee.0for the secondary, but the repo only had18.9.2through18.9.7. The.debis still in pkgcloud's storage backend though, and you can pull it directly:https://packages.gitlab.com/gitlab/gitlab-ee/packages/ubuntu/<dist>/gitlab-ee_<VER>-ee.0_amd64.deb/download.debdpkg -iworks on it, thenapt-mark holdto stop apt from "fixing" your install.
TL;DR: hold the package on both sides, and remember pkgcloud's pool URLs when you need an older patch.
2. The external_url lock-in I avoided by issuing multi-SAN certs first
The long-term goal is a floating canonical name (gitlab.example.com) that CNAMEs to whichever site is currently primary (gitlab1.example.com or gitlab2.example.com). For the TLS not to break the instant a CNAME flips, both sides need a cert that already covers both names.
So when I issued the secondary's cert, I asked for both SANs up front:
acme.sh --issue --dns dns_dgon \
-d gitlab2.example.com -d gitlab.example.com \
--keylength ec-256 \
--server letsencrypt \
--dnssleep 60
acme.sh handles DNS-01 validation against both names automatically. Critically, the canonical name being a CNAME doesn't break this — the CA looks for _acme-challenge.gitlab.example.com directly, which lives at a sibling name, not at the CNAME target.
When I later promoted the secondary and CNAMEd gitlab.example.com over to it, no TLS warnings. Same trick applied in reverse: I re-issued the original primary's cert with multi-SAN before flipping its external_url.
TL;DR: if you have a CNAME flip in your future, issue both names on the cert now, not later.
3. postgresql['listen_address'] = '0.0.0.0' silently flips Rails to TCP
To allow the secondary to connect for streaming replication, the primary needs PostgreSQL listening on a public interface, not just the Unix socket. So I set:
postgresql['listen_address'] = '0.0.0.0'
postgresql['md5_auth_cidr_addresses'] = ['<secondary-ip>/32']
And reconfigured. Reconfigure succeeded. Then Puma went into a crash-loop, returning 502s. Logs:
PG::ConnectionBad: connection to server at "0.0.0.0", port 5432 failed:
FATAL: no pg_hba.conf entry for host "127.0.0.1", user "gitlab", database "gitlabhq_production"
The change to listen_address causes Omnibus to rewrite Rails' database.yml to connect via TCP at the listen address instead of via the Unix socket. The pg_hba.conf I'd modified added an entry for the secondary's IP only; loopback (127.0.0.1) had no host-level entry. Rails couldn't connect to its own database.
Fix:
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32']
trust on loopback is safe for a single-host setup (a remote attacker can't impersonate 127.0.0.1; only a local privileged user can, and they already have everything).
TL;DR: any time you set listen_address, add a matching loopback auth entry, or your own Rails app stops being able to log in.
4. sql_replication_password doesn't apply to an existing role
I set this in gitlab.rb on the primary:
postgresql['sql_replication_password'] = 'md5<hash>'
…and reconfigured. From the secondary's box, pg_basebackup against the primary still returned:
FATAL: password authentication failed for user "gitlab_replicator"
It turns out the directive only sets the password at role creation. If gitlab_replicator already exists (Omnibus creates it implicitly the moment you enable Geo), reconfigure won't change its stored credential.
You have to set it directly:
ALTER USER gitlab_replicator WITH PASSWORD '<plaintext>';
PostgreSQL stores the new hash under whatever password_encryption is set to (modern default: SCRAM-SHA-256). pg_hba.conf with method md5 accepts both md5 and scram-sha-256 stored hashes transparently, so you don't need to also flip the method.
TL;DR: after every change to sql_replication_password, run ALTER USER directly to make sure PG actually has the password you think it has.
5. gitlab-ctl replicate-geo-database ignores PGPASSWORD
The tool that bootstraps a secondary by running pg_basebackup from the primary refuses to read the standard PG environment variables. I tried:
PGPASSWORD="$PASS" sudo gitlab-ctl replicate-geo-database --slot-name=secondary --host=primary
The script's first action is to call ask_pass, which does:
return STDIN.gets.chomp unless STDIN.tty?
…so when stdin is a non-tty (running under SSH non-interactively), it tries to read from stdin and crashes on nil.chomp if you didn't pipe anything in. The fix is to pipe the password through stdin:
echo "$PASS" | sudo gitlab-ctl replicate-geo-database \
--slot-name=secondary \
--host=primary \
--sslmode=disable \
--no-wait --force
That feeds the password into ask_pass cleanly. But this only works for the first prompt the script makes; subsequent calls to internal helpers (which we'll hit in section 7) need their own auth path.
TL;DR: pipe the password via stdin. PGPASSWORD will not be read.
6. The "replication failed" message is often a lie — check standby.signal
Twice in a row, replicate-geo-database printed an angry banner:
*** Initial replication failed! ***
Replication tool returned with a non zero exit status!
Failed to execute: gitlab-ctl start
Both times, the actual pg_basebackup had succeeded. The failure was in the post-replicate service restart phase — gitlab-kas or sidekiq timed out on its way back up, and the wrapper reported the last failed step as if the whole thing had failed.
The cheap way to tell whether replication actually worked:
ls /var/opt/gitlab/postgresql/data/standby.signal
If that file exists, PostgreSQL is in recovery mode. Confirm with:
SELECT pg_is_in_recovery();
-- t ✅ secondary; streaming
-- f ❌ no replication happened
Don't trust the wrapper's exit code. Trust the on-disk state.
TL;DR: after any failure from replicate-geo-database, check for standby.signal and pg_is_in_recovery() before assuming you have to start over.
7. gitlab-psql wrapper drops PGPASSFILE through its sudo
When I tried to set up backward replication — making the old primary into a secondary of the new primary — replicate-geo-database failed earlier in its own flow, at the slot-check step:
PGPASSFILE=/var/opt/gitlab/postgresql/.pgpass /opt/gitlab/bin/gitlab-psql \
-h <new-primary> -p 5432 -U gitlab_replicator -d gitlabhq_production \
-t -c "SELECT slot_name FROM pg_replication_slots WHERE slot_name = '...';"
That call hit "password authentication failed" even though .pgpass was on disk in the right place, owned by the right user, with the right mode. Why?
The gitlab-psql shell wrapper does this:
exec /opt/gitlab/embedded/bin/chpst -u gitlab-psql:gitlab-psql -U gitlab-psql \
/usr/bin/env PGSSLCOMPRESSION=0 /opt/gitlab/embedded/bin/psql ...
chpst -u drops to the gitlab-psql user but does not change $HOME. So when libpq looks for a default password file, it checks the wrong home (the calling user's, or the empty default). And the explicit PGPASSFILE env we set? The downstream env PGSSLCOMPRESSION=0 doesn't strip it, but the wrapper's environment handling was just inconsistent enough that the variable didn't survive end-to-end in the way I expected.
Rather than debug the wrapper, I bypassed it entirely:
sudo -u gitlab-psql PGPASSWORD="$PASS" /opt/gitlab/embedded/bin/pg_basebackup \
-h <new-primary> -U gitlab_replicator \
-D /var/opt/gitlab/postgresql/data \
-X stream --slot=<slot-name> \
-d "sslmode=disable" \
--write-recovery-conf --progress --verbose
This works because sudo -u preserves explicitly-passed env vars, and PGPASSWORD is read by libpq directly.
To make this stand-in for the wrapper, also:
- Create the replication slot manually on the new primary first:
SELECT pg_create_physical_replication_slot('<slot-name>'); - Use
--write-recovery-confsopg_basebackupdrops a validstandby.signal+postgresql.auto.confwithprimary_conninfoandprimary_slot_namebaked in. - After PG comes up in recovery mode, run
gitlab-ctl reconfigureto set up the tracking database (geo-postgresql) and startgeo-logcursor.
TL;DR: for backward replication, skip the wrapper. Manual pg_basebackup --write-recovery-conf plus a manual slot is dramatically more debuggable.
8. Maintenance mode rides along on PostgreSQL replication
Before promoting, I set maintenance_mode = true on the primary to drain writes. That setting lives in the application_settings table — which is part of the database that streams to the secondary. So when the secondary was promoted, it inherited the maintenance flag and proudly displayed the read-only banner on the (now-primary) UI.
Worse: trying to clear it via the obvious path didn't work:
sudo gitlab-rails runner '
ApplicationSetting.current_without_cache.update!(maintenance_mode: false)
'
…crashed for reasons I didn't dig into post-failover (probably related to the Geo internal node-identity logic kicking in mid-Rails-boot). The blunt fix:
UPDATE application_settings SET maintenance_mode = false WHERE maintenance_mode = true;
…followed by:
sudo gitlab-ctl restart puma
Rails caches application_settings in memory; restarting Puma is the easiest way to invalidate that cache after a direct SQL update.
TL;DR: maintenance mode is database state, not local state. Either disable it on the old primary before promote, or disable it on the new primary right after.
9. The DNS flip needs propagation time even from the authoritative server
I deleted the old A record and created a CNAME in one tool call each. Within a second the DO API showed exactly what I wanted: no A record for gitlab.example.com, one CNAME pointing at gitlab2.example.com. Then:
$ dig +short @ns1.digitalocean.com gitlab.example.com
<old-IP>
The authoritative nameserver was still handing out the old A record. Not a recursive cache somewhere — the auth server itself. For about a minute.
So even from a system that should be the source of truth, you get stale responses during the changeover window. Resolve directly to the target's IP with --resolve if you need to verify the new path before DNS settles:
curl -k --resolve gitlab.example.com:443:<new-IP> https://gitlab.example.com/
That confirms TLS works for the floating name on the new host (because the cert has the SAN), independent of DNS. If that passes and you've waited a couple of minutes, you're done worrying about the flip.
TL;DR: post-flip, confirm via --resolve before fretting about DNS. The auth server lag is real but short.
10. gitlab-ctl geo promote succeeds but leaves gitlab.rb lying
After:
sudo gitlab-ctl geo promote --force
…on the secondary, the database is correctly out of recovery mode and the GeoNode table correctly shows the promoted node as primary. But /etc/gitlab/gitlab.rb still has:
roles ['geo_secondary_role']
Reconfigure cleanly anyway, services run fine, but the next reconfigure will reassert the secondary behaviour (re-enable hot_standby, etc.) on top of a non-replica PostgreSQL — which then fails to start properly.
After promote, edit gitlab.rb manually to declare what the node now is:
roles ['geo_primary_role']
# remove any hot_standby, geo_postgresql, max_standby_*_delay lines from the secondary block
postgresql['listen_address'] = '0.0.0.0'
postgresql['md5_auth_cidr_addresses'] = ['<old-primary-ip>/32'] # ready for backward replication
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32']
postgresql['sql_replication_password'] = 'md5<hash>'
postgresql['max_replication_slots'] = 4
postgresql['max_wal_senders'] = 10
Then reconfigure. The reconfigure is now declarative — your node is what its gitlab.rb says it is.
TL;DR: geo promote does runtime work; you do the config-file work. Don't leave them out of sync.
What the end-to-end recipe ended up looking like
For my own future reference (and yours), here's the trimmed sequence that actually works:
# === Pre-flight ===
# Confirm replication is at zero lag (on primary):
SELECT client_addr, state, replay_lsn FROM pg_stat_replication;
# === On primary (gitlab1) — set maintenance mode ===
sudo gitlab-rails runner '
ApplicationSetting.current_without_cache.update!(
maintenance_mode: true,
maintenance_mode_message: "Failover in progress"
)
'
# === On secondary (gitlab2) — promote ===
sudo gitlab-ctl geo promote --force
# Edit /etc/gitlab/gitlab.rb on gitlab2:
# roles ['geo_secondary_role'] → roles ['geo_primary_role']
# Add primary's listen_address / md5_auth / trust_auth / sql_replication_password
# Then:
sudo gitlab-ctl reconfigure
sudo gitlab-ctl restart postgresql # to pick up listen_address change
# Reset the replicator password so SCRAM hash is fresh:
sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -d postgres \
-c "ALTER USER gitlab_replicator WITH PASSWORD '<plaintext>';"
# === On old primary (gitlab1) — turn it into the new secondary ===
# Park user services:
sudo gitlab-ctl stop puma sidekiq nginx gitlab-workhorse
# Edit /etc/gitlab/gitlab.rb on gitlab1:
# external_url → https://gitlab1.example.com
# ssl cert paths → new multi-SAN cert
# roles ['geo_primary_role'] → roles ['geo_secondary_role']
# Remove all primary postgresql settings (listen_address, md5_auth, sql_replication_password, slots)
# Add hot_standby, geo_postgresql settings.
# Stop pg, wipe data dir, manual basebackup:
sudo gitlab-ctl stop postgresql
sudo rm -rf /var/opt/gitlab/postgresql/data
sudo install -d -m 0700 -o gitlab-psql -g gitlab-psql /var/opt/gitlab/postgresql/data
# Create slot on new primary:
ssh new-primary "sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -d postgres \
-c \"SELECT pg_create_physical_replication_slot('<slot>');\""
# Run basebackup manually:
sudo -u gitlab-psql PGPASSWORD="$PASS" /opt/gitlab/embedded/bin/pg_basebackup \
-h <new-primary> -U gitlab_replicator \
-D /var/opt/gitlab/postgresql/data \
-X stream --slot=<slot> \
-d "sslmode=disable" \
--write-recovery-conf --progress --verbose
# Bring PG back up — it's now in recovery, streaming from new primary:
sudo gitlab-ctl start postgresql
# Reconfigure to set up tracking DB + geo-logcursor under geo_secondary_role:
sudo gitlab-ctl reconfigure
sudo gitlab-ctl start
# === Geo node registry — done on new primary ===
sudo gitlab-rails runner '
GeoNode.create(name: "gitlab1.example.com", url: "https://gitlab1.example.com/", primary: false)
'
# === DNS flip ===
# Delete old A record gitlab → <old-primary>
# Create CNAME gitlab → gitlab2.example.com
# Add A record gitlab1 → <old-primary>
What the failover was supposed to be vs. what it was
A promote command takes 90 seconds. The cert dance for the floating name is another 90 seconds. The DNS triple-flip is a single API call. If everything else cooperates, this is genuinely a four-minute operation.
What ate the day was:
- the
replicate-geo-databasewrapper papering over real auth failures with cryptic messages; - assumed-but-not-applied
gitlab.rbdirectives (the password one especially); - the maintenance-mode flag riding on replication and persisting after promote;
- and the wrapper's environment handling not being quite consistent enough to survive multiple sudo-and-chpst layers.
Most of those have one common workaround: bypass the wrapper, drive pg_basebackup directly, and verify on-disk state rather than trusting exit codes. Geo's replication mechanics are well-engineered. The wrapper around them, less so.
Next failover should be ten minutes. If it isn't, I'll write another post.