2026-05-14T12:00:00+10:00 · Infra · 14 min read

Standing up GitLab EE on DigitalOcean: the gotchas I'd avoid next time

I built a GitLab EE 18.11 lab on a handful of DigitalOcean droplets recently. Ultimate-NFR license, single top-level group with five Owner users, 9 docker-executor runners. The end state works. Getting there exposed enough pitfalls — most of them subtle, several silent — that I want to write them down before the lessons fade.

This isn't a tutorial. There are plenty of those. It's a list of the things that aren't in the tutorials, the silent failures and the wrong assumptions, in roughly the order I tripped over them. Each section ends with a TL;DR for next time — read those first if you're skimming.

1. Confirm your region has an upgrade path before you commit

I picked the region I always pick. Hours later, when capacity became a concern, I discovered that region has no slugs above 8 vCPU available on this account tier. Max is 8 vCPU, full stop. The bigger sizes only live in General Purpose, CPU-Optimized, or Memory-Optimized pools — and those pools have uneven regional coverage. Just because a slug exists doesn't mean it's offered everywhere.

The implication: a lab you intend to scale up can become un-scalable simply because of where you put it. Migrating between regions means snapshot + new droplet + new IP + DNS changes — an order of magnitude more disruptive than a resize.

TL;DR for next time: before picking a region, list both your initial size and a plausible upgrade target and confirm both are available there:

curl -sS -H "Authorization: Bearer $DO_TOKEN" \
  "https://api.digitalocean.com/v2/sizes?per_page=200" \
| jq -r '.sizes[] | select(.available and (.regions|index("nyc3"))) | "\(.slug)\t\(.vcpus)\t\(.memory)MB\t$\(.price_monthly)"' \
| sort -n -k2

Pick a different region before you provision if your upgrade target isn't there.

2. The root password complexity check is strict and silent-ish

GitLab 18.x rejects any root password that "contains commonly used combinations of words and letters". This is enforced at the admin-seed step, deep inside gitlab-ctl reconfigure. My first attempt used <word>Lab2026!Root — both Lab and Root tripped the dictionary check, and the install died mid-reconfigure with:

Could not create the default administrator account:
--> Password must not contain commonly used combinations of words and letters

It's not catastrophic — the install knows it failed — but if you're not watching the reconfigure log, you only find out when nothing works afterward.

TL;DR: generate a random 24+ char password with no English words in it, and pass it via env var on the first install:

ROOTPW=$(openssl rand -base64 24 | tr -d '/+=' | head -c 28)
EXTERNAL_URL="https://gitlab.example.com" \
GITLAB_ROOT_PASSWORD="$ROOTPW" \
apt-get install -y gitlab-ee

3. If the admin seed fails, it won't retry on its own

This was the deepest rabbit hole. The default admin-user seed lives at db/fixtures/production/003_admin.rb and only runs on the first DB setup. When step 2 above failed, the DB was left with User.count == 0. Running gitlab-ctl reconfigure again did not re-seed — it considered the DB initialized and moved on.

End result: GitLab serving 200 OK on the login page, root user doesn't exist, no way in via the UI, and gitlab-rails runner can't find the user it expects.

Recovery — seed root manually via Rails:

# /tmp/seed_admin.rb  — GITLAB_ROOT_PASSWORD=... gitlab-rails runner /tmp/seed_admin.rb
password = ENV.fetch('GITLAB_ROOT_PASSWORD')
admin = User.new(
  username: 'root', email: 'admin@gitlab.example.com', name: 'Administrator',
  password: password, password_confirmation: password,
  admin: true, confirmed_at: Time.now, skip_confirmation: true
)
admin.assign_personal_namespace(Organizations::Organization.default_organization)
admin.save!(validate: false)
admin.update_columns(admin: true, confirmed_at: Time.now, state: 'active')

TL;DR: if install ever fails at "Could not create the default administrator account", don't just rerun reconfigure — seed root manually.

4. The CI job token signing key isn't auto-seeded on a partial install

This one cost me the most time, and the error message is wonderfully misleading.

GitLab 17+ signs every CI job's auth token with an RSA key stored in ApplicationSetting.ci_job_token_signing_key. This is seeded during a clean install. If your install was interrupted at the admin-seed step (see above), the key is missing — and every job submission to a runner fails the assignment step server-side with:

RuntimeError: CI job token signing key is not set

What the user sees in the UI is "The scheduler failed to assign job to the runner, please try again or contact system administrator." That message strongly implies a runner-side problem. It isn't. The runner is fine, doing its job, asking for work. The server can't generate the token it would hand the runner, and bails.

TL;DR: after any non-clean install, set the key explicitly:

gitlab-rails runner '
require "openssl"
s = ApplicationSetting.current
if s.ci_job_token_signing_key.blank?
  s.update!(ci_job_token_signing_key: OpenSSL::PKey::RSA.new(2048).to_pem)
end
puts "set? #{s.ci_job_token_signing_key.present?}"
'

And: when CI errors look like runner issues, always check /var/log/gitlab/gitlab-rails/exceptions_json.log on the GitLab server before touching the runner. The real exception is in there, and it usually tells you exactly which class is blowing up. Way faster than poking at runner configs.

5. Modern GitLab requires `organization_id` on Group create, and an `Organizations::OrganizationUser` row on every user

Two API quirks introduced by the Organizations feature flag, both with bad error messages.

a) Groups::CreateService returns ServiceResponse with status: :error if organization_id is missing:

# WRONG — fails with "Group has errors: Organization can't be blank"
Groups::CreateService.new(admin, name: 'X', path: 'x').execute

# RIGHT
Groups::CreateService.new(admin,
  name: 'X', path: 'x',
  organization_id: Organizations::Organization.default_organization.id
).execute

b) group.add_owner(user) silently fails if the user has no Organizations::OrganizationUser row. It returns a Member object with access_level: 50 (Owner) — looks successful at a glance — but member.persisted? is false. The error buried inside is "already belongs to another organization", which is doubly misleading: the user belongs to no organization yet, that's the actual problem.

org = Organizations::Organization.default_organization
Organizations::OrganizationUser.find_or_create_by!(user: user, organization: org)
m = group.add_member(user, Gitlab::Access::OWNER)
raise m.errors.full_messages.join(', ') unless m.persisted?

TL;DR: when scripting user creation: (1) pass organization_id when creating groups, (2) insert an Organizations::OrganizationUser row right after User.save!, (3) check Member#persisted? after add_owner — never trust that no exception means success.

6. Don't run package installs in parallel over SSH

I tried installing gitlab-runner on four droplets concurrently by backgrounding four ssh sessions and waiting for them. One installed successfully. The other three exited 0 without installing the package. No errors. No clue why — they raced on something, perhaps the apt-repo init or a lock — and silently dropped the install step.

I didn't notice until gitlab-runner register returned bash: gitlab-runner: command not found.

TL;DR: run package installs sequentially, and check the binary exists after every install. A failing-fast loop is cheaper to debug than four silent successes:

for ip in "${IPS[@]}"; do
  ssh root@"$ip" 'apt-get update -y && apt-get install -y gitlab-runner && which gitlab-runner' \
    || { echo "INSTALL FAILED ON $ip"; exit 1; }
done

7. The standalone `gitlab-runner` binary doesn't install a systemd unit

A teammate had downloaded the gitlab-runner binary, put it in /usr/bin/, and run gitlab-runner register. The registration succeeded — a runner record appeared in GitLab — but in GitLab the runner showed offline, and gitlab-runner verify came back empty.

The package install (.deb) sets up a systemd service for you. The binary doesn't. There's no service running to read /etc/gitlab-runner/config.toml, and depending on which path register wrote to, the config may not contain anything anyway.

TL;DR: for binary installs, always do this explicitly:

sudo gitlab-runner install --user=root --working-directory=/home/gitlab-runner
sudo gitlab-runner start
sudo gitlab-runner register --non-interactive \
  --url https://... --token glrt-... \
  --executor docker --docker-image alpine:3.20 \
  --description "$(hostname)"
sudo gitlab-runner verify   # confirms server-side sees us

If verify says "is removed", the GitLab side has deleted the runner record (e.g. you ran unregister server-side earlier) — re-register with a fresh token.

8. Bash heredoc + parens in arguments = silent failure

I had a runner description with parentheses: "gitlab-runner-3 (lab-b)". When I passed that into a remote SSH heredoc, the parens triggered the remote shell to interpret them as subshell syntax. The ssh exited non-zero, register never ran, but the previous lines (which created the runner record server-side) had already succeeded — leaving an orphan.

TL;DR: never inline arbitrary strings into shell heredocs. Pass them as positional args, and use a quoted heredoc delimiter:

ssh root@host bash -s "$TOKEN" "$DESCRIPTION" <<'REMOTE'
  TOKEN="$1"; DESC="$2"
  gitlab-runner register --token "$TOKEN" --description "$DESC" ...
REMOTE

The 'REMOTE' (single-quoted) plus $1/$2 positional pickup is the safe pattern. Once you've been bitten, you start using it for everything.

9. Install postfix at bootstrap, not later

GitLab uses /usr/sbin/sendmail by default for outbound email — pipeline notifications, MR comments, password resets, the works. Without it, every mailer Sidekiq job dies and lands in the dead set:

error_class=Errno::ENOENT
error_message=No such file or directory - /usr/sbin/sendmail

This isn't catastrophic in the abstract, but the second-order effects bite:

Every pipeline retries the failed mailer, slowly bloating the Sidekiq queue.
When you eventually install postfix mid-flight, the configurator hangs on an interactive prompt despite debconf-set-selections (postfix preseeding is famously fiddly), holds the apt lock, and you end up with a server you can't even apt-get on.

I learned this during active use of the platform. Strongly do not recommend.

TL;DR: install postfix before apt-get install -y gitlab-ee:

debconf-set-selections <<<"postfix postfix/mailname string $(hostname -f)"
debconf-set-selections <<<"postfix postfix/main_mailer_type string 'Local only'"
DEBIAN_FRONTEND=noninteractive apt-get install -y postfix mailutils

Or configure GitLab to use SMTP and skip postfix entirely. Either way, decide before traffic exists.

10. `sidekiq['max_concurrency']` is a cap, not a setting

GitLab Omnibus's default is sidekiq['concurrency'] = 20. With ten-plus runners actively requesting jobs and updating state, this is not enough — pipelines back up at the created → pending transition because PipelineProcessWorker waits its turn behind unrelated queue traffic. Users see jobs sitting in "pending" for 5–10 minutes.

I set sidekiq['max_concurrency'] = 50 and assumed I'd doubled-and-a-bit the worker count. I had not. max_concurrency is the upper bound the auto-scaler is allowed to reach; the actual running concurrency is sidekiq['concurrency'], which was still 20.

TL;DR: set the real knob:

# /etc/gitlab/gitlab.rb
sidekiq['concurrency'] = 50          # actual worker count
sidekiq['max_concurrency'] = 50      # cap for auto-scaling
sidekiq['min_concurrency'] = 20      # floor

Then gitlab-ctl reconfigure. Verify at runtime — don't trust the config file:

gitlab-rails runner 'require "sidekiq/api"; \
  Sidekiq::ProcessSet.new.each { |p| puts "concurrency=#{p["concurrency"]}" }'

That output is the source of truth.

11. DO doesn't live-resize. Confirm the size you need first

DigitalOcean droplet resizes always require a power-off. No live migration, no hot CPU/RAM addition. CPU/RAM-only resize ("disk": false in the API) is reversible and takes ~2 min of downtime; full disk resize is irreversible.

Doing this during active use cost me five minutes of full outage. Worse, the DO API reported the resize action as errored even though it had actually completed and updated the droplet's size_slug. The action status is unreliable; check the droplet's actual current size instead:

curl -sS -H "Authorization: Bearer $DO_TOKEN" \
  "https://api.digitalocean.com/v2/droplets/$ID" \
| jq '.droplet | {size_slug, vcpus, memory, disk, status}'

TL;DR:

Right-size at provisioning time. For a lab serving 20–30 users with 5–10 runners, start at 8 vCPU / 32 GB, not 16 GB.
Treat resize as a maintenance window, not a runtime tweak.
After a resize, verify the droplet's size_slug directly. Don't trust the action endpoint.

12. The meta-lesson: don't do operational changes during active use

Each of these — installing postfix, resizing the droplet, restarting Sidekiq — is fine on its own. But doing all three back-to-back while the platform is in use compounded into a full outage and an unscheduled reboot. The fix is process, not technology:

Reserve a pre-flight maintenance window for "anything that touches gitlab.rb or runs apt."
Schedule any reconfigure for an idle period.
Have a known-good CI sanity-check job you can trigger to verify the platform is healthy before each phase.

I had none of this. I do now.

Pre-flight checklist

The actually-useful part. Before provisioning:

Region has slugs for both initial size AND a plausible upgrade path
SSH key uploaded to DO, public key matches a private key on the provisioning host
DNS zone editable, no naming conflicts
Group structure decided (group runners don't bleed across groups)

During GitLab install:

Random non-dictionary root password (openssl rand -base64 24 | tr -d '/+=')
EXTERNAL_URL=https://... set before apt-get install gitlab-ee
Postfix installed before gitlab-ee (or SMTP configured in gitlab.rb)
In gitlab.rb: sidekiq['concurrency'] = 50 (or sized for your scale)
After install: User.count non-zero; if zero, manual seed

After install, before users:

ApplicationSetting.ci_job_token_signing_key is set
License activated
TLS cert valid (openssl s_client -connect gitlab.host:443 -servername gitlab.host)

For each user:

Created with skip_confirmation: true, confirmed_at: Time.now
Organizations::OrganizationUser row exists
group.add_owner(u).persisted? == true

For each runner host:

Installed via .deb (not raw binary) — service comes with it
gitlab-runner --version and docker --version both return something
usermod -aG docker gitlab-runner then systemctl restart gitlab-runner
gitlab-runner verify says "is alive"
GitLab API confirms online: true

For each test pipeline:

Job uses tags: [unique-host-tag] to prove targeting works
run_untagged: true is on at least one runner per group
If a job sticks in created past its predecessor: check Sidekiq queue depth and production.log

Appendix: minimal working bootstrap

# bootstrap.rb — gitlab-rails runner /tmp/bootstrap.rb

require 'openssl'

# 1. Make sure CI signing key exists (covers interrupted-install case)
s = ApplicationSetting.current
if s.ci_job_token_signing_key.blank?
  s.update!(ci_job_token_signing_key: OpenSSL::PKey::RSA.new(2048).to_pem)
end

# 2. Activate license (cloud licensing)
GitlabSubscriptions::ActivateService.new.execute(ENV['ACTIVATION_CODE'])

# 3. Create top-level group with org_id
admin = User.find_by!(username: 'root')
org   = Organizations::Organization.default_organization
group = Group.find_or_initialize_by(path: ENV['GROUP_PATH'])
if group.new_record?
  res = Groups::CreateService.new(admin,
    name: ENV['GROUP_NAME'], path: ENV['GROUP_PATH'],
    visibility_level: Gitlab::VisibilityLevel::PRIVATE,
    organization_id: org.id).execute
  group = res.payload[:group]
  raise group.errors.full_messages.join(', ') if group.errors.any?
end

# 4. Create users, attach to org, add as owners
ENV['USERS'].split(',').each do |u|
  user = User.find_by(username: u) || User.new(
    username: u, name: u.capitalize, email: "#{u}@example.com",
    password: ENV['USER_PASSWORD'], password_confirmation: ENV['USER_PASSWORD'],
    skip_confirmation: true
  )
  if user.new_record?
    user.assign_personal_namespace(org)
    user.save!(validate: false)
    user.update_columns(confirmed_at: Time.now, state: 'active')
  end
  Organizations::OrganizationUser.find_or_create_by!(user: user, organization: org)
  m = group.add_member(user, Gitlab::Access::OWNER)
  raise "membership failed for #{u}: #{m.errors.full_messages.join(', ')}" unless m.persisted?
end

puts 'bootstrap complete'

Call with:

ACTIVATION_CODE=xxx GROUP_PATH=labs GROUP_NAME='Labs' \
USERS=user1,user2,user3 USER_PASSWORD='<strong>' \
gitlab-rails runner /tmp/bootstrap.rb

If I do this again with this list in hand, I expect the same lab in ~45 minutes end-to-end. Most of the time we burned was on the silent-failure cluster — admin seed not retrying, the CI signing key looking like a runner problem, add_owner returning a non-persisted member. None of these are bugs exactly. They're just rough edges that the docs assume you'll skate past on your first try.

Pocket cheat-sheet

The 30-second version, for pasting into a new agent or chat window before you start:

DO regions are uneven; confirm your initial and upgrade slugs exist in your chosen region before provisioning.
Generate a random root password with no English words; set via GITLAB_ROOT_PASSWORD env var on first install.
If install dies at "Could not create the default administrator account", reconfigure won't reseed; create root manually via Rails.
After any non-clean install, set ApplicationSetting.ci_job_token_signing_key — its absence presents as "scheduler failed to assign job to runner" in the UI.
Groups::CreateService needs organization_id; group.add_owner(user) silently returns a non-persisted Member unless the user has an Organizations::OrganizationUser row. Always check m.persisted?.
Run package installs over SSH sequentially, not in parallel; verify the binary exists after each.
Install postfix before apt-get install gitlab-ee, or configure SMTP — never deal with mailers later.
Set sidekiq['concurrency'] (not just max_concurrency) — defaults to 20, bump to 50 for active CI; verify at runtime via Sidekiq::ProcessSet.