← all posts

· Infra · 14 min read

Standing up GitLab EE on DigitalOcean: the gotchas I'd avoid next time

I built a GitLab EE 18.11 lab on a handful of DigitalOcean droplets recently. Ultimate-NFR license, single top-level group with five Owner users, 9 docker-executor runners. The end state works. Getting there exposed enough pitfalls — most of them subtle, several silent — that I want to write them down before the lessons fade.

This isn't a tutorial. There are plenty of those. It's a list of the things that aren't in the tutorials, the silent failures and the wrong assumptions, in roughly the order I tripped over them. Each section ends with a TL;DR for next time — read those first if you're skimming.


1. Confirm your region has an upgrade path before you commit

I picked the region I always pick. Hours later, when capacity became a concern, I discovered that region has no slugs above 8 vCPU available on this account tier. Max is 8 vCPU, full stop. The bigger sizes only live in General Purpose, CPU-Optimized, or Memory-Optimized pools — and those pools have uneven regional coverage. Just because a slug exists doesn't mean it's offered everywhere.

The implication: a lab you intend to scale up can become un-scalable simply because of where you put it. Migrating between regions means snapshot + new droplet + new IP + DNS changes — an order of magnitude more disruptive than a resize.

TL;DR for next time: before picking a region, list both your initial size and a plausible upgrade target and confirm both are available there:

curl -sS -H "Authorization: Bearer $DO_TOKEN" \
  "https://api.digitalocean.com/v2/sizes?per_page=200" \
| jq -r '.sizes[] | select(.available and (.regions|index("nyc3"))) | "\(.slug)\t\(.vcpus)\t\(.memory)MB\t$\(.price_monthly)"' \
| sort -n -k2

Pick a different region before you provision if your upgrade target isn't there.


2. The root password complexity check is strict and silent-ish

GitLab 18.x rejects any root password that "contains commonly used combinations of words and letters". This is enforced at the admin-seed step, deep inside gitlab-ctl reconfigure. My first attempt used <word>Lab2026!Root — both Lab and Root tripped the dictionary check, and the install died mid-reconfigure with:

Could not create the default administrator account:
--> Password must not contain commonly used combinations of words and letters

It's not catastrophic — the install knows it failed — but if you're not watching the reconfigure log, you only find out when nothing works afterward.

TL;DR: generate a random 24+ char password with no English words in it, and pass it via env var on the first install:

ROOTPW=$(openssl rand -base64 24 | tr -d '/+=' | head -c 28)
EXTERNAL_URL="https://gitlab.example.com" \
GITLAB_ROOT_PASSWORD="$ROOTPW" \
apt-get install -y gitlab-ee

3. If the admin seed fails, it won't retry on its own

This was the deepest rabbit hole. The default admin-user seed lives at db/fixtures/production/003_admin.rb and only runs on the first DB setup. When step 2 above failed, the DB was left with User.count == 0. Running gitlab-ctl reconfigure again did not re-seed — it considered the DB initialized and moved on.

End result: GitLab serving 200 OK on the login page, root user doesn't exist, no way in via the UI, and gitlab-rails runner can't find the user it expects.

Recovery — seed root manually via Rails:

# /tmp/seed_admin.rb  — GITLAB_ROOT_PASSWORD=... gitlab-rails runner /tmp/seed_admin.rb
password = ENV.fetch('GITLAB_ROOT_PASSWORD')
admin = User.new(
  username: 'root', email: 'admin@gitlab.example.com', name: 'Administrator',
  password: password, password_confirmation: password,
  admin: true, confirmed_at: Time.now, skip_confirmation: true
)
admin.assign_personal_namespace(Organizations::Organization.default_organization)
admin.save!(validate: false)
admin.update_columns(admin: true, confirmed_at: Time.now, state: 'active')

TL;DR: if install ever fails at "Could not create the default administrator account", don't just rerun reconfigure — seed root manually.


4. The CI job token signing key isn't auto-seeded on a partial install

This one cost me the most time, and the error message is wonderfully misleading.

GitLab 17+ signs every CI job's auth token with an RSA key stored in ApplicationSetting.ci_job_token_signing_key. This is seeded during a clean install. If your install was interrupted at the admin-seed step (see above), the key is missing — and every job submission to a runner fails the assignment step server-side with:

RuntimeError: CI job token signing key is not set

What the user sees in the UI is "The scheduler failed to assign job to the runner, please try again or contact system administrator." That message strongly implies a runner-side problem. It isn't. The runner is fine, doing its job, asking for work. The server can't generate the token it would hand the runner, and bails.

TL;DR: after any non-clean install, set the key explicitly:

gitlab-rails runner '
require "openssl"
s = ApplicationSetting.current
if s.ci_job_token_signing_key.blank?
  s.update!(ci_job_token_signing_key: OpenSSL::PKey::RSA.new(2048).to_pem)
end
puts "set? #{s.ci_job_token_signing_key.present?}"
'

And: when CI errors look like runner issues, always check /var/log/gitlab/gitlab-rails/exceptions_json.log on the GitLab server before touching the runner. The real exception is in there, and it usually tells you exactly which class is blowing up. Way faster than poking at runner configs.


5. Modern GitLab requires organization_id on Group create, and an Organizations::OrganizationUser row on every user

Two API quirks introduced by the Organizations feature flag, both with bad error messages.

a) Groups::CreateService returns ServiceResponse with status: :error if organization_id is missing:

# WRONG — fails with "Group has errors: Organization can't be blank"
Groups::CreateService.new(admin, name: 'X', path: 'x').execute

# RIGHT
Groups::CreateService.new(admin,
  name: 'X', path: 'x',
  organization_id: Organizations::Organization.default_organization.id
).execute

b) group.add_owner(user) silently fails if the user has no Organizations::OrganizationUser row. It returns a Member object with access_level: 50 (Owner) — looks successful at a glance — but member.persisted? is false. The error buried inside is "already belongs to another organization", which is doubly misleading: the user belongs to no organization yet, that's the actual problem.

org = Organizations::Organization.default_organization
Organizations::OrganizationUser.find_or_create_by!(user: user, organization: org)
m = group.add_member(user, Gitlab::Access::OWNER)
raise m.errors.full_messages.join(', ') unless m.persisted?

TL;DR: when scripting user creation: (1) pass organization_id when creating groups, (2) insert an Organizations::OrganizationUser row right after User.save!, (3) check Member#persisted? after add_owner — never trust that no exception means success.


6. Don't run package installs in parallel over SSH

I tried installing gitlab-runner on four droplets concurrently by backgrounding four ssh sessions and waiting for them. One installed successfully. The other three exited 0 without installing the package. No errors. No clue why — they raced on something, perhaps the apt-repo init or a lock — and silently dropped the install step.

I didn't notice until gitlab-runner register returned bash: gitlab-runner: command not found.

TL;DR: run package installs sequentially, and check the binary exists after every install. A failing-fast loop is cheaper to debug than four silent successes:

for ip in "${IPS[@]}"; do
  ssh root@"$ip" 'apt-get update -y && apt-get install -y gitlab-runner && which gitlab-runner' \
    || { echo "INSTALL FAILED ON $ip"; exit 1; }
done

7. The standalone gitlab-runner binary doesn't install a systemd unit

A teammate had downloaded the gitlab-runner binary, put it in /usr/bin/, and run gitlab-runner register. The registration succeeded — a runner record appeared in GitLab — but in GitLab the runner showed offline, and gitlab-runner verify came back empty.

The package install (.deb) sets up a systemd service for you. The binary doesn't. There's no service running to read /etc/gitlab-runner/config.toml, and depending on which path register wrote to, the config may not contain anything anyway.

TL;DR: for binary installs, always do this explicitly:

sudo gitlab-runner install --user=root --working-directory=/home/gitlab-runner
sudo gitlab-runner start
sudo gitlab-runner register --non-interactive \
  --url https://... --token glrt-... \
  --executor docker --docker-image alpine:3.20 \
  --description "$(hostname)"
sudo gitlab-runner verify   # confirms server-side sees us

If verify says "is removed", the GitLab side has deleted the runner record (e.g. you ran unregister server-side earlier) — re-register with a fresh token.


8. Bash heredoc + parens in arguments = silent failure

I had a runner description with parentheses: "gitlab-runner-3 (lab-b)". When I passed that into a remote SSH heredoc, the parens triggered the remote shell to interpret them as subshell syntax. The ssh exited non-zero, register never ran, but the previous lines (which created the runner record server-side) had already succeeded — leaving an orphan.

TL;DR: never inline arbitrary strings into shell heredocs. Pass them as positional args, and use a quoted heredoc delimiter:

ssh root@host bash -s "$TOKEN" "$DESCRIPTION" <<'REMOTE'
  TOKEN="$1"; DESC="$2"
  gitlab-runner register --token "$TOKEN" --description "$DESC" ...
REMOTE

The 'REMOTE' (single-quoted) plus $1/$2 positional pickup is the safe pattern. Once you've been bitten, you start using it for everything.


9. Install postfix at bootstrap, not later

GitLab uses /usr/sbin/sendmail by default for outbound email — pipeline notifications, MR comments, password resets, the works. Without it, every mailer Sidekiq job dies and lands in the dead set:

error_class=Errno::ENOENT
error_message=No such file or directory - /usr/sbin/sendmail

This isn't catastrophic in the abstract, but the second-order effects bite:

I learned this during active use of the platform. Strongly do not recommend.

TL;DR: install postfix before apt-get install -y gitlab-ee:

debconf-set-selections <<<"postfix postfix/mailname string $(hostname -f)"
debconf-set-selections <<<"postfix postfix/main_mailer_type string 'Local only'"
DEBIAN_FRONTEND=noninteractive apt-get install -y postfix mailutils

Or configure GitLab to use SMTP and skip postfix entirely. Either way, decide before traffic exists.


10. sidekiq['max_concurrency'] is a cap, not a setting

GitLab Omnibus's default is sidekiq['concurrency'] = 20. With ten-plus runners actively requesting jobs and updating state, this is not enough — pipelines back up at the created → pending transition because PipelineProcessWorker waits its turn behind unrelated queue traffic. Users see jobs sitting in "pending" for 5–10 minutes.

I set sidekiq['max_concurrency'] = 50 and assumed I'd doubled-and-a-bit the worker count. I had not. max_concurrency is the upper bound the auto-scaler is allowed to reach; the actual running concurrency is sidekiq['concurrency'], which was still 20.

TL;DR: set the real knob:

# /etc/gitlab/gitlab.rb
sidekiq['concurrency'] = 50          # actual worker count
sidekiq['max_concurrency'] = 50      # cap for auto-scaling
sidekiq['min_concurrency'] = 20      # floor

Then gitlab-ctl reconfigure. Verify at runtime — don't trust the config file:

gitlab-rails runner 'require "sidekiq/api"; \
  Sidekiq::ProcessSet.new.each { |p| puts "concurrency=#{p["concurrency"]}" }'

That output is the source of truth.


11. DO doesn't live-resize. Confirm the size you need first

DigitalOcean droplet resizes always require a power-off. No live migration, no hot CPU/RAM addition. CPU/RAM-only resize ("disk": false in the API) is reversible and takes ~2 min of downtime; full disk resize is irreversible.

Doing this during active use cost me five minutes of full outage. Worse, the DO API reported the resize action as errored even though it had actually completed and updated the droplet's size_slug. The action status is unreliable; check the droplet's actual current size instead:

curl -sS -H "Authorization: Bearer $DO_TOKEN" \
  "https://api.digitalocean.com/v2/droplets/$ID" \
| jq '.droplet | {size_slug, vcpus, memory, disk, status}'

TL;DR:


12. The meta-lesson: don't do operational changes during active use

Each of these — installing postfix, resizing the droplet, restarting Sidekiq — is fine on its own. But doing all three back-to-back while the platform is in use compounded into a full outage and an unscheduled reboot. The fix is process, not technology:

I had none of this. I do now.


Pre-flight checklist

The actually-useful part. Before provisioning:

During GitLab install:

After install, before users:

For each user:

For each runner host:

For each test pipeline:


Appendix: minimal working bootstrap

# bootstrap.rb — gitlab-rails runner /tmp/bootstrap.rb

require 'openssl'

# 1. Make sure CI signing key exists (covers interrupted-install case)
s = ApplicationSetting.current
if s.ci_job_token_signing_key.blank?
  s.update!(ci_job_token_signing_key: OpenSSL::PKey::RSA.new(2048).to_pem)
end

# 2. Activate license (cloud licensing)
GitlabSubscriptions::ActivateService.new.execute(ENV['ACTIVATION_CODE'])

# 3. Create top-level group with org_id
admin = User.find_by!(username: 'root')
org   = Organizations::Organization.default_organization
group = Group.find_or_initialize_by(path: ENV['GROUP_PATH'])
if group.new_record?
  res = Groups::CreateService.new(admin,
    name: ENV['GROUP_NAME'], path: ENV['GROUP_PATH'],
    visibility_level: Gitlab::VisibilityLevel::PRIVATE,
    organization_id: org.id).execute
  group = res.payload[:group]
  raise group.errors.full_messages.join(', ') if group.errors.any?
end

# 4. Create users, attach to org, add as owners
ENV['USERS'].split(',').each do |u|
  user = User.find_by(username: u) || User.new(
    username: u, name: u.capitalize, email: "#{u}@example.com",
    password: ENV['USER_PASSWORD'], password_confirmation: ENV['USER_PASSWORD'],
    skip_confirmation: true
  )
  if user.new_record?
    user.assign_personal_namespace(org)
    user.save!(validate: false)
    user.update_columns(confirmed_at: Time.now, state: 'active')
  end
  Organizations::OrganizationUser.find_or_create_by!(user: user, organization: org)
  m = group.add_member(user, Gitlab::Access::OWNER)
  raise "membership failed for #{u}: #{m.errors.full_messages.join(', ')}" unless m.persisted?
end

puts 'bootstrap complete'

Call with:

ACTIVATION_CODE=xxx GROUP_PATH=labs GROUP_NAME='Labs' \
USERS=user1,user2,user3 USER_PASSWORD='<strong>' \
gitlab-rails runner /tmp/bootstrap.rb

If I do this again with this list in hand, I expect the same lab in ~45 minutes end-to-end. Most of the time we burned was on the silent-failure cluster — admin seed not retrying, the CI signing key looking like a runner problem, add_owner returning a non-persisted member. None of these are bugs exactly. They're just rough edges that the docs assume you'll skate past on your first try.


Pocket cheat-sheet

The 30-second version, for pasting into a new agent or chat window before you start: