Self-Hosted NetBird Mesh VPN on a Production-Grade Docker Swarm HA Cluster

The cloud-native crowd will tell you to spin up an EKS cluster with Crossplane, cert-manager, external-secrets, and a Helm chart for everything. For a VPN control plane serving a few hundred nodes, that’s overkill that costs $400/month before you’ve written a single line of application code. Docker Swarm — declared dead by the same crowd in 2019 — runs this entire stack on three Hetzner cx23 instances for under €30/month, with full HA, automated PostgreSQL failover, distributed storage, and zero dependency on NetBird’s SaaS offering.

This post documents the architecture, the sharp edges, and the reasoning behind every non-obvious choice.

The Problem with Managed NetBird

NetBird’s hosted offering works fine until you need to comply with data residency requirements, connect private cloud networks, or simply refuse to let your VPN control plane phone home to a third party. The self-hosted option exists, but the official docs gloss over the operational reality: NetBird’s control plane is a collection of services — management API, signal server, STUN/TURN relay, dashboard, and an OIDC provider — that need to survive node failures, handle concurrent connections, and store state durably.

Running this on a single VPS means the VPN goes down when you reboot for kernel updates. Running it on Kubernetes means you now maintain a Kubernetes cluster for a VPN. The middle path is a properly configured Docker Swarm cluster with genuine HA at every layer.

3-node Hetzner cluster topology showing Swarm managers, GlusterFS replication, and Patroni leader election across NBG1 and FSN1 datacenters

Cluster Foundation: Three Nodes, Two Datacenters

The physical layout is deliberate:

node1 (cx23) — Nuremberg NBG1  — 10.0.0.1
node2 (cx23) — Frankfurt FSN1  — 10.0.0.2
node3 (cx23) — Frankfurt FSN1  — 10.0.0.3
Floating IP   — 65.21.x.x      — client entry point

All three nodes are Swarm managers. Running three managers gives you a raft quorum that survives a single node loss; with a floating IP as the VIP, clients always hit a live node. The private 10.0.0.0/24 network carries all inter-node traffic — etcd replication, PostgreSQL WAL streaming, GlusterFS brick sync — keeping it off the public interface.

VIP Failover Without keepalived

Hetzner’s Floating IPs can be reassigned via API in under two seconds. Instead of running keepalived or a full VRRP stack, a minimal Bash container does the job.

One critical detail: Hetzner does not automatically unassign a floating IP when a server crashes. The IP stays pointing at the dead server’s ID. A naive check for owner == "null" will never fire during a real failure. You must actively test whether the current owner is reachable:

#!/bin/bash
# Runs on every manager node. Each node checks whether the floating IP owner
# is alive, and claims it if the owner is unreachable.

# Note: The metadata path is /hetzner/v1/, not /hcloud/v1/.
METADATA_URL="http://169.254.169.254/hetzner/v1/metadata/instance-id"
HCLOUD_API="https://api.hetzner.cloud/v1"
FLOATING_IP_ID="${FLOATING_IP_ID}"
POLL_INTERVAL=5

current_instance() {
  curl -sf "$METADATA_URL"
}

floating_ip_owner() {
  curl -sf -H "Authorization: Bearer $HCLOUD_TOKEN" \
    "$HCLOUD_API/floating_ips/$FLOATING_IP_ID" \
    | jq -r '.floating_ip.server'
}

assign_to_self() {
  local self_id
  self_id=$(current_instance)
  curl -sf -X POST \
    -H "Authorization: Bearer $HCLOUD_TOKEN" \
    -H "Content-Type: application/json" \
    -d "{\"server\": $self_id}" \
    "$HCLOUD_API/floating_ips/$FLOATING_IP_ID/actions/assign"
}

while true; do
  owner=$(floating_ip_owner)
  self=$(current_instance)

  if [[ "$owner" == "null" || -z "$owner" ]]; then
    echo "Floating IP unassigned, claiming..."
    sleep $((RANDOM % 5))  # jitter to avoid thundering herd on simultaneous detection
    assign_to_self
  elif [[ "$owner" != "$self" ]]; then
    # Owner is assigned to another node — check whether it's actually alive
    owner_ip=$(curl -sf -H "Authorization: Bearer $HCLOUD_TOKEN" \
      "$HCLOUD_API/servers/$owner" | jq -r '.server.public_net.ipv4.ip')
    if ! curl -sf --max-time 2 "http://$owner_ip:8404/stats" > /dev/null 2>&1; then
      echo "Owner $owner ($owner_ip) unreachable, claiming floating IP..."
      sleep $((RANDOM % 5))  # jitter
      assign_to_self
    fi
  fi

  sleep "$POLL_INTERVAL"
done

The jitter sleep $((RANDOM % 5)) is load-bearing. Without it, all surviving managers detect the failure simultaneously and hammer the Hetzner API in lockstep every poll cycle. The API handles concurrent assigns gracefully (last write wins), but the audit log noise is unnecessary and the behavior is fragile. Random backoff costs nothing.

This runs as a Swarm service constrained to manager nodes with replicas: 1. When a manager dies, Swarm reschedules it to a surviving manager, which detects the dead owner and claims the IP. No external dependency, no exotic tooling.

The Database Layer: Patroni + etcd + HAProxy

This is where most self-hosted PostgreSQL HA setups fall apart. The common failure mode: you use pg_auto_failover or a simple replication setup, your primary goes down, the replica doesn’t auto-promote cleanly, and you’re manually running pg_ctl promote at 2am.

Patroni solves this properly by separating the concerns:

graph TD
    C["Client :5432"] --> H["HAProxy :5432"]
    H -->|"health check: GET /master → :8008<br/>HTTP 200 = leader, 503 = replica"| P1["Patroni 1<br/>PG Leader"]
    H -.->|"bypassed (replica)"| P2["Patroni 2<br/>Replica"]
    H -.->|"bypassed (replica)"| P3["Patroni 3<br/>Replica"]
    P1 <--> E["etcd cluster :2379<br/>DCS + leader election"]
    P2 <--> E
    P3 <--> E

The critical detail is HAProxy’s health check. It hits Patroni’s REST API on each node — only the current leader returns HTTP 200 on /master; replicas return 503. HAProxy routes all write traffic exclusively to the leader.

HAProxy 2.2+ deprecated the inline option httpchk GET /path syntax. Use the current form:

backend patroni_primary
  option httpchk
  http-check send meth GET uri /master
  http-check expect status 200
  server pg1 10.0.0.1:5432 check port 8008 inter 2s fall 2 rise 3
  server pg2 10.0.0.2:5432 check port 8008 inter 2s fall 2 rise 3
  server pg3 10.0.0.3:5432 check port 8008 inter 2s fall 2 rise 3

On failover timing: inter 2s fall 2 means two consecutive failed checks before HAProxy marks a server down — a minimum detection time of 4 seconds. Add Patroni’s etcd-based leader election (typically 5–10 seconds with default TTL settings). End-to-end failover completes in under 15 seconds, not 5. Plan your application reconnect logic accordingly.

etcd Configuration

The etcd cluster pins each instance to a specific node via Swarm placement constraints. A few things to get right that are easy to get wrong:

services:
  etcd1:
    # Use the maintained etcd-io image. quay.io/coreos/etcd is from the
    # archived CoreOS repository and should not be used for new deployments.
    image: quay.io/etcd-io/etcd:v3.5.16
    deploy:
      placement:
        constraints:
          - node.hostname == node1
    environment:
      ETCD_NAME: etcd1
      # ETCD_DATA_DIR must be set and backed by a persistent volume.
      # Without it, etcd reinitializes on every container restart and
      # Patroni loses its DCS state — a cluster-breaking condition.
      ETCD_DATA_DIR: /var/lib/etcd
      ETCD_INITIAL_CLUSTER: "etcd1=http://10.0.0.1:2380,etcd2=http://10.0.0.2:2380,etcd3=http://10.0.0.3:2380"
      # CRITICAL: ETCD_INITIAL_CLUSTER_STATE: new is for initial bootstrap only.
      # After the cluster is initialized, remove this line or set it to 'existing'.
      # Leaving 'new' in place causes etcd to attempt re-initialization on restart
      # and refuse to rejoin the running cluster.
      ETCD_INITIAL_CLUSTER_STATE: new
      ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
      ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
      ETCD_ADVERTISE_CLIENT_URLS: http://10.0.0.1:2379
      ETCD_INITIAL_ADVERTISE_PEER_URLS: http://10.0.0.1:2380
    volumes:
      - etcd1_data:/var/lib/etcd

volumes:
  etcd1_data:

All peer and client URLs use http://. For this architecture, where all inter-node traffic is isolated on the private 10.0.0.0/24 VLAN, unencrypted etcd is an acceptable trade-off. If your private network is shared or untrusted, configure TLS with ETCD_PEER_CERT_FILE, ETCD_PEER_KEY_FILE, ETCD_CERT_FILE, and ETCD_KEY_FILE before putting this in front of production workloads.

PostgreSQL itself runs in global deployment mode — one container per node — with dnsrr endpoint mode. This bypasses Swarm’s virtual IP for the service and lets HAProxy connect directly to each node’s PostgreSQL port. Standard Swarm VIPs would route through the overlay mesh and interfere with HAProxy’s direct health checks.

GlusterFS: The Shared Brain

The fundamental challenge with multi-node Swarm is that containers can schedule anywhere, but configuration files, TLS certificates, and secrets need to be consistent across all nodes. You have three options:

Swarm secrets/configs — great for small, infrequently-changing data; bad for a directory tree with dynamic cert renewals
NFS — single point of failure, latency, locking issues
GlusterFS replica volume — distributed, survives single-node failure, presents as a standard POSIX mount

The setup uses a replica-3 volume with one brick per node:

# Run once from node1 after GlusterFS peers are established
gluster volume create swarm_vol replica 3 \
  node1:/data/gluster/brick \
  node2:/data/gluster/brick \
  node3:/data/gluster/brick

# Restricts FUSE client connections to localhost only.
# This does NOT affect internal peer replication — bricks communicate
# over the GlusterFS peer protocol on port 24007, which auth.allow does not control.
gluster volume set swarm_vol auth.allow 127.0.0.1
gluster volume start swarm_vol

# Mount on each node (via /etc/fstab)
# localhost:/swarm_vol /mnt/netbird_shared glusterfs defaults,_netdev 0 0

Using localhost:/swarm_vol for the mount is intentional. The GlusterFS FUSE client connects to the local brick daemon, which handles replication internally. If network partitioning isolates a node, that node’s local mount continues to serve reads from its local brick — the VPN plane keeps functioning while writes block waiting for quorum.

Everything service-related lives here: docker-compose.yaml, Traefik’s ACME JSON, Zitadel configs, NetBird management configs, and the login-client.pat token.

Traefik Routing: Priority Order Matters

The routing configuration has a specific priority ordering that took iteration to get right. Zitadel’s login UI is a Next.js app on port 3000; the Zitadel API runs gRPC on port 8080. Both sit behind the same domain, and the path prefixes overlap in non-obvious ways:

# Zitadel root redirect — highest priority
- "traefik.http.routers.zitadel-root.rule=Host(`id.yourdomain.com`) && Path(`/`)"
- "traefik.http.routers.zitadel-root.priority=400"
- "traefik.http.routers.zitadel-root.middlewares=redirect-to-login"

# Zitadel login UI (Next.js :3000)
- "traefik.http.routers.zitadel-login.rule=Host(`id.yourdomain.com`) && PathPrefix(`/ui/v2/login`)"
- "traefik.http.routers.zitadel-login.priority=250"
- "traefik.http.routers.zitadel-login.service=zitadel-login-svc"

# Zitadel API (gRPC :8080) — h2c forwarding
- "traefik.http.routers.zitadel-api.rule=Host(`id.yourdomain.com`) && PathPrefix(`/zitadel.`)"
- "traefik.http.routers.zitadel-api.priority=200"
- "traefik.http.services.zitadel-api-svc.loadbalancer.server.scheme=h2c"
- "traefik.http.services.zitadel-api-svc.loadbalancer.server.port=8080"

# Catch-all to Zitadel API
- "traefik.http.routers.zitadel-catchall.rule=Host(`id.yourdomain.com`)"
- "traefik.http.routers.zitadel-catchall.priority=100"

The gRPC forwarding as h2c is the piece that trips people up. Traefik terminates TLS from the client, then forwards to the backend over HTTP/2 cleartext. If you forward as regular HTTP/1.1, gRPC connections fail with cryptic PROTOCOL_ERROR messages. The h2c scheme tells Traefik to use HTTP/2 without TLS for the upstream connection, preserving the framing gRPC requires.

NetBird Reverse Proxy and PROXY Protocol

The NetBird relay handles WireGuard UDP on 51820 and exposes management/signal on 8443. The management service needs the real client IP for connection logging, but Traefik is the actual TCP endpoint. PROXY protocol v2 solves this — with one hard constraint you must respect.

TLS passthrough and PROXY protocol are mutually exclusive. With tls.passthrough: true, Traefik forwards raw TLS bytes unchanged to the backend. It cannot prepend a PROXY protocol header to an in-flight TLS stream without corrupting the TLS ClientHello the backend receives. The config breaks silently in ways that are tedious to debug.

The correct approach is to terminate TLS at Traefik, then forward over TCP with PROXY protocol v2:

# Traefik terminates TLS via cert resolver, then forwards with PROXY protocol v2.
# The backend sees decrypted TCP bytes prefixed with a PROXY header containing
# the real client IP.
tcp:
  routers:
    netbird-relay:
      entryPoints:
        - netbird-https
      rule: "HostSNI(`vpn.yourdomain.com`)"
      service: netbird-relay-svc
      tls:
        certResolver: letsencrypt
  services:
    netbird-relay-svc:
      loadBalancer:
        proxyProtocol:
          version: 2
        servers:
          - address: "netbird-relay:8443"

The backend (NetBird’s reverse proxy) must be configured to trust PROXY protocol headers from Traefik’s overlay IP — for example 172.30.0.10 if that’s Traefik’s address on the overlay network. This is configured in the backend service, not in Traefik. It prevents clients from spoofing their source IP by injecting PROXY protocol headers directly at the entrypoint.

The internal communication between the reverse proxy and the management server uses http://netbird-server:80 deliberately. Direct overlay DNS resolution avoids hairpin NAT — traffic stays inside the overlay network instead of exiting through the host and coming back in on the public IP.

Deployment: No Registry Required

The Patroni image is built locally because no maintained public image pins both Patroni and PostgreSQL 17 at the exact versions needed. The deployment script avoids a private registry entirely:

#!/bin/bash
# deploy.sh (abbreviated)
set -euo pipefail

NODES=("node1" "node2" "node3")
IMAGE_NAME="patroni-pg17:local"

echo "Building Patroni/PG17 image..."
docker build -t "$IMAGE_NAME" ./patroni/

echo "Exporting image..."
docker save "$IMAGE_NAME" | gzip > /tmp/patroni-pg17.tar.gz

echo "Distributing to nodes..."
for node in "${NODES[@]}"; do
  echo "  -> $node"
  scp /tmp/patroni-pg17.tar.gz "$node:/tmp/"
  ssh "$node" "docker load < /tmp/patroni-pg17.tar.gz"
done

echo "Deploying stack..."
docker stack deploy -c /mnt/netbird_shared/docker-compose.yaml netbird

docker save | gzip | scp | docker load is inelegant but completely self-contained. For a stack this size, the tar approach is faster to set up than running a registry and managing authentication. The image is ~400MB compressed; transfer to two remote nodes takes about 45 seconds on Hetzner’s internal network.

SQLite to PostgreSQL Migration

Users running NetBird’s default SQLite backend can migrate using the pgloader Docker image. Note that pgloader’s --with and --cast flags are not valid CLI arguments — they are command-file directives. Pass them via a heredoc:

#!/bin/bash
# migrate_sqlite_to_postgres.sh

STORE_DB="${1:-/mnt/netbird_shared/netbird/store.db}"
MGMT_DB="${2:-/mnt/netbird_shared/netbird/management.db}"
PG_DSN="postgresql://netbird:${NETBIRD_DB_PASSWORD}@haproxy:5432/netbird"

run_migration() {
  local sqlite_path="$1"
  local label="$2"

  docker run --rm \
    -v "$(dirname "$sqlite_path"):/data" \
    dimitri/pgloader:latest \
    pgloader /dev/stdin <<EOF
LOAD DATABASE
  FROM sqlite:///data/$(basename "$sqlite_path")
  INTO $PG_DSN
WITH include drop, create tables
CAST type text to text using remove-null-characters,
     type blob to bytea;
EOF

  echo "$label migration complete"
}

run_migration "$STORE_DB" "store.db"
run_migration "$MGMT_DB" "management.db"

The remove-null-characters cast is a safe defensive inclusion: PostgreSQL rejects null bytes in text columns outright, and if you encounter them in migrated data, this cast handles it transparently. The blob to bytea cast handles binary session data columns.

When to Use This Architecture

Good fit	Bad fit
<50 services	Hundreds of microservices
Small ops team (1–3 people)	Multiple teams owning infra
Cost-sensitive	Need advanced scheduling
Stable, infrequent deploys	Frequent rolling deploys
Strong HA requirements	Need custom CRDs/operators
No Kubernetes expertise	Existing k8s investment

Docker Swarm’s operational surface is dramatically smaller than Kubernetes. There’s no etcd to manage separately (outside of Patroni’s etcd — which you’d need regardless), no kubelet to debug, no CNI plugin to misconfigure. The entire cluster state is in docker stack ls and docker service ls. When something breaks at 2am, the debugging path is docker service logs netbird_management — not parsing 14 layers of Kubernetes events.

The genuine limitation is that Swarm doesn’t support StatefulSets, PodDisruptionBudgets, or custom controllers. For PostgreSQL HA, you compensate with Patroni (which predates Kubernetes-native Postgres operators anyway). For shared storage, GlusterFS fills the gap. These are solved problems that add real complexity, but the complexity is bounded and understandable.

Operational Checklist

Before running this in production, verify:

etcd cluster health: etcdctl endpoint health --cluster

Patroni leader election (run from inside the container):

docker exec -it $(docker ps -q -f name=netbird_patroni1.1) \
  patronictl -c /etc/patroni/patroni.yml list

GlusterFS volume status: gluster volume status swarm_vol
HAProxy stats: curl http://localhost:8404/stats
Floating IP assignment: hcloud floating-ip list
Traefik ACME cert renewal: check acme.json on GlusterFS mount
Test failover: docker node drain node1, verify HAProxy reroutes in <15s

Two things to monitor specifically: Patroni’s REST API (/patroni endpoint returns JSON with full cluster state including replication lag and timeline) and etcd’s health endpoint. These surface real issues that Swarm’s container-level health checks will not catch — a container can be running while etcd has lost quorum and Patroni is stuck in a read-only state.

The full stack — mesh VPN control plane, HA database cluster, distributed storage, L7 ingress, OIDC provider — runs stable on hardware that costs less than a single managed Kubernetes node on any major cloud. That’s the argument for Swarm in 2026: not that it’s better than Kubernetes, but that it’s the right tool for workloads where you want infrastructure complexity proportional to actual requirements.