~ / posts /

Docker in 2026: Running AI Workloads, Model Services, and the Container Runtime Reality Check

Docker in 2026: Running AI Workloads, Model Services, and the Container Runtime Reality Check

Five years ago, the Docker pitch was simple: wrap your stateless Node.js service in a 100MB Alpine image, push it to a registry, and ship it. The mental model fit on a napkin. Today, I’m pulling 50GB model files into containers, managing GPU passthrough, and writing Compose configs with deploy.resources.reservations.devices blocks that would have looked like science fiction in 2021.

The container runtime hasn’t fundamentally changed — but what we’re putting inside containers has. And that gap between “Docker still works the same” and “your old Dockerfile patterns are completely wrong for this” is where teams are silently bleeding operational hours.

This post is about that gap.

The Shifting Payload Problem

Traditional microservices containers have predictable characteristics: small image size, fast startup, stateless, CPU-bound. You could spin up 50 of them on a single 8-core node with 32GB RAM without breaking a sweat.

AI model services break every one of those assumptions:

CharacteristicTraditional MicroserviceAI Model Service
Image size50-500 MB5-80 GB
Cold start<2 seconds30-300 seconds
GPU requiredNoOften yes
Memory footprint128 MB - 2 GB8-80 GB VRAM
Scaling triggerCPU / RPSQueue depth / GPU util
Idle behaviorScale to zeroKeep warm ($$)
Canary metricError rateModel accuracy

Your auto-scaling logic, health checks, readiness probes, and resource limits all need rethinking when you move from one paradigm to the other. The infra is the same; the operational model is different.

Docker AI workload architecture on a multi-GPU node

Container Runtimes and GPU Passthrough: Getting Your Hands Dirty

Before anything else, you need the NVIDIA Container Toolkit installed and configured on the host. This is the part that breaks silently if you get it wrong — the container starts, but your model loads on CPU and you don’t notice until you wonder why inference is taking 40x longer than expected.

# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Configure the Docker daemon to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify -- you should see your GPU listed
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

Once that’s in place, here’s a production-grade Compose file for running a vLLM inference server — the kind of thing you’d actually deploy for serving a quantized Llama or Mistral variant:

# docker-compose.yml
services:
  vllm-inference:
    image: vllm/vllm-openai:latest
    # runtime: nvidia is the legacy pre-Compose-v2 approach.
    # The deploy.resources block below is the correct modern equivalent.
    # Both are kept here for compatibility with older Docker Engine versions
    # that don't process the deploy key outside of Swarm mode.
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      # HF_TOKEN is the current recommended variable (HUGGING_FACE_HUB_TOKEN is legacy)
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - model-cache:/root/.cache/huggingface
      - ./config/vllm:/app/config:ro
    ports:
      - "8000:8000"
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --quantization awq
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --served-model-name mistral-7b
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s  # Unhealthy results during start_period don't trigger restarts
    restart: unless-stopped

  model-proxy:
    image: nginx:alpine
    volumes:
      - ./config/nginx/model-proxy.conf:/etc/nginx/conf.d/default.conf:ro
    ports:
      - "80:80"
    depends_on:
      vllm-inference:
        condition: service_healthy

volumes:
  model-cache:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/model-cache  # Put this on your fastest storage

The start_period: 120s on the healthcheck is not a typo. A 7B parameter model with AWQ quantization still takes 45-90 seconds to load on a decent GPU. Docker’s default start_period is 0s — meaning failed health checks count against the restart threshold immediately, and your orchestrator will mark the container unhealthy before it finishes initializing. Set it to at least 2x your measured cold start time.

The Image Layering Problem

Standard Docker layer caching advice doesn’t hold when your model is the dominant layer. Putting a 40GB model into an image via COPY or ADD means every rebuild triggers a 40GB push and pull. This is the pattern that kills CI pipelines.

The correct approach: separate the model from the runtime.

# Dockerfile -- runtime only, no model weights
# Using runtime (not cudnn8) base: cuDNN is for training, not inference.
# Dropping it saves ~500MB from the base image.
FROM nvidia/cuda:12.3.0-runtime-ubuntu22.04

# Ubuntu 22.04 ships Python 3.10. Pin to it explicitly.
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./src/

# Model weights are mounted at runtime, not baked in
ENV MODEL_PATH=/models
ENV HF_HOME=/cache/huggingface

ENTRYPOINT ["python3.10", "src/serve.py"]
# Pull model weights to host once, mount at runtime
docker run \
  -v /data/models:/models \
  -v /data/hf-cache:/cache/huggingface \
  --gpus all \
  my-inference-server:latest

This keeps your images small (measured in hundreds of MB, not tens of GB), your CI fast, and your model versions managed independently from your runtime versions. Use huggingface-cli or a dedicated model registry to version-control the weights separately.

Architecture: Multi-Service AI Stack

Here’s how a realistic 2026 inference platform looks when you compose these pieces together:

graph TB
    Client["Client / API Consumer"]
    Gateway["API Gateway<br/>(nginx / traefik)"]
    RateLimit["Rate Limiter<br/>(Redis-backed)"]

    subgraph Inference Cluster
        LB["Load Balancer<br/>(round-robin)"]
        Worker1["vLLM Worker 1<br/>(GPU 0)"]
        Worker2["vLLM Worker 2<br/>(GPU 1)"]
        Worker3["vLLM Worker 3<br/>(GPU 2)"]
    end

    subgraph Storage
        ModelVol["Model Volume<br/>(/data/models)"]
        Cache["Redis<br/>(KV Cache)"]
        VectorDB["Weaviate<br/>(Vector DB)"]
    end

    subgraph Observability
        Prometheus["Prometheus"]
        Grafana["Grafana"]
        Jaeger["Jaeger<br/>(Tracing)"]
    end

    Client --> Gateway
    Gateway --> RateLimit
    RateLimit --> LB
    LB --> Worker1
    LB --> Worker2
    LB --> Worker3
    Worker1 & Worker2 & Worker3 --> ModelVol
    Worker1 & Worker2 & Worker3 --> Cache
    Worker1 & Worker2 & Worker3 --> VectorDB
    Worker1 & Worker2 & Worker3 --> Prometheus
    Prometheus --> Grafana
    Worker1 & Worker2 & Worker3 --> Jaeger

Container Lifecycle Management for AI Workloads

The slow startup problem requires a different approach to rolling updates. With stateless microservices, the standard rolling update — kill old, start new, wait for health — is a 5-second operation. With model services, you’re looking at 2-5 minutes per replica.

This changes your deployment strategy significantly:

sequenceDiagram
    participant Orchestrator
    participant OldWorker as Old Worker (v1)
    participant NewWorker as New Worker (v2)
    participant LB as Load Balancer

    Orchestrator->>NewWorker: Start v2 container
    Note over NewWorker: Loading model weights (45-90s)
    NewWorker-->>Orchestrator: /health returns 200
    Orchestrator->>LB: Add v2 to rotation (10% traffic)
    Note over LB: Canary phase: monitor accuracy metrics
    LB->>NewWorker: Route 10% of requests
    LB->>OldWorker: Route 90% of requests
    Note over Orchestrator: Wait for accuracy baseline match
    Orchestrator->>LB: Shift to 50% / 50%
    Note over Orchestrator: Monitor for 5 min
    Orchestrator->>LB: Shift 100% to v2
    Orchestrator->>OldWorker: Drain connections
    Orchestrator->>OldWorker: Stop container

Notice that the canary metric here isn’t error rate — it’s model accuracy. A model that returns a 200 with a confidently wrong answer is worse than a model that returns a 500. Your health checks need to be application-aware, not just HTTP-alive.

A simple accuracy probe implementation — using only the standard library to avoid hidden dependency issues:

# src/healthcheck.py
import json
import sys
import urllib.request
import urllib.error

PROBE_MESSAGES = [{"role": "user", "content": "What is 2 + 2?"}]
EXPECTED_ANSWER_FRAGMENT = "4"

# Mistral-7B-Instruct is a chat model: use /v1/chat/completions, not /v1/completions.
# The legacy completions endpoint works in vLLM but produces unreliable results
# for instruct models without careful prompt formatting.
INFERENCE_URL = "http://localhost:8000/v1/chat/completions"

def check_model_accuracy() -> bool:
    payload = json.dumps({
        "model": "mistral-7b",
        "messages": PROBE_MESSAGES,
        "max_tokens": 20,
        "temperature": 0.0,
    }).encode()

    req = urllib.request.Request(
        INFERENCE_URL,
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    try:
        with urllib.request.urlopen(req, timeout=10) as resp:
            body = json.loads(resp.read())
            output = body["choices"][0]["message"]["content"]
            return EXPECTED_ANSWER_FRAGMENT in output
    except (urllib.error.URLError, KeyError, json.JSONDecodeError):
        return False

if __name__ == "__main__":
    sys.exit(0 if check_model_accuracy() else 1)
# In your Dockerfile
COPY src/healthcheck.py /app/healthcheck.py
HEALTHCHECK --interval=60s --timeout=15s --start-period=120s --retries=3 \
  CMD python3.10 /app/healthcheck.py

Resource Limits: Where Teams Get Burned

The most common operational mistake I see is either no resource limits (leads to OOM kills taking down neighboring containers) or CPU-style limits applied to GPU workloads (useless — Docker doesn’t throttle GPU by default).

For GPU memory specifically, you manage limits at the application level via CLI arguments, not the container level and not via environment variables. vLLM does not read a GPU_MEMORY_UTILIZATION env var — that value is passed through --gpu-memory-utilization in the command, as shown in the Compose command: block above:

# Correct: GPU memory utilization is a CLI argument, not an env var.
# vLLM ignores any env var you set for this -- it will silently do nothing.
command: >
  --model mistralai/Mistral-7B-Instruct-v0.3
  --gpu-memory-utilization 0.85
  ...

# If you push this to 0.95+, you will get OOM errors under load
# that are very hard to reproduce locally. 0.85-0.90 is a safe ceiling
# for production workloads; the remaining VRAM headroom absorbs
# CUDA context overhead and memory fragmentation.

For CPU and RAM, standard Docker limits apply and you should absolutely set them:

deploy:
  resources:
    limits:
      cpus: '4'
      memory: 16G
    reservations:
      cpus: '2'
      memory: 8G
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

The reservations block ensures the scheduler knows you need a GPU before placing the container. Without it, you’ll get placement on a node without a GPU and a cryptic failure at runtime.

The Observability Layer

AI workloads need a different metric set than traditional services. Beyond the standard HTTP metrics, you want:

# prometheus.yml scrape config addition
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['vllm-inference:8000']
    metrics_path: '/metrics'
    # vLLM exposes these natively:
    # vllm:gpu_cache_usage_perc
    # vllm:num_requests_running
    # vllm:num_requests_waiting
    # vllm:avg_prompt_throughput_toks_per_s
    # vllm:avg_generation_throughput_toks_per_s
    # vllm:time_to_first_token_seconds (histogram)

Time-to-first-token (TTFT) is your primary SLO metric for interactive use cases. Track its P95 and P99 — not just average. A model that’s fast 95% of the time but takes 30 seconds on the tail will destroy user experience.

Grafana dashboard showing GPU utilization, TTFT distribution, and queue depth

When to Not Use Docker for This

There are scenarios where Docker alone is the wrong tool, and containerizing your model infrastructure just adds friction without benefit:

Skip Docker when:

  • You’re doing active model training (not inference) — use the GPU directly via bare metal or a VM; every virtualization layer costs you training time and money
  • Your model fits in a Jupyter notebook and your “deployment” is one data scientist on one machine — don’t build infrastructure for problems you don’t have
  • You need MIG (Multi-Instance GPU) partitioning for running multiple small models on one GPU — this works with Docker but requires careful nvidia-ctk configuration that most teams aren’t ready for

Use Docker when:

  • You’re serving models in a multi-tenant environment alongside other workloads
  • You need reproducible model serving environments across dev/staging/prod
  • You’re building a platform that other teams will consume
  • You want isolation between model versions during canary deployments

The Practical Checklist

Before you ship a model service container to production:

  • Health check start_period is at least 2x your measured cold start time (Docker default is 0s — don’t leave it unset)
  • Model weights are mounted as a volume, not baked into the image
  • --gpu-memory-utilization CLI arg is set to ≤0.90 in your command: block (not as an env var — vLLM ignores env vars for this)
  • CPU and RAM limits are set (use your P99 observed usage + 20% headroom)
  • TTFT histogram metrics are being scraped and dashboarded
  • Your rolling update strategy accounts for 2-5 minute startup times
  • You have a warm-up request (or readiness probe) that actually exercises the model, not just checks HTTP 200
  • Your accuracy probe uses /v1/chat/completions for instruct/chat models, not the legacy /v1/completions endpoint
  • Log verbosity is tuned — vLLM at DEBUG level will fill your disk in hours under load

The containers are still the same. The defaults are just wrong for this class of workload, and the cost of getting it wrong is measured in dropped requests, runaway GPU bills, and on-call pages at 2am.

Adjust the defaults. Check the metrics. Ship the model.