Five years ago, the Docker pitch was simple: wrap your stateless Node.js service in a 100MB Alpine image, push it to a registry, and ship it. The mental model fit on a napkin. Today, I’m pulling 50GB model files into containers, managing GPU passthrough, and writing Compose configs with deploy.resources.reservations.devices blocks that would have looked like science fiction in 2021.
The container runtime hasn’t fundamentally changed — but what we’re putting inside containers has. And that gap between “Docker still works the same” and “your old Dockerfile patterns are completely wrong for this” is where teams are silently bleeding operational hours.
This post is about that gap.
The Shifting Payload Problem
Traditional microservices containers have predictable characteristics: small image size, fast startup, stateless, CPU-bound. You could spin up 50 of them on a single 8-core node with 32GB RAM without breaking a sweat.
AI model services break every one of those assumptions:
| Characteristic | Traditional Microservice | AI Model Service |
|---|---|---|
| Image size | 50-500 MB | 5-80 GB |
| Cold start | <2 seconds | 30-300 seconds |
| GPU required | No | Often yes |
| Memory footprint | 128 MB - 2 GB | 8-80 GB VRAM |
| Scaling trigger | CPU / RPS | Queue depth / GPU util |
| Idle behavior | Scale to zero | Keep warm ($$) |
| Canary metric | Error rate | Model accuracy |
Your auto-scaling logic, health checks, readiness probes, and resource limits all need rethinking when you move from one paradigm to the other. The infra is the same; the operational model is different.

Container Runtimes and GPU Passthrough: Getting Your Hands Dirty
Before anything else, you need the NVIDIA Container Toolkit installed and configured on the host. This is the part that breaks silently if you get it wrong — the container starts, but your model loads on CPU and you don’t notice until you wonder why inference is taking 40x longer than expected.
# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Configure the Docker daemon to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify -- you should see your GPU listed
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
Once that’s in place, here’s a production-grade Compose file for running a vLLM inference server — the kind of thing you’d actually deploy for serving a quantized Llama or Mistral variant:
# docker-compose.yml
services:
vllm-inference:
image: vllm/vllm-openai:latest
# runtime: nvidia is the legacy pre-Compose-v2 approach.
# The deploy.resources block below is the correct modern equivalent.
# Both are kept here for compatibility with older Docker Engine versions
# that don't process the deploy key outside of Swarm mode.
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
# HF_TOKEN is the current recommended variable (HUGGING_FACE_HUB_TOKEN is legacy)
- HF_TOKEN=${HF_TOKEN}
volumes:
- model-cache:/root/.cache/huggingface
- ./config/vllm:/app/config:ro
ports:
- "8000:8000"
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--quantization awq
--max-model-len 8192
--gpu-memory-utilization 0.90
--served-model-name mistral-7b
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s # Unhealthy results during start_period don't trigger restarts
restart: unless-stopped
model-proxy:
image: nginx:alpine
volumes:
- ./config/nginx/model-proxy.conf:/etc/nginx/conf.d/default.conf:ro
ports:
- "80:80"
depends_on:
vllm-inference:
condition: service_healthy
volumes:
model-cache:
driver: local
driver_opts:
type: none
o: bind
device: /data/model-cache # Put this on your fastest storage
The start_period: 120s on the healthcheck is not a typo. A 7B parameter model with AWQ quantization still takes 45-90 seconds to load on a decent GPU. Docker’s default start_period is 0s — meaning failed health checks count against the restart threshold immediately, and your orchestrator will mark the container unhealthy before it finishes initializing. Set it to at least 2x your measured cold start time.
The Image Layering Problem
Standard Docker layer caching advice doesn’t hold when your model is the dominant layer. Putting a 40GB model into an image via COPY or ADD means every rebuild triggers a 40GB push and pull. This is the pattern that kills CI pipelines.
The correct approach: separate the model from the runtime.
# Dockerfile -- runtime only, no model weights
# Using runtime (not cudnn8) base: cuDNN is for training, not inference.
# Dropping it saves ~500MB from the base image.
FROM nvidia/cuda:12.3.0-runtime-ubuntu22.04
# Ubuntu 22.04 ships Python 3.10. Pin to it explicitly.
RUN apt-get update && apt-get install -y \
python3.10 python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
# Model weights are mounted at runtime, not baked in
ENV MODEL_PATH=/models
ENV HF_HOME=/cache/huggingface
ENTRYPOINT ["python3.10", "src/serve.py"]
# Pull model weights to host once, mount at runtime
docker run \
-v /data/models:/models \
-v /data/hf-cache:/cache/huggingface \
--gpus all \
my-inference-server:latest
This keeps your images small (measured in hundreds of MB, not tens of GB), your CI fast, and your model versions managed independently from your runtime versions. Use huggingface-cli or a dedicated model registry to version-control the weights separately.
Architecture: Multi-Service AI Stack
Here’s how a realistic 2026 inference platform looks when you compose these pieces together:
graph TB
Client["Client / API Consumer"]
Gateway["API Gateway<br/>(nginx / traefik)"]
RateLimit["Rate Limiter<br/>(Redis-backed)"]
subgraph Inference Cluster
LB["Load Balancer<br/>(round-robin)"]
Worker1["vLLM Worker 1<br/>(GPU 0)"]
Worker2["vLLM Worker 2<br/>(GPU 1)"]
Worker3["vLLM Worker 3<br/>(GPU 2)"]
end
subgraph Storage
ModelVol["Model Volume<br/>(/data/models)"]
Cache["Redis<br/>(KV Cache)"]
VectorDB["Weaviate<br/>(Vector DB)"]
end
subgraph Observability
Prometheus["Prometheus"]
Grafana["Grafana"]
Jaeger["Jaeger<br/>(Tracing)"]
end
Client --> Gateway
Gateway --> RateLimit
RateLimit --> LB
LB --> Worker1
LB --> Worker2
LB --> Worker3
Worker1 & Worker2 & Worker3 --> ModelVol
Worker1 & Worker2 & Worker3 --> Cache
Worker1 & Worker2 & Worker3 --> VectorDB
Worker1 & Worker2 & Worker3 --> Prometheus
Prometheus --> Grafana
Worker1 & Worker2 & Worker3 --> Jaeger
Container Lifecycle Management for AI Workloads
The slow startup problem requires a different approach to rolling updates. With stateless microservices, the standard rolling update — kill old, start new, wait for health — is a 5-second operation. With model services, you’re looking at 2-5 minutes per replica.
This changes your deployment strategy significantly:
sequenceDiagram
participant Orchestrator
participant OldWorker as Old Worker (v1)
participant NewWorker as New Worker (v2)
participant LB as Load Balancer
Orchestrator->>NewWorker: Start v2 container
Note over NewWorker: Loading model weights (45-90s)
NewWorker-->>Orchestrator: /health returns 200
Orchestrator->>LB: Add v2 to rotation (10% traffic)
Note over LB: Canary phase: monitor accuracy metrics
LB->>NewWorker: Route 10% of requests
LB->>OldWorker: Route 90% of requests
Note over Orchestrator: Wait for accuracy baseline match
Orchestrator->>LB: Shift to 50% / 50%
Note over Orchestrator: Monitor for 5 min
Orchestrator->>LB: Shift 100% to v2
Orchestrator->>OldWorker: Drain connections
Orchestrator->>OldWorker: Stop container
Notice that the canary metric here isn’t error rate — it’s model accuracy. A model that returns a 200 with a confidently wrong answer is worse than a model that returns a 500. Your health checks need to be application-aware, not just HTTP-alive.
A simple accuracy probe implementation — using only the standard library to avoid hidden dependency issues:
# src/healthcheck.py
import json
import sys
import urllib.request
import urllib.error
PROBE_MESSAGES = [{"role": "user", "content": "What is 2 + 2?"}]
EXPECTED_ANSWER_FRAGMENT = "4"
# Mistral-7B-Instruct is a chat model: use /v1/chat/completions, not /v1/completions.
# The legacy completions endpoint works in vLLM but produces unreliable results
# for instruct models without careful prompt formatting.
INFERENCE_URL = "http://localhost:8000/v1/chat/completions"
def check_model_accuracy() -> bool:
payload = json.dumps({
"model": "mistral-7b",
"messages": PROBE_MESSAGES,
"max_tokens": 20,
"temperature": 0.0,
}).encode()
req = urllib.request.Request(
INFERENCE_URL,
data=payload,
headers={"Content-Type": "application/json"},
method="POST",
)
try:
with urllib.request.urlopen(req, timeout=10) as resp:
body = json.loads(resp.read())
output = body["choices"][0]["message"]["content"]
return EXPECTED_ANSWER_FRAGMENT in output
except (urllib.error.URLError, KeyError, json.JSONDecodeError):
return False
if __name__ == "__main__":
sys.exit(0 if check_model_accuracy() else 1)
# In your Dockerfile
COPY src/healthcheck.py /app/healthcheck.py
HEALTHCHECK --interval=60s --timeout=15s --start-period=120s --retries=3 \
CMD python3.10 /app/healthcheck.py
Resource Limits: Where Teams Get Burned
The most common operational mistake I see is either no resource limits (leads to OOM kills taking down neighboring containers) or CPU-style limits applied to GPU workloads (useless — Docker doesn’t throttle GPU by default).
For GPU memory specifically, you manage limits at the application level via CLI arguments, not the container level and not via environment variables. vLLM does not read a GPU_MEMORY_UTILIZATION env var — that value is passed through --gpu-memory-utilization in the command, as shown in the Compose command: block above:
# Correct: GPU memory utilization is a CLI argument, not an env var.
# vLLM ignores any env var you set for this -- it will silently do nothing.
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--gpu-memory-utilization 0.85
...
# If you push this to 0.95+, you will get OOM errors under load
# that are very hard to reproduce locally. 0.85-0.90 is a safe ceiling
# for production workloads; the remaining VRAM headroom absorbs
# CUDA context overhead and memory fragmentation.
For CPU and RAM, standard Docker limits apply and you should absolutely set them:
deploy:
resources:
limits:
cpus: '4'
memory: 16G
reservations:
cpus: '2'
memory: 8G
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
The reservations block ensures the scheduler knows you need a GPU before placing the container. Without it, you’ll get placement on a node without a GPU and a cryptic failure at runtime.
The Observability Layer
AI workloads need a different metric set than traditional services. Beyond the standard HTTP metrics, you want:
# prometheus.yml scrape config addition
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm-inference:8000']
metrics_path: '/metrics'
# vLLM exposes these natively:
# vllm:gpu_cache_usage_perc
# vllm:num_requests_running
# vllm:num_requests_waiting
# vllm:avg_prompt_throughput_toks_per_s
# vllm:avg_generation_throughput_toks_per_s
# vllm:time_to_first_token_seconds (histogram)
Time-to-first-token (TTFT) is your primary SLO metric for interactive use cases. Track its P95 and P99 — not just average. A model that’s fast 95% of the time but takes 30 seconds on the tail will destroy user experience.

When to Not Use Docker for This
There are scenarios where Docker alone is the wrong tool, and containerizing your model infrastructure just adds friction without benefit:
Skip Docker when:
- You’re doing active model training (not inference) — use the GPU directly via bare metal or a VM; every virtualization layer costs you training time and money
- Your model fits in a Jupyter notebook and your “deployment” is one data scientist on one machine — don’t build infrastructure for problems you don’t have
- You need MIG (Multi-Instance GPU) partitioning for running multiple small models on one GPU — this works with Docker but requires careful
nvidia-ctkconfiguration that most teams aren’t ready for
Use Docker when:
- You’re serving models in a multi-tenant environment alongside other workloads
- You need reproducible model serving environments across dev/staging/prod
- You’re building a platform that other teams will consume
- You want isolation between model versions during canary deployments
The Practical Checklist
Before you ship a model service container to production:
- Health check
start_periodis at least 2x your measured cold start time (Docker default is0s— don’t leave it unset) - Model weights are mounted as a volume, not baked into the image
-
--gpu-memory-utilizationCLI arg is set to ≤0.90 in yourcommand:block (not as an env var — vLLM ignores env vars for this) - CPU and RAM
limitsare set (use your P99 observed usage + 20% headroom) - TTFT histogram metrics are being scraped and dashboarded
- Your rolling update strategy accounts for 2-5 minute startup times
- You have a warm-up request (or readiness probe) that actually exercises the model, not just checks HTTP 200
- Your accuracy probe uses
/v1/chat/completionsfor instruct/chat models, not the legacy/v1/completionsendpoint - Log verbosity is tuned — vLLM at DEBUG level will fill your disk in hours under load
The containers are still the same. The defaults are just wrong for this class of workload, and the cost of getting it wrong is measured in dropped requests, runaway GPU bills, and on-call pages at 2am.
Adjust the defaults. Check the metrics. Ship the model.