
You’ve built an amazing AI application locally. Now you need to deploy it. Simple, right?
Except your laptop has 64GB RAM, local model files, cached embeddings, and environment variables scattered across three different files. Production has… none of that.
Here’s how to bridge the gap between “works on my machine” and “running reliably in production.”
The Production Deployment Checklist
Before we dive into specifics, here’s what you need:
Component | Purpose | Example Tools |
---|---|---|
Containerization | Reproducible environments | Docker, Podman |
Orchestration | Manage multiple containers | Docker Compose, Kubernetes |
Reverse Proxy | Handle HTTPS, routing | Caddy, Nginx, Traefik |
CI/CD | Automated testing & deployment | GitHub Actions, GitLab CI |
Secrets Management | Secure API keys, passwords | Vault, AWS Secrets Manager |
Monitoring | Know when things break | Grafana, Datadog, Sentry |
Logging | Debug production issues | Loki, CloudWatch, Better Stack |
Step 1: Dockerize Your Application
Docker ensures your app runs the same everywhere. Here’s a production-ready Dockerfile for a Python AI application:
# Multi-stage build for smaller images FROM python:3.11-slim as builder WORKDIR /app # Install build dependencies RUN apt-get update && apt-get install -y gcc g++ && rm -rf /var/lib/apt/lists/* # Install Python dependencies COPY requirements.txt . RUN pip install --user --no-cache-dir -r requirements.txt # Final stage FROM python:3.11-slim WORKDIR /app # Copy only what we need from builder COPY --from=builder /root/.local /root/.local COPY . . # Make sure scripts are in PATH ENV PATH=/root/.local/bin:$PATH # Don't run as root RUN useradd -m appuser && chown -R appuser /app USER appuser # Health check HEALTHCHECK --interval=30s --timeout=3s CMD python -c "import requests; requests.get('http://localhost:8000/health')" # Run application CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Key Docker Best Practices
- Multi-stage builds: Reduce final image size by 60-80%
- Don’t run as root: Security best practice
- Health checks: Let orchestrators know if container is healthy
- Specific base images: Use
python:3.11-slim
notpython:latest
- .dockerignore: Exclude unnecessary files (node_modules, .git, cache)
Step 2: Environment Configuration
Never hardcode API keys or secrets. Use environment variables:
# .env.example (check this into git) ANTHROPIC_API_KEY=sk-ant-xxx DATABASE_URL=postgresql://localhost/mydb REDIS_URL=redis://localhost:6379 LOG_LEVEL=info # .env (never commit this!) ANTHROPIC_API_KEY=sk-ant-real-key-here DATABASE_URL=postgresql://user:pass@prod-db.com/prod
Load them in your app:
from pydantic_settings import BaseSettings class Settings(BaseSettings): anthropic_api_key: str database_url: str redis_url: str = "redis://localhost:6379" # default log_level: str = "info" class Config: env_file = ".env" settings = Settings()
Step 3: Docker Compose for Local Development
Run your entire stack with one command:
version: '3.8' services: app: build: . ports: - "8000:8000" environment: - DATABASE_URL=postgresql://postgres:password@db:5432/mydb - REDIS_URL=redis://redis:6379 env_file: - .env depends_on: db: condition: service_healthy redis: condition: service_started volumes: - ./app:/app # hot reload in dev restart: unless-stopped db: image: postgres:15-alpine environment: - POSTGRES_PASSWORD=password - POSTGRES_DB=mydb volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres"] interval: 5s timeout: 5s retries: 5 redis: image: redis:7-alpine volumes: - redis_data:/data # Vector database for RAG qdrant: image: qdrant/qdrant:latest ports: - "6333:6333" volumes: - qdrant_data:/qdrant/storage volumes: postgres_data: redis_data: qdrant_data:
Run everything:
docker compose up -d
Your app, database, Redis, and vector database are now running.
Step 4: Caddy for HTTPS and Reverse Proxy
Caddy automatically provisions SSL certificates from Let’s Encrypt. Configuration is beautifully simple:
# Caddyfile ai.yourdomain.com { # Automatic HTTPS! reverse_proxy app:8000 # Rate limiting rate_limit { zone app_zone { key {remote_host} events 100 window 1m } } # Logging log { output file /var/log/caddy/access.log format json } } # Separate domain for admin panel admin.yourdomain.com { reverse_proxy app:8000 # Basic auth basicauth { admin $2a$14$hashed_password_here } }
Add Caddy to docker-compose.yml:
caddy: image: caddy:2-alpine ports: - "80:80" - "443:443" volumes: - ./Caddyfile:/etc/caddy/Caddyfile - caddy_data:/data - caddy_config:/config restart: unless-stopped
Boom. HTTPS, rate limiting, and load balancing in 20 lines.
Step 5: CI/CD Pipeline
Automate testing and deployment with GitHub Actions:
# .github/workflows/deploy.yml name: Deploy to Production on: push: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install dependencies run: | pip install -r requirements.txt pip install pytest pytest-cov - name: Run tests run: pytest --cov=app tests/ - name: Run LLM evals run: python scripts/run_evals.py env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} deploy: needs: test runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v3 - name: Build and push Docker image uses: docker/build-push-action@v4 with: push: true tags: your-registry/ai-app:latest - name: Deploy to server uses: appleboy/ssh-action@master with: host: ${{ secrets.SERVER_HOST }} username: ${{ secrets.SERVER_USER }} key: ${{ secrets.SSH_PRIVATE_KEY }} script: | cd /opt/ai-app docker compose pull docker compose up -d docker compose exec app python scripts/migrate.py
Now every push to main:
- Runs unit tests
- Runs LLM evaluations
- Builds Docker image
- Deploys to production
- Runs database migrations
All automatically.
Step 6: Monitoring and Observability
You need to know when things break. Set up Grafana + Prometheus:
# docker-compose.yml additions prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' ports: - "9090:9090" grafana: image: grafana/grafana:latest ports: - "3000:3000" volumes: - grafana_data:/var/lib/grafana environment: - GF_SECURITY_ADMIN_PASSWORD=your_password_here
Instrument your app:
from prometheus_client import Counter, Histogram import time llm_requests = Counter('llm_requests_total', 'Total LLM requests') llm_latency = Histogram('llm_request_duration_seconds', 'LLM request latency') llm_cost = Counter('llm_cost_dollars', 'Total LLM cost in dollars') @llm_latency.time() async def call_llm(prompt: str): llm_requests.inc() start = time.time() response = await client.messages.create( model="claude-3-5-sonnet-20241022", messages=[{"role": "user", "content": prompt}] ) # Track cost cost = calculate_cost(response.usage) llm_cost.inc(cost) return response
Now you have dashboards showing:
- Request volume
- Latency (p50, p95, p99)
- Error rates
- Cost per hour
- Cache hit rates
Step 7: Blue-Green Deployments
Deploy new versions without downtime:
# Deploy new version (green) docker compose -f docker-compose.green.yml up -d # Test it on separate port curl http://localhost:8001/health # If good, switch traffic (update Caddy) # If bad, kill green and keep blue
Or use Kubernetes for automatic rolling updates.
Common Production Issues (And Fixes)
Issue: Out of Memory
Symptom: Container keeps restarting
Fix: Set memory limits in docker-compose.yml:
services: app: deploy: resources: limits: memory: 4G reservations: memory: 2G
Issue: Slow Performance
Symptom: Requests timing out
Fix: Add Redis caching, increase worker processes, use async I/O
Issue: Database Connection Exhaustion
Symptom: “Too many connections” errors
Fix: Use connection pooling (SQLAlchemy, asyncpg), increase DB max connections
Issue: Secrets Leaked in Logs
Symptom: API keys visible in logs
Fix: Scrub logs, use structured logging with sensitive field redaction
The Production Deployment Runbook
When deploying a major change:
- ☐ Test locally with
docker compose up
- ☐ Run full test suite including LLM evals
- ☐ Deploy to staging environment first
- ☐ Run smoke tests on staging
- ☐ Deploy to 10% of prod traffic (canary)
- ☐ Monitor for 1 hour
- ☐ If metrics look good, deploy to 100%
- ☐ If anything breaks, rollback immediately
- ☐ Keep deployment window open for 24 hours
Cost Optimization
Running AI in production can get expensive. Optimize:
- Use spot instances for batch jobs (save 60-80%)
- Auto-scale workers based on queue depth
- Cache LLM responses aggressively (Redis with 1-hour TTL)
- Use smaller models where quality difference is minimal
- Batch similar requests to save on API calls
Security Hardening
Essential security practices:
- Run containers as non-root user
- Use secrets management (Vault, AWS Secrets Manager)
- Enable rate limiting (prevent abuse)
- Scan Docker images for vulnerabilities (Trivy, Snyk)
- Use network policies (isolate services)
- Enable audit logging (track all API calls)
- Rotate API keys regularly
Backup and Disaster Recovery
What happens if your server dies?
- Database backups: Automated daily backups to S3
- Vector DB backups: Regular snapshots of Qdrant/Weaviate
- Configuration backups: Store in git (Infrastructure as Code)
- Recovery time objective: Can you restore in under 1 hour?
The Bottom Line
Deploying AI applications is more complex than traditional apps because:
- They depend on external APIs (LLMs)
- They have ML-specific failure modes
- They can be expensive to run at scale
- They require continuous monitoring and improvement
But with the right DevOps practices, you can run AI in production with confidence.
Start simple (Docker + Docker Compose), add complexity only when needed (Kubernetes), and always measure what matters (latency, cost, user satisfaction).
Now go deploy that AI app. The world is waiting.