Skip to content

Docker Deployment Guide

Overview

This project uses Docker and Docker Compose for reproducible environments across development, training, and inference.

Prerequisites

Required Software

  • Docker: 20.10+ (Install Docker)
  • Docker Compose: 2.0+ (included with Docker Desktop)
  • NVIDIA Docker (for GPU): nvidia-docker2

GPU Support (Linux)

Install NVIDIA Container Toolkit:

# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2

# Restart Docker
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Docker Images

Available Build Targets

  1. base: PyTorch with CUDA runtime
  2. development: Base + dev tools (Jupyter, IPython)
  3. production: Optimized for training
  4. inference: Minimal runtime for serving models

Quick Start

1. Build Images

Build all images:

docker-compose build

Build specific target:

# Development
docker build --target development -t skin-cancer-classifier:dev .

# Production
docker build --target production -t skin-cancer-classifier:prod .

# Inference
docker build --target inference -t skin-cancer-classifier:inference .

2. Run Services

Development Environment (Jupyter + TensorBoard)

docker-compose up dev

Access: - Jupyter Notebook: http://localhost:8888 - TensorBoard: http://localhost:6006

Get Jupyter token:

docker logs skin-cancer-dev 2>&1 | grep token

Training with GPU

docker-compose up training

Monitor logs:

docker logs -f skin-cancer-training

Training on CPU (for testing)

docker-compose up training-cpu

TensorBoard Only

docker-compose up tensorboard

Access at: http://localhost:6006

Inference Server

docker-compose up inference

API available at: http://localhost:8000

3. Interactive Shell

Enter development container:

docker-compose run --rm dev bash

Enter training container:

docker-compose run --rm training bash

Advanced Usage

Custom Training Configuration

Override command:

docker-compose run --rm training \
    python experiments/baseline/train_resnet50.py \
    --config baseline_config.yaml \
    --epochs 50 \
    --batch-size 64

Multi-GPU Training

Edit docker-compose.yml:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all  # Use all GPUs
          capabilities: [gpu]

Or specify GPU IDs:

environment:
  - CUDA_VISIBLE_DEVICES=0,1,2,3  # Use GPUs 0-3

Volume Mounting

Mount custom data directory:

docker-compose run --rm \
    -v /path/to/custom/data:/app/data \
    training python experiments/baseline/train_resnet50.py

Environment Variables

Pass custom environment variables:

docker-compose run --rm \
    -e WANDB_API_KEY=your_key \
    -e BATCH_SIZE=64 \
    training python experiments/baseline/train_resnet50.py

Resource Limits

Limit CPU and memory:

deploy:
  resources:
    limits:
      cpus: '4'
      memory: 16G
    reservations:
      cpus: '2'
      memory: 8G

Production Deployment

Build Optimized Images

# Optimize for size
docker build --target production \
    --build-arg PYTHON_VERSION=3.10 \
    -t skin-cancer-classifier:v1.0 .

# Tag for registry
docker tag skin-cancer-classifier:v1.0 \
    your-registry/skin-cancer-classifier:v1.0

# Push to registry
docker push your-registry/skin-cancer-classifier:v1.0

Kubernetes Deployment

Example Job for training:

apiVersion: batch/v1
kind: Job
metadata:
  name: skin-cancer-training
spec:
  template:
    spec:
      containers:
      - name: training
        image: your-registry/skin-cancer-classifier:v1.0
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: data
          mountPath: /app/data
        - name: experiments
          mountPath: /app/experiments
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: data-pvc
      - name: experiments
        persistentVolumeClaim:
          claimName: experiments-pvc
      restartPolicy: Never

Docker Swarm

docker stack deploy -c docker-compose.yml skin-cancer-stack

Inference API

Start Inference Server

docker-compose up -d inference

API Endpoints

Health check:

curl http://localhost:8000/health

Predict (example - API to be implemented):

curl -X POST http://localhost:8000/predict \
    -F "image=@sample_lesion.jpg" \
    -F "metadata={\"age\":45,\"location\":\"arm\"}"

Batch inference:

curl -X POST http://localhost:8000/predict_batch \
    -F "images[]=@image1.jpg" \
    -F "images[]=@image2.jpg"

Data Management

Persistent Volumes

Create named volumes:

docker volume create skin-cancer-data
docker volume create skin-cancer-experiments

Use in docker-compose.yml:

volumes:
  - skin-cancer-data:/app/data
  - skin-cancer-experiments:/app/experiments

Backup Data

# Backup data volume
docker run --rm \
    -v skin-cancer-data:/data \
    -v $(pwd):/backup \
    busybox tar czf /backup/data-backup.tar.gz /data

# Restore
docker run --rm \
    -v skin-cancer-data:/data \
    -v $(pwd):/backup \
    busybox tar xzf /backup/data-backup.tar.gz -C /

Troubleshooting

Issue: GPU not accessible in container

Check NVIDIA Docker runtime:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

If error, restart Docker:

sudo systemctl restart docker

Verify runtime config (/etc/docker/daemon.json):

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia"
}

Issue: Permission denied on volumes

Fix permissions:

# Linux
sudo chown -R $USER:$USER data/ experiments/

# Or run container as current user
docker-compose run --rm --user $(id -u):$(id -g) training bash

Issue: Out of memory

Reduce batch size or increase Docker memory limit:

# Docker Desktop: Settings > Resources > Memory
# Linux: Edit /etc/docker/daemon.json
{
  "default-shm-size": "2G"
}

Issue: Slow build times

Use BuildKit:

export DOCKER_BUILDKIT=1
docker build --target production -t skin-cancer-classifier:prod .

Cache optimization:

# Use cache from registry
docker build --cache-from your-registry/skin-cancer-classifier:latest \
    --target production -t skin-cancer-classifier:prod .

Issue: Container exits immediately

Check logs:

docker logs skin-cancer-training

Run interactively:

docker-compose run --rm training bash

Best Practices

Security

  1. Don't run as root in production:
  2. Use non-root user (already configured in Dockerfile)
  3. Scan images: docker scan skin-cancer-classifier:prod

  4. Secrets management:

  5. Use Docker secrets or environment files
  6. Never hardcode API keys in images

  7. Network security:

  8. Use custom networks
  9. Restrict port exposure

Performance

  1. Multi-stage builds: Reduce image size
  2. Layer caching: Order Dockerfile commands by change frequency
  3. Shared memory: Increase for data loading
shm_size: '2gb'  # In docker-compose.yml

Monitoring

Container stats:

docker stats skin-cancer-training

GPU usage:

docker exec skin-cancer-training nvidia-smi

Resource usage over time:

docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" \
    --no-stream

CI/CD Integration

GitHub Actions

- name: Build Docker image
  run: |
    docker build --target production -t skin-cancer-classifier:latest .

- name: Run tests in container
  run: |
    docker run --rm skin-cancer-classifier:latest pytest tests/

- name: Push to registry
  run: |
    echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
    docker push your-registry/skin-cancer-classifier:latest

Cleanup

Stop all services:

docker-compose down

Remove volumes:

docker-compose down -v

Clean up images:

# Remove unused images
docker image prune -a

# Remove specific image
docker rmi skin-cancer-classifier:dev

Complete cleanup:

docker system prune -a --volumes

Next Steps

  1. Build development image: docker-compose build dev
  2. Start Jupyter environment: docker-compose up dev
  3. Run baseline training: docker-compose up training-cpu
  4. Monitor with TensorBoard: docker-compose up tensorboard
  5. Deploy inference server: docker-compose up inference

Last Updated: 2025-10-13 Docker Version: 20.10+ Docker Compose Version: 2.0+