Docker Deployment Guide¶
Overview¶
This project uses Docker and Docker Compose for reproducible environments across development, training, and inference.
Prerequisites¶
Required Software¶
- Docker: 20.10+ (Install Docker)
- Docker Compose: 2.0+ (included with Docker Desktop)
- NVIDIA Docker (for GPU): nvidia-docker2
GPU Support (Linux)¶
Install NVIDIA Container Toolkit:
# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Install nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# Restart Docker
sudo systemctl restart docker
# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Docker Images¶
Available Build Targets¶
- base: PyTorch with CUDA runtime
- development: Base + dev tools (Jupyter, IPython)
- production: Optimized for training
- inference: Minimal runtime for serving models
Quick Start¶
1. Build Images¶
Build all images:
Build specific target:
# Development
docker build --target development -t skin-cancer-classifier:dev .
# Production
docker build --target production -t skin-cancer-classifier:prod .
# Inference
docker build --target inference -t skin-cancer-classifier:inference .
2. Run Services¶
Development Environment (Jupyter + TensorBoard)¶
Access: - Jupyter Notebook: http://localhost:8888 - TensorBoard: http://localhost:6006
Get Jupyter token:
Training with GPU¶
Monitor logs:
Training on CPU (for testing)¶
TensorBoard Only¶
Access at: http://localhost:6006
Inference Server¶
API available at: http://localhost:8000
3. Interactive Shell¶
Enter development container:
Enter training container:
Advanced Usage¶
Custom Training Configuration¶
Override command:
docker-compose run --rm training \
python experiments/baseline/train_resnet50.py \
--config baseline_config.yaml \
--epochs 50 \
--batch-size 64
Multi-GPU Training¶
Edit docker-compose.yml:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all # Use all GPUs
capabilities: [gpu]
Or specify GPU IDs:
Volume Mounting¶
Mount custom data directory:
docker-compose run --rm \
-v /path/to/custom/data:/app/data \
training python experiments/baseline/train_resnet50.py
Environment Variables¶
Pass custom environment variables:
docker-compose run --rm \
-e WANDB_API_KEY=your_key \
-e BATCH_SIZE=64 \
training python experiments/baseline/train_resnet50.py
Resource Limits¶
Limit CPU and memory:
Production Deployment¶
Build Optimized Images¶
# Optimize for size
docker build --target production \
--build-arg PYTHON_VERSION=3.10 \
-t skin-cancer-classifier:v1.0 .
# Tag for registry
docker tag skin-cancer-classifier:v1.0 \
your-registry/skin-cancer-classifier:v1.0
# Push to registry
docker push your-registry/skin-cancer-classifier:v1.0
Kubernetes Deployment¶
Example Job for training:
apiVersion: batch/v1
kind: Job
metadata:
name: skin-cancer-training
spec:
template:
spec:
containers:
- name: training
image: your-registry/skin-cancer-classifier:v1.0
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: data
mountPath: /app/data
- name: experiments
mountPath: /app/experiments
volumes:
- name: data
persistentVolumeClaim:
claimName: data-pvc
- name: experiments
persistentVolumeClaim:
claimName: experiments-pvc
restartPolicy: Never
Docker Swarm¶
Inference API¶
Start Inference Server¶
API Endpoints¶
Health check:
Predict (example - API to be implemented):
curl -X POST http://localhost:8000/predict \
-F "image=@sample_lesion.jpg" \
-F "metadata={\"age\":45,\"location\":\"arm\"}"
Batch inference:
curl -X POST http://localhost:8000/predict_batch \
-F "images[]=@image1.jpg" \
-F "images[]=@image2.jpg"
Data Management¶
Persistent Volumes¶
Create named volumes:
Use in docker-compose.yml:
Backup Data¶
# Backup data volume
docker run --rm \
-v skin-cancer-data:/data \
-v $(pwd):/backup \
busybox tar czf /backup/data-backup.tar.gz /data
# Restore
docker run --rm \
-v skin-cancer-data:/data \
-v $(pwd):/backup \
busybox tar xzf /backup/data-backup.tar.gz -C /
Troubleshooting¶
Issue: GPU not accessible in container¶
Check NVIDIA Docker runtime:
If error, restart Docker:
Verify runtime config (/etc/docker/daemon.json):
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Issue: Permission denied on volumes¶
Fix permissions:
# Linux
sudo chown -R $USER:$USER data/ experiments/
# Or run container as current user
docker-compose run --rm --user $(id -u):$(id -g) training bash
Issue: Out of memory¶
Reduce batch size or increase Docker memory limit:
# Docker Desktop: Settings > Resources > Memory
# Linux: Edit /etc/docker/daemon.json
{
"default-shm-size": "2G"
}
Issue: Slow build times¶
Use BuildKit:
Cache optimization:
# Use cache from registry
docker build --cache-from your-registry/skin-cancer-classifier:latest \
--target production -t skin-cancer-classifier:prod .
Issue: Container exits immediately¶
Check logs:
Run interactively:
Best Practices¶
Security¶
- Don't run as root in production:
- Use non-root user (already configured in Dockerfile)
-
Scan images:
docker scan skin-cancer-classifier:prod -
Secrets management:
- Use Docker secrets or environment files
-
Never hardcode API keys in images
-
Network security:
- Use custom networks
- Restrict port exposure
Performance¶
- Multi-stage builds: Reduce image size
- Layer caching: Order Dockerfile commands by change frequency
- Shared memory: Increase for data loading
Monitoring¶
Container stats:
GPU usage:
Resource usage over time:
CI/CD Integration¶
GitHub Actions¶
- name: Build Docker image
run: |
docker build --target production -t skin-cancer-classifier:latest .
- name: Run tests in container
run: |
docker run --rm skin-cancer-classifier:latest pytest tests/
- name: Push to registry
run: |
echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
docker push your-registry/skin-cancer-classifier:latest
Cleanup¶
Stop all services:
Remove volumes:
Clean up images:
# Remove unused images
docker image prune -a
# Remove specific image
docker rmi skin-cancer-classifier:dev
Complete cleanup:
Next Steps¶
- Build development image:
docker-compose build dev - Start Jupyter environment:
docker-compose up dev - Run baseline training:
docker-compose up training-cpu - Monitor with TensorBoard:
docker-compose up tensorboard - Deploy inference server:
docker-compose up inference
Last Updated: 2025-10-13 Docker Version: 20.10+ Docker Compose Version: 2.0+