HAM10000 Quick Start Guide¶
Phase 1.5 Complete - Real dataset integration ready for baseline training
5-Minute Setup¶
1. Download HAM10000 Dataset¶
Visit: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
Download:
- HAM10000_images_part_1.zip (~2.5 GB)
- HAM10000_images_part_2.zip (~2.5 GB)
- HAM10000_metadata (CSV)
Extract to: data/raw/ham10000/
2. Automated Setup¶
# One command to set up everything
python scripts/setup_ham10000.py
# This will:
# - Check dataset integrity
# - Generate FST annotations (ITA-based)
# - Create train/val/test splits (stratified)
# - Verify all is ready
3. Train Baseline Model¶
# Train ResNet50 on HAM10000 with FST evaluation
python experiments/baseline/train_resnet50.py --config configs/baseline_config.yaml
Done! Your model will train with fairness metrics tracked across FST groups.
What's Included¶
Core Implementation¶
| Component | File | Description |
|---|---|---|
| Dataset Class | src/data/ham10000_dataset.py |
PyTorch Dataset with FST support |
| FST Estimation | scripts/generate_ham10000_fst.py |
ITA-based skin tone annotation |
| Stratified Splits | scripts/create_ham10000_splits.py |
Train/val/test with no leakage |
| Verification | scripts/verify_ham10000.py |
Complete integrity check |
| Setup Script | scripts/setup_ham10000.py |
Automated end-to-end setup |
| Documentation | docs/ham10000_integration.md |
Comprehensive guide |
Key Features¶
- Automatic FST Annotation: ITA-based estimation for all 10,015 images
- Stratified Splitting: Preserves diagnosis AND FST distribution
- No Data Leakage: Lesion-level splitting (same lesion stays in one split)
- Comprehensive Metadata: Age, sex, localization, lesion_id, FST
- Graceful Fallbacks: Uses dummy data if HAM10000 not available
- Fairness Evaluation: Built-in FST-stratified metrics
Directory Structure After Setup¶
data/
├── raw/
│ └── ham10000/
│ ├── HAM10000_images_part_1/ # 5000 images
│ ├── HAM10000_images_part_2/ # 5015 images
│ └── HAM10000_metadata.csv
└── metadata/
├── ham10000_fst_estimated.csv # Metadata + FST labels
├── ham10000_splits.json # Train/val/test indices
└── visualizations/ # FST distribution plots
├── ham10000_fst_distribution.png
├── ham10000_ita_distribution.png
└── ham10000_fst_by_diagnosis.png
Usage Example¶
from src.data.ham10000_dataset import HAM10000Dataset, load_splits
from torch.utils.data import DataLoader
# Load splits
splits = load_splits("data/metadata/ham10000_splits.json")
# Create dataset
train_dataset = HAM10000Dataset(
root_dir="data/raw/ham10000",
metadata_path="data/metadata/ham10000_fst_estimated.csv",
split="train",
split_indices=splits['train'],
use_fst_annotations=True,
)
print(f"Train samples: {len(train_dataset)}") # ~7,010 images
# Access sample
sample = train_dataset[0]
# Returns:
# - image: Tensor (3, H, W)
# - label: int (0-6 diagnosis class)
# - fst: int (1-6 Fitzpatrick type)
# - lesion_id, image_id, age, sex, localization
Expected Performance¶
Diagnosis Classes (7 total)¶
| Class | Samples | % | Expected AUROC |
|---|---|---|---|
| nv (nevus) | 6705 | 67.0% | 0.90-0.95 |
| mel (melanoma) | 1113 | 11.1% | 0.85-0.90 |
| bkl (keratosis) | 1099 | 11.0% | 0.80-0.88 |
| bcc (carcinoma) | 514 | 5.1% | 0.85-0.92 |
| akiec (keratoses) | 327 | 3.3% | 0.78-0.85 |
| vasc (vascular) | 142 | 1.4% | 0.75-0.82 |
| df (dermatofibroma) | 115 | 1.1% | 0.72-0.80 |
Fairness Gaps (Baseline ResNet50)¶
Melanoma Detection AUROC by FST: - FST I-II (light): 0.88-0.92 (baseline) - FST III-IV (medium): 0.85-0.89 (-3% gap) - FST V-VI (dark): 0.80-0.86 (-6% gap)
Phase 2 Goal: Reduce gap to < 3% through synthetic data and debiasing.
Manual Steps (if not using setup script)¶
Generate FST Annotations¶
python scripts/generate_ham10000_fst.py \
--data-dir data/raw/ham10000 \
--output data/metadata/ham10000_fst_estimated.csv
Create Splits¶
python scripts/create_ham10000_splits.py \
--metadata data/metadata/ham10000_fst_estimated.csv \
--output data/metadata/ham10000_splits.json \
--visualize
Verify Dataset¶
python scripts/verify_ham10000.py \
--data-dir data/raw/ham10000 \
--metadata data/metadata/ham10000_fst_estimated.csv \
--splits data/metadata/ham10000_splits.json
Troubleshooting¶
"HAM10000 data directory not found"¶
Solution: Download dataset from Harvard Dataverse and extract to data/raw/ham10000/
"Image not found for image_id: ISIC_XXXXXXX"¶
Solution: Verify both HAM10000_images_part_1 and part_2 directories exist with images
"Severe class imbalance detected"¶
Expected: HAM10000 has 67% nevi, 1.1% dermatofibroma (natural imbalance)
Solution: Use weighted sampling or class-balanced loss during training
Next Steps¶
Phase 1.5 Complete ✓¶
- HAM10000 dataset integration
- FST annotation (ITA-based)
- Stratified splitting
- Baseline training ready
Phase 2: Fairness Enhancement¶
- Synthetic data generation for FST balancing
- Group-balanced mini-batch sampling
- FST-aware loss functions
- Advanced architectures (ViT, Swin)
Important Notes¶
FST Estimation Accuracy¶
FST labels are estimated using ITA, not ground truth clinical assessments: - Accuracy: ~70-80% agreement with expert annotations - Purpose: Research on fairness gaps, NOT clinical FST assessment - Limitation: Document this in publications/reports
To use external FST annotations (if available):
dataset = HAM10000Dataset(
fst_csv_path="data/annotations/expert_fst.csv", # CSV with image_id, fst
estimate_fst_if_missing=True, # Fill gaps with ITA
)
Class Imbalance¶
HAM10000 has severe imbalance (58:1 ratio). Consider: - Weighted random sampling - Focal loss or class-balanced loss - Synthetic oversampling for minority classes (Phase 2)
Documentation¶
- Full Integration Guide:
docs/ham10000_integration.md - FST Annotation Details:
src/data/fst_annotation.py - Dataset API Reference:
src/data/ham10000_dataset.py - Training Configuration:
configs/baseline_config.yaml
Quick Commands¶
# Setup everything
python scripts/setup_ham10000.py
# Train baseline
python experiments/baseline/train_resnet50.py --config configs/baseline_config.yaml
# Verify dataset
python scripts/verify_ham10000.py
# Test dataset loading
python -m src.data.ham10000_dataset
Framework: MENDICANT_BIAS Phase: 1.5 Complete Agent: HOLLOWED_EYES Status: Ready for Baseline Training
Real experiments begin now.