Fairness-Aware AI for Skin Cancer Detection: Implementation Roadmap¶
Research Foundation This project is inspired by and builds upon the comprehensive survey:
Flores, J., & Alzahrani, N. (2025). AI Skin Cancer Detection Across Skin Tones: A Survey of Experimental Advances, Fairness Techniques, and Dataset Limitations. Submitted to Computers (MDPI).
Authors: Jasmin Flores & Dr. Nabeel Alzahrani Institution: School of Computer Science & Engineering, California State University, San Bernardino, USA
Project Vision¶
Develop a production-grade, fairness-aware AI system for skin cancer detection that achieves equitable diagnostic performance across all Fitzpatrick skin types (FST I-VI), addressing the critical healthcare disparity where existing models show 15-30% performance drops on darker skin tones.
Core Principles: 1. Fairness-First Development: Equity across skin tones is not an afterthought—it's embedded from Phase 1 2. Evidence-Based Design: Every architectural and methodological decision grounded in peer-reviewed research 3. Clinical Viability: Target performance benchmarks from deployed systems (NHS DERM: 97% sensitivity across all FST) 4. Open Science: Transparent methodology, reproducible experiments, public model cards with subgroup metrics 5. Ethical AI: Patient co-design, informed consent, continuous fairness monitoring
Implementation Phases¶
Phase 1: Foundation (Weeks 1-4)¶
Objective: Establish baseline infrastructure, quantify fairness gap, validate evaluation framework
Key Activities: 1. Dataset Acquisition - Primary: Fitzpatrick17k, DDI (Diverse Dermatology Images), MIDAS, SCIN - Baseline: HAM10000, ISIC 2019 (for comparison) - Target FST distribution: Minimum 25% FST IV-VI (vs <5% in standard datasets)
- Baseline Model Training
- ResNet50, EfficientNet B4 (transfer learning)
- Quantify fairness gap: AUROC per FST group
-
Expected: 15-20% AUROC drop for FST V-VI (literature benchmark)
-
Evaluation Framework
- Metrics: AUROC, Sensitivity, Specificity, Equal Opportunity Difference (EOD), Expected Calibration Error (ECE)
- Per-FST reporting: Disaggregate ALL metrics by skin tone
-
Visualization: ROC curves per FST, calibration plots, confusion matrices
-
Tone Annotation Pipeline
- Monk Skin Tone (MST) scale (10-point, superior to Fitzpatrick 6-point)
- Dual annotation (2 annotators per image, adjudication protocol)
- ITA (Individual Typology Angle) validation
Success Criteria: - Dataset access confirmed (4/5 datasets) - Baseline AUROC gap quantified: 15-20% (FST I-III vs V-VI) - Evaluation pipeline operational: Automated subgroup metrics - 5,000 images annotated with MST labels
Deliverables:
- src/data/datasets.py: Dataset loaders (Fitzpatrick17k, DDI, HAM10000)
- src/evaluation/metrics.py: Fairness metrics (AUROC per FST, EOD, ECE)
- experiments/baseline/: Baseline model training scripts + results
- docs/datasets.md: Dataset documentation with FST distributions
Phase 2: Fairness MVP (Weeks 5-12)¶
Objective: Implement core fairness techniques, reduce AUROC gap to <8%
Strategic Implementation Order (ROI-optimized, from THE DIDACT analysis):
Week 5-6: FairDisCo Adversarial Debiasing (Highest ROI)¶
Rationale: Best EOD reduction (65%), fastest implementation, moderate complexity
Implementation Plan: - Day 1-2: Clone siyi-wind/FairDisCo repository, setup environment (PyTorch 2.1+, CUDA 12.1) - Day 3-4: Adapt for HAM10000 + Fitzpatrick17k combined dataset - Day 5-7: Implement gradient reversal layer (GRL), verify gradients flow correctly - Day 8-10: Implement supervised contrastive loss (temperature 0.07, batch 64) - Day 11-12: Integrate into training loop (multi-task loss: classification + adversarial + contrastive) - Day 13-14: Train 100 epochs (~25 GPU hours, RTX 3090)
Expected Results: - AUROC gap: 20% → 10% (50% reduction) - EOD: 0.18 → 0.06 (65% reduction) - Accuracy trade-off: -0.5% to -2% (acceptable)
Key Metrics to Monitor: - Discriminator accuracy: Should decrease from 50-70% (early) → 20-25% (late) - If discriminator accuracy >50% after epoch 50: Increase λ_adv (0.3 → 0.4)
Deliverables:
- src/fairness/gradient_reversal.py: GRL PyTorch autograd function
- src/fairness/fairdisco_model.py: Full architecture (backbone + discriminator + contrastive)
- src/fairness/supervised_contrastive_loss.py: Contrastive loss implementation
- experiments/phase2_week5-6_fairdisco/: Training scripts, logs, checkpoints
Week 7-8: CIRCLe Color-Invariant Learning (Low Complexity, Fast)¶
Rationale: Improves calibration (ECE -3-5%), easiest implementation
Implementation Plan: - Day 1-2: Clone arezou-pakzad/CIRCLe repository, setup environment - Day 3-4: Implement simple LAB color transformations (skip StarGAN for Phase 2) - Transform images to FST I and VI (extreme classes) - Pre-compute transformations: 3x dataset size (~48GB → 144GB) - Day 5-7: Implement regularization loss (L2 distance between original and transformed embeddings) - Day 8-10: Integrate into FairDisCo architecture (add 4th loss term) - Day 11-14: Train combined model (FairDisCo + CIRCLe, 100 epochs, ~30 GPU hours)
Expected Results: - AUROC gap: 10% → 7% (additional 30% reduction) - ECE: 0.10 → 0.07 (improved calibration) - OOD generalization: +8-12% on unseen FST combinations
Hyperparameter Tuning: - λ_reg: Start with 0.2, tune [0.1, 0.2, 0.3] if needed - Target FST: ["I", "VI"] (extreme classes, most effective) - Distance metric: L2 (simpler than cosine for Phase 2)
Deliverables:
- src/fairness/color_transforms.py: LAB transformation functions
- src/fairness/circle_regularization.py: Regularization loss
- experiments/phase2_week7-8_circle/: Training scripts, ablation studies
Week 9-11: FairSkin Diffusion Augmentation (Highest Absolute Gain)¶
Rationale: Largest AUROC improvement (+18-21%), one-time synthetic generation cost
Implementation Plan:
- Week 9 (Textual Inversion + LoRA Setup):
- Day 1-2: Clone janet-sw/skin-diff repository, install Hugging Face Diffusers
- Day 3-4: Prepare training data (Fitzpatrick17k + DDI, resize 512x512, hair removal)
- Day 5: Train textual inversion (2000 steps, ~4 hours)
- Learn tokens: <melanoma-FST-VI>, <nevus-FST-I>, etc.
- Day 6-7: Validate: Generate 100 test images from prompts, qualitative review
- Week 10 (LoRA Training):
- Day 1-3: Train LoRA adapters (rank 16, alpha 32, 10k steps, ~20 hours)
- Day 4-5: Validate: Generate 500 images (all diagnosis × FST combinations)
-
Day 6-7: Quality metrics: FID <20, LPIPS <0.15, classifier confidence >0.7
-
Week 11 (Batch Generation + Classifier Training):
- Day 1-3: Generate 60k synthetic images (50-100 GPU hours, parallelizable on 4 GPUs → 12-25 hours)
- Balanced FST distribution: FST V-VI 50% (vs <5% in real data)
- Quality filtering: Accept if FID <30, LPIPS <0.2, no artifacts
- Day 4: Expert review: Sample 500 images, dermatologist rating (target: >5.0/7.0)
- Day 5-7: Train classifier on mixed dataset (real + synthetic, FST-dependent weighting)
- FST I-III: 20% synthetic, 80% real
- FST V-VI: 80% synthetic, 20% real
Expected Results: - AUROC gap: 7% → 3.5% (achieve <4% Phase 2 target) - Synthetic dataset: 60k images, FID <20, expert rating >5.0/7.0 - EOD: 0.06 → 0.04 (additional 33% reduction)
Risk Mitigation: - If GAN mode collapse: Increase class diversity loss (λ_diversity = 0.1 → 0.2) - If poor quality (FID >30): Increase LoRA rank (16 → 32), more training steps (10k → 20k) - If expert rating <5.0: Generate 1.5-2x images, filter more aggressively
Deliverables:
- data/synthetic/fairskin/: 60,000 synthetic images (balanced FST)
- checkpoints/textual_inversion/: Learned token embeddings
- checkpoints/lora/: LoRA adapter weights
- experiments/phase2_week9-11_fairskin/: Generation scripts, quality reports
Week 12: Integration, Evaluation & Phase 2 Completion¶
Objective: Train final combined model, comprehensive evaluation, Phase 3 readiness
Activities: - Day 1-2: Train final model with all three techniques - Loss: L_cls + 0.3×L_adv + 0.2×L_con + 0.2×L_reg - Dataset: 60k synthetic + 16.5k real (Fitzpatrick17k + DDI) - 100 epochs, ~35 GPU hours - Day 3-4: Comprehensive fairness evaluation - AUROC per FST (I-VI), overall AUROC - EOD, DPD, Equalized Odds - ECE, calibration curves per FST - Confusion matrices, sensitivity/specificity per FST - Day 5: Ablation studies - Baseline (no fairness) - FairDisCo only - FairDisCo + CIRCLe - FairDisCo + CIRCLe + FairSkin (full) - Quantify each technique's contribution - Day 6-7: Documentation and handoff - Model card: Dataset composition, subgroup metrics, limitations - Experiment report: Figures, tables, statistical tests - Phase 3 preparation: Hybrid architecture requirements
Success Criteria (Phase 2 Targets): - AUROC gap: <8% (target: 3.5-4.0%) - EOD: <0.08 (target: 0.04-0.05) - ECE: <0.10 (target: 0.06-0.08) - Overall accuracy: >88% (target: 89-91%) - Synthetic quality: FID <20, expert rating >5.0/7.0
Deliverables:
- models/phase2_final_fairness_mvp.pth: Final combined model
- experiments/phase2_week12_final/: Complete evaluation results
- docs/phase2_results.md: Comprehensive results report
- docs/phase3_requirements.md: Hybrid architecture specifications
Total Phase 2 Timeline: 8 weeks (56 days) Total GPU Hours: ~227 hours (risk-adjusted, see computational_costs.md) Human Time: 8 weeks full-time equivalent (1 developer)
Phase 2 Checkpoints: - Week 6: FairDisCo complete (AUROC gap <12%) - Week 8: CIRCLe complete (AUROC gap <9%) - Week 11: FairSkin complete (AUROC gap <5%) - Week 12: Phase 2 MVP complete (AUROC gap <4%, Phase 3 ready)
Reference Documents (created by THE DIDACT):
- docs/fairskin_implementation_plan.md: Detailed FairSkin guide
- docs/fairdisco_architecture.md: Complete FairDisCo specifications
- docs/circle_implementation.md: CIRCLe methodology
- docs/open_source_fairness_code.md: Repository evaluation
- docs/fairness_computational_costs.md: Cost analysis and ROI
Phase 3: Hybrid Architecture (Weeks 11-18)¶
Objective: Implement state-of-the-art hybrid model, achieve 93%+ accuracy with <4% gap
Architecture: ConvNeXtV2-Swin Transformer Hybrid
Components: 1. ConvNeXtV2-Base (first 2 stages) - Local feature extraction (lesion borders, texture) - 36.44M parameters, efficient (80-100ms inference)
- Swin Transformer V2 Small (later stages)
- Global attention (tone-invariant contextual features)
-
Hierarchical windows (7×7, 14×14, 28×28)
-
Attentional Feature Fusion (AFF)
- Merge ConvNeXt + Swin features
-
Learned attention weights per branch
-
Metadata Encoder
- Attention-MLP: FST, age, anatomical site → 64-dim embedding
- Late fusion with image features
Training Strategy: - Pre-train: Synthetic-augmented dataset (Phase 2 output) - Fine-tune: Real data (Fitzpatrick17k + DDI + MIDAS) - Multi-task loss: Classification + adversarial + color-invariant + metadata - Loss weights: [1.0, 0.3, 0.2, 0.1] (classification dominant)
Hyperparameters (from literature synthesis): - Optimizer: AdamW (lr=1e-4, weight_decay=0.01) - Scheduler: CosineAnnealingWarmRestarts - Batch size: 32 - Epochs: 100 - Augmentation: RandAugment + FairSkin synthetic
Success Criteria: - Overall accuracy: 93-95% (ISIC 2019 benchmark) - AUROC gap: <4% (FST I-III vs FST V-VI) - EOD: <0.05, ECE: <0.08 (ALL FST groups) - Inference time: <120ms (clinical acceptability)
Deliverables:
- src/models/hybrid_convnext_swin.py: Hybrid architecture implementation
- src/models/attention_fusion.py: AFF module
- experiments/hybrid_architecture/: Training pipeline + ablation studies
- docs/architecture.md: Technical documentation with diagrams
Phase 4: Production Hardening (Weeks 19-24)¶
Objective: Optimize for edge deployment, add explainability, prepare for clinical validation
Key Activities:
- FairPrune Edge Optimization
- Analyze activation saliency per FST subgroup
- Prune 30-40% of filters contributing to light-tone overfitting
- Fine-tune pruned model
-
Target: <50MB model, <80ms inference on mobile, maintained fairness
-
Quantization
- INT8 quantization (TensorFlow Lite, ONNX Runtime)
- Platform-specific: Core ML (iOS), TF Lite (Android)
-
Target: <12MB model (75% size reduction), <1% accuracy loss
-
Grad-CAM Explainability
- Heatmap overlays: Show model attention regions
- Clinical feature alignment: ABCD rule (Asymmetry, Border, Color, Diameter)
-
Tone-specific failure mode analysis: Qualitative review per FST
-
Model Card Documentation
- Dataset composition: FST distribution, disease categories
- Subgroup metrics: AUROC, Sensitivity, Specificity per FST
- Limitations: Intermediate tone subjectivity, OOD performance
-
Intended use: Teledermatology decision support (not standalone diagnosis)
-
Fairness Monitoring Dashboard
- Real-time metrics: AUROC, EOD, ECE per FST (updated daily)
- Model drift detection: KL divergence, population stability index
- Alert system: Email/SMS when fairness thresholds exceeded
Success Criteria: - Edge model deployed: <50MB, <80ms inference on mobile - Grad-CAM validated: Dermatologist feedback (qualitative) - Model card complete: 10+ pages, public disclosure - Monitoring dashboard operational: Real-time fairness tracking
Deliverables:
- src/optimization/fairprune.py: Pruning implementation
- src/explainability/gradcam.py: Grad-CAM visualization
- docs/model_card.md: Comprehensive model documentation
- scripts/monitoring_dashboard.py: Fairness monitoring (Streamlit/Gradana)
Phase 5: Deployment & Validation (Weeks 25-32)¶
Objective: Deploy to teledermatology platform, conduct prospective clinical trial
Key Activities:
- Teledermatology API
- RESTful API: OpenAPI 3.0 specification
- SDK: Python, JavaScript clients
- EHR integration: HL7 FHIR messaging
-
Deployment: Cloud (AWS/Azure/GCP), auto-scaling (100+ concurrent users)
-
Prospective Clinical Trial
- Design: Multi-site (2-3 hospitals), 500+ patients, all FST types
- Comparator: Dermatologist diagnosis (gold standard: biopsy)
- Primary outcome: Sensitivity and specificity per FST (non-inferiority: 5% margin)
- Secondary: Calibration (ECE), time-to-diagnosis, patient satisfaction
-
Duration: 4-6 months
-
Continual Learning Pipeline
- Weekly model updates: Incremental learning on new labeled data
- Bayesian generative approach: Store statistics, not raw images (privacy)
-
Drift monitoring: Trigger retraining when EOD or ECE exceed thresholds
-
Regulatory Documentation
- FDA: De Novo submission (breakthrough device pathway)
- EU: CE marking (MDR Class IIa/IIb or Class III)
- Clinical data: Prospective trial results
- Risk analysis: FMEA (Failure Mode and Effects Analysis)
Success Criteria: - API deployed: 99.9% uptime, <200ms response time - Clinical trial completed: 500+ patients, all FST represented - Non-inferiority demonstrated: Sensitivity >95%, Specificity >80% for ALL FST groups - Regulatory submission prepared: FDA De Novo or EU CE application
Deliverables:
- src/api/: RESTful API implementation (FastAPI/Flask)
- src/continual_learning/: Incremental learning pipeline
- docs/clinical_trial_protocol.md: Trial design, statistical analysis plan
- docs/regulatory/: FDA submission documentation
Target Performance Benchmarks¶
Literature-Derived Targets (from 100+ papers surveyed):
| Metric | FST I-III | FST IV-VI | Gap | Benchmark Source |
|---|---|---|---|---|
| AUROC | 91-93% | 89-92% | <4% | NHS DERM (deployed), BiaslessNAS |
| Sensitivity (Melanoma) | >95% | >95% | 0% | NHS DERM (97% across all FST) |
| Specificity | >80% | >80% | 0% | Clinical acceptability threshold |
| EOD | --- | --- | <0.05 | Fairness standard (5% max disparity) |
| ECE | <0.08 | <0.08 | 0% | Calibration quality (clinical trust) |
Baseline (No Fairness): - ResNet50 on ISIC 2020: 91.3% (FST I-III), 75.4% (FST V-VI) = -15.9% gap - InceptionV3 on HAM10000: 90.1% (FST I-III), 78.3% (FST V-VI) = -11.8% gap
Phase 2 Target (Fairness MVP): - AUROC gap: <8% (50% reduction from baseline) - EOD: <0.08
Phase 3 Target (Hybrid Architecture): - AUROC gap: <4% (match BiaslessNAS, NHS DERM) - EOD: <0.05, ECE: <0.08 (all FST groups)
Phase 5 Target (Clinical Deployment): - Non-inferiority to dermatologist: Within 5% sensitivity, 5% specificity - Patient satisfaction: >80% (NHS DERM achieved 85%)
Key Datasets¶
Primary Training Datasets (FST Diversity):
- Fitzpatrick17k: 16,577 images, ~8% FST V-VI, dual annotation (FST + ITA)
- DDI (Stanford): 656 images, 34% FST V-VI, clinician-rated (gold standard)
- MIDAS: Biopsy-confirmed, ~28% FST V-VI, multi-modal (clinical + dermoscopic)
- SCIN (Google): 10,000+ images, ~33% FST V-VI, triple annotation (eFST, eMST, CST)
- SkinCon (MIT): Built on Fitzpatrick17k + DDI, ~30% FST V-VI, meta-concept tags
Baseline Datasets (For Comparison): - HAM10000: 10,015 images, <5% FST V-VI (high quality, tone-imbalanced) - ISIC 2020: 33,126 images, <3% FST V-VI (no tone labels, benchmark)
Synthetic Augmentation: - FairSkin/DermDiff: Generate 60,000 synthetic images with balanced FST distribution
Fairness Techniques Summary¶
Three-Tier Fairness Methodology:
| Stage | Technique | Expected Impact | Complexity |
|---|---|---|---|
| Data-Level | FairSkin Diffusion Augmentation | +18-21% FST VI AUROC | High (48-72 hrs GPU) |
| Algorithm-Level | FairDisCo Adversarial Debiasing | 65% EOD reduction | Moderate |
| Algorithm-Level | CIRCLe Color-Invariant Loss | 3-5% ECE reduction | Moderate |
| Post-Processing | FairPrune Selective Pruning | 3-6% AUROC gap reduction | Low |
Trade-offs: - Accuracy cost: 1-3% (mitigated by contrastive loss in FairDisCo) - Computational cost: Diffusion training (48-72 hrs), NAS (7-14 days if used) - Calibration: May degrade 5-10% ECE with synthetic data (mitigate: temperature scaling)
Technology Stack¶
Deep Learning: - PyTorch 2.0+ (primary framework) - timm (ConvNeXt, Swin Transformer pre-trained models) - Diffusers (Hugging Face, for FairSkin diffusion)
Data Science: - NumPy, pandas, scikit-learn - OpenCV, Albumentations (image preprocessing, augmentation)
Fairness: - Fairlearn (fairness metrics) - Custom implementations: FairDisCo, CIRCLe, FairPrune
Evaluation & Visualization: - Matplotlib, Seaborn (plots, calibration curves) - TensorBoard (training monitoring) - Grad-CAM (explainability)
Deployment: - FastAPI (RESTful API) - TensorFlow Lite / ONNX Runtime (edge deployment) - Docker (containerization) - AWS/Azure/GCP (cloud hosting)
Regulatory Pathway¶
FDA (United States): - Pathway: De Novo (Class II) or 510(k) if predicate exists - Precedent: DermaSensor (FDA cleared Jan 2024, breakthrough device) - Requirements: Prospective multi-site trial (500+ patients), subgroup metrics, software documentation (IEC 62304)
EU (Europe): - Pathway: CE marking (MDR 2017/745) - Classification: Class IIa/IIb (decision support) or Class III (autonomous diagnosis) - Precedent: DERM (CE Class III approved 2024, 99.8% accuracy)
Timeline: - FDA De Novo: 18-24 months (with breakthrough designation: 12-18 months) - EU CE: 6-18 months (depends on class)
Ethical Considerations¶
Patient Consent: - Transparent disclosure: Model uses skin tone for fairness-aware training - Opt-out mechanism: Tone-blind inference available (with performance caveat)
Co-Design: - Patient advisory board: Diverse FST representation - Iterative feedback: Incorporate patient concerns (privacy, bias, transparency)
Model Card Transparency: - Dataset composition: FST distribution, annotation methods - Subgroup metrics: AUROC, Sensitivity, Specificity per FST - Limitations: Intermediate tone subjectivity, OOD performance, synthetic data artifacts
Post-Market Surveillance: - Quarterly fairness audits: EOD, ECE per FST - Incident reporting: Misdiagnosis cases, tone-related failures - Continual learning: Model updates based on real-world feedback
Success Metrics¶
Technical: - AUROC gap <4% (FST I-III vs FST V-VI) - EOD <0.05 (Equal Opportunity Difference) - ECE <0.08 per FST (Expected Calibration Error) - Sensitivity >95% for melanoma (ALL FST groups) - Inference time <100ms (edge deployment)
Clinical: - Non-inferiority to dermatologist (within 5% sensitivity, 5% specificity) - Patient satisfaction >80% - Time-to-diagnosis reduction (vs standard referral pathway) - Cost-effectiveness (QALY analysis)
Regulatory: - FDA clearance or EU CE marking - Model card with subgroup metrics (public disclosure) - Post-market surveillance plan approved
References¶
Foundational Survey: - Flores, J., & Alzahrani, N. (2025). AI Skin Cancer Detection Across Skin Tones: A Survey of Experimental Advances, Fairness Techniques, and Dataset Limitations. Computers (MDPI). [Submitted]
Key Techniques: - FairSkin: Ju et al. (2024). Diffusion-based synthetic augmentation for skin tone fairness. - FairDisCo: Daneshjou et al. (2022). Adversarial debiasing with contrastive learning. Science Advances, 8(25), eabq6147. - CIRCLe: Pakzad et al. (2022). Color-invariant representation learning. ECCV 2022. - BiaslessNAS: Pacal et al. (2025). Neural architecture search for fairness. Biomedical Signal Processing and Control, 104, 107627.
Deployed Systems: - NHS DERM: Skin Analytics (2022-2023). 9,649 patients, 97% melanoma sensitivity, 85% patient satisfaction. - DermaSensor: FDA cleared Jan 2024. 96% malignancy sensitivity in primary care.
Datasets: - Fitzpatrick17k: Groh et al. (2021). 16,577 images with FST labels. - DDI: Daneshjou et al. (2022). 656 diverse dermatology images. - MIDAS: Stanford AIMI. Multimodal biopsy-confirmed dataset.
Acknowledgments¶
This project builds upon the comprehensive survey by Jasmin Flores and Dr. Nabeel Alzahrani from California State University, San Bernardino. Their systematic analysis of 100+ experimental studies (2022-2025) on fairness-aware skin cancer detection provided the foundational research that informs every phase of this implementation roadmap.
Special recognition to: - The research community: Authors of FairSkin, FairDisCo, CIRCLe, BiaslessNAS, and other fairness techniques - Dataset creators: Fitzpatrick17k, DDI, MIDAS, SCIN, SkinCon teams for enabling diverse-tone research - Clinical pioneers: NHS DERM, DermaSensor teams for demonstrating real-world viability
Project Status: Foundation Phase Last Updated: 2025-10-13 Framework: MENDICANT_BIAS Multi-Agent System License: Apache 2.0
Strategic Research: the_didact Core Development: hollowed_eyes QA & Security: loveless DevOps & Deployment: zhadyz Supreme Orchestrator: mendicant_bias