CIRCLe: Color-Invariant Representation Learning - Implementation Guide¶

Executive Summary¶

CIRCLe (Color Invariant Representation Learning for Unbiased Classification of Skin Lesions) is an algorithm-level fairness technique that enforces skin tone invariance through regularization. By encouraging similar latent representations for images with same diagnosis but different skin tones, CIRCLe improves calibration and out-of-distribution generalization.

Expected Impact: 3-5% ECE (Expected Calibration Error) reduction, improved OOD generalization, +2-4% AUROC for FST V-VI (Pakzad et al., 2022)

1. Methodology Overview¶

1.1 Core Concept¶

Problem: Deep learning models learn spurious correlations between skin color and diagnosis - Example: "Dark skin + lesion = benign nevus" (dataset bias) - Result: Poor performance on rare (diagnosis, FST) combinations

Solution: Regularize latent embeddings to be invariant to skin tone transformations - Same diagnosis, different FST → similar embeddings - Different diagnosis, any FST → dissimilar embeddings

Mathematical Formulation:

L_total = L_cls + λ_reg × L_reg

Where:
L_cls = CrossEntropy(f(x), y_diagnosis)
L_reg = ||f(x_original) - f(T_FST(x_original))||²

f(·): Feature extractor (e.g., ResNet50 embeddings)
T_FST(·): Skin tone transformation (FST I ↔ VI)
λ_reg: Regularization strength (typically 0.1-0.3)

1.2 Pipeline¶

Original Image (FST III)
        |
        v
   ┌─────────────────────┐
   │ Skin Tone Transformer│  ← StarGAN or Color Transformation
   │    T_FST(x)          │     Generate FST I, VI versions
   └─────────────────────┘
        |
        ├────────────────┐
        v                v
  x_FST-I          x_FST-VI
        |                |
        v                v
   ┌────────────────────────┐
   │  Feature Extractor (f) │  ← ResNet50 or other backbone
   │  Shared Weights        │
   └────────────────────────┘
        |                |
        v                v
   emb_FST-I       emb_FST-VI
        |                |
        └───────┬────────┘
                v
        Regularization Loss
        L_reg = ||emb_FST-I - emb_FST-VI||²

                +

        Classification Loss
        L_cls = CrossEntropy(f(x), y)

2. Skin Tone Transformation Approaches¶

2.1 Approach 1: StarGAN (Original Paper)¶

Architecture: StarGAN v2 (Choi et al., 2020) - Generator: Transforms image from source FST → target FST - Discriminator: Verifies realism of generated images - Style encoder: Extracts skin tone style codes

Training Requirements: - Dataset: 5,000+ images per FST class (for robust GAN training) - GPU: 1x RTX 3090 (24GB VRAM) - Training time: 100-200 epochs × 1 hour/epoch = 100-200 hours - Hyperparameters: - Learning rate: 1e-4 (generator), 1e-4 (discriminator) - Batch size: 8 (high-resolution images) - Adversarial loss weight: 1.0 - Style reconstruction loss weight: 1.0 - Cycle consistency loss weight: 1.0

Advantages: - High-quality transformations (realistic skin tone changes) - Preserves lesion morphology (shape, texture, borders) - Medical domain adaptation possible (fine-tune on dermoscopy)

Disadvantages: - Complex training (GAN instability, mode collapse risks) - Requires large FST-diverse dataset (5k+ images per FST) - Long training time (100-200 GPU hours) - Potential artifacts (blurriness, checkerboard patterns)

Implementation (using official StarGAN repository):

# Clone StarGAN v2
git clone https://github.com/clovaai/stargan-v2
cd stargan-v2

# Prepare dataset (organize by FST)
python prepare_data.py \
    --input_dir data/fitzpatrick17k \
    --output_dir data/stargan_fst \
    --attribute fst \
    --classes I,II,III,IV,V,VI

# Train StarGAN
python main.py \
    --mode train \
    --num_domains 6 \
    --train_img_dir data/stargan_fst/train \
    --val_img_dir data/stargan_fst/val \
    --batch_size 8 \
    --total_iters 100000 \
    --lambda_reg 1.0 \
    --lambda_sty 1.0 \
    --lambda_cyc 1.0

# Generate transformed images
python main.py \
    --mode sample \
    --checkpoint_dir checkpoints/stargan_fst \
    --result_dir results/transformed \
    --src_dir data/fitzpatrick17k/test

Quality Validation: - FID (Frechet Inception Distance): <30 per FST transformation - LPIPS (perceptual similarity): 0.2-0.4 (vs original, should preserve structure) - Expert review: Dermatologist rating >4/7 for realism

2.2 Approach 2: Simple Color Transformations (Practical Alternative)¶

Concept: Approximate skin tone changes using color space manipulations - No GAN training required - Fast, deterministic, no artifacts - Less realistic but sufficient for regularization

Transformations:

HSV (Hue, Saturation, Value) Adjustment:

import cv2
import numpy as np

def transform_skin_tone_hsv(image, target_fst):
    """
    Transform image to target FST using HSV adjustments.

    FST I-III (Light): Increase brightness, decrease saturation
    FST IV (Intermediate): Minimal changes
    FST V-VI (Dark): Decrease brightness, increase saturation
    """
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV).astype(np.float32)

    # Define FST-specific adjustments (empirically tuned)
    fst_adjustments = {
        "I":  {"h": 0,   "s": -0.2, "v": +0.3},   # Very light
        "II": {"h": 0,   "s": -0.1, "v": +0.2},   # Light
        "III": {"h": 0,  "s": 0.0,  "v": +0.1},   # Light-medium
        "IV": {"h": 0,   "s": 0.0,  "v": 0.0},    # Medium (baseline)
        "V":  {"h": +5,  "s": +0.1, "v": -0.2},   # Dark
        "VI": {"h": +10, "s": +0.2, "v": -0.3},   # Very dark
    }

    adj = fst_adjustments[target_fst]

    # Apply adjustments
    hsv[:, :, 0] = np.clip(hsv[:, :, 0] + adj["h"], 0, 179)        # Hue
    hsv[:, :, 1] = np.clip(hsv[:, :, 1] * (1 + adj["s"]), 0, 255)  # Saturation
    hsv[:, :, 2] = np.clip(hsv[:, :, 2] * (1 + adj["v"]), 0, 255)  # Value

    rgb = cv2.cvtColor(hsv.astype(np.uint8), cv2.COLOR_HSV2RGB)
    return rgb

LAB Color Space Adjustment (More Perceptually Uniform):
```
                      
```
name="__codelineno-4-1" href="#__codelineno-4-1">def transform_skin_tone_lab(image, target_fst): """ class="sd"> Transform using LAB color space (L*a*b*). class="sd"> L: Lightness (0-100) class="sd"> a: Green-Red axis class="sd"> b: Blue-Yellow axis class="sd"> """ lab = cv2.cvtColor(image, cv2.COLOR_RGB2LAB).astype(np.float32) # FST-specific LAB adjustments fst_adjustments = { "I": {"L": +30, "a": +5, "b": +10}, "II": {"L": +20, "a": +3, "b": +7}, "III": {"L": +10, "a": +2, "b": +5}, "IV": {"L": 0, "a": 0, "b": 0}, "V": {"L": -15, "a": -3, "b": -5}, "VI": {"L": -25, "a": -5, "b": -8}, } adj = fst_adjustments[target_fst] # Apply adjustments (L channel most important for skin tone) lab[:, :, 0] = np.clip(lab[:, :, 0] + adj["L"], 0, 255) lab[:, :, 1] = np.clip(lab[:, :, 1] + adj["a"], 0, 255) lab[:, :, 2] = np.clip(lab[:, :, 2] + adj["b"], 0, 255) rgb = cv2.cvtColor(lab.astype(np.uint8), cv2.COLOR_LAB2RGB) return rgb

Hybrid: Skin Segmentation + LAB Adjustment:

def transform_skin_tone_segmented(image, target_fst):
    """
    1. Segment skin regions (avoid lesion)
    2. Apply LAB transformation only to skin
    3. Preserve lesion appearance
    """
    # Step 1: Segment skin (simple thresholding or U-Net)
    skin_mask = segment_skin(image)  # Binary mask

    # Step 2: Transform only skin regions
    transformed = transform_skin_tone_lab(image, target_fst)

    # Step 3: Blend original lesion with transformed skin
    result = image.copy()
    result[skin_mask > 0] = transformed[skin_mask > 0]

    return result

Advantages: - No training required (instant setup) - Fast (CPU-based, <10ms per image) - Deterministic (reproducible) - No artifacts or mode collapse

Disadvantages: - Less realistic (may not capture FST diversity) - Heuristic parameters (require manual tuning) - May alter lesion appearance if not segmented

Recommendation for Phase 2: Use simple color transformations - Faster iteration (no GAN training) - Sufficient for regularization (embeddings learn invariance) - Upgrade to StarGAN in Phase 3 if results insufficient

2.3 Approach 3: Pre-trained Dermatology StyleGAN (Future Work)¶

Concept: Use pre-trained StyleGAN2-ADA on dermatology images - Latent space: Disentangled skin tone from lesion morphology - Edit: Manipulate tone latent code, preserve morphology

Availability: No public dermatology StyleGAN as of 2025-01 Alternative: Train StyleGAN2-ADA on Fitzpatrick17k (requires 1-2 weeks GPU time)

3. Regularization Loss Formulation¶

3.1 L2 Distance Regularization (Original Paper)¶

Formula:

L_reg = 1/N × Σ ||f(x_i) - f(T_FST(x_i))||²

Where:
- N: Batch size
- x_i: Original image (source FST)
- T_FST(x_i): Transformed image (target FST, e.g., FST I → VI)
- f(·): Feature extractor (2048-dim embeddings from ResNet50)

PyTorch Implementation:

import torch
import torch.nn as nn

class CIRCLeRegularizationLoss(nn.Module):
    def __init__(self, distance_metric="l2"):
        super().__init__()
        self.distance_metric = distance_metric

    def forward(self, embeddings_original, embeddings_transformed):
        """
        Args:
            embeddings_original: [batch_size, feature_dim] (e.g., 2048)
            embeddings_transformed: [batch_size, feature_dim]

        Returns:
            loss: Scalar regularization loss
        """
        if self.distance_metric == "l2":
            # Euclidean distance squared
            loss = torch.mean((embeddings_original - embeddings_transformed) ** 2)
        elif self.distance_metric == "cosine":
            # Cosine distance (1 - cosine_similarity)
            cosine_sim = F.cosine_similarity(embeddings_original, embeddings_transformed, dim=1)
            loss = torch.mean(1 - cosine_sim)
        else:
            raise ValueError(f"Unsupported distance metric: {self.distance_metric}")

        return loss

Hyperparameters: - λ_reg = 0.1-0.3 (regularization strength) - 0.1: Weak regularization (minimal fairness improvement) - 0.2: Balanced (recommended starting point) - 0.3: Strong regularization (may hurt accuracy if too aggressive)

3.2 Multi-FST Regularization (Extended)¶

Concept: Regularize against MULTIPLE tone transformations (not just one) - Original FST III → Transform to FST I, VI - Encourage f(x_FST-III) ≈ f(x_FST-I) ≈ f(x_FST-VI)

Formula:

L_reg = 1/(N×K) × Σ_i Σ_k ||f(x_i) - f(T_FST-k(x_i))||²

Where K = number of target FST classes (e.g., 2: FST I and VI)

Implementation:

def multi_fst_regularization_loss(model, images, target_fsts=["I", "VI"]):
    """
    Regularize embeddings against multiple FST transformations.

    Args:
        model: Feature extractor (e.g., ResNet50)
        images: Original images [batch_size, 3, H, W]
        target_fsts: List of target FST classes (e.g., ["I", "VI"])

    Returns:
        loss: Multi-FST regularization loss
    """
    # Extract embeddings from original images
    embeddings_original = model.feature_extractor(images)

    total_loss = 0.0
    for target_fst in target_fsts:
        # Transform images to target FST
        images_transformed = transform_skin_tone(images, target_fst)

        # Extract embeddings from transformed images
        embeddings_transformed = model.feature_extractor(images_transformed)

        # Compute L2 distance
        loss_fst = torch.mean((embeddings_original - embeddings_transformed) ** 2)
        total_loss += loss_fst

    # Average over target FST classes
    return total_loss / len(target_fsts)

Advantages: - More robust (invariant to multiple FST directions) - Better OOD generalization (handles unseen FST combinations)

Disadvantages: - Higher computational cost (K transformations per image) - More GPU memory (need to store K transformed images + embeddings)

4. Training Protocol¶

4.1 Full Training Loop¶

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

# Initialize model
model = ResNet50Classifier(num_classes=7).cuda()

# Loss functions
criterion_cls = nn.CrossEntropyLoss()
criterion_reg = CIRCLeRegularizationLoss(distance_metric="l2")

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# Hyperparameters
lambda_reg = 0.2  # Regularization strength
target_fsts = ["I", "VI"]  # Extreme FST classes for regularization

# Training loop
for epoch in range(num_epochs):
    for images, labels, fst_labels in train_loader:
        images = images.cuda()
        labels = labels.cuda()

        # Forward pass (original images)
        embeddings_original, logits = model(images, return_embeddings=True)
        loss_cls = criterion_cls(logits, labels)

        # Regularization: Transform to target FST, compute embedding distance
        loss_reg = 0.0
        for target_fst in target_fsts:
            # Transform images (on-the-fly or pre-computed)
            images_transformed = transform_skin_tone(images, target_fst)

            # Extract embeddings from transformed images
            embeddings_transformed, _ = model(images_transformed, return_embeddings=True)

            # Compute regularization loss
            loss_reg += criterion_reg(embeddings_original, embeddings_transformed)

        loss_reg /= len(target_fsts)

        # Total loss
        loss = loss_cls + lambda_reg * loss_reg

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Validation
    val_metrics = evaluate(model, val_loader)
    print(f"Epoch {epoch}: AUROC gap = {val_metrics['auroc_gap']:.2%}, ECE = {val_metrics['ece']:.4f}")

4.2 Data Augmentation Strategy¶

Standard Augmentation (always applied): - Random horizontal flip (p=0.5) - Random rotation (±15°) - Color jitter (brightness ±0.2, contrast ±0.2, saturation ±0.2)

Tone Transformation (for regularization): - Applied during training (on-the-fly or pre-computed) - Targets: FST I and VI (extreme classes) - Frequency: Every batch (all images transformed)

Pre-computed vs On-the-Fly:

Option 1: Pre-compute Transformations (Recommended for Phase 2) - Generate transformed versions of entire dataset BEFORE training - Store: original + FST-I + FST-VI versions (3x storage) - Training: Load all 3 versions, compute regularization

Advantages: Fast training (no transformation overhead) Disadvantages: 3x storage (e.g., 48GB → 144GB)

Option 2: On-the-Fly Transformation - Transform images during training (in DataLoader) - No extra storage

Advantages: Storage-efficient Disadvantages: CPU overhead (~20ms per transformation), may bottleneck GPU

Recommendation: Pre-compute for Phase 2 (easier debugging, faster training)

4.3 Hyperparameter Tuning¶

Key Hyperparameters: - λ_reg: 0.1, 0.2, 0.3 (start with 0.2) - Target FST classes: ["I", "VI"] or ["I", "III", "VI"] - Distance metric: "l2" or "cosine" - Transformation method: HSV, LAB, or StarGAN

Tuning Strategy (Grid Search):

hyperparameter_grid = {
    "lambda_reg": [0.1, 0.2, 0.3],
    "target_fsts": [["I", "VI"], ["I", "III", "VI"]],
    "distance_metric": ["l2", "cosine"],
}

for config in iterate_grid(hyperparameter_grid):
    model = train_circle(config)
    val_metrics = evaluate(model, val_loader)
    log_experiment(config, val_metrics)

# Select best: Minimize AUROC gap, maximize calibration (minimize ECE)
best_config = select_best(
    criterion="auroc_gap + ece",  # Multi-objective
    direction="minimize"
)

Expected Training Time per Config: - 100 epochs × 15 min/epoch = 25 hours (RTX 3090) - 9 configs (3 λ_reg × 2 target_fsts × 2 metrics) = 225 hours total - Parallelize: 4 GPUs → 56 hours (2.3 days)

5. Model Architecture¶

5.1 Feature Extractor with Dual Outputs¶

import torch
import torch.nn as nn
import torchvision.models as models

class CIRCLeModel(nn.Module):
    def __init__(self, num_classes=7, backbone="resnet50", pretrained=True):
        super().__init__()

        # Backbone
        if backbone == "resnet50":
            self.backbone = models.resnet50(pretrained=pretrained)
            feature_dim = self.backbone.fc.in_features  # 2048
            self.backbone.fc = nn.Identity()  # Remove original FC
        else:
            raise ValueError(f"Unsupported backbone: {backbone}")

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes),
        )

    def forward(self, x, return_embeddings=False):
        # Extract features
        embeddings = self.backbone(x)

        # Classification
        logits = self.classifier(embeddings)

        if return_embeddings:
            return embeddings, logits
        else:
            return logits

    def feature_extractor(self, x):
        """Extract embeddings only (for regularization)."""
        return self.backbone(x)

5.2 Integration with Existing Models¶

Scenario 1: Add CIRCLe to Baseline ResNet50 - Train baseline ResNet50 (Phase 1) - Add regularization loss (Phase 2) - Continue training for 50-100 epochs

Scenario 2: Combine CIRCLe + FairDisCo - Use FairDisCo architecture (adversarial + contrastive) - Add CIRCLe regularization as third loss term - Total loss: L_cls + λ_adv × L_adv + λ_con × L_con + λ_reg × L_reg

Combined Loss:

loss = (
    criterion_cls(diagnosis_logits, labels) +
    0.3 * criterion_adv(fst_logits, fst_labels) +
    0.2 * criterion_con(contrastive_embeddings, labels, fst_labels) +
    0.2 * criterion_circle(embeddings_original, embeddings_transformed)
)

6. Computational Requirements¶

6.1 GPU Requirements¶

Training (Batch size 64, ResNet50): - VRAM: ~10GB (original images) + ~10GB (transformed images) = 20GB - Minimum: 1x RTX 3090 (24GB VRAM) - Recommended: 1x RTX 4090 (24GB VRAM, 1.5x faster)

VRAM Optimization: - Mixed precision (FP16): Reduces to ~12GB - Gradient checkpointing: Reduces to ~10GB (slower training)

Training Time: - Single epoch: ~18 minutes (Fitzpatrick17k, 16,577 images, batch 64) - Original images: 15 min - Transformed images (2x FST): +3 min overhead - 100 epochs: ~30 hours (RTX 3090)

Inference Time: - No overhead (regularization only during training) - Same as baseline: ~30ms per image (RTX 3090)

6.2 Storage Requirements¶

Pre-computed Transformations: - Original dataset: 48GB (Fitzpatrick17k, 512×512 PNG) - Transformed FST-I: 48GB - Transformed FST-VI: 48GB - Total: 144GB (3x original)

On-the-Fly Transformation: No extra storage (0GB)

7. Open-Source Implementation¶

7.1 Official CIRCLe Repository¶

GitHub: https://github.com/arezou-pakzad/CIRCLe

Key Details: - Language: Python - Framework: PyTorch - License: Not specified (assume academic use, contact for commercial)

Provided Code: - train_stargan.py: Train StarGAN skin tone transformer - train_classifier.py: Train classifier with/without regularization - models/: ResNet, DenseNet, MobileNet implementations - utils/regularization.py: CIRCLe regularization loss

Training Command:

# Step 1: Train StarGAN (optional, skip if using simple transformations)
python train_stargan.py \
    --dataset fitzpatrick17k \
    --num_domains 6 \
    --epochs 200

# Step 2: Train classifier with CIRCLe regularization
python train_classifier.py \
    --model resnet50 \
    --dataset fitzpatrick17k \
    --use_regularization \
    --lambda_reg 0.2 \
    --target_fsts I,VI

Model Checkpoints: Not publicly released (train from scratch)

7.2 Mirror Repository¶

GitHub: https://github.com/sfu-mial/CIRCLe (Simon Fraser University) - Mirror of original repository - Same codebase, alternative hosting

7.3 Integration Assessment¶

Ease of Integration: Moderate - Well-structured, modular code - Supports multiple backbones (ResNet, DenseNet, MobileNet, VGG) - Requires StarGAN training (complex) or adaptation to simple transformations

Code Quality: Good - PyTorch best practices - Configurable hyperparameters (command-line arguments) - Limited documentation (assume familiarity with paper)

Recommended Approach: 1. Clone repository, install dependencies 2. Skip StarGAN training (Phase 2), use simple LAB transformations 3. Adapt train_classifier.py to use color transformations (modify data loader) 4. Run experiments with λ_reg = 0.1, 0.2, 0.3 5. Integrate into Phase 2 pipeline (after FairSkin + FairDisCo)

8. Implementation Timeline¶

Week 1: Setup & Simple Transformations - Day 1-2: Install dependencies, download Fitzpatrick17k - Day 3-4: Implement simple color transformations (HSV, LAB) - Day 5: Validate transformations (visual inspection, LPIPS) - Day 6-7: Pre-compute transformed datasets (FST I, VI versions)

Week 2: CIRCLe Integration - Day 1-2: Implement regularization loss (L2 distance) - Day 3: Modify training loop (add regularization term) - Day 4-5: Debug (verify gradients flow correctly) - Day 6-7: Baseline experiment (λ_reg = 0.2)

Week 3: Hyperparameter Tuning - Day 1-3: Grid search (λ_reg, target FST, distance metric) - Day 4-5: Analyze results (AUROC gap, ECE per config) - Day 6-7: Final training with best config

Week 4: Combined Fairness (CIRCLe + FairDisCo) - Day 1-2: Integrate CIRCLe into FairDisCo architecture - Day 3-5: Train combined model (100 epochs, ~30 hours) - Day 6-7: Evaluate fairness (AUROC gap, EOD, ECE)

Total: 4 weeks (28 days) - GPU time: ~150 hours (6 days continuous) - Human time: ~50 hours (1.25 weeks full-time equivalent)

9. Success Criteria¶

Fairness Metrics: - AUROC gap reduction: +2-4% (from FairDisCo 8-10% → CIRCLe 6-8%) - ECE reduction: 3-5% (improved calibration) - OOD generalization: +3-5% AUROC on held-out FST classes

Accuracy Maintenance: - Overall accuracy: >88% (no degradation from Phase 2 baseline) - AUROC (average): >90%

Calibration: - ECE <0.08 for ALL FST groups (vs 0.10-0.12 baseline) - Reliability diagrams: Tighter fit to diagonal (better calibration)

10. Risk Mitigation¶

Risk 1: Simple Transformations Insufficient (Poor Fairness Gain) Mitigation: - Increase λ_reg (0.2 → 0.3 → 0.4) - Use skin segmentation (apply transformation only to skin) - Upgrade to StarGAN (higher quality transformations)

Risk 2: StarGAN Training Fails (Mode Collapse, Artifacts) Mitigation: - Use spectral normalization (improves GAN stability) - Increase training data (need 5k+ images per FST) - Lower expectations (simple transformations may suffice)

Risk 3: Over-Regularization (Accuracy Drops) Mitigation: - Reduce λ_reg (0.2 → 0.1) - Use cosine distance (softer than L2) - Monitor validation accuracy (stop if drops >2%)

Risk 4: High Storage Overhead (144GB Pre-computed) Mitigation: - Use on-the-fly transformation (0GB extra storage) - Optimize CPU transformation (multi-threading, <10ms overhead) - Compress transformed images (lossy JPEG, -50% size)

11. Comparison: StarGAN vs Simple Transformations¶

Aspect	StarGAN	Simple Color Transformations
Training Time	100-200 GPU hours	0 hours (no training)
Realism	High (photo-realistic)	Low-moderate (heuristic)
Implementation Complexity	High (GAN training, hyperparameters)	Low (10-20 lines of code)
Fairness Improvement	+4-5% AUROC gap reduction	+2-3% AUROC gap reduction
Artifacts	Potential (blur, checkerboard)	Minimal (deterministic)
Dataset Requirements	5k+ images per FST	Any size (even 100 images)
Recommendation	Phase 3+ (after MVP)	Phase 2 (rapid iteration)

Recommendation: Use simple transformations for Phase 2, upgrade to StarGAN in Phase 3 if needed.

12. Key Insights from Literature¶

Pakzad et al. (2022) Findings: - CIRCLe improves equal opportunity (+5%) and calibration (ECE -3-5%) - Regularization most effective on ResNet50 (vs MobileNet, DenseNet) - Multi-FST regularization (FST I + VI) outperforms single-FST (FST I only) - Combined with data augmentation (FairSkin), achieves 91.3% AUROC with 3.7% gap

Best Practices: - Start with λ_reg = 0.2, tune ±0.1 based on validation - Use both extreme FST classes (I and VI) for regularization - Pre-compute transformations for faster training (storage permitting) - Combine with adversarial debiasing (FairDisCo) for synergistic effect

13. References¶

Primary Paper: - Pakzad, A., Abhishek, K., Hamarneh, G. (2023). "CIRCLe: Color Invariant Representation Learning for Unbiased Classification of Skin Lesions." ECCV 2022 Workshops. arXiv:2208.13528

Skin Tone Transformation: - Choi, Y., et al. (2020). "StarGAN v2: Diverse Image Synthesis for Multiple Domains." CVPR.

Color Spaces: - Fairchild, M.D. (2013). "Color Appearance Models." Wiley. (LAB color space)

Calibration: - Guo, C., et al. (2017). "On Calibration of Modern Neural Networks." ICML. (Expected Calibration Error)

Document Version: 1.0 Last Updated: 2025-10-13 Author: THE DIDACT (Strategic Research Agent) Status: IMPLEMENTATION-READY Next Review: Post-Phase 2 (Week 10)