{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "xai-gradcam-title",
   "metadata": {},
   "source": [
    "# Explainable AI (XAI) for Partially Spoofed Audio Detection with Grad-CAM\n",
    "**Author:** Tianchi Liu  \n",
    "**Status:** In Progress\n",
    "\n",
    "**Reference:** [How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?](https://arxiv.org/abs/2406.02483)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "overview",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This tutorial explains the **step-by-step workflow** for applying **Explainable AI (XAI)** techniques to **partially spoofed audio detection** using the **Gradient-weighted Class Activation Mapping (Grad-CAM)** method.\n",
    "\n",
    "**Partially spoofed audio** refers to utterances where only certain segments are synthetic while others remain genuine.\n",
    "\n",
    "### 📂 Reference Implementation Path\n",
    "\n",
    "```bash\n",
    "egs/detection/partialspoof/x12_ssl_res1d/\n",
    "```\n",
    "\n",
    "### Key Components\n",
    "\n",
    "| File | Purpose |\n",
    "|------|--------|\n",
    "| `run.sh` | Main pipeline orchestrating Stages 1-10 |\n",
    "| `conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml` | Model configuration |\n",
    "| `local/prepare_data.sh` | Data preparation script |\n",
    "| `wedefense/bin/train.py` | Model training |\n",
    "| `wedefense/bin/XAI_GradCam_infer.py` | XAI heatmap extraction |\n",
    "| `wedefense/bin/XAI_Score_analysis.py` | XAI score analysis and visualization |\n",
    "\n",
    "### What This Tutorial Covers\n",
    "\n",
    "✅ **Complete Pipeline** - From data preparation to XAI analysis  \n",
    "✅ **Model Architecture** - SSL-Res1D for partial spoofing detection  \n",
    "✅ **Grad-CAM Theory** - How temporal activation maps are computed  \n",
    "✅ **XAI Extraction** - Step-by-step extraction process  \n",
    "✅ **Result Interpretation** - Understanding and analyzing XAI scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pipeline-overview",
   "metadata": {},
   "source": [
    "## Complete Pipeline Overview\n",
    "\n",
    "The `run.sh` script implements a 10-stage pipeline:\n",
    "\n",
    "```\n",
    "Stage 1: Data Preparation          → wav.scp, utt2lab, lab2utt\n",
    "Stage 2: Data Format Conversion    → Shard/Raw format\n",
    "Stage 3: Model Training            → SSL-Res1D training\n",
    "Stage 4: Model Averaging           → Average best checkpoints\n",
    "Stage 5: Extract Logits            → Model inference\n",
    "Stage 6: Compute LLR Scores        → Log-likelihood ratios\n",
    "Stage 7: Performance Evaluation    → EER, min t-DCF metrics\n",
    "Stage 8: Analysis                  → Statistical tests\n",
    "Stage 9: XAI Extraction            → Grad-CAM heatmaps\n",
    "Stage 10: XAI Analysis             → Visualization and interpretation\n",
    "```\n",
    "\n",
    "**This tutorial focuses on Stages 9-10** (XAI extraction and analysis), assuming Stages 1-8 are complete."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "theory-gradcam",
   "metadata": {},
   "source": [
    "## Grad-CAM Theory for Audio\n",
    "\n",
    "### What is Grad-CAM?\n",
    "\n",
    "Grad-CAM (Gradient-weighted Class Activation Mapping) identifies which regions of the input the model focuses on when making predictions.\n",
    "\n",
    "### Mathematical Formulation\n",
    "\n",
    "For a target class $c$ (e.g., spoof class):\n",
    "\n",
    "1. **Forward Pass**: \n",
    "   - Input audio → SSL Frontend → Classifier (Res1D) → Classification score $y^c$\n",
    "   - Extract feature maps $A^k$ from target layer\n",
    "\n",
    "2. **Backward Pass**:\n",
    "   - Compute gradients: $\\frac{\\partial y^c}{\\partial A^k}$\n",
    "\n",
    "3. **Weight Calculation** (Global Average Pooling):\n",
    "   $$\\alpha_k^c = \\frac{1}{T}\\sum_{t=1}^{T}\\frac{\\partial y^c}{\\partial A^k_t}$$\n",
    "   \n",
    "   where $T$ is the temporal dimension.\n",
    "\n",
    "4. **Weighted Combination**:\n",
    "   $$L^c_{\\text{Grad-CAM}} = \\text{ReLU}\\left(\\sum_k \\alpha_k^c A^k\\right)$$\n",
    "\n",
    "5. **Temporal Heatmap**:\n",
    "   - Normalize to [0, 1]\n",
    "   - High values indicate regions important for classification\n",
    "\n",
    "### Why Grad-CAM for Partial Spoofing?\n",
    "\n",
    "Unlike fully synthetic audio (uniform fake), partially spoofed audio requires:\n",
    "- **Temporal localization**: Identify *when* spoofing occurs\n",
    "- **Boundary detection**: Find transitions between real/fake\n",
    "- **Segment-level understanding**: Distinguish mixed content\n",
    "\n",
    "Grad-CAM provides this temporal resolution by showing activation strength over time."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "model-architecture",
   "metadata": {},
   "source": [
    "## Model Architecture: SSL-Res1D\n",
    "\n",
    "### Pipeline Components\n",
    "\n",
    "```\n",
    "Audio Input (16kHz)\n",
    "    ↓\n",
    "[SSL Frontend] XLSR-53\n",
    "    ↓\n",
    "[Classifier] Res1D Backend\n",
    "    ↓\n",
    "Classification Score (Bonafide/Spoof)\n",
    "```\n",
    "\n",
    "### Key Configuration\n",
    "\n",
    "From `conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml`:\n",
    "\n",
    "```yaml\n",
    "model: ssl_multireso_gmlp\n",
    "model_args:\n",
    "  feat_dim: 768          # XLSR-53 feature dimension\n",
    "  embed_dim: -2          # Output embedding dimension\n",
    "  num_scale: 6           # Multi-resolution scales\n",
    "  gmlp_layers: 1\n",
    "  batch_first: true\n",
    "  flag_pool: ap          # Attentive pooling\n",
    "\n",
    "frontend: xlsr_53\n",
    "xlsr_53_args:\n",
    "  layer: 12              # Use 12th layer of XLSR-53\n",
    "  \n",
    "projection_args:\n",
    "  project_type: arc_margin\n",
    "  scale: 30.0\n",
    "  margin: 0.2\n",
    "```\n",
    "\n",
    "### Why This Architecture?\n",
    "\n",
    "1. **XLSR-53**: Self-supervised speech representations capture fine-grained acoustic patterns\n",
    "2. **Res1D**: 1D residual blocks effective for temporal modeling\n",
    "3. **Multi-Resolution**: Captures artifacts at different temporal scales\n",
    "4. **Arc Margin**: Enhances inter-class separation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stage1-data-prep",
   "metadata": {},
   "source": [
    "## Stage 1: Data Preparation\n",
    "\n",
    "### Script: `local/prepare_data.sh`\n",
    "\n",
    "### Purpose\n",
    "Prepare the PartialSpoof dataset in WeDefense format.\n",
    "\n",
    "### Input\n",
    "- PartialSpoof database directory\n",
    "- Protocol files: `PartialSpoof.LA.cm.{train,dev,eval}.trl.txt`\n",
    "\n",
    "### Process\n",
    "\n",
    "1. **Create wav.scp**\n",
    "   ```bash\n",
    "   find ${PS_dir}/${dset}/con_wav -name \"*.wav\" | awk -F\"/\" '{print $NF,$0}' | sort\n",
    "   ```\n",
    "   Format: `utterance_id /path/to/audio.wav`\n",
    "\n",
    "2. **Extract labels (utt2lab)**\n",
    "   ```bash\n",
    "   cut -d' ' -f2,5 ${PS_dir}/protocols/PartialSpoof_LA_cm_protocols/PartialSpoof.LA.cm.${dset}.trl.txt\n",
    "   ```\n",
    "   Format: `utterance_id bonafide/spoof`\n",
    "\n",
    "3. **Create lab2utt mapping**\n",
    "   ```bash\n",
    "   ./tools/utt2lab_to_lab2utt.pl ${data}/${dset}/utt2lab\n",
    "   ```\n",
    "   Groups utterances by label\n",
    "\n",
    "4. **Compute durations**\n",
    "   ```bash\n",
    "   python tools/wav2dur.py ${data}/${dset}/wav.scp ${data}/${dset}/utt2dur\n",
    "   ```\n",
    "\n",
    "### Output Files\n",
    "```\n",
    "data/{train,dev,eval}/\n",
    "  ├── wav.scp      # Audio paths\n",
    "  ├── utt2lab      # Utterance labels\n",
    "  ├── lab2utt      # Label-to-utterance mapping\n",
    "  └── utt2dur      # Audio durations\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stage3-training",
   "metadata": {},
   "source": [
    "## Stage 3: Model Training (Overview)\n",
    "\n",
    "### Command\n",
    "\n",
    "```bash\n",
    "torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:$PORT \\\n",
    "  --nnodes=1 --nproc_per_node=$num_gpus \\\n",
    "  wedefense/bin/train.py --config $config \\\n",
    "    --exp_dir ${exp_dir} \\\n",
    "    --gpus $gpus \\\n",
    "    --num_avg ${num_avg} \\\n",
    "    --data_type \"${data_type}\" \\\n",
    "    --train_data ${data}/train/${data_type}.list \\\n",
    "    --train_label ${data}/train/utt2lab\n",
    "```\n",
    "\n",
    "### Training Process\n",
    "\n",
    "1. **Data Loading**: Batch sampling from shard/raw format\n",
    "2. **Frontend**: Extract XLSR-53 features (Layer 12)\n",
    "3. **Augmentation**: Optional spec augmentation, speed perturbation\n",
    "4. **Forward**: Encoder → Pooling → Projection\n",
    "5. **Loss**: Arc Margin Softmax loss\n",
    "6. **Optimization**: AdamW with learning rate scheduling\n",
    "\n",
    "### Key Training Parameters\n",
    "\n",
    "- **Batch size**: Typically 64-128\n",
    "- **Learning rate**: 1e-4 with warmup\n",
    "- **Epochs**: 50-100 with early stopping\n",
    "- **Checkpointing**: Save every epoch\n",
    "\n",
    "### Output\n",
    "\n",
    "```\n",
    "exp/singlereso_utt_xlsr_53_ft_backend_Res1D/\n",
    "  ├── config.yaml\n",
    "  ├── models/\n",
    "  │   ├── model_1.pt\n",
    "  │   ├── model_2.pt\n",
    "  │   └── ...\n",
    "  └── tensorboard/\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stage4-averaging",
   "metadata": {},
   "source": [
    "## Stage 4: Model Averaging\n",
    "\n",
    "### Purpose\n",
    "Average the top-N best model checkpoints to improve robustness.\n",
    "\n",
    "### Command\n",
    "\n",
    "```bash\n",
    "python wedefense/bin/average_model.py \\\n",
    "  --dst_model $exp_dir/models/avg_model.pt \\\n",
    "  --src_path $exp_dir/models \\\n",
    "  --num 10\n",
    "```\n",
    "\n",
    "### Process\n",
    "\n",
    "1. Identify top-10 checkpoints by validation performance\n",
    "2. Load state dictionaries\n",
    "3. Average parameters: $\\theta_{\\text{avg}} = \\frac{1}{N}\\sum_{i=1}^{N}\\theta_i$\n",
    "4. Save averaged model\n",
    "\n",
    "### Output\n",
    "\n",
    "```\n",
    "exp/singlereso_utt_xlsr_53_ft_backend_Res1D/models/avg_model.pt\n",
    "```\n",
    "\n",
    "This averaged model is used for all subsequent stages."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stage9-xai-extraction",
   "metadata": {},
   "source": [
    "## Stage 9: XAI Extraction with Grad-CAM\n",
    "\n",
    "### Script: `wedefense/bin/XAI_GradCam_infer.py`\n",
    "\n",
    "### Command\n",
    "\n",
    "```bash\n",
    "CUDA_VISIBLE_DEVICES=0 python wedefense/bin/XAI_GradCam_infer.py \\\n",
    "  --config ${exp_dir}/config.yaml \\\n",
    "  --model_path $exp_dir/models/avg_model.pt \\\n",
    "  --data_type \"shard\" \\\n",
    "  --data_list ${data}/dev/shard.list \\\n",
    "  --batch_size 1 \\\n",
    "  --num_workers 1 \\\n",
    "  --num_classes 2 \\\n",
    "  --xai_scores_path ${exp_dir}/xai_scores/dev.pkl\n",
    "```\n",
    "\n",
    "### Step-by-Step Process\n",
    "\n",
    "#### 1. Model Preparation\n",
    "\n",
    "```python\n",
    "# Load pretrained model\n",
    "model = get_model(configs['model'])(**configs['model_args'])\n",
    "load_checkpoint(model, model_path)\n",
    "\n",
    "# Wrap with projection head\n",
    "projection = get_projection(configs['projection_args'])\n",
    "full_model = FullModel(model, projection, test_conf)\n",
    "```\n",
    "\n",
    "#### 2. Target Layer Selection\n",
    "\n",
    "```python\n",
    "# For SSL-Res1D, target the final pooling layer\n",
    "target_layer = [full_model.encoder.stat_pooling]\n",
    "```\n",
    "\n",
    "**Why this layer?**\n",
    "- Final representation before classification\n",
    "- Captures high-level temporal features\n",
    "- Maintains temporal resolution\n",
    "\n",
    "#### 3. Grad-CAM Initialization\n",
    "\n",
    "```python\n",
    "from pytorch_grad_cam import GradCAM\n",
    "from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget\n",
    "\n",
    "cam = GradCAM(model=full_model, target_layers=target_layer)\n",
    "```\n",
    "\n",
    "#### 4. Per-Utterance Extraction\n",
    "\n",
    "For each audio utterance:\n",
    "\n",
    "```python\n",
    "# Load audio\n",
    "wavs = batch['wav'].float().to(device)  # Shape: (1, wav_length)\n",
    "\n",
    "# Target spoof class (class 1)\n",
    "targets = [ClassifierOutputTarget(1)]\n",
    "\n",
    "# Extract Grad-CAM heatmap\n",
    "cam_output = cam(input_tensor=wavs, targets=targets)\n",
    "# cam_output shape: (temporal_frames,) ranging [0, 1]\n",
    "```\n",
    "\n",
    "#### 5. Save Results\n",
    "\n",
    "```python\n",
    "results = []\n",
    "for utt, heatmap in zip(utterance_ids, cam_outputs):\n",
    "    results.append([[utt], heatmap.tolist()])\n",
    "\n",
    "with open(xai_scores_path, 'wb') as f:\n",
    "    pickle.dump(results, f)\n",
    "```\n",
    "\n",
    "### Output Format\n",
    "\n",
    "```python\n",
    "# xai_scores/dev.pkl structure:\n",
    "[\n",
    "  [[\"utt_id_1\"], [0.12, 0.23, 0.89, ..., 0.34]],  # Heatmap for utterance 1\n",
    "  [[\"utt_id_2\"], [0.08, 0.15, 0.76, ..., 0.21]],  # Heatmap for utterance 2\n",
    "  ...\n",
    "]\n",
    "```\n",
    "\n",
    "Each heatmap is a 1D array where:\n",
    "- **Length**: Number of temporal frames\n",
    "- **Values**: [0, 1] indicating activation strength\n",
    "- **High values**: Model focuses on these regions for spoof detection"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stage10-xai-analysis",
   "metadata": {},
   "source": [
    "## Stage 10: XAI Score Analysis\n",
    "\n",
    "### Script: `wedefense/bin/XAI_Score_analysis.py`\n",
    "\n",
    "### Command\n",
    "\n",
    "```bash\n",
    "python3 wedefense/bin/XAI_Score_analysis.py \\\n",
    "  --set dev \\\n",
    "  --pkl_path ${exp_dir}/xai_scores/dev.pkl \\\n",
    "  --vad_path \"$VAD_PATH\"\n",
    "```\n",
    "\n",
    "### Analysis Components\n",
    "\n",
    "#### 1. Load XAI Scores and VAD Information\n",
    "\n",
    "```python\n",
    "# Load XAI heatmaps\n",
    "with open(pkl_path, 'rb') as f:\n",
    "    xai_results = pickle.load(f)\n",
    "\n",
    "# Load voice activity detection (optional)\n",
    "# VAD helps focus on speech regions only\n",
    "vad_info = load_vad(vad_path)\n",
    "```\n",
    "\n",
    "#### 2. Compute Statistics\n",
    "\n",
    "For each utterance:\n",
    "\n",
    "```python\n",
    "heatmap = np.array(xai_result[1])\n",
    "\n",
    "# Basic statistics\n",
    "mean_activation = np.mean(heatmap)\n",
    "max_activation = np.max(heatmap)\n",
    "std_activation = np.std(heatmap)\n",
    "\n",
    "# Temporal analysis\n",
    "peak_indices = find_peaks(heatmap, threshold=0.5)\n",
    "peak_regions = group_consecutive_peaks(peak_indices)\n",
    "```\n",
    "\n",
    "#### 3. Segment Detection\n",
    "\n",
    "**Threshold-based segmentation:**\n",
    "\n",
    "```python\n",
    "threshold = 0.5  # Tunable parameter\n",
    "spoofed_mask = heatmap > threshold\n",
    "\n",
    "# Find continuous regions\n",
    "segments = []\n",
    "in_segment = False\n",
    "for t, is_spoof in enumerate(spoofed_mask):\n",
    "    if is_spoof and not in_segment:\n",
    "        start = t\n",
    "        in_segment = True\n",
    "    elif not is_spoof and in_segment:\n",
    "        end = t\n",
    "        segments.append((start, end))\n",
    "        in_segment = False\n",
    "```\n",
    "\n",
    "#### 4. Visualization\n",
    "\n",
    "Generate plots for each utterance:\n",
    "\n",
    "**A. Temporal Activation Profile**\n",
    "```python\n",
    "plt.figure(figsize=(12, 4))\n",
    "plt.plot(time_axis, heatmap, linewidth=2, color='red')\n",
    "plt.fill_between(time_axis, heatmap, alpha=0.3, color='red')\n",
    "plt.axhline(y=threshold, linestyle='--', color='blue', label='Threshold')\n",
    "plt.xlabel('Time (s)')\n",
    "plt.ylabel('Activation')\n",
    "plt.title(f'XAI Temporal Activation - {utterance_id}')\n",
    "plt.legend()\n",
    "```\n",
    "\n",
    "**B. Spectrogram with Heatmap Overlay**\n",
    "```python\n",
    "# Load audio and compute spectrogram\n",
    "audio, sr = librosa.load(audio_path, sr=16000)\n",
    "D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)))\n",
    "\n",
    "# Overlay heatmap\n",
    "heatmap_2d = np.tile(heatmap, (D.shape[0], 1))  # Repeat along frequency\n",
    "plt.imshow(heatmap_2d, aspect='auto', cmap='hot', alpha=0.6)\n",
    "```\n",
    "\n",
    "**C. Detected Segment Boundaries**\n",
    "```python\n",
    "# Mark detected spoofed regions\n",
    "for start, end in segments:\n",
    "    plt.axvspan(start, end, alpha=0.3, color='red', label='Detected Spoof')\n",
    "```\n",
    "\n",
    "#### 5. Aggregate Analysis\n",
    "\n",
    "**Compare Bonafide vs Spoof distributions:**\n",
    "\n",
    "```python\n",
    "# Separate by ground truth label\n",
    "bonafide_activations = []\n",
    "spoof_activations = []\n",
    "\n",
    "for result, label in zip(xai_results, labels):\n",
    "    mean_act = np.mean(result[1])\n",
    "    if label == 'bonafide':\n",
    "        bonafide_activations.append(mean_act)\n",
    "    else:\n",
    "        spoof_activations.append(mean_act)\n",
    "\n",
    "# Plot distributions\n",
    "plt.hist(bonafide_activations, bins=50, alpha=0.5, label='Bonafide', color='green')\n",
    "plt.hist(spoof_activations, bins=50, alpha=0.5, label='Spoof', color='red')\n",
    "plt.xlabel('Mean Activation')\n",
    "plt.ylabel('Count')\n",
    "plt.legend()\n",
    "```\n",
    "\n",
    "### Output\n",
    "\n",
    "```\n",
    "exp/xai_scores/\n",
    "  ├── dev.pkl                    # Raw heatmaps\n",
    "  ├── analysis/\n",
    "  │   ├── temporal_profiles/     # Per-utterance plots\n",
    "  │   ├── segment_detection/     # Detected boundaries\n",
    "  │   ├── statistics.csv         # Aggregate stats\n",
    "  │   └── distribution.png       # Bonafide vs Spoof comparison\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "interpretation",
   "metadata": {},
   "source": [
    "## Interpreting XAI Results\n",
    "\n",
    "### Activation Patterns and Their Meanings\n",
    "\n",
    "| Pattern | Visual Appearance | Interpretation | Example Scenario |\n",
    "|---------|-------------------|----------------|------------------|\n",
    "| **Sharp Peaks** | 📈 Sudden spikes at specific time points | Splice boundaries detected | Partially spoofed audio with clear transitions |\n",
    "| **Sustained High Activation** | 🌊 Long regions with elevated values | Continuous spoofed segment | TTS-generated insertion |\n",
    "| **Low Flat Profile** | 📉 Consistently low values | Genuine speech | Bonafide utterance |\n",
    "| **Multiple Peaks** | 🎯 Several distinct high regions | Multiple spoofed insertions | Complex partial spoofing |\n",
    "| **Gradual Rise/Fall** | 📊 Smooth transitions | Soft boundaries or gradual blending | Advanced synthesis with smoothing |\n",
    "\n",
    "### Decision Guidelines\n",
    "\n",
    "#### For Bonafide Audio:\n",
    "- ✅ Expected: Low mean activation (<0.3)\n",
    "- ✅ Expected: Small standard deviation (<0.15)\n",
    "- ✅ Expected: No sustained high-activation regions\n",
    "\n",
    "#### For Partially Spoofed Audio:\n",
    "- ✅ Expected: Moderate to high mean activation (>0.4)\n",
    "- ✅ Expected: High variance in temporal profile\n",
    "- ✅ Expected: Clear peaks corresponding to fake segments\n",
    "- ⚠️ Watch for: Peaks aligning with VAD boundaries (may indicate model bias)\n",
    "\n",
    "### Common Pitfalls\n",
    "\n",
    "1. **Edge Effects**: High activation at utterance boundaries may be artifacts\n",
    "   - **Solution**: Ignore first/last 100ms\n",
    "\n",
    "2. **VAD Correlation**: Model may focus on silence/non-speech regions\n",
    "   - **Solution**: Compare XAI with VAD labels\n",
    "\n",
    "3. **Threshold Sensitivity**: Different thresholds yield different segmentations\n",
    "   - **Solution**: Use multiple thresholds (0.3, 0.5, 0.7) for robustness\n",
    "\n",
    "4. **Model Overfitting**: Consistent patterns across all spoof types\n",
    "   - **Solution**: Analyze per-algorithm breakdown\n",
    "\n",
    "### Validation Checklist\n",
    "\n",
    "✅ Do activation peaks align with known spoofed segments (if ground truth available)?  \n",
    "✅ Are bonafide utterances consistently low-activation?  \n",
    "✅ Do different spoofing algorithms show distinct patterns?  \n",
    "✅ Are high activations focused on speech regions (not silence)?  \n",
    "✅ Can you aurally perceive artifacts in high-activation regions?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "practical-usage",
   "metadata": {},
   "source": [
    "## Practical Usage Guide\n",
    "\n",
    "### Running the Complete Pipeline\n",
    "\n",
    "#### 1. Setup Environment\n",
    "\n",
    "```bash\n",
    "cd egs/detection/partialspoof/x12_ssl_res1d\n",
    "source path.sh\n",
    "```\n",
    "\n",
    "#### 2. Configure Paths\n",
    "\n",
    "Edit `run.sh`:\n",
    "```bash\n",
    "PS_dir=/path/to/PartialSpoof/database\n",
    "data=/path/to/output/data\n",
    "config=conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml\n",
    "exp_dir=exp/singlereso_utt_xlsr_53_ft_backend_Res1D\n",
    "VAD_PATH=/path/to/vad_annotations  # Optional\n",
    "```\n",
    "\n",
    "#### 3. Run Data Preparation (Stage 1-2)\n",
    "\n",
    "```bash\n",
    "bash run.sh --stage 1 --stop_stage 2\n",
    "```\n",
    "\n",
    "#### 4. Train Model (Stage 3-4)\n",
    "\n",
    "```bash\n",
    "bash run.sh --stage 3 --stop_stage 4 --gpus \"[0]\"\n",
    "```\n",
    "\n",
    "**Training time**: ~24-48 hours on single GPU\n",
    "\n",
    "#### 5. Evaluate Model (Stage 5-7)\n",
    "\n",
    "```bash\n",
    "bash run.sh --stage 5 --stop_stage 7\n",
    "```\n",
    "\n",
    "Check performance:\n",
    "```\n",
    "EER: X.XX%\n",
    "min t-DCF: X.XXX\n",
    "```\n",
    "\n",
    "#### 6. Extract XAI (Stage 9)\n",
    "\n",
    "```bash\n",
    "bash run.sh --stage 9 --stop_stage 9 --gpus \"[0]\"\n",
    "```\n",
    "\n",
    "**Extraction time**: ~1-2 hours for eval set\n",
    "\n",
    "#### 7. Analyze XAI (Stage 10)\n",
    "\n",
    "```bash\n",
    "bash run.sh --stage 10 --stop_stage 10\n",
    "```\n",
    "\n",
    "### Customization Options\n",
    "\n",
    "#### Change Target Layer\n",
    "\n",
    "In `wedefense/bin/XAI_GradCam_infer.py`:\n",
    "```python\n",
    "# Original: final pooling layer\n",
    "target_layer = [full_model.encoder.stat_pooling]\n",
    "\n",
    "# Alternative: intermediate layer\n",
    "target_layer = [full_model.encoder.layer4]  # Earlier features\n",
    "```\n",
    "\n",
    "#### Adjust Detection Threshold\n",
    "\n",
    "In `XAI_Score_analysis.py`:\n",
    "```python\n",
    "# Default threshold\n",
    "threshold = 0.5\n",
    "\n",
    "# Stricter detection (fewer false positives)\n",
    "threshold = 0.7\n",
    "\n",
    "# More sensitive (catch subtle spoofs)\n",
    "threshold = 0.3\n",
    "```\n",
    "\n",
    "#### Target Different Class\n",
    "\n",
    "```python\n",
    "# Original: target spoof class\n",
    "targets = [ClassifierOutputTarget(1)]\n",
    "\n",
    "# Alternative: target bonafide class (what makes it genuine?)\n",
    "targets = [ClassifierOutputTarget(0)]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "summary",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This tutorial covered the complete workflow for XAI-based partially spoofed audio detection:\n",
    "\n",
    "### Key Takeaways\n",
    "\n",
    "✅ **Pipeline Architecture**\n",
    "   - 10-stage pipeline from data to XAI analysis\n",
    "   - SSL-Res1D model with XLSR-53 frontend\n",
    "   - Grad-CAM for temporal activation mapping\n",
    "\n",
    "✅ **XAI Extraction Process**\n",
    "   - Target layer selection critical for interpretability\n",
    "   - Per-utterance temporal heatmaps\n",
    "   - Batch processing for efficiency\n",
    "\n",
    "✅ **Result Interpretation**\n",
    "   - Activation patterns indicate spoofed regions\n",
    "   - Threshold-based segment detection\n",
    "   - Statistical validation essential\n",
    "\n",
    "✅ **Practical Considerations**\n",
    "   - Model quality affects XAI quality\n",
    "   - VAD integration improves focus\n",
    "   - Cross-validation with audio inspection\n",
    "\n",
    "### Limitations and Future Directions\n",
    "\n",
    "⚠️ **Current Limitations:**\n",
    "- Grad-CAM shows correlation, not causation\n",
    "- Requires well-trained model\n",
    "- Threshold selection is dataset-dependent\n",
    "- May miss subtle artifacts\n",
    "\n",
    "🔬 **Future Work:**\n",
    "- Multi-layer XAI fusion\n",
    "- Attention-based explainability\n",
    "- Frame-level ground truth comparison\n",
    "- Real-time XAI for streaming audio\n",
    "\n",
    "### Resources\n",
    "\n",
    "📂 **Implementation**: `egs/detection/partialspoof/x12_ssl_res1d/`  \n",
    "📄 **Paper**: [arxiv.org/abs/2406.02483](https://arxiv.org/abs/2406.02483)  \n",
    "💻 **GitHub**: [github.com/zlin0/wedefense](https://github.com/zlin0/wedefense)  \n",
    "📖 **Docs**: [wedefense.readthedocs.io](https://wedefense.readthedocs.io)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "references",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "1. **Partial Spoofing Detection**: Liu et al., \"How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?\", 2024 [[paper](https://arxiv.org/abs/2406.02483)]\n",
    "\n",
    "2. **Grad-CAM**: Selvaraju et al., \"Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization\", ICCV 2017 [[paper](https://arxiv.org/abs/1610.02391)]\n",
    "\n",
    "3. **SSL Representations**: Baevski et al., \"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations\", NeurIPS 2020 [[paper](https://arxiv.org/abs/2006.11477)]\n",
    "\n",
    "4. **XLSR**: Conneau et al., \"Unsupervised Cross-lingual Representation Learning for Speech Recognition\", Interspeech 2021 [[paper](https://arxiv.org/abs/2006.13979)]\n",
    "\n",
    "5. **PartialSpoof Dataset**: Guo et al., \"Partially Spoofed Audio Detection\", ASVspoof 2019 [[paper](https://arxiv.org/abs/2105.08050)]\n",
    "\n",
    "6. **WeDefense Framework**: [[GitHub](https://github.com/zlin0/wedefense)] [[Documentation](https://wedefense.readthedocs.io)]\n",
    "\n",
    "7. **PyTorch Grad-CAM**: [[GitHub](https://github.com/jacobgil/pytorch-grad-cam)]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}