{ "cells": [ { "cell_type": "markdown", "id": "xai-gradcam-title", "metadata": {}, "source": [ "# Explainable AI (XAI) for Partially Spoofed Audio Detection with Grad-CAM\n", "**Author:** Tianchi Liu \n", "**Status:** In Progress\n", "\n", "**Reference:** [How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?](https://arxiv.org/abs/2406.02483)\n" ] }, { "cell_type": "markdown", "id": "overview", "metadata": {}, "source": [ "## Overview\n", "\n", "This tutorial explains the **step-by-step workflow** for applying **Explainable AI (XAI)** techniques to **partially spoofed audio detection** using the **Gradient-weighted Class Activation Mapping (Grad-CAM)** method.\n", "\n", "**Partially spoofed audio** refers to utterances where only certain segments are synthetic while others remain genuine.\n", "\n", "### 📂 Reference Implementation Path\n", "\n", "```bash\n", "egs/detection/partialspoof/x12_ssl_res1d/\n", "```\n", "\n", "### Key Components\n", "\n", "| File | Purpose |\n", "|------|--------|\n", "| `run.sh` | Main pipeline orchestrating Stages 1-10 |\n", "| `conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml` | Model configuration |\n", "| `local/prepare_data.sh` | Data preparation script |\n", "| `wedefense/bin/train.py` | Model training |\n", "| `wedefense/bin/XAI_GradCam_infer.py` | XAI heatmap extraction |\n", "| `wedefense/bin/XAI_Score_analysis.py` | XAI score analysis and visualization |\n", "\n", "### What This Tutorial Covers\n", "\n", "✅ **Complete Pipeline** - From data preparation to XAI analysis \n", "✅ **Model Architecture** - SSL-Res1D for partial spoofing detection \n", "✅ **Grad-CAM Theory** - How temporal activation maps are computed \n", "✅ **XAI Extraction** - Step-by-step extraction process \n", "✅ **Result Interpretation** - Understanding and analyzing XAI scores" ] }, { "cell_type": "markdown", "id": "pipeline-overview", "metadata": {}, "source": [ "## Complete Pipeline Overview\n", "\n", "The `run.sh` script implements a 10-stage pipeline:\n", "\n", "```\n", "Stage 1: Data Preparation → wav.scp, utt2lab, lab2utt\n", "Stage 2: Data Format Conversion → Shard/Raw format\n", "Stage 3: Model Training → SSL-Res1D training\n", "Stage 4: Model Averaging → Average best checkpoints\n", "Stage 5: Extract Logits → Model inference\n", "Stage 6: Compute LLR Scores → Log-likelihood ratios\n", "Stage 7: Performance Evaluation → EER, min t-DCF metrics\n", "Stage 8: Analysis → Statistical tests\n", "Stage 9: XAI Extraction → Grad-CAM heatmaps\n", "Stage 10: XAI Analysis → Visualization and interpretation\n", "```\n", "\n", "**This tutorial focuses on Stages 9-10** (XAI extraction and analysis), assuming Stages 1-8 are complete." ] }, { "cell_type": "markdown", "id": "theory-gradcam", "metadata": {}, "source": [ "## Grad-CAM Theory for Audio\n", "\n", "### What is Grad-CAM?\n", "\n", "Grad-CAM (Gradient-weighted Class Activation Mapping) identifies which regions of the input the model focuses on when making predictions.\n", "\n", "### Mathematical Formulation\n", "\n", "For a target class $c$ (e.g., spoof class):\n", "\n", "1. **Forward Pass**: \n", " - Input audio → SSL Frontend → Classifier (Res1D) → Classification score $y^c$\n", " - Extract feature maps $A^k$ from target layer\n", "\n", "2. **Backward Pass**:\n", " - Compute gradients: $\\frac{\\partial y^c}{\\partial A^k}$\n", "\n", "3. **Weight Calculation** (Global Average Pooling):\n", " $$\\alpha_k^c = \\frac{1}{T}\\sum_{t=1}^{T}\\frac{\\partial y^c}{\\partial A^k_t}$$\n", " \n", " where $T$ is the temporal dimension.\n", "\n", "4. **Weighted Combination**:\n", " $$L^c_{\\text{Grad-CAM}} = \\text{ReLU}\\left(\\sum_k \\alpha_k^c A^k\\right)$$\n", "\n", "5. **Temporal Heatmap**:\n", " - Normalize to [0, 1]\n", " - High values indicate regions important for classification\n", "\n", "### Why Grad-CAM for Partial Spoofing?\n", "\n", "Unlike fully synthetic audio (uniform fake), partially spoofed audio requires:\n", "- **Temporal localization**: Identify *when* spoofing occurs\n", "- **Boundary detection**: Find transitions between real/fake\n", "- **Segment-level understanding**: Distinguish mixed content\n", "\n", "Grad-CAM provides this temporal resolution by showing activation strength over time." ] }, { "cell_type": "markdown", "id": "model-architecture", "metadata": {}, "source": [ "## Model Architecture: SSL-Res1D\n", "\n", "### Pipeline Components\n", "\n", "```\n", "Audio Input (16kHz)\n", " ↓\n", "[SSL Frontend] XLSR-53\n", " ↓\n", "[Classifier] Res1D Backend\n", " ↓\n", "Classification Score (Bonafide/Spoof)\n", "```\n", "\n", "### Key Configuration\n", "\n", "From `conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml`:\n", "\n", "```yaml\n", "model: ssl_multireso_gmlp\n", "model_args:\n", " feat_dim: 768 # XLSR-53 feature dimension\n", " embed_dim: -2 # Output embedding dimension\n", " num_scale: 6 # Multi-resolution scales\n", " gmlp_layers: 1\n", " batch_first: true\n", " flag_pool: ap # Attentive pooling\n", "\n", "frontend: xlsr_53\n", "xlsr_53_args:\n", " layer: 12 # Use 12th layer of XLSR-53\n", " \n", "projection_args:\n", " project_type: arc_margin\n", " scale: 30.0\n", " margin: 0.2\n", "```\n", "\n", "### Why This Architecture?\n", "\n", "1. **XLSR-53**: Self-supervised speech representations capture fine-grained acoustic patterns\n", "2. **Res1D**: 1D residual blocks effective for temporal modeling\n", "3. **Multi-Resolution**: Captures artifacts at different temporal scales\n", "4. **Arc Margin**: Enhances inter-class separation" ] }, { "cell_type": "markdown", "id": "stage1-data-prep", "metadata": {}, "source": [ "## Stage 1: Data Preparation\n", "\n", "### Script: `local/prepare_data.sh`\n", "\n", "### Purpose\n", "Prepare the PartialSpoof dataset in WeDefense format.\n", "\n", "### Input\n", "- PartialSpoof database directory\n", "- Protocol files: `PartialSpoof.LA.cm.{train,dev,eval}.trl.txt`\n", "\n", "### Process\n", "\n", "1. **Create wav.scp**\n", " ```bash\n", " find ${PS_dir}/${dset}/con_wav -name \"*.wav\" | awk -F\"/\" '{print $NF,$0}' | sort\n", " ```\n", " Format: `utterance_id /path/to/audio.wav`\n", "\n", "2. **Extract labels (utt2lab)**\n", " ```bash\n", " cut -d' ' -f2,5 ${PS_dir}/protocols/PartialSpoof_LA_cm_protocols/PartialSpoof.LA.cm.${dset}.trl.txt\n", " ```\n", " Format: `utterance_id bonafide/spoof`\n", "\n", "3. **Create lab2utt mapping**\n", " ```bash\n", " ./tools/utt2lab_to_lab2utt.pl ${data}/${dset}/utt2lab\n", " ```\n", " Groups utterances by label\n", "\n", "4. **Compute durations**\n", " ```bash\n", " python tools/wav2dur.py ${data}/${dset}/wav.scp ${data}/${dset}/utt2dur\n", " ```\n", "\n", "### Output Files\n", "```\n", "data/{train,dev,eval}/\n", " ├── wav.scp # Audio paths\n", " ├── utt2lab # Utterance labels\n", " ├── lab2utt # Label-to-utterance mapping\n", " └── utt2dur # Audio durations\n", "```" ] }, { "cell_type": "markdown", "id": "stage3-training", "metadata": {}, "source": [ "## Stage 3: Model Training (Overview)\n", "\n", "### Command\n", "\n", "```bash\n", "torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:$PORT \\\n", " --nnodes=1 --nproc_per_node=$num_gpus \\\n", " wedefense/bin/train.py --config $config \\\n", " --exp_dir ${exp_dir} \\\n", " --gpus $gpus \\\n", " --num_avg ${num_avg} \\\n", " --data_type \"${data_type}\" \\\n", " --train_data ${data}/train/${data_type}.list \\\n", " --train_label ${data}/train/utt2lab\n", "```\n", "\n", "### Training Process\n", "\n", "1. **Data Loading**: Batch sampling from shard/raw format\n", "2. **Frontend**: Extract XLSR-53 features (Layer 12)\n", "3. **Augmentation**: Optional spec augmentation, speed perturbation\n", "4. **Forward**: Encoder → Pooling → Projection\n", "5. **Loss**: Arc Margin Softmax loss\n", "6. **Optimization**: AdamW with learning rate scheduling\n", "\n", "### Key Training Parameters\n", "\n", "- **Batch size**: Typically 64-128\n", "- **Learning rate**: 1e-4 with warmup\n", "- **Epochs**: 50-100 with early stopping\n", "- **Checkpointing**: Save every epoch\n", "\n", "### Output\n", "\n", "```\n", "exp/singlereso_utt_xlsr_53_ft_backend_Res1D/\n", " ├── config.yaml\n", " ├── models/\n", " │ ├── model_1.pt\n", " │ ├── model_2.pt\n", " │ └── ...\n", " └── tensorboard/\n", "```" ] }, { "cell_type": "markdown", "id": "stage4-averaging", "metadata": {}, "source": [ "## Stage 4: Model Averaging\n", "\n", "### Purpose\n", "Average the top-N best model checkpoints to improve robustness.\n", "\n", "### Command\n", "\n", "```bash\n", "python wedefense/bin/average_model.py \\\n", " --dst_model $exp_dir/models/avg_model.pt \\\n", " --src_path $exp_dir/models \\\n", " --num 10\n", "```\n", "\n", "### Process\n", "\n", "1. Identify top-10 checkpoints by validation performance\n", "2. Load state dictionaries\n", "3. Average parameters: $\\theta_{\\text{avg}} = \\frac{1}{N}\\sum_{i=1}^{N}\\theta_i$\n", "4. Save averaged model\n", "\n", "### Output\n", "\n", "```\n", "exp/singlereso_utt_xlsr_53_ft_backend_Res1D/models/avg_model.pt\n", "```\n", "\n", "This averaged model is used for all subsequent stages." ] }, { "cell_type": "markdown", "id": "stage9-xai-extraction", "metadata": {}, "source": [ "## Stage 9: XAI Extraction with Grad-CAM\n", "\n", "### Script: `wedefense/bin/XAI_GradCam_infer.py`\n", "\n", "### Command\n", "\n", "```bash\n", "CUDA_VISIBLE_DEVICES=0 python wedefense/bin/XAI_GradCam_infer.py \\\n", " --config ${exp_dir}/config.yaml \\\n", " --model_path $exp_dir/models/avg_model.pt \\\n", " --data_type \"shard\" \\\n", " --data_list ${data}/dev/shard.list \\\n", " --batch_size 1 \\\n", " --num_workers 1 \\\n", " --num_classes 2 \\\n", " --xai_scores_path ${exp_dir}/xai_scores/dev.pkl\n", "```\n", "\n", "### Step-by-Step Process\n", "\n", "#### 1. Model Preparation\n", "\n", "```python\n", "# Load pretrained model\n", "model = get_model(configs['model'])(**configs['model_args'])\n", "load_checkpoint(model, model_path)\n", "\n", "# Wrap with projection head\n", "projection = get_projection(configs['projection_args'])\n", "full_model = FullModel(model, projection, test_conf)\n", "```\n", "\n", "#### 2. Target Layer Selection\n", "\n", "```python\n", "# For SSL-Res1D, target the final pooling layer\n", "target_layer = [full_model.encoder.stat_pooling]\n", "```\n", "\n", "**Why this layer?**\n", "- Final representation before classification\n", "- Captures high-level temporal features\n", "- Maintains temporal resolution\n", "\n", "#### 3. Grad-CAM Initialization\n", "\n", "```python\n", "from pytorch_grad_cam import GradCAM\n", "from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget\n", "\n", "cam = GradCAM(model=full_model, target_layers=target_layer)\n", "```\n", "\n", "#### 4. Per-Utterance Extraction\n", "\n", "For each audio utterance:\n", "\n", "```python\n", "# Load audio\n", "wavs = batch['wav'].float().to(device) # Shape: (1, wav_length)\n", "\n", "# Target spoof class (class 1)\n", "targets = [ClassifierOutputTarget(1)]\n", "\n", "# Extract Grad-CAM heatmap\n", "cam_output = cam(input_tensor=wavs, targets=targets)\n", "# cam_output shape: (temporal_frames,) ranging [0, 1]\n", "```\n", "\n", "#### 5. Save Results\n", "\n", "```python\n", "results = []\n", "for utt, heatmap in zip(utterance_ids, cam_outputs):\n", " results.append([[utt], heatmap.tolist()])\n", "\n", "with open(xai_scores_path, 'wb') as f:\n", " pickle.dump(results, f)\n", "```\n", "\n", "### Output Format\n", "\n", "```python\n", "# xai_scores/dev.pkl structure:\n", "[\n", " [[\"utt_id_1\"], [0.12, 0.23, 0.89, ..., 0.34]], # Heatmap for utterance 1\n", " [[\"utt_id_2\"], [0.08, 0.15, 0.76, ..., 0.21]], # Heatmap for utterance 2\n", " ...\n", "]\n", "```\n", "\n", "Each heatmap is a 1D array where:\n", "- **Length**: Number of temporal frames\n", "- **Values**: [0, 1] indicating activation strength\n", "- **High values**: Model focuses on these regions for spoof detection" ] }, { "cell_type": "markdown", "id": "stage10-xai-analysis", "metadata": {}, "source": [ "## Stage 10: XAI Score Analysis\n", "\n", "### Script: `wedefense/bin/XAI_Score_analysis.py`\n", "\n", "### Command\n", "\n", "```bash\n", "python3 wedefense/bin/XAI_Score_analysis.py \\\n", " --set dev \\\n", " --pkl_path ${exp_dir}/xai_scores/dev.pkl \\\n", " --vad_path \"$VAD_PATH\"\n", "```\n", "\n", "### Analysis Components\n", "\n", "#### 1. Load XAI Scores and VAD Information\n", "\n", "```python\n", "# Load XAI heatmaps\n", "with open(pkl_path, 'rb') as f:\n", " xai_results = pickle.load(f)\n", "\n", "# Load voice activity detection (optional)\n", "# VAD helps focus on speech regions only\n", "vad_info = load_vad(vad_path)\n", "```\n", "\n", "#### 2. Compute Statistics\n", "\n", "For each utterance:\n", "\n", "```python\n", "heatmap = np.array(xai_result[1])\n", "\n", "# Basic statistics\n", "mean_activation = np.mean(heatmap)\n", "max_activation = np.max(heatmap)\n", "std_activation = np.std(heatmap)\n", "\n", "# Temporal analysis\n", "peak_indices = find_peaks(heatmap, threshold=0.5)\n", "peak_regions = group_consecutive_peaks(peak_indices)\n", "```\n", "\n", "#### 3. Segment Detection\n", "\n", "**Threshold-based segmentation:**\n", "\n", "```python\n", "threshold = 0.5 # Tunable parameter\n", "spoofed_mask = heatmap > threshold\n", "\n", "# Find continuous regions\n", "segments = []\n", "in_segment = False\n", "for t, is_spoof in enumerate(spoofed_mask):\n", " if is_spoof and not in_segment:\n", " start = t\n", " in_segment = True\n", " elif not is_spoof and in_segment:\n", " end = t\n", " segments.append((start, end))\n", " in_segment = False\n", "```\n", "\n", "#### 4. Visualization\n", "\n", "Generate plots for each utterance:\n", "\n", "**A. Temporal Activation Profile**\n", "```python\n", "plt.figure(figsize=(12, 4))\n", "plt.plot(time_axis, heatmap, linewidth=2, color='red')\n", "plt.fill_between(time_axis, heatmap, alpha=0.3, color='red')\n", "plt.axhline(y=threshold, linestyle='--', color='blue', label='Threshold')\n", "plt.xlabel('Time (s)')\n", "plt.ylabel('Activation')\n", "plt.title(f'XAI Temporal Activation - {utterance_id}')\n", "plt.legend()\n", "```\n", "\n", "**B. Spectrogram with Heatmap Overlay**\n", "```python\n", "# Load audio and compute spectrogram\n", "audio, sr = librosa.load(audio_path, sr=16000)\n", "D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)))\n", "\n", "# Overlay heatmap\n", "heatmap_2d = np.tile(heatmap, (D.shape[0], 1)) # Repeat along frequency\n", "plt.imshow(heatmap_2d, aspect='auto', cmap='hot', alpha=0.6)\n", "```\n", "\n", "**C. Detected Segment Boundaries**\n", "```python\n", "# Mark detected spoofed regions\n", "for start, end in segments:\n", " plt.axvspan(start, end, alpha=0.3, color='red', label='Detected Spoof')\n", "```\n", "\n", "#### 5. Aggregate Analysis\n", "\n", "**Compare Bonafide vs Spoof distributions:**\n", "\n", "```python\n", "# Separate by ground truth label\n", "bonafide_activations = []\n", "spoof_activations = []\n", "\n", "for result, label in zip(xai_results, labels):\n", " mean_act = np.mean(result[1])\n", " if label == 'bonafide':\n", " bonafide_activations.append(mean_act)\n", " else:\n", " spoof_activations.append(mean_act)\n", "\n", "# Plot distributions\n", "plt.hist(bonafide_activations, bins=50, alpha=0.5, label='Bonafide', color='green')\n", "plt.hist(spoof_activations, bins=50, alpha=0.5, label='Spoof', color='red')\n", "plt.xlabel('Mean Activation')\n", "plt.ylabel('Count')\n", "plt.legend()\n", "```\n", "\n", "### Output\n", "\n", "```\n", "exp/xai_scores/\n", " ├── dev.pkl # Raw heatmaps\n", " ├── analysis/\n", " │ ├── temporal_profiles/ # Per-utterance plots\n", " │ ├── segment_detection/ # Detected boundaries\n", " │ ├── statistics.csv # Aggregate stats\n", " │ └── distribution.png # Bonafide vs Spoof comparison\n", "```" ] }, { "cell_type": "markdown", "id": "interpretation", "metadata": {}, "source": [ "## Interpreting XAI Results\n", "\n", "### Activation Patterns and Their Meanings\n", "\n", "| Pattern | Visual Appearance | Interpretation | Example Scenario |\n", "|---------|-------------------|----------------|------------------|\n", "| **Sharp Peaks** | 📈 Sudden spikes at specific time points | Splice boundaries detected | Partially spoofed audio with clear transitions |\n", "| **Sustained High Activation** | 🌊 Long regions with elevated values | Continuous spoofed segment | TTS-generated insertion |\n", "| **Low Flat Profile** | 📉 Consistently low values | Genuine speech | Bonafide utterance |\n", "| **Multiple Peaks** | 🎯 Several distinct high regions | Multiple spoofed insertions | Complex partial spoofing |\n", "| **Gradual Rise/Fall** | 📊 Smooth transitions | Soft boundaries or gradual blending | Advanced synthesis with smoothing |\n", "\n", "### Decision Guidelines\n", "\n", "#### For Bonafide Audio:\n", "- ✅ Expected: Low mean activation (<0.3)\n", "- ✅ Expected: Small standard deviation (<0.15)\n", "- ✅ Expected: No sustained high-activation regions\n", "\n", "#### For Partially Spoofed Audio:\n", "- ✅ Expected: Moderate to high mean activation (>0.4)\n", "- ✅ Expected: High variance in temporal profile\n", "- ✅ Expected: Clear peaks corresponding to fake segments\n", "- ⚠️ Watch for: Peaks aligning with VAD boundaries (may indicate model bias)\n", "\n", "### Common Pitfalls\n", "\n", "1. **Edge Effects**: High activation at utterance boundaries may be artifacts\n", " - **Solution**: Ignore first/last 100ms\n", "\n", "2. **VAD Correlation**: Model may focus on silence/non-speech regions\n", " - **Solution**: Compare XAI with VAD labels\n", "\n", "3. **Threshold Sensitivity**: Different thresholds yield different segmentations\n", " - **Solution**: Use multiple thresholds (0.3, 0.5, 0.7) for robustness\n", "\n", "4. **Model Overfitting**: Consistent patterns across all spoof types\n", " - **Solution**: Analyze per-algorithm breakdown\n", "\n", "### Validation Checklist\n", "\n", "✅ Do activation peaks align with known spoofed segments (if ground truth available)? \n", "✅ Are bonafide utterances consistently low-activation? \n", "✅ Do different spoofing algorithms show distinct patterns? \n", "✅ Are high activations focused on speech regions (not silence)? \n", "✅ Can you aurally perceive artifacts in high-activation regions?" ] }, { "cell_type": "markdown", "id": "practical-usage", "metadata": {}, "source": [ "## Practical Usage Guide\n", "\n", "### Running the Complete Pipeline\n", "\n", "#### 1. Setup Environment\n", "\n", "```bash\n", "cd egs/detection/partialspoof/x12_ssl_res1d\n", "source path.sh\n", "```\n", "\n", "#### 2. Configure Paths\n", "\n", "Edit `run.sh`:\n", "```bash\n", "PS_dir=/path/to/PartialSpoof/database\n", "data=/path/to/output/data\n", "config=conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml\n", "exp_dir=exp/singlereso_utt_xlsr_53_ft_backend_Res1D\n", "VAD_PATH=/path/to/vad_annotations # Optional\n", "```\n", "\n", "#### 3. Run Data Preparation (Stage 1-2)\n", "\n", "```bash\n", "bash run.sh --stage 1 --stop_stage 2\n", "```\n", "\n", "#### 4. Train Model (Stage 3-4)\n", "\n", "```bash\n", "bash run.sh --stage 3 --stop_stage 4 --gpus \"[0]\"\n", "```\n", "\n", "**Training time**: ~24-48 hours on single GPU\n", "\n", "#### 5. Evaluate Model (Stage 5-7)\n", "\n", "```bash\n", "bash run.sh --stage 5 --stop_stage 7\n", "```\n", "\n", "Check performance:\n", "```\n", "EER: X.XX%\n", "min t-DCF: X.XXX\n", "```\n", "\n", "#### 6. Extract XAI (Stage 9)\n", "\n", "```bash\n", "bash run.sh --stage 9 --stop_stage 9 --gpus \"[0]\"\n", "```\n", "\n", "**Extraction time**: ~1-2 hours for eval set\n", "\n", "#### 7. Analyze XAI (Stage 10)\n", "\n", "```bash\n", "bash run.sh --stage 10 --stop_stage 10\n", "```\n", "\n", "### Customization Options\n", "\n", "#### Change Target Layer\n", "\n", "In `wedefense/bin/XAI_GradCam_infer.py`:\n", "```python\n", "# Original: final pooling layer\n", "target_layer = [full_model.encoder.stat_pooling]\n", "\n", "# Alternative: intermediate layer\n", "target_layer = [full_model.encoder.layer4] # Earlier features\n", "```\n", "\n", "#### Adjust Detection Threshold\n", "\n", "In `XAI_Score_analysis.py`:\n", "```python\n", "# Default threshold\n", "threshold = 0.5\n", "\n", "# Stricter detection (fewer false positives)\n", "threshold = 0.7\n", "\n", "# More sensitive (catch subtle spoofs)\n", "threshold = 0.3\n", "```\n", "\n", "#### Target Different Class\n", "\n", "```python\n", "# Original: target spoof class\n", "targets = [ClassifierOutputTarget(1)]\n", "\n", "# Alternative: target bonafide class (what makes it genuine?)\n", "targets = [ClassifierOutputTarget(0)]\n", "```" ] }, { "cell_type": "markdown", "id": "summary", "metadata": {}, "source": [ "## Summary\n", "\n", "This tutorial covered the complete workflow for XAI-based partially spoofed audio detection:\n", "\n", "### Key Takeaways\n", "\n", "✅ **Pipeline Architecture**\n", " - 10-stage pipeline from data to XAI analysis\n", " - SSL-Res1D model with XLSR-53 frontend\n", " - Grad-CAM for temporal activation mapping\n", "\n", "✅ **XAI Extraction Process**\n", " - Target layer selection critical for interpretability\n", " - Per-utterance temporal heatmaps\n", " - Batch processing for efficiency\n", "\n", "✅ **Result Interpretation**\n", " - Activation patterns indicate spoofed regions\n", " - Threshold-based segment detection\n", " - Statistical validation essential\n", "\n", "✅ **Practical Considerations**\n", " - Model quality affects XAI quality\n", " - VAD integration improves focus\n", " - Cross-validation with audio inspection\n", "\n", "### Limitations and Future Directions\n", "\n", "⚠️ **Current Limitations:**\n", "- Grad-CAM shows correlation, not causation\n", "- Requires well-trained model\n", "- Threshold selection is dataset-dependent\n", "- May miss subtle artifacts\n", "\n", "🔬 **Future Work:**\n", "- Multi-layer XAI fusion\n", "- Attention-based explainability\n", "- Frame-level ground truth comparison\n", "- Real-time XAI for streaming audio\n", "\n", "### Resources\n", "\n", "📂 **Implementation**: `egs/detection/partialspoof/x12_ssl_res1d/` \n", "📄 **Paper**: [arxiv.org/abs/2406.02483](https://arxiv.org/abs/2406.02483) \n", "💻 **GitHub**: [github.com/zlin0/wedefense](https://github.com/zlin0/wedefense) \n", "📖 **Docs**: [wedefense.readthedocs.io](https://wedefense.readthedocs.io)" ] }, { "cell_type": "markdown", "id": "references", "metadata": {}, "source": [ "## References\n", "\n", "1. **Partial Spoofing Detection**: Liu et al., \"How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?\", 2024 [[paper](https://arxiv.org/abs/2406.02483)]\n", "\n", "2. **Grad-CAM**: Selvaraju et al., \"Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization\", ICCV 2017 [[paper](https://arxiv.org/abs/1610.02391)]\n", "\n", "3. **SSL Representations**: Baevski et al., \"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations\", NeurIPS 2020 [[paper](https://arxiv.org/abs/2006.11477)]\n", "\n", "4. **XLSR**: Conneau et al., \"Unsupervised Cross-lingual Representation Learning for Speech Recognition\", Interspeech 2021 [[paper](https://arxiv.org/abs/2006.13979)]\n", "\n", "5. **PartialSpoof Dataset**: Guo et al., \"Partially Spoofed Audio Detection\", ASVspoof 2019 [[paper](https://arxiv.org/abs/2105.08050)]\n", "\n", "6. **WeDefense Framework**: [[GitHub](https://github.com/zlin0/wedefense)] [[Documentation](https://wedefense.readthedocs.io)]\n", "\n", "7. **PyTorch Grad-CAM**: [[GitHub](https://github.com/jacobgil/pytorch-grad-cam)]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 5 }