Explainable AI (XAI) for Partially Spoofed Audio Detection with Grad-CAM

Author: Tianchi Liu
Status: In Progress

Reference: How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

Overview

This tutorial explains the step-by-step workflow for applying Explainable AI (XAI) techniques to partially spoofed audio detection using the Gradient-weighted Class Activation Mapping (Grad-CAM) method.

Partially spoofed audio refers to utterances where only certain segments are synthetic while others remain genuine.

📂 Reference Implementation Path

egs/detection/partialspoof/x12_ssl_res1d/

Key Components

File	Purpose
`run.sh`	Main pipeline orchestrating Stages 1-10
`conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml`	Model configuration
`local/prepare_data.sh`	Data preparation script
`wedefense/bin/train.py`	Model training
`wedefense/bin/XAI_GradCam_infer.py`	XAI heatmap extraction
`wedefense/bin/XAI_Score_analysis.py`	XAI score analysis and visualization

What This Tutorial Covers

✅ Complete Pipeline - From data preparation to XAI analysis
✅ Model Architecture - SSL-Res1D for partial spoofing detection
✅ Grad-CAM Theory - How temporal activation maps are computed
✅ XAI Extraction - Step-by-step extraction process
✅ Result Interpretation - Understanding and analyzing XAI scores

Complete Pipeline Overview

The run.sh script implements a 10-stage pipeline:

Stage 1: Data Preparation          → wav.scp, utt2lab, lab2utt
Stage 2: Data Format Conversion    → Shard/Raw format
Stage 3: Model Training            → SSL-Res1D training
Stage 4: Model Averaging           → Average best checkpoints
Stage 5: Extract Logits            → Model inference
Stage 6: Compute LLR Scores        → Log-likelihood ratios
Stage 7: Performance Evaluation    → EER, min t-DCF metrics
Stage 8: Analysis                  → Statistical tests
Stage 9: XAI Extraction            → Grad-CAM heatmaps
Stage 10: XAI Analysis             → Visualization and interpretation

This tutorial focuses on Stages 9-10 (XAI extraction and analysis), assuming Stages 1-8 are complete.

Grad-CAM Theory for Audio

What is Grad-CAM?

Grad-CAM (Gradient-weighted Class Activation Mapping) identifies which regions of the input the model focuses on when making predictions.

Mathematical Formulation

For a target class \(c\) (e.g., spoof class):

Forward Pass:
- Input audio → SSL Frontend → Classifier (Res1D) → Classification score \(y^c\)
- Extract feature maps \(A^k\) from target layer
Backward Pass:
- Compute gradients: \(\frac{\partial y^c}{\partial A^k}\)
Weight Calculation (Global Average Pooling):
\[\alpha_k^c = \frac{1}{T}\sum_{t=1}^{T}\frac{\partial y^c}{\partial A^k_t}\]

where \(T\) is the temporal dimension.
Weighted Combination:
\[L^c_{\text{Grad-CAM}} = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)\]
Temporal Heatmap:
- Normalize to [0, 1]
- High values indicate regions important for classification

Why Grad-CAM for Partial Spoofing?

Unlike fully synthetic audio (uniform fake), partially spoofed audio requires:

Temporal localization: Identify when spoofing occurs
Boundary detection: Find transitions between real/fake
Segment-level understanding: Distinguish mixed content

Grad-CAM provides this temporal resolution by showing activation strength over time.

Model Architecture: SSL-Res1D

Pipeline Components

Audio Input (16kHz)
    ↓
[SSL Frontend] XLSR-53
    ↓
[Classifier] Res1D Backend
    ↓
Classification Score (Bonafide/Spoof)

Key Configuration

From conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml:

model: ssl_multireso_gmlp
model_args:
  feat_dim: 768          # XLSR-53 feature dimension
  embed_dim: -2          # Output embedding dimension
  num_scale: 6           # Multi-resolution scales
  gmlp_layers: 1
  batch_first: true
  flag_pool: ap          # Attentive pooling

frontend: xlsr_53
xlsr_53_args:
  layer: 12              # Use 12th layer of XLSR-53
  
projection_args:
  project_type: arc_margin
  scale: 30.0
  margin: 0.2

Why This Architecture?

XLSR-53: Self-supervised speech representations capture fine-grained acoustic patterns
Res1D: 1D residual blocks effective for temporal modeling
Multi-Resolution: Captures artifacts at different temporal scales
Arc Margin: Enhances inter-class separation

Stage 1: Data Preparation

Script: `local/prepare_data.sh`

Purpose

Prepare the PartialSpoof dataset in WeDefense format.

Input

PartialSpoof database directory
Protocol files: PartialSpoof.LA.cm.{train,dev,eval}.trl.txt

Process

Create wav.scp

find ${PS_dir}/${dset}/con_wav -name "*.wav" | awk -F"/" '{print $NF,$0}' | sort

Format: utterance_id /path/to/audio.wav

Extract labels (utt2lab)

cut -d' ' -f2,5 ${PS_dir}/protocols/PartialSpoof_LA_cm_protocols/PartialSpoof.LA.cm.${dset}.trl.txt

Format: utterance_id bonafide/spoof

Create lab2utt mapping

./tools/utt2lab_to_lab2utt.pl ${data}/${dset}/utt2lab

Groups utterances by label

Compute durations

python tools/wav2dur.py ${data}/${dset}/wav.scp ${data}/${dset}/utt2dur

Output Files

data/{train,dev,eval}/
  ├── wav.scp      # Audio paths
  ├── utt2lab      # Utterance labels
  ├── lab2utt      # Label-to-utterance mapping
  └── utt2dur      # Audio durations

Stage 3: Model Training (Overview)

Command

torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:$PORT \
  --nnodes=1 --nproc_per_node=$num_gpus \
  wedefense/bin/train.py --config $config \
    --exp_dir ${exp_dir} \
    --gpus $gpus \
    --num_avg ${num_avg} \
    --data_type "${data_type}" \
    --train_data ${data}/train/${data_type}.list \
    --train_label ${data}/train/utt2lab

Training Process

Data Loading: Batch sampling from shard/raw format
Frontend: Extract XLSR-53 features (Layer 12)
Augmentation: Optional spec augmentation, speed perturbation
Forward: Encoder → Pooling → Projection
Loss: Arc Margin Softmax loss
Optimization: AdamW with learning rate scheduling

Key Training Parameters

Batch size: Typically 64-128
Learning rate: 1e-4 with warmup
Epochs: 50-100 with early stopping
Checkpointing: Save every epoch

Output

exp/singlereso_utt_xlsr_53_ft_backend_Res1D/
  ├── config.yaml
  ├── models/
  │   ├── model_1.pt
  │   ├── model_2.pt
  │   └── ...
  └── tensorboard/

Stage 4: Model Averaging

Purpose

Average the top-N best model checkpoints to improve robustness.

Command

python wedefense/bin/average_model.py \
  --dst_model $exp_dir/models/avg_model.pt \
  --src_path $exp_dir/models \
  --num 10

Process

Identify top-10 checkpoints by validation performance
Load state dictionaries
Average parameters: \(\theta_{\text{avg}} = \frac{1}{N}\sum_{i=1}^{N}\theta_i\)
Save averaged model

Output

exp/singlereso_utt_xlsr_53_ft_backend_Res1D/models/avg_model.pt

This averaged model is used for all subsequent stages.

Stage 9: XAI Extraction with Grad-CAM

Script: `wedefense/bin/XAI_GradCam_infer.py`

Command

CUDA_VISIBLE_DEVICES=0 python wedefense/bin/XAI_GradCam_infer.py \
  --config ${exp_dir}/config.yaml \
  --model_path $exp_dir/models/avg_model.pt \
  --data_type "shard" \
  --data_list ${data}/dev/shard.list \
  --batch_size 1 \
  --num_workers 1 \
  --num_classes 2 \
  --xai_scores_path ${exp_dir}/xai_scores/dev.pkl

Step-by-Step Process

1. Model Preparation

# Load pretrained model
model = get_model(configs['model'])(**configs['model_args'])
load_checkpoint(model, model_path)

# Wrap with projection head
projection = get_projection(configs['projection_args'])
full_model = FullModel(model, projection, test_conf)

2. Target Layer Selection

# For SSL-Res1D, target the final pooling layer
target_layer = [full_model.encoder.stat_pooling]

Why this layer?

Final representation before classification
Captures high-level temporal features
Maintains temporal resolution

3. Grad-CAM Initialization

from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget

cam = GradCAM(model=full_model, target_layers=target_layer)

4. Per-Utterance Extraction

For each audio utterance:

# Load audio
wavs = batch['wav'].float().to(device)  # Shape: (1, wav_length)

# Target spoof class (class 1)
targets = [ClassifierOutputTarget(1)]

# Extract Grad-CAM heatmap
cam_output = cam(input_tensor=wavs, targets=targets)
# cam_output shape: (temporal_frames,) ranging [0, 1]

5. Save Results

results = []
for utt, heatmap in zip(utterance_ids, cam_outputs):
    results.append([[utt], heatmap.tolist()])

with open(xai_scores_path, 'wb') as f:
    pickle.dump(results, f)

Output Format

# xai_scores/dev.pkl structure:
[
  [["utt_id_1"], [0.12, 0.23, 0.89, ..., 0.34]],  # Heatmap for utterance 1
  [["utt_id_2"], [0.08, 0.15, 0.76, ..., 0.21]],  # Heatmap for utterance 2
  ...
]

Each heatmap is a 1D array where:

Length: Number of temporal frames
Values: [0, 1] indicating activation strength
High values: Model focuses on these regions for spoof detection

Stage 10: XAI Score Analysis

Script: `wedefense/bin/XAI_Score_analysis.py`

Command

python3 wedefense/bin/XAI_Score_analysis.py \
  --set dev \
  --pkl_path ${exp_dir}/xai_scores/dev.pkl \
  --vad_path "$VAD_PATH"

Analysis Components

1. Load XAI Scores and VAD Information

# Load XAI heatmaps
with open(pkl_path, 'rb') as f:
    xai_results = pickle.load(f)

# Load voice activity detection (optional)
# VAD helps focus on speech regions only
vad_info = load_vad(vad_path)

2. Compute Statistics

For each utterance:

heatmap = np.array(xai_result[1])

# Basic statistics
mean_activation = np.mean(heatmap)
max_activation = np.max(heatmap)
std_activation = np.std(heatmap)

# Temporal analysis
peak_indices = find_peaks(heatmap, threshold=0.5)
peak_regions = group_consecutive_peaks(peak_indices)

3. Segment Detection

Threshold-based segmentation:

threshold = 0.5  # Tunable parameter
spoofed_mask = heatmap > threshold

# Find continuous regions
segments = []
in_segment = False
for t, is_spoof in enumerate(spoofed_mask):
    if is_spoof and not in_segment:
        start = t
        in_segment = True
    elif not is_spoof and in_segment:
        end = t
        segments.append((start, end))
        in_segment = False

4. Visualization

Generate plots for each utterance:

A. Temporal Activation Profile

plt.figure(figsize=(12, 4))
plt.plot(time_axis, heatmap, linewidth=2, color='red')
plt.fill_between(time_axis, heatmap, alpha=0.3, color='red')
plt.axhline(y=threshold, linestyle='--', color='blue', label='Threshold')
plt.xlabel('Time (s)')
plt.ylabel('Activation')
plt.title(f'XAI Temporal Activation - {utterance_id}')
plt.legend()

B. Spectrogram with Heatmap Overlay

# Load audio and compute spectrogram
audio, sr = librosa.load(audio_path, sr=16000)
D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)))

# Overlay heatmap
heatmap_2d = np.tile(heatmap, (D.shape[0], 1))  # Repeat along frequency
plt.imshow(heatmap_2d, aspect='auto', cmap='hot', alpha=0.6)

C. Detected Segment Boundaries

# Mark detected spoofed regions
for start, end in segments:
    plt.axvspan(start, end, alpha=0.3, color='red', label='Detected Spoof')

5. Aggregate Analysis

Compare Bonafide vs Spoof distributions:

# Separate by ground truth label
bonafide_activations = []
spoof_activations = []

for result, label in zip(xai_results, labels):
    mean_act = np.mean(result[1])
    if label == 'bonafide':
        bonafide_activations.append(mean_act)
    else:
        spoof_activations.append(mean_act)

# Plot distributions
plt.hist(bonafide_activations, bins=50, alpha=0.5, label='Bonafide', color='green')
plt.hist(spoof_activations, bins=50, alpha=0.5, label='Spoof', color='red')
plt.xlabel('Mean Activation')
plt.ylabel('Count')
plt.legend()

Output

exp/xai_scores/
  ├── dev.pkl                    # Raw heatmaps
  ├── analysis/
  │   ├── temporal_profiles/     # Per-utterance plots
  │   ├── segment_detection/     # Detected boundaries
  │   ├── statistics.csv         # Aggregate stats
  │   └── distribution.png       # Bonafide vs Spoof comparison

Interpreting XAI Results

Activation Patterns and Their Meanings

Pattern	Visual Appearance	Interpretation	Example Scenario
Sharp Peaks	📈 Sudden spikes at specific time points	Splice boundaries detected	Partially spoofed audio with clear transitions
Sustained High Activation	🌊 Long regions with elevated values	Continuous spoofed segment	TTS-generated insertion
Low Flat Profile	📉 Consistently low values	Genuine speech	Bonafide utterance
Multiple Peaks	🎯 Several distinct high regions	Multiple spoofed insertions	Complex partial spoofing
Gradual Rise/Fall	📊 Smooth transitions	Soft boundaries or gradual blending	Advanced synthesis with smoothing

Decision Guidelines

For Bonafide Audio:

✅ Expected: Low mean activation (<0.3)
✅ Expected: Small standard deviation (<0.15)
✅ Expected: No sustained high-activation regions

For Partially Spoofed Audio:

✅ Expected: Moderate to high mean activation (>0.4)
✅ Expected: High variance in temporal profile
✅ Expected: Clear peaks corresponding to fake segments
⚠️ Watch for: Peaks aligning with VAD boundaries (may indicate model bias)

Common Pitfalls

Edge Effects: High activation at utterance boundaries may be artifacts
- Solution: Ignore first/last 100ms
VAD Correlation: Model may focus on silence/non-speech regions
- Solution: Compare XAI with VAD labels
Threshold Sensitivity: Different thresholds yield different segmentations
- Solution: Use multiple thresholds (0.3, 0.5, 0.7) for robustness
Model Overfitting: Consistent patterns across all spoof types
- Solution: Analyze per-algorithm breakdown

Validation Checklist

✅ Do activation peaks align with known spoofed segments (if ground truth available)?
✅ Are bonafide utterances consistently low-activation?
✅ Do different spoofing algorithms show distinct patterns?
✅ Are high activations focused on speech regions (not silence)?
✅ Can you aurally perceive artifacts in high-activation regions?

Practical Usage Guide

Running the Complete Pipeline

1. Setup Environment

cd egs/detection/partialspoof/x12_ssl_res1d
source path.sh

2. Configure Paths

Edit run.sh:

PS_dir=/path/to/PartialSpoof/database
data=/path/to/output/data
config=conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml
exp_dir=exp/singlereso_utt_xlsr_53_ft_backend_Res1D
VAD_PATH=/path/to/vad_annotations  # Optional

3. Run Data Preparation (Stage 1-2)

bash run.sh --stage 1 --stop_stage 2

4. Train Model (Stage 3-4)

bash run.sh --stage 3 --stop_stage 4 --gpus "[0]"

Training time: ~24-48 hours on single GPU

5. Evaluate Model (Stage 5-7)

bash run.sh --stage 5 --stop_stage 7

Check performance:

EER: X.XX%
min t-DCF: X.XXX

6. Extract XAI (Stage 9)

bash run.sh --stage 9 --stop_stage 9 --gpus "[0]"

Extraction time: ~1-2 hours for eval set

7. Analyze XAI (Stage 10)

bash run.sh --stage 10 --stop_stage 10

Customization Options

Change Target Layer

In wedefense/bin/XAI_GradCam_infer.py:

# Original: final pooling layer
target_layer = [full_model.encoder.stat_pooling]

# Alternative: intermediate layer
target_layer = [full_model.encoder.layer4]  # Earlier features

Adjust Detection Threshold

In XAI_Score_analysis.py:

# Default threshold
threshold = 0.5

# Stricter detection (fewer false positives)
threshold = 0.7

# More sensitive (catch subtle spoofs)
threshold = 0.3

Target Different Class

# Original: target spoof class
targets = [ClassifierOutputTarget(1)]

# Alternative: target bonafide class (what makes it genuine?)
targets = [ClassifierOutputTarget(0)]

Summary

This tutorial covered the complete workflow for XAI-based partially spoofed audio detection:

Key Takeaways

✅ Pipeline Architecture

10-stage pipeline from data to XAI analysis
SSL-Res1D model with XLSR-53 frontend
Grad-CAM for temporal activation mapping

✅ XAI Extraction Process

Target layer selection critical for interpretability
Per-utterance temporal heatmaps
Batch processing for efficiency

✅ Result Interpretation

Activation patterns indicate spoofed regions
Threshold-based segment detection
Statistical validation essential

✅ Practical Considerations

Model quality affects XAI quality
VAD integration improves focus
Cross-validation with audio inspection

Limitations and Future Directions

⚠️ Current Limitations:

Grad-CAM shows correlation, not causation
Requires well-trained model
Threshold selection is dataset-dependent
May miss subtle artifacts

🔬 Future Work:

Multi-layer XAI fusion
Attention-based explainability
Frame-level ground truth comparison
Real-time XAI for streaming audio

Resources

📂 Implementation: egs/detection/partialspoof/x12_ssl_res1d/
📄 Paper: arxiv.org/abs/2406.02483
💻 GitHub: github.com/zlin0/wedefense
📖 Docs: wedefense.readthedocs.io

References

Partial Spoofing Detection: Liu et al., “How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?”, 2024 [paper]
Grad-CAM: Selvaraju et al., “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization”, ICCV 2017 [paper]
SSL Representations: Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, NeurIPS 2020 [paper]
XLSR: Conneau et al., “Unsupervised Cross-lingual Representation Learning for Speech Recognition”, Interspeech 2021 [paper]
PartialSpoof Dataset: Guo et al., “Partially Spoofed Audio Detection”, ASVspoof 2019 [paper]
WeDefense Framework: [GitHub] [Documentation]
PyTorch Grad-CAM: [GitHub]

Explainable AI (XAI) for Partially Spoofed Audio Detection with Grad-CAM

Overview

📂 Reference Implementation Path

Key Components

What This Tutorial Covers

Complete Pipeline Overview

Grad-CAM Theory for Audio

What is Grad-CAM?

Mathematical Formulation

Why Grad-CAM for Partial Spoofing?

Model Architecture: SSL-Res1D

Pipeline Components

Key Configuration

Why This Architecture?

Stage 1: Data Preparation

Script: local/prepare_data.sh

Purpose

Input

Process

Output Files

Stage 3: Model Training (Overview)

Command

Training Process

Key Training Parameters

Output

Stage 4: Model Averaging

Purpose

Command

Process

Output

Stage 9: XAI Extraction with Grad-CAM

Script: wedefense/bin/XAI_GradCam_infer.py

Command

Step-by-Step Process

1. Model Preparation

2. Target Layer Selection

3. Grad-CAM Initialization

4. Per-Utterance Extraction

5. Save Results

Output Format

Stage 10: XAI Score Analysis

Script: wedefense/bin/XAI_Score_analysis.py

Command

Analysis Components

1. Load XAI Scores and VAD Information

2. Compute Statistics

3. Segment Detection

4. Visualization

5. Aggregate Analysis

Output

Interpreting XAI Results

Activation Patterns and Their Meanings

Decision Guidelines

For Bonafide Audio:

For Partially Spoofed Audio:

Common Pitfalls

Validation Checklist

Practical Usage Guide

Running the Complete Pipeline

1. Setup Environment

2. Configure Paths

3. Run Data Preparation (Stage 1-2)

4. Train Model (Stage 3-4)

5. Evaluate Model (Stage 5-7)

6. Extract XAI (Stage 9)

7. Analyze XAI (Stage 10)

Customization Options

Change Target Layer

Adjust Detection Threshold

Target Different Class

Summary

Key Takeaways

Limitations and Future Directions

Resources

References

Script: `local/prepare_data.sh`

Script: `wedefense/bin/XAI_GradCam_infer.py`

Script: `wedefense/bin/XAI_Score_analysis.py`