Explainable AI (XAI) for Partially Spoofed Audio Detection with Grad-CAM

Author: Tianchi Liu
Status: In Progress

Reference: How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

Overview

This tutorial explains the step-by-step workflow for applying Explainable AI (XAI) techniques to partially spoofed audio detection using the Gradient-weighted Class Activation Mapping (Grad-CAM) method.

Partially spoofed audio refers to utterances where only certain segments are synthetic while others remain genuine.

πŸ“‚ Reference Implementation Path

egs/detection/partialspoof/x12_ssl_res1d/

Key Components

File

Purpose

run.sh

Main pipeline orchestrating Stages 1-10

conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml

Model configuration

local/prepare_data.sh

Data preparation script

wedefense/bin/train.py

Model training

wedefense/bin/XAI_GradCam_infer.py

XAI heatmap extraction

wedefense/bin/XAI_Score_analysis.py

XAI score analysis and visualization

What This Tutorial Covers

βœ… Complete Pipeline - From data preparation to XAI analysis
βœ… Model Architecture - SSL-Res1D for partial spoofing detection
βœ… Grad-CAM Theory - How temporal activation maps are computed
βœ… XAI Extraction - Step-by-step extraction process
βœ… Result Interpretation - Understanding and analyzing XAI scores

Complete Pipeline Overview

The run.sh script implements a 10-stage pipeline:

Stage 1: Data Preparation          β†’ wav.scp, utt2lab, lab2utt
Stage 2: Data Format Conversion    β†’ Shard/Raw format
Stage 3: Model Training            β†’ SSL-Res1D training
Stage 4: Model Averaging           β†’ Average best checkpoints
Stage 5: Extract Logits            β†’ Model inference
Stage 6: Compute LLR Scores        β†’ Log-likelihood ratios
Stage 7: Performance Evaluation    β†’ EER, min t-DCF metrics
Stage 8: Analysis                  β†’ Statistical tests
Stage 9: XAI Extraction            β†’ Grad-CAM heatmaps
Stage 10: XAI Analysis             β†’ Visualization and interpretation

This tutorial focuses on Stages 9-10 (XAI extraction and analysis), assuming Stages 1-8 are complete.

Grad-CAM Theory for Audio

What is Grad-CAM?

Grad-CAM (Gradient-weighted Class Activation Mapping) identifies which regions of the input the model focuses on when making predictions.

Mathematical Formulation

For a target class \(c\) (e.g., spoof class):

  1. Forward Pass:

    • Input audio β†’ SSL Frontend β†’ Classifier (Res1D) β†’ Classification score \(y^c\)

    • Extract feature maps \(A^k\) from target layer

  2. Backward Pass:

    • Compute gradients: \(\frac{\partial y^c}{\partial A^k}\)

  3. Weight Calculation (Global Average Pooling):

    \[\alpha_k^c = \frac{1}{T}\sum_{t=1}^{T}\frac{\partial y^c}{\partial A^k_t}\]

    where \(T\) is the temporal dimension.

  4. Weighted Combination:

    \[L^c_{\text{Grad-CAM}} = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)\]

  5. Temporal Heatmap:

    • Normalize to [0, 1]

    • High values indicate regions important for classification

Why Grad-CAM for Partial Spoofing?

Unlike fully synthetic audio (uniform fake), partially spoofed audio requires:

  • Temporal localization: Identify when spoofing occurs

  • Boundary detection: Find transitions between real/fake

  • Segment-level understanding: Distinguish mixed content

Grad-CAM provides this temporal resolution by showing activation strength over time.

Model Architecture: SSL-Res1D

Pipeline Components

Audio Input (16kHz)
    ↓
[SSL Frontend] XLSR-53
    ↓
[Classifier] Res1D Backend
    ↓
Classification Score (Bonafide/Spoof)

Key Configuration

From conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml:

model: ssl_multireso_gmlp
model_args:
  feat_dim: 768          # XLSR-53 feature dimension
  embed_dim: -2          # Output embedding dimension
  num_scale: 6           # Multi-resolution scales
  gmlp_layers: 1
  batch_first: true
  flag_pool: ap          # Attentive pooling

frontend: xlsr_53
xlsr_53_args:
  layer: 12              # Use 12th layer of XLSR-53
  
projection_args:
  project_type: arc_margin
  scale: 30.0
  margin: 0.2

Why This Architecture?

  1. XLSR-53: Self-supervised speech representations capture fine-grained acoustic patterns

  2. Res1D: 1D residual blocks effective for temporal modeling

  3. Multi-Resolution: Captures artifacts at different temporal scales

  4. Arc Margin: Enhances inter-class separation

Stage 1: Data Preparation

Script: local/prepare_data.sh

Purpose

Prepare the PartialSpoof dataset in WeDefense format.

Input

  • PartialSpoof database directory

  • Protocol files: PartialSpoof.LA.cm.{train,dev,eval}.trl.txt

Process

  1. Create wav.scp

    find ${PS_dir}/${dset}/con_wav -name "*.wav" | awk -F"/" '{print $NF,$0}' | sort
    

    Format: utterance_id /path/to/audio.wav

  2. Extract labels (utt2lab)

    cut -d' ' -f2,5 ${PS_dir}/protocols/PartialSpoof_LA_cm_protocols/PartialSpoof.LA.cm.${dset}.trl.txt
    

    Format: utterance_id bonafide/spoof

  3. Create lab2utt mapping

    ./tools/utt2lab_to_lab2utt.pl ${data}/${dset}/utt2lab
    

    Groups utterances by label

  4. Compute durations

    python tools/wav2dur.py ${data}/${dset}/wav.scp ${data}/${dset}/utt2dur
    

Output Files

data/{train,dev,eval}/
  β”œβ”€β”€ wav.scp      # Audio paths
  β”œβ”€β”€ utt2lab      # Utterance labels
  β”œβ”€β”€ lab2utt      # Label-to-utterance mapping
  └── utt2dur      # Audio durations

Stage 3: Model Training (Overview)

Command

torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:$PORT \
  --nnodes=1 --nproc_per_node=$num_gpus \
  wedefense/bin/train.py --config $config \
    --exp_dir ${exp_dir} \
    --gpus $gpus \
    --num_avg ${num_avg} \
    --data_type "${data_type}" \
    --train_data ${data}/train/${data_type}.list \
    --train_label ${data}/train/utt2lab

Training Process

  1. Data Loading: Batch sampling from shard/raw format

  2. Frontend: Extract XLSR-53 features (Layer 12)

  3. Augmentation: Optional spec augmentation, speed perturbation

  4. Forward: Encoder β†’ Pooling β†’ Projection

  5. Loss: Arc Margin Softmax loss

  6. Optimization: AdamW with learning rate scheduling

Key Training Parameters

  • Batch size: Typically 64-128

  • Learning rate: 1e-4 with warmup

  • Epochs: 50-100 with early stopping

  • Checkpointing: Save every epoch

Output

exp/singlereso_utt_xlsr_53_ft_backend_Res1D/
  β”œβ”€β”€ config.yaml
  β”œβ”€β”€ models/
  β”‚   β”œβ”€β”€ model_1.pt
  β”‚   β”œβ”€β”€ model_2.pt
  β”‚   └── ...
  └── tensorboard/

Stage 4: Model Averaging

Purpose

Average the top-N best model checkpoints to improve robustness.

Command

python wedefense/bin/average_model.py \
  --dst_model $exp_dir/models/avg_model.pt \
  --src_path $exp_dir/models \
  --num 10

Process

  1. Identify top-10 checkpoints by validation performance

  2. Load state dictionaries

  3. Average parameters: \(\theta_{\text{avg}} = \frac{1}{N}\sum_{i=1}^{N}\theta_i\)

  4. Save averaged model

Output

exp/singlereso_utt_xlsr_53_ft_backend_Res1D/models/avg_model.pt

This averaged model is used for all subsequent stages.

Stage 9: XAI Extraction with Grad-CAM

Script: wedefense/bin/XAI_GradCam_infer.py

Command

CUDA_VISIBLE_DEVICES=0 python wedefense/bin/XAI_GradCam_infer.py \
  --config ${exp_dir}/config.yaml \
  --model_path $exp_dir/models/avg_model.pt \
  --data_type "shard" \
  --data_list ${data}/dev/shard.list \
  --batch_size 1 \
  --num_workers 1 \
  --num_classes 2 \
  --xai_scores_path ${exp_dir}/xai_scores/dev.pkl

Step-by-Step Process

1. Model Preparation

# Load pretrained model
model = get_model(configs['model'])(**configs['model_args'])
load_checkpoint(model, model_path)

# Wrap with projection head
projection = get_projection(configs['projection_args'])
full_model = FullModel(model, projection, test_conf)

2. Target Layer Selection

# For SSL-Res1D, target the final pooling layer
target_layer = [full_model.encoder.stat_pooling]

Why this layer?

  • Final representation before classification

  • Captures high-level temporal features

  • Maintains temporal resolution

3. Grad-CAM Initialization

from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget

cam = GradCAM(model=full_model, target_layers=target_layer)

4. Per-Utterance Extraction

For each audio utterance:

# Load audio
wavs = batch['wav'].float().to(device)  # Shape: (1, wav_length)

# Target spoof class (class 1)
targets = [ClassifierOutputTarget(1)]

# Extract Grad-CAM heatmap
cam_output = cam(input_tensor=wavs, targets=targets)
# cam_output shape: (temporal_frames,) ranging [0, 1]

5. Save Results

results = []
for utt, heatmap in zip(utterance_ids, cam_outputs):
    results.append([[utt], heatmap.tolist()])

with open(xai_scores_path, 'wb') as f:
    pickle.dump(results, f)

Output Format

# xai_scores/dev.pkl structure:
[
  [["utt_id_1"], [0.12, 0.23, 0.89, ..., 0.34]],  # Heatmap for utterance 1
  [["utt_id_2"], [0.08, 0.15, 0.76, ..., 0.21]],  # Heatmap for utterance 2
  ...
]

Each heatmap is a 1D array where:

  • Length: Number of temporal frames

  • Values: [0, 1] indicating activation strength

  • High values: Model focuses on these regions for spoof detection

Stage 10: XAI Score Analysis

Script: wedefense/bin/XAI_Score_analysis.py

Command

python3 wedefense/bin/XAI_Score_analysis.py \
  --set dev \
  --pkl_path ${exp_dir}/xai_scores/dev.pkl \
  --vad_path "$VAD_PATH"

Analysis Components

1. Load XAI Scores and VAD Information

# Load XAI heatmaps
with open(pkl_path, 'rb') as f:
    xai_results = pickle.load(f)

# Load voice activity detection (optional)
# VAD helps focus on speech regions only
vad_info = load_vad(vad_path)

2. Compute Statistics

For each utterance:

heatmap = np.array(xai_result[1])

# Basic statistics
mean_activation = np.mean(heatmap)
max_activation = np.max(heatmap)
std_activation = np.std(heatmap)

# Temporal analysis
peak_indices = find_peaks(heatmap, threshold=0.5)
peak_regions = group_consecutive_peaks(peak_indices)

3. Segment Detection

Threshold-based segmentation:

threshold = 0.5  # Tunable parameter
spoofed_mask = heatmap > threshold

# Find continuous regions
segments = []
in_segment = False
for t, is_spoof in enumerate(spoofed_mask):
    if is_spoof and not in_segment:
        start = t
        in_segment = True
    elif not is_spoof and in_segment:
        end = t
        segments.append((start, end))
        in_segment = False

4. Visualization

Generate plots for each utterance:

A. Temporal Activation Profile

plt.figure(figsize=(12, 4))
plt.plot(time_axis, heatmap, linewidth=2, color='red')
plt.fill_between(time_axis, heatmap, alpha=0.3, color='red')
plt.axhline(y=threshold, linestyle='--', color='blue', label='Threshold')
plt.xlabel('Time (s)')
plt.ylabel('Activation')
plt.title(f'XAI Temporal Activation - {utterance_id}')
plt.legend()

B. Spectrogram with Heatmap Overlay

# Load audio and compute spectrogram
audio, sr = librosa.load(audio_path, sr=16000)
D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)))

# Overlay heatmap
heatmap_2d = np.tile(heatmap, (D.shape[0], 1))  # Repeat along frequency
plt.imshow(heatmap_2d, aspect='auto', cmap='hot', alpha=0.6)

C. Detected Segment Boundaries

# Mark detected spoofed regions
for start, end in segments:
    plt.axvspan(start, end, alpha=0.3, color='red', label='Detected Spoof')

5. Aggregate Analysis

Compare Bonafide vs Spoof distributions:

# Separate by ground truth label
bonafide_activations = []
spoof_activations = []

for result, label in zip(xai_results, labels):
    mean_act = np.mean(result[1])
    if label == 'bonafide':
        bonafide_activations.append(mean_act)
    else:
        spoof_activations.append(mean_act)

# Plot distributions
plt.hist(bonafide_activations, bins=50, alpha=0.5, label='Bonafide', color='green')
plt.hist(spoof_activations, bins=50, alpha=0.5, label='Spoof', color='red')
plt.xlabel('Mean Activation')
plt.ylabel('Count')
plt.legend()

Output

exp/xai_scores/
  β”œβ”€β”€ dev.pkl                    # Raw heatmaps
  β”œβ”€β”€ analysis/
  β”‚   β”œβ”€β”€ temporal_profiles/     # Per-utterance plots
  β”‚   β”œβ”€β”€ segment_detection/     # Detected boundaries
  β”‚   β”œβ”€β”€ statistics.csv         # Aggregate stats
  β”‚   └── distribution.png       # Bonafide vs Spoof comparison

Interpreting XAI Results

Activation Patterns and Their Meanings

Pattern

Visual Appearance

Interpretation

Example Scenario

Sharp Peaks

πŸ“ˆ Sudden spikes at specific time points

Splice boundaries detected

Partially spoofed audio with clear transitions

Sustained High Activation

🌊 Long regions with elevated values

Continuous spoofed segment

TTS-generated insertion

Low Flat Profile

πŸ“‰ Consistently low values

Genuine speech

Bonafide utterance

Multiple Peaks

🎯 Several distinct high regions

Multiple spoofed insertions

Complex partial spoofing

Gradual Rise/Fall

πŸ“Š Smooth transitions

Soft boundaries or gradual blending

Advanced synthesis with smoothing

Decision Guidelines

For Bonafide Audio:

  • βœ… Expected: Low mean activation (<0.3)

  • βœ… Expected: Small standard deviation (<0.15)

  • βœ… Expected: No sustained high-activation regions

For Partially Spoofed Audio:

  • βœ… Expected: Moderate to high mean activation (>0.4)

  • βœ… Expected: High variance in temporal profile

  • βœ… Expected: Clear peaks corresponding to fake segments

  • ⚠️ Watch for: Peaks aligning with VAD boundaries (may indicate model bias)

Common Pitfalls

  1. Edge Effects: High activation at utterance boundaries may be artifacts

    • Solution: Ignore first/last 100ms

  2. VAD Correlation: Model may focus on silence/non-speech regions

    • Solution: Compare XAI with VAD labels

  3. Threshold Sensitivity: Different thresholds yield different segmentations

    • Solution: Use multiple thresholds (0.3, 0.5, 0.7) for robustness

  4. Model Overfitting: Consistent patterns across all spoof types

    • Solution: Analyze per-algorithm breakdown

Validation Checklist

βœ… Do activation peaks align with known spoofed segments (if ground truth available)?
βœ… Are bonafide utterances consistently low-activation?
βœ… Do different spoofing algorithms show distinct patterns?
βœ… Are high activations focused on speech regions (not silence)?
βœ… Can you aurally perceive artifacts in high-activation regions?

Practical Usage Guide

Running the Complete Pipeline

1. Setup Environment

cd egs/detection/partialspoof/x12_ssl_res1d
source path.sh

2. Configure Paths

Edit run.sh:

PS_dir=/path/to/PartialSpoof/database
data=/path/to/output/data
config=conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml
exp_dir=exp/singlereso_utt_xlsr_53_ft_backend_Res1D
VAD_PATH=/path/to/vad_annotations  # Optional

3. Run Data Preparation (Stage 1-2)

bash run.sh --stage 1 --stop_stage 2

4. Train Model (Stage 3-4)

bash run.sh --stage 3 --stop_stage 4 --gpus "[0]"

Training time: ~24-48 hours on single GPU

5. Evaluate Model (Stage 5-7)

bash run.sh --stage 5 --stop_stage 7

Check performance:

EER: X.XX%
min t-DCF: X.XXX

6. Extract XAI (Stage 9)

bash run.sh --stage 9 --stop_stage 9 --gpus "[0]"

Extraction time: ~1-2 hours for eval set

7. Analyze XAI (Stage 10)

bash run.sh --stage 10 --stop_stage 10

Customization Options

Change Target Layer

In wedefense/bin/XAI_GradCam_infer.py:

# Original: final pooling layer
target_layer = [full_model.encoder.stat_pooling]

# Alternative: intermediate layer
target_layer = [full_model.encoder.layer4]  # Earlier features

Adjust Detection Threshold

In XAI_Score_analysis.py:

# Default threshold
threshold = 0.5

# Stricter detection (fewer false positives)
threshold = 0.7

# More sensitive (catch subtle spoofs)
threshold = 0.3

Target Different Class

# Original: target spoof class
targets = [ClassifierOutputTarget(1)]

# Alternative: target bonafide class (what makes it genuine?)
targets = [ClassifierOutputTarget(0)]

Summary

This tutorial covered the complete workflow for XAI-based partially spoofed audio detection:

Key Takeaways

βœ… Pipeline Architecture

  • 10-stage pipeline from data to XAI analysis

  • SSL-Res1D model with XLSR-53 frontend

  • Grad-CAM for temporal activation mapping

βœ… XAI Extraction Process

  • Target layer selection critical for interpretability

  • Per-utterance temporal heatmaps

  • Batch processing for efficiency

βœ… Result Interpretation

  • Activation patterns indicate spoofed regions

  • Threshold-based segment detection

  • Statistical validation essential

βœ… Practical Considerations

  • Model quality affects XAI quality

  • VAD integration improves focus

  • Cross-validation with audio inspection

Limitations and Future Directions

⚠️ Current Limitations:

  • Grad-CAM shows correlation, not causation

  • Requires well-trained model

  • Threshold selection is dataset-dependent

  • May miss subtle artifacts

πŸ”¬ Future Work:

  • Multi-layer XAI fusion

  • Attention-based explainability

  • Frame-level ground truth comparison

  • Real-time XAI for streaming audio

Resources

πŸ“‚ Implementation: egs/detection/partialspoof/x12_ssl_res1d/
πŸ“„ Paper: arxiv.org/abs/2406.02483
πŸ’» GitHub: github.com/zlin0/wedefense
πŸ“– Docs: wedefense.readthedocs.io

References

  1. Partial Spoofing Detection: Liu et al., β€œHow Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?”, 2024 [paper]

  2. Grad-CAM: Selvaraju et al., β€œGrad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization”, ICCV 2017 [paper]

  3. SSL Representations: Baevski et al., β€œwav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”, NeurIPS 2020 [paper]

  4. XLSR: Conneau et al., β€œUnsupervised Cross-lingual Representation Learning for Speech Recognition”, Interspeech 2021 [paper]

  5. PartialSpoof Dataset: Guo et al., β€œPartially Spoofed Audio Detection”, ASVspoof 2019 [paper]

  6. WeDefense Framework: [GitHub] [Documentation]

  7. PyTorch Grad-CAM: [GitHub]