Explainable AI (XAI) for Partially Spoofed Audio Detection with Grad-CAMο
Author: Tianchi Liu
Status: In Progress
Reference: How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?
Overviewο
This tutorial explains the step-by-step workflow for applying Explainable AI (XAI) techniques to partially spoofed audio detection using the Gradient-weighted Class Activation Mapping (Grad-CAM) method.
Partially spoofed audio refers to utterances where only certain segments are synthetic while others remain genuine.
π Reference Implementation Pathο
egs/detection/partialspoof/x12_ssl_res1d/
Key Componentsο
File |
Purpose |
|---|---|
|
Main pipeline orchestrating Stages 1-10 |
|
Model configuration |
|
Data preparation script |
|
Model training |
|
XAI heatmap extraction |
|
XAI score analysis and visualization |
What This Tutorial Coversο
β
Complete Pipeline - From data preparation to XAI analysis
β
Model Architecture - SSL-Res1D for partial spoofing detection
β
Grad-CAM Theory - How temporal activation maps are computed
β
XAI Extraction - Step-by-step extraction process
β
Result Interpretation - Understanding and analyzing XAI scores
Complete Pipeline Overviewο
The run.sh script implements a 10-stage pipeline:
Stage 1: Data Preparation β wav.scp, utt2lab, lab2utt
Stage 2: Data Format Conversion β Shard/Raw format
Stage 3: Model Training β SSL-Res1D training
Stage 4: Model Averaging β Average best checkpoints
Stage 5: Extract Logits β Model inference
Stage 6: Compute LLR Scores β Log-likelihood ratios
Stage 7: Performance Evaluation β EER, min t-DCF metrics
Stage 8: Analysis β Statistical tests
Stage 9: XAI Extraction β Grad-CAM heatmaps
Stage 10: XAI Analysis β Visualization and interpretation
This tutorial focuses on Stages 9-10 (XAI extraction and analysis), assuming Stages 1-8 are complete.
Grad-CAM Theory for Audioο
What is Grad-CAM?ο
Grad-CAM (Gradient-weighted Class Activation Mapping) identifies which regions of the input the model focuses on when making predictions.
Mathematical Formulationο
For a target class \(c\) (e.g., spoof class):
Forward Pass:
Input audio β SSL Frontend β Classifier (Res1D) β Classification score \(y^c\)
Extract feature maps \(A^k\) from target layer
Backward Pass:
Compute gradients: \(\frac{\partial y^c}{\partial A^k}\)
Weight Calculation (Global Average Pooling):
\[\alpha_k^c = \frac{1}{T}\sum_{t=1}^{T}\frac{\partial y^c}{\partial A^k_t}\]where \(T\) is the temporal dimension.
Weighted Combination:
\[L^c_{\text{Grad-CAM}} = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)\]Temporal Heatmap:
Normalize to [0, 1]
High values indicate regions important for classification
Why Grad-CAM for Partial Spoofing?ο
Unlike fully synthetic audio (uniform fake), partially spoofed audio requires:
Temporal localization: Identify when spoofing occurs
Boundary detection: Find transitions between real/fake
Segment-level understanding: Distinguish mixed content
Grad-CAM provides this temporal resolution by showing activation strength over time.
Model Architecture: SSL-Res1Dο
Pipeline Componentsο
Audio Input (16kHz)
β
[SSL Frontend] XLSR-53
β
[Classifier] Res1D Backend
β
Classification Score (Bonafide/Spoof)
Key Configurationο
From conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml:
model: ssl_multireso_gmlp
model_args:
feat_dim: 768 # XLSR-53 feature dimension
embed_dim: -2 # Output embedding dimension
num_scale: 6 # Multi-resolution scales
gmlp_layers: 1
batch_first: true
flag_pool: ap # Attentive pooling
frontend: xlsr_53
xlsr_53_args:
layer: 12 # Use 12th layer of XLSR-53
projection_args:
project_type: arc_margin
scale: 30.0
margin: 0.2
Why This Architecture?ο
XLSR-53: Self-supervised speech representations capture fine-grained acoustic patterns
Res1D: 1D residual blocks effective for temporal modeling
Multi-Resolution: Captures artifacts at different temporal scales
Arc Margin: Enhances inter-class separation
Stage 1: Data Preparationο
Script: local/prepare_data.shο
Purposeο
Prepare the PartialSpoof dataset in WeDefense format.
Inputο
PartialSpoof database directory
Protocol files:
PartialSpoof.LA.cm.{train,dev,eval}.trl.txt
Processο
Create wav.scp
find ${PS_dir}/${dset}/con_wav -name "*.wav" | awk -F"/" '{print $NF,$0}' | sort
Format:
utterance_id /path/to/audio.wavExtract labels (utt2lab)
cut -d' ' -f2,5 ${PS_dir}/protocols/PartialSpoof_LA_cm_protocols/PartialSpoof.LA.cm.${dset}.trl.txt
Format:
utterance_id bonafide/spoofCreate lab2utt mapping
./tools/utt2lab_to_lab2utt.pl ${data}/${dset}/utt2lab
Groups utterances by label
Compute durations
python tools/wav2dur.py ${data}/${dset}/wav.scp ${data}/${dset}/utt2dur
Output Filesο
data/{train,dev,eval}/
βββ wav.scp # Audio paths
βββ utt2lab # Utterance labels
βββ lab2utt # Label-to-utterance mapping
βββ utt2dur # Audio durations
Stage 3: Model Training (Overview)ο
Commandο
torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:$PORT \
--nnodes=1 --nproc_per_node=$num_gpus \
wedefense/bin/train.py --config $config \
--exp_dir ${exp_dir} \
--gpus $gpus \
--num_avg ${num_avg} \
--data_type "${data_type}" \
--train_data ${data}/train/${data_type}.list \
--train_label ${data}/train/utt2lab
Training Processο
Data Loading: Batch sampling from shard/raw format
Frontend: Extract XLSR-53 features (Layer 12)
Augmentation: Optional spec augmentation, speed perturbation
Forward: Encoder β Pooling β Projection
Loss: Arc Margin Softmax loss
Optimization: AdamW with learning rate scheduling
Key Training Parametersο
Batch size: Typically 64-128
Learning rate: 1e-4 with warmup
Epochs: 50-100 with early stopping
Checkpointing: Save every epoch
Outputο
exp/singlereso_utt_xlsr_53_ft_backend_Res1D/
βββ config.yaml
βββ models/
β βββ model_1.pt
β βββ model_2.pt
β βββ ...
βββ tensorboard/
Stage 4: Model Averagingο
Purposeο
Average the top-N best model checkpoints to improve robustness.
Commandο
python wedefense/bin/average_model.py \
--dst_model $exp_dir/models/avg_model.pt \
--src_path $exp_dir/models \
--num 10
Processο
Identify top-10 checkpoints by validation performance
Load state dictionaries
Average parameters: \(\theta_{\text{avg}} = \frac{1}{N}\sum_{i=1}^{N}\theta_i\)
Save averaged model
Outputο
exp/singlereso_utt_xlsr_53_ft_backend_Res1D/models/avg_model.pt
This averaged model is used for all subsequent stages.
Stage 9: XAI Extraction with Grad-CAMο
Script: wedefense/bin/XAI_GradCam_infer.pyο
Commandο
CUDA_VISIBLE_DEVICES=0 python wedefense/bin/XAI_GradCam_infer.py \
--config ${exp_dir}/config.yaml \
--model_path $exp_dir/models/avg_model.pt \
--data_type "shard" \
--data_list ${data}/dev/shard.list \
--batch_size 1 \
--num_workers 1 \
--num_classes 2 \
--xai_scores_path ${exp_dir}/xai_scores/dev.pkl
Step-by-Step Processο
1. Model Preparationο
# Load pretrained model
model = get_model(configs['model'])(**configs['model_args'])
load_checkpoint(model, model_path)
# Wrap with projection head
projection = get_projection(configs['projection_args'])
full_model = FullModel(model, projection, test_conf)
2. Target Layer Selectionο
# For SSL-Res1D, target the final pooling layer
target_layer = [full_model.encoder.stat_pooling]
Why this layer?
Final representation before classification
Captures high-level temporal features
Maintains temporal resolution
3. Grad-CAM Initializationο
from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget
cam = GradCAM(model=full_model, target_layers=target_layer)
4. Per-Utterance Extractionο
For each audio utterance:
# Load audio
wavs = batch['wav'].float().to(device) # Shape: (1, wav_length)
# Target spoof class (class 1)
targets = [ClassifierOutputTarget(1)]
# Extract Grad-CAM heatmap
cam_output = cam(input_tensor=wavs, targets=targets)
# cam_output shape: (temporal_frames,) ranging [0, 1]
5. Save Resultsο
results = []
for utt, heatmap in zip(utterance_ids, cam_outputs):
results.append([[utt], heatmap.tolist()])
with open(xai_scores_path, 'wb') as f:
pickle.dump(results, f)
Output Formatο
# xai_scores/dev.pkl structure:
[
[["utt_id_1"], [0.12, 0.23, 0.89, ..., 0.34]], # Heatmap for utterance 1
[["utt_id_2"], [0.08, 0.15, 0.76, ..., 0.21]], # Heatmap for utterance 2
...
]
Each heatmap is a 1D array where:
Length: Number of temporal frames
Values: [0, 1] indicating activation strength
High values: Model focuses on these regions for spoof detection
Stage 10: XAI Score Analysisο
Script: wedefense/bin/XAI_Score_analysis.pyο
Commandο
python3 wedefense/bin/XAI_Score_analysis.py \
--set dev \
--pkl_path ${exp_dir}/xai_scores/dev.pkl \
--vad_path "$VAD_PATH"
Analysis Componentsο
1. Load XAI Scores and VAD Informationο
# Load XAI heatmaps
with open(pkl_path, 'rb') as f:
xai_results = pickle.load(f)
# Load voice activity detection (optional)
# VAD helps focus on speech regions only
vad_info = load_vad(vad_path)
2. Compute Statisticsο
For each utterance:
heatmap = np.array(xai_result[1])
# Basic statistics
mean_activation = np.mean(heatmap)
max_activation = np.max(heatmap)
std_activation = np.std(heatmap)
# Temporal analysis
peak_indices = find_peaks(heatmap, threshold=0.5)
peak_regions = group_consecutive_peaks(peak_indices)
3. Segment Detectionο
Threshold-based segmentation:
threshold = 0.5 # Tunable parameter
spoofed_mask = heatmap > threshold
# Find continuous regions
segments = []
in_segment = False
for t, is_spoof in enumerate(spoofed_mask):
if is_spoof and not in_segment:
start = t
in_segment = True
elif not is_spoof and in_segment:
end = t
segments.append((start, end))
in_segment = False
4. Visualizationο
Generate plots for each utterance:
A. Temporal Activation Profile
plt.figure(figsize=(12, 4))
plt.plot(time_axis, heatmap, linewidth=2, color='red')
plt.fill_between(time_axis, heatmap, alpha=0.3, color='red')
plt.axhline(y=threshold, linestyle='--', color='blue', label='Threshold')
plt.xlabel('Time (s)')
plt.ylabel('Activation')
plt.title(f'XAI Temporal Activation - {utterance_id}')
plt.legend()
B. Spectrogram with Heatmap Overlay
# Load audio and compute spectrogram
audio, sr = librosa.load(audio_path, sr=16000)
D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)))
# Overlay heatmap
heatmap_2d = np.tile(heatmap, (D.shape[0], 1)) # Repeat along frequency
plt.imshow(heatmap_2d, aspect='auto', cmap='hot', alpha=0.6)
C. Detected Segment Boundaries
# Mark detected spoofed regions
for start, end in segments:
plt.axvspan(start, end, alpha=0.3, color='red', label='Detected Spoof')
5. Aggregate Analysisο
Compare Bonafide vs Spoof distributions:
# Separate by ground truth label
bonafide_activations = []
spoof_activations = []
for result, label in zip(xai_results, labels):
mean_act = np.mean(result[1])
if label == 'bonafide':
bonafide_activations.append(mean_act)
else:
spoof_activations.append(mean_act)
# Plot distributions
plt.hist(bonafide_activations, bins=50, alpha=0.5, label='Bonafide', color='green')
plt.hist(spoof_activations, bins=50, alpha=0.5, label='Spoof', color='red')
plt.xlabel('Mean Activation')
plt.ylabel('Count')
plt.legend()
Outputο
exp/xai_scores/
βββ dev.pkl # Raw heatmaps
βββ analysis/
β βββ temporal_profiles/ # Per-utterance plots
β βββ segment_detection/ # Detected boundaries
β βββ statistics.csv # Aggregate stats
β βββ distribution.png # Bonafide vs Spoof comparison
Interpreting XAI Resultsο
Activation Patterns and Their Meaningsο
Pattern |
Visual Appearance |
Interpretation |
Example Scenario |
|---|---|---|---|
Sharp Peaks |
π Sudden spikes at specific time points |
Splice boundaries detected |
Partially spoofed audio with clear transitions |
Sustained High Activation |
π Long regions with elevated values |
Continuous spoofed segment |
TTS-generated insertion |
Low Flat Profile |
π Consistently low values |
Genuine speech |
Bonafide utterance |
Multiple Peaks |
π― Several distinct high regions |
Multiple spoofed insertions |
Complex partial spoofing |
Gradual Rise/Fall |
π Smooth transitions |
Soft boundaries or gradual blending |
Advanced synthesis with smoothing |
Decision Guidelinesο
For Bonafide Audio:ο
β Expected: Low mean activation (<0.3)
β Expected: Small standard deviation (<0.15)
β Expected: No sustained high-activation regions
For Partially Spoofed Audio:ο
β Expected: Moderate to high mean activation (>0.4)
β Expected: High variance in temporal profile
β Expected: Clear peaks corresponding to fake segments
β οΈ Watch for: Peaks aligning with VAD boundaries (may indicate model bias)
Common Pitfallsο
Edge Effects: High activation at utterance boundaries may be artifacts
Solution: Ignore first/last 100ms
VAD Correlation: Model may focus on silence/non-speech regions
Solution: Compare XAI with VAD labels
Threshold Sensitivity: Different thresholds yield different segmentations
Solution: Use multiple thresholds (0.3, 0.5, 0.7) for robustness
Model Overfitting: Consistent patterns across all spoof types
Solution: Analyze per-algorithm breakdown
Validation Checklistο
β
Do activation peaks align with known spoofed segments (if ground truth available)?
β
Are bonafide utterances consistently low-activation?
β
Do different spoofing algorithms show distinct patterns?
β
Are high activations focused on speech regions (not silence)?
β
Can you aurally perceive artifacts in high-activation regions?
Practical Usage Guideο
Running the Complete Pipelineο
1. Setup Environmentο
cd egs/detection/partialspoof/x12_ssl_res1d
source path.sh
2. Configure Pathsο
Edit run.sh:
PS_dir=/path/to/PartialSpoof/database
data=/path/to/output/data
config=conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml
exp_dir=exp/singlereso_utt_xlsr_53_ft_backend_Res1D
VAD_PATH=/path/to/vad_annotations # Optional
3. Run Data Preparation (Stage 1-2)ο
bash run.sh --stage 1 --stop_stage 2
4. Train Model (Stage 3-4)ο
bash run.sh --stage 3 --stop_stage 4 --gpus "[0]"
Training time: ~24-48 hours on single GPU
5. Evaluate Model (Stage 5-7)ο
bash run.sh --stage 5 --stop_stage 7
Check performance:
EER: X.XX%
min t-DCF: X.XXX
6. Extract XAI (Stage 9)ο
bash run.sh --stage 9 --stop_stage 9 --gpus "[0]"
Extraction time: ~1-2 hours for eval set
7. Analyze XAI (Stage 10)ο
bash run.sh --stage 10 --stop_stage 10
Customization Optionsο
Change Target Layerο
In wedefense/bin/XAI_GradCam_infer.py:
# Original: final pooling layer
target_layer = [full_model.encoder.stat_pooling]
# Alternative: intermediate layer
target_layer = [full_model.encoder.layer4] # Earlier features
Adjust Detection Thresholdο
In XAI_Score_analysis.py:
# Default threshold
threshold = 0.5
# Stricter detection (fewer false positives)
threshold = 0.7
# More sensitive (catch subtle spoofs)
threshold = 0.3
Target Different Classο
# Original: target spoof class
targets = [ClassifierOutputTarget(1)]
# Alternative: target bonafide class (what makes it genuine?)
targets = [ClassifierOutputTarget(0)]
Summaryο
This tutorial covered the complete workflow for XAI-based partially spoofed audio detection:
Key Takeawaysο
β Pipeline Architecture
10-stage pipeline from data to XAI analysis
SSL-Res1D model with XLSR-53 frontend
Grad-CAM for temporal activation mapping
β XAI Extraction Process
Target layer selection critical for interpretability
Per-utterance temporal heatmaps
Batch processing for efficiency
β Result Interpretation
Activation patterns indicate spoofed regions
Threshold-based segment detection
Statistical validation essential
β Practical Considerations
Model quality affects XAI quality
VAD integration improves focus
Cross-validation with audio inspection
Limitations and Future Directionsο
β οΈ Current Limitations:
Grad-CAM shows correlation, not causation
Requires well-trained model
Threshold selection is dataset-dependent
May miss subtle artifacts
π¬ Future Work:
Multi-layer XAI fusion
Attention-based explainability
Frame-level ground truth comparison
Real-time XAI for streaming audio
Resourcesο
π Implementation: egs/detection/partialspoof/x12_ssl_res1d/
π Paper: arxiv.org/abs/2406.02483
π» GitHub: github.com/zlin0/wedefense
π Docs: wedefense.readthedocs.io
Referencesο
Partial Spoofing Detection: Liu et al., βHow Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?β, 2024 [paper]
Grad-CAM: Selvaraju et al., βGrad-CAM: Visual Explanations from Deep Networks via Gradient-based Localizationβ, ICCV 2017 [paper]
SSL Representations: Baevski et al., βwav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representationsβ, NeurIPS 2020 [paper]
XLSR: Conneau et al., βUnsupervised Cross-lingual Representation Learning for Speech Recognitionβ, Interspeech 2021 [paper]
PartialSpoof Dataset: Guo et al., βPartially Spoofed Audio Detectionβ, ASVspoof 2019 [paper]
WeDefense Framework: [GitHub] [Documentation]
PyTorch Grad-CAM: [GitHub]