{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spoof Detection Tutorial with WeDefense\n", "**Author:** Lin Zhang, You Zhang\n", "**Date:** 2026-02-11\n", "**Status:** Ready" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "shellscript" } }, "source": [ "## What is Spoof Detection?\n", "\n", "Spoof detection (also known as anti-spoofing or fake audio detection) aims to detect whether an input audio sample is genuine (bonafide) or artificially generated/modified (spoofed). This is crucial for protecting automatic speaker verification systems and ensuring the authenticity of audio content.\n", "\n", "### Task Definition\n", "\n", "Given an audio input $x$, the goal is to produce a score $s$ that indicates how likely the input is genuine. \n", "\n", "\n", "\n", "In WeDefense, we model $s$ to LLR for final decision as:\n", "\n", "$$\n", "x \\xrightarrow{\\text{Model}} \\text{embedding} \\xrightarrow{\\text{Projection}} \\text{logits} \\xrightarrow{\\text{Calibration}} \\text{LLR score}\n", "$$\n", "\n", "The final LLR (Log-Likelihood Ratio) score = $\\log \\frac{p(x|H_a)}{p(x|H_r)}$ determines the decision:\n", "- **Positive LLR** → classified as bonafide (real)\n", "- **Negative LLR** → classified as spoof (fake)\n", "\n", "Where $H_a$ and $H_r$ represents the accept hypothesis (the input audio is real) and reject hypothesis (the input audio is fake), respectively. \n", "\n", "\n", "### Why use LLR instead of raw posterior from network?\n", "\n", "While spoof detection can be framed as binary classification, we recommend using calibrated LLR scores instead of raw logits or posteriors for several reasons:\n", "\n", "1. **Prior-aware calibration**: LLR incorporates the prior probability of spoofing attacks, making it more suitable for real-world deployment where attack frequencies vary.\n", "2. **Interpretability**: LLR provides a principled decision threshold (0) with clear probabilistic meaning that $p(x|H_a) = p(x|H_r)$.\n", "3. **Robustness**: Calibrated scores are less sensitive to training data imbalance and generalize better across datasets.\n", "\n", "Note,\n", "- Genuine/Bona fide/Real: Speech spoken by human or authentic audio naturally captured from real sources.\n", "- Spoof/Fake: Generated or modified audio (e.g., TTS, voice conversion, etc.)\n", "\n", "\n", "### Evaluation Metrics\n", "WeDefense follows [ASVspoof5](https://www.asvspoof.org/) (appendix of [the eval. plan](https://www.asvspoof.org/file/ASVspoof5___Evaluation_Plan_Phase2.pdf)) official evaluation metrics. Lower is better for all of them.\n", "\n", "\n", "| Metric | Range | What it measures |\n", "|--------|------|-----------------|\n", "| **min-DCF** | [0, $\\infty$] | Best achievable cost (optimal threshold) | \n", "| **EER** | [0, 0.5] | Equal error rate where false negative rate and false positive rate are equal (threshold-independent) | \n", "| **$C_\\text{llr}$** | [0, $\\infty$] | Score calibration quality. 0 for perfect calibration, 1 for non-informative system, and $\\gt 1$ for worse than non-informative | \n", "| **act-DCF** | [0, $\\infty$] | Detection cost at a fixed Bayesian threshold, sensitive to calibration. (So act-DCF $\\ge$ min-DCF.) | " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step-by-Step Implementation in WeDefense\n", "\n", "The implementation stages in WeDefense are:\n", "\n", "- **Stage 1–2:** Data preparation and list generation.\n", "- **Stage 3:** Model training.\n", "- **Stage 4:** Model averaging and embedding extraction.\n", "- **Stage 5:** Logit extraction (and optional posterior output via softmax).\n", "- **Stage 6:** Score calibration (logits to LLR).\n", "- **Stage 7:** Performance evaluation.\n", "\n", "**Note:** \n", "1. WeDefense separates embedding extraction (Stage 4) from logit/posterior prediction (Stage 5) to keep the pipeline modular. This makes it easier to analyze or visualize embeddings, reuse the same embeddings with different back-end scoring methods, and debug each stage independently. \n", "2. For evaluation, we convert logits to LLR (Stage 6) to apply prior-aware calibration and obtain well-calibrated decision scores rather than using raw logits or posteriors directly.\n", "\n", "In the following of this notebook, we provide a step-by-step guide to running a anti-spoofing detection experiment using the WeDefense toolkit. We will follow the structure of the `run.sh` script for the `detection/asvspoof5/v03_resnet18` recipe on the PartialSpoof dataset.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prerequisites\n", "\n", "1. **WeDefense Installation:** Ensure you have successfully installed the WeDefense toolkit and all its dependencies.\n", "2. **Dataset:** This tutorial assumes you have access to the PartialSpoof dataset. The script will attempt to download it automatically if it's not found.\n", "3. **Environment:** \n", "> [!IMPORTANT] \n", "> Make sure you are running this notebook from the `egs/detection/partialspoof/v03_resnet18/` directory. And installed conda enviorment success.\n", "4. **Hardware:** A GPU is highly recommended for the training stage (Stage 3).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initial Configuration\n", "\n", "First, we set up all the necessary paths and parameters for our experiment. These are the same variables you would find at the top of the `run.sh` script.\n", "\n", "> [!IMPORTANT] \n", "> Please modify PS_dir path to your PartialSpoof database directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Notebook env configured:\n", " WEDEFENSE_ROOT = /home/neil/public_wedefense_Feb2026/wedefense\n", " PS_DIR = /data/neil/PartialSpoof/database\n", " DATA_DIR = data/partialspoof_tutorial\n", " EXP_DIR = exp/resnet_tutorial\n", " CONFIG = conf/resnet.yaml\n" ] } ], "source": [ "import os\n", "\n", "# --- Path and Data Configuration ---\n", "\n", "# TODO: IMPORTANT! Please modify this path to your PartialSpoof database directory.\n", "PS_dir = '/data/neil/PartialSpoof/database'\n", "# Directory to store prepared data files (wav.scp, utt2lab, etc.)\n", "data_dir = 'data/partialspoof_tutorial'\n", "# The format for the dataloader. 'shard' is recommended for large datasets\n", "# as it groups audio files into .tar files, improving I/O efficiency.\n", "# 'raw' loads individual files.\n", "data_type = 'shard'\n", "\n", "# --- Model and Experiment Configuration ---\n", "# The configuration file for the model architecture and training parameters.\n", "config = 'conf/resnet.yaml'\n", "# Directory to save model checkpoints, logs, and results.\n", "exp_dir = 'exp/resnet_tutorial'\n", "\n", "# --- Execution Configuration ---\n", "# Specify which GPUs to use, e.g., \"[0]\" or \"[0,1]\".\n", "gpus = \"[0]\"\n", "# Number of models to average for inference. >0 to use averaging, <=0 to use the single best model.\n", "num_avg = -1\n", "# Save a model checkpoint every N epochs.\n", "save_epoch_interval = 5\n", "# Patience for early stopping. <0 disables it.\n", "early_stop_patience = -1\n", "# How often to run validation (in epochs).\n", "validate_interval = 1\n", "\n", "# --- Notebook-wide environment (shared with future %%bash cells) ---\n", "# Assumes you run the notebook from egs/detection/partialspoof/v03_resnet18/\n", "wedefense_root = os.path.abspath(os.path.join(os.getcwd(), '../../../..'))\n", "os.environ['WEDEFENSE_ROOT'] = wedefense_root\n", "\n", "# Make wedefense importable for subprocesses started by %%bash\n", "old_pythonpath = os.environ.get('PYTHONPATH', '')\n", "if old_pythonpath:\n", " os.environ['PYTHONPATH'] = wedefense_root + os.pathsep + old_pythonpath\n", "else:\n", " os.environ['PYTHONPATH'] = wedefense_root\n", "\n", "# Export config so every %%bash cell can reuse it without re-defining\n", "os.environ['PS_DIR'] = PS_dir\n", "os.environ['DATA_DIR'] = data_dir\n", "os.environ['DATA_TYPE'] = data_type\n", "os.environ['CONFIG'] = config\n", "os.environ['EXP_DIR'] = exp_dir\n", "os.environ['GPUS'] = gpus\n", "os.environ['NUM_AVG'] = str(num_avg)\n", "os.environ['SAVE_EPOCH_INTERVAL'] = str(save_epoch_interval)\n", "os.environ['EARLY_STOP_PATIENCE'] = str(early_stop_patience)\n", "os.environ['VALIDATE_INTERVAL'] = str(validate_interval)\n", "\n", "# Training launcher defaults used later\n", "os.environ.setdefault('PORT', '29500')\n", "os.environ.setdefault('NUM_GPUS', '1')\n", "\n", "print('Notebook env configured:')\n", "print(' WEDEFENSE_ROOT =', os.environ['WEDEFENSE_ROOT'])\n", "print(' PS_DIR =', os.environ['PS_DIR'])\n", "print(' DATA_DIR =', os.environ['DATA_DIR'])\n", "print(' EXP_DIR =', os.environ['EXP_DIR'])\n", "print(' CONFIG =', os.environ['CONFIG'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stage 1: Data Preparation\n", "\n", "In this stage, we process the raw PartialSpoof dataset into a standard format required by the toolkit. The `local/prepare_data.sh` script will:\n", "1. Download the dataset if it's not found.\n", "2. Create `wav.scp`: Maps a unique utterance ID to its audio file path. ( \\ )\n", "3. Create `utt2lab`: Maps each utterance ID to its label (`bonafide` or `spoof`). (