7.1 KiB

Raw Blame History

Whisper Fine-tuning on Common Voice

This example demonstrates how to fine-tune Whisper models on the Common Voice dataset using PaddleSpeech.

Overview

Whisper is a state-of-the-art speech recognition model from OpenAI. This implementation allows you to fine-tune Whisper models on new datasets to improve performance for specific languages, domains, or accents.

Features

Complete fine-tuning pipeline for Whisper models on custom datasets
Flexible configuration via YAML files
Support for all Whisper model sizes (tiny, base, small, medium, large, etc.)
Data preparation tools for Common Voice and custom datasets
Distributed training support with mixed precision
Gradient accumulation for large batch training
Learning rate scheduling and optimization techniques
Evaluation tools with WER/CER metrics
Command-line inference with both fine-tuned and original Whisper models
Model export utilities for deployment
Visualization tools for performance analysis

Installation

Ensure you have PaddleSpeech installed with all dependencies:

git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install -e .
pip install datasets soundfile librosa matplotlib pandas jiwer

Data

We use the Common Voice dataset (version 11.0) available on Hugging Face: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0

Other datasets compatible with the pipeline include LibriSpeech, AISHELL, and any dataset that can be converted to the manifest format (see below).

Unified Command-Line Interface

This example includes a unified CLI for all operations:

python whisper_cli.py COMMAND [OPTIONS]

Available commands:

prepare: Prepare dataset for fine-tuning
train: Fine-tune Whisper model
evaluate: Evaluate model performance
infer: Run inference with fine-tuned or original model

Data Preparation

To download and prepare the Common Voice dataset:

python whisper_cli.py prepare --language en --output_dir ./data

Options:

--language: Target language code (default: en)
--output_dir: Directory to save preprocessed data (default: ./data)
--cache_dir: Cache directory for HuggingFace datasets
--val_size: Validation set size ratio (default: 0.03)
--test_size: Test set size ratio (default: 0.03)
--min_duration: Minimum audio duration in seconds (default: 0.5)
--max_duration: Maximum audio duration in seconds (default: 30.0)

Manifest format:

{"audio": "path/to/audio.wav", "text": "transcription", "duration": 3.45}

Configuration

Fine-tuning parameters are specified in YAML config files. See conf/whisper_base.yaml for a detailed example.

Key configuration sections:

model: Model size, checkpoint path, freeze options
data: Dataset paths, languages, tasks
training: Batch size, learning rate, optimizer settings
distributed: Distributed training options
output: Save paths, logging options

Training

To fine-tune the Whisper model:

python whisper_cli.py train --config conf/whisper_base.yaml --resource_path ./resources

For distributed training:

python -m paddle.distributed.launch --gpus "0,1,2,3" whisper_cli.py train --config conf/whisper_base.yaml --distributed True

Options:

--config: Path to configuration YAML file
--resource_path: Path to resources directory containing model assets
--device: Device to use (cpu, gpu, xpu)
--seed: Random seed
--checkpoint_path: Path to resume training from checkpoint
--distributed: Enable distributed training

Evaluation

Evaluate model performance on a test set:

python whisper_cli.py evaluate --manifest ./data/test_manifest.json --checkpoint ./exp/whisper_fine_tune/epoch_10 --output_dir ./eval_results

Options:

--manifest: Path to test manifest file
--checkpoint: Path to model checkpoint
--model_size: Model size if using original Whisper
--language: Language code
--output_dir: Directory to save evaluation results
--max_samples: Maximum number of samples to evaluate

Inference

For transcribing audio with a fine-tuned model:

python whisper_cli.py infer --audio_file path/to/audio.wav --checkpoint ./exp/whisper_fine_tune/final

For batch processing a directory:

python whisper_cli.py infer --audio_dir path/to/audio/folder --output_dir ./transcriptions --checkpoint ./exp/whisper_fine_tune/final

For inference with the original Whisper models:

python whisper_cli.py infer --audio_file path/to/audio.wav --use_original --model_size large-v3 --resource_path ./resources

Options:

--audio_file: Path to single audio file
--audio_dir: Path to directory with audio files
--checkpoint: Path to fine-tuned checkpoint
--use_original: Use original Whisper model
--model_size: Model size (tiny, base, small, medium, large, etc.)
--language: Language code (or "auto" for detection)
--task: Task type (transcribe or translate)
--beam_size: Beam size for beam search
--temperature: Temperature for sampling
--without_timestamps: Don't include timestamps

Visualization

Visualize evaluation results:

python visualize.py --results_file ./eval_results/evaluation_results.json --output_dir ./visualizations

Options:

--results_file: Path to evaluation results JSON file
--output_dir: Directory to save visualizations
--audio_dir: Directory with audio files (optional)
--num_samples: Number of individual samples to visualize
--show: Show plots interactively

Model Export

Export fine-tuned model to inference format:

python export_model.py --checkpoint ./exp/whisper_fine_tune/final --output_path ./exported_model --model_size base

Options:

--checkpoint: Path to model checkpoint
--output_path: Path to save exported model
--model_size: Model size

Advanced Usage

Freezing Encoder

To freeze the encoder and only fine-tune the decoder, set the following in your config file:

model:
  freeze_encoder: true

Gradient Accumulation

For effective training with limited GPU memory, use gradient accumulation:

training:
  accum_grad: 8  # Accumulate gradients over 8 batches

Mixed Precision

Enable mixed precision training for faster computation:

training:
  amp: true  # Enable automatic mixed precision

Custom Datasets

To use custom datasets, prepare manifest files in the following format:

{"audio": "/absolute/path/to/audio.wav", "text": "transcription text"}

Then specify the manifest paths in your config file:

data:
  train_manifest: path/to/train_manifest.json
  dev_manifest: path/to/dev_manifest.json
  test_manifest: path/to/test_manifest.json

7.1 KiB Raw Blame History

Whisper Fine-tuning on Common Voice

Overview

Features

Installation

Data

Unified Command-Line Interface

Data Preparation

Configuration

Training

Evaluation

Inference

Visualization

Model Export

Advanced Usage

Freezing Encoder

Gradient Accumulation

Mixed Precision

Custom Datasets

Reference

7.1 KiB

Raw Blame History