7.1 KiB
Whisper Fine-tuning on Common Voice
This example demonstrates how to fine-tune Whisper models on the Common Voice dataset using PaddleSpeech.
Overview
Whisper is a state-of-the-art speech recognition model from OpenAI. This implementation allows you to fine-tune Whisper models on new datasets to improve performance for specific languages, domains, or accents.
Features
- Complete fine-tuning pipeline for Whisper models on custom datasets
- Flexible configuration via YAML files
- Support for all Whisper model sizes (tiny, base, small, medium, large, etc.)
- Data preparation tools for Common Voice and custom datasets
- Distributed training support with mixed precision
- Gradient accumulation for large batch training
- Learning rate scheduling and optimization techniques
- Evaluation tools with WER/CER metrics
- Command-line inference with both fine-tuned and original Whisper models
- Model export utilities for deployment
- Visualization tools for performance analysis
Installation
Ensure you have PaddleSpeech installed with all dependencies:
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install -e .
pip install datasets soundfile librosa matplotlib pandas jiwer
Data
We use the Common Voice dataset (version 11.0) available on Hugging Face: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0
Other datasets compatible with the pipeline include LibriSpeech, AISHELL, and any dataset that can be converted to the manifest format (see below).
Unified Command-Line Interface
This example includes a unified CLI for all operations:
python whisper_cli.py COMMAND [OPTIONS]
Available commands:
prepare
: Prepare dataset for fine-tuningtrain
: Fine-tune Whisper modelevaluate
: Evaluate model performanceinfer
: Run inference with fine-tuned or original model
Data Preparation
To download and prepare the Common Voice dataset:
python whisper_cli.py prepare --language en --output_dir ./data
Options:
--language
: Target language code (default: en)--output_dir
: Directory to save preprocessed data (default: ./data)--cache_dir
: Cache directory for HuggingFace datasets--val_size
: Validation set size ratio (default: 0.03)--test_size
: Test set size ratio (default: 0.03)--min_duration
: Minimum audio duration in seconds (default: 0.5)--max_duration
: Maximum audio duration in seconds (default: 30.0)
Manifest format:
{"audio": "path/to/audio.wav", "text": "transcription", "duration": 3.45}
Configuration
Fine-tuning parameters are specified in YAML config files. See conf/whisper_base.yaml
for a detailed example.
Key configuration sections:
- model: Model size, checkpoint path, freeze options
- data: Dataset paths, languages, tasks
- training: Batch size, learning rate, optimizer settings
- distributed: Distributed training options
- output: Save paths, logging options
Training
To fine-tune the Whisper model:
python whisper_cli.py train --config conf/whisper_base.yaml --resource_path ./resources
For distributed training:
python -m paddle.distributed.launch --gpus "0,1,2,3" whisper_cli.py train --config conf/whisper_base.yaml --distributed True
Options:
--config
: Path to configuration YAML file--resource_path
: Path to resources directory containing model assets--device
: Device to use (cpu, gpu, xpu)--seed
: Random seed--checkpoint_path
: Path to resume training from checkpoint--distributed
: Enable distributed training
Evaluation
Evaluate model performance on a test set:
python whisper_cli.py evaluate --manifest ./data/test_manifest.json --checkpoint ./exp/whisper_fine_tune/epoch_10 --output_dir ./eval_results
Options:
--manifest
: Path to test manifest file--checkpoint
: Path to model checkpoint--model_size
: Model size if using original Whisper--language
: Language code--output_dir
: Directory to save evaluation results--max_samples
: Maximum number of samples to evaluate
Inference
For transcribing audio with a fine-tuned model:
python whisper_cli.py infer --audio_file path/to/audio.wav --checkpoint ./exp/whisper_fine_tune/final
For batch processing a directory:
python whisper_cli.py infer --audio_dir path/to/audio/folder --output_dir ./transcriptions --checkpoint ./exp/whisper_fine_tune/final
For inference with the original Whisper models:
python whisper_cli.py infer --audio_file path/to/audio.wav --use_original --model_size large-v3 --resource_path ./resources
Options:
--audio_file
: Path to single audio file--audio_dir
: Path to directory with audio files--checkpoint
: Path to fine-tuned checkpoint--use_original
: Use original Whisper model--model_size
: Model size (tiny, base, small, medium, large, etc.)--language
: Language code (or "auto" for detection)--task
: Task type (transcribe or translate)--beam_size
: Beam size for beam search--temperature
: Temperature for sampling--without_timestamps
: Don't include timestamps
Visualization
Visualize evaluation results:
python visualize.py --results_file ./eval_results/evaluation_results.json --output_dir ./visualizations
Options:
--results_file
: Path to evaluation results JSON file--output_dir
: Directory to save visualizations--audio_dir
: Directory with audio files (optional)--num_samples
: Number of individual samples to visualize--show
: Show plots interactively
Model Export
Export fine-tuned model to inference format:
python export_model.py --checkpoint ./exp/whisper_fine_tune/final --output_path ./exported_model --model_size base
Options:
--checkpoint
: Path to model checkpoint--output_path
: Path to save exported model--model_size
: Model size
Advanced Usage
Freezing Encoder
To freeze the encoder and only fine-tune the decoder, set the following in your config file:
model:
freeze_encoder: true
Gradient Accumulation
For effective training with limited GPU memory, use gradient accumulation:
training:
accum_grad: 8 # Accumulate gradients over 8 batches
Mixed Precision
Enable mixed precision training for faster computation:
training:
amp: true # Enable automatic mixed precision
Custom Datasets
To use custom datasets, prepare manifest files in the following format:
{"audio": "/absolute/path/to/audio.wav", "text": "transcription text"}
Then specify the manifest paths in your config file:
data:
train_manifest: path/to/train_manifest.json
dev_manifest: path/to/dev_manifest.json
test_manifest: path/to/test_manifest.json