English | [简体中文](README_ch.md) # PaddleSpeech
------------------------------------------------------------------------------------ ![License](https://img.shields.io/badge/license-Apache%202-red.svg) ![python version](https://img.shields.io/badge/python-3.7+-orange.svg) ![support os](https://img.shields.io/badge/os-linux-yellow.svg) **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech, with state-of-art and influential models. Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing modules, and deployment process. To be more specific, this toolkit features at: - **Fast and Light-weight**: we provide high-speed and ultra-lightweight models that are convenient for industrial deployment. - **Rule-based Chinese frontend**: our frontend contains Text Normalization (TN) and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context. - **Varieties of Functions that Vitalize both Industrial and Academia**: - *Implementation of critical audio tasks*: this toolkit contains audio functions like Speech Translation (ST), Automatic Speech Recognition (ASR), Text-To-Speech Synthesis (TTS), Voice Cloning(VC), Punctuation Restoration, etc. - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model lists](#models-list) for more details. - *Cross-domain application*: as an extension of the application of traditional audio tasks, we combine the aforementioned tasks with other fields like NLP. Let's install PaddleSpeech with only a few lines of code! >Note: The official name is still deepspeech. 2021/10/26 If you are using Ubuntu, PaddleSpeech can be set up with pip installation (with root privilege). ```shell git clone https://github.com/PaddlePaddle/DeepSpeech.git cd DeepSpeech pip install -e . ``` ## Table of Contents The contents of this README is as follow: - [Alternative Installation](#alternative-installation) - [Quick Start](#quick-start) - [Models List](#models-list) - [Tutorials](#tutorials) - [FAQ and Contributing](#faq-and-contributing) - [License](#license) - [Acknowledgement](#acknowledgement) ## Alternative Installation The base environment in this page is - Ubuntu 16.04 - python>=3.7 - paddlepaddle==2.1.2 If you want to set up PaddleSpeech in other environment, please see the [ASR installation](docs/source/asr/install.md) and [TTS installation](docs/source/tts/install.md) documents for all the alternatives. ## Quick Start > Note: the current links to `English ASR` and `English TTS` are not valid. Just a quick test of our functions: [English ASR](link/hubdetail?name=deepspeech2_aishell&en_category=AutomaticSpeechRecognition) and [English TTS](link/hubdetail?name=fastspeech2_baker&en_category=TextToSpeech) by typing message or upload your own audio file. Developers can have a try of our model with only a few lines of code. A tiny **ASR** DeepSpeech2 model training on toy set of LibriSpeech: ```bash cd examples/tiny/s0/ # source the environment source path.sh # prepare librispeech dataset bash local/data.sh # evaluate your ckptfile model file bash local/test.sh conf/deepspeech2.yaml ckptfile offline ``` For **TTS**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC: ```bash cd examples/csmsc/tts3 # download the pretrained models and unaip them wget https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip unzip pwg_baker_ckpt_0.4.zip wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip unzip fastspeech2_nosil_baker_ckpt_0.4.zip # source the environment source path.sh # run end-to-end synthesize FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/synthesize_e2e.py \ --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \ --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \ --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \ --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ --text=${BIN_DIR}/../sentences.txt \ --output-dir=exp/default/test_e2e \ --inference-dir=exp/default/inference \ --device="gpu" \ --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt ``` If you want to try more functions like training and tuning, please see [ASR getting started](docs/source/asr/getting_started.md) and [TTS Basic Use](/docs/source/tts/basic_usage.md). ## Models List PaddleSpeech supports a series of most popular models, summarized in [released models](./docs/source/released_model.md) with available pretrained models. ASR module contains *Acoustic Model* and *Language Model*, with the following details: > Note: The `Link` should be code path rather than download links.
ASR Module Type | Dataset | Model Type | Link |
---|---|---|---|
Acoustic Model | Aishell | 2 Conv + 5 LSTM layers with only forward direction | Ds2 Online Aishell Model |
2 Conv + 3 bidirectional GRU layers | Ds2 Offline Aishell Model | ||
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC | Conformer Offline Aishell Model | ||
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | Conformer Librispeech Model | ||
Librispeech | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | Conformer Librispeech Model | |
Encoder:Transformer, Decoder:Transformer, Decoding method: Attention | Transformer Librispeech Model | ||
Language Model | CommonCrawl(en.00) | English Language Model | English Language Model |
Baidu Internal Corpus | Mandarin Language Model Small | Mandarin Language Model Small | |
Mandarin Language Model Large | Mandarin Language Model Large |
TTS Module Type | Model Type | Dataset | Link |
---|---|---|---|
Text Frontend | chinese-fronted | ||
Acoustic Model | Tacotron2 | LJSpeech | tacotron2-vctk |
TransformerTTS | transformer-ljspeech | ||
SpeedySpeech | CSMSC | speedyspeech-csmsc | |
FastSpeech2 | AISHELL-3 | fastspeech2-aishell3 | |
VCTK | fastspeech2-vctk | ||
LJSpeech | fastspeech2-ljspeech | ||
CSMSC | fastspeech2-csmsc | ||
Vocoder | WaveFlow | LJSpeech | waveflow-ljspeech |
Parallel WaveGAN | LJSpeech | PWGAN-ljspeech | |
VCTK | PWGAN-vctk | ||
CSMSC | PWGAN-csmsc | ||
Voice Cloning | GE2E | AISHELL-3, etc. | ge2e |
GE2E + Tactron2 | AISHELL-3 | ge2e-tactron2-aishell3 |