English | [简体中文](README_ch.md) # PaddleSpeech
------------------------------------------------------------------------------------ ![License](https://img.shields.io/badge/license-Apache%202-red.svg) ![python version](https://img.shields.io/badge/python-3.7+-orange.svg) ![support os](https://img.shields.io/badge/os-linux-yellow.svg) **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for two critical tasks in Speech - **Automatic Speech Recognition (ASR)** and **Text-To-Speech Synthesis (TTS)**, with modules involving state-of-art and influential models. Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing module, and deployment. Besides, this toolkit also features at: - **Fast and Light-weight**: we provide a high-speed and ultra-lightweight model that is convenient for industrial deployment. - **Rule-based Chinese frontend**: our frontend contains Text Normalization (TN) and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context. - **Varieties of Functions that Vitalize Research**: - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of both ASR and TTS, and uses datasets like LibriSpeech, LJSpeech, AIShell, etc. See also [model lists](#models-list) for more details. - *Support of ASR streaming and non-streaming data*: This toolkit contains non-streaming/streaming models like [DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf), [Transformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100) and [U2](https://arxiv.org/pdf/2012.05481.pdf). Let's install PaddleSpeech with only a few lines of code! >Note: The official name is still deepspeech. 2021/10/26 ``` shell # 1. Install essential libraries and paddlepaddle first. # install prerequisites sudo apt-get install -y sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev libsndfile1 # `pip install paddlepaddle-gpu` instead if you are using GPU. pip install paddlepaddle # 2.Then install PaddleSpeech. git clone https://github.com/PaddlePaddle/DeepSpeech.git cd DeepSpeech pip install -e . ``` ## Table of Contents The contents of this README is as follow: - [Alternative Installation](#installation) - [Quick Start](#quick-start) - [Models List](#models-list) - [Tutorials](#tutorials) - [FAQ and Contributing](#faq-and-contributing) - [License](#license) - [Acknowledgement](#acknowledgement) ## Alternative Installation The base environment in this page is - Ubuntu 16.04 - python>=3.7 - paddlepaddle==2.1.2 If you want to set up PaddleSpeech in other environment, please see the [ASR installation](docs/source/asr/install.md) and [TTS installation](docs/source/tts/install.md) documents for all the alternatives. ## Quick Start > Note: `ckptfile` should be replaced by real path that represents files or folders later. Similarly, `exp/default` is the folder that contains the pretrained models. Try a tiny ASR DeepSpeech2 model training on toy set of LibriSpeech: ```shell cd examples/tiny/s0/ # source the environment source path.sh # prepare librispeech dataset bash local/data.sh # evaluate your ckptfile model file bash local/test.sh conf/deepspeech2.yaml ckptfile offline ``` For TTS, try FastSpeech2 on LJSpeech: - Download LJSpeech-1.1 from the [ljspeech official website](https://keithito.com/LJ-Speech-Dataset/) and our prepared durations for fastspeech2 [ljspeech_alignment](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz). - Assume your path to the dataset is `~/datasets/LJSpeech-1.1` and `./ljspeech_alignment` accordingly, preprocess your data and then use our pretrained model to synthesize: ```shell bash ./local/preprocess.sh conf/default.yaml bash ./local/synthesize_e2e.sh conf/default.yaml exp/default ckptfile ``` If you want to try more functions like training and tuning, please see [ASR getting started](docs/source/asr/getting_started.md) and [TTS Basic Use](/docs/source/tts/basic_usage.md). ## Models List PaddleSpeech ASR supports a lot of mainstream models, which are summarized as follow. For more information, please refer to [ASR Models](./docs/source/asr/released_model.md).
ASR Module Type | Dataset | Model Type | Link |
---|---|---|---|
Acoustic Model | Aishell | 2 Conv + 5 LSTM layers with only forward direction | Ds2 Online Aishell Model |
2 Conv + 3 bidirectional GRU layers | Ds2 Offline Aishell Model | ||
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC | Conformer Offline Aishell Model | ||
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | Conformer Librispeech Model | ||
Librispeech | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | Conformer Librispeech Model | |
Encoder:Transformer, Decoder:Transformer, Decoding method: Attention | Transformer Librispeech Model | ||
Language Model | CommonCrawl(en.00) | English Language Model | English Language Model |
Baidu Internal Corpus | Mandarin Language Model Small | Mandarin Language Model Small | |
Mandarin Language Model Large | Mandarin Language Model Large |
TTS Module Type | Model Type | Dataset | Link |
---|---|---|---|
Text Frontend | chinese-fronted | ||
Acoustic Model | Tacotron2 | LJSpeech | tacotron2-vctk |
TransformerTTS | transformer-ljspeech | ||
SpeedySpeech | CSMSC | speedyspeech-csmsc | |
FastSpeech2 | AISHELL-3 | fastspeech2-aishell3 | |
VCTK | fastspeech2-vctk | ||
LJSpeech | fastspeech2-ljspeech | ||
CSMSC | fastspeech2-csmsc | ||
Vocoder | WaveFlow | LJSpeech | waveflow-ljspeech |
Parallel WaveGAN | LJSpeech | PWGAN-ljspeech | |
VCTK | PWGAN-vctk | ||
CSMSC | PWGAN-csmsc | ||
Voice Cloning | GE2E | AISHELL-3, etc. | ge2e |
GE2E + Tactron2 | AISHELL-3 | ge2e-tactron2-aishell3 |