merge parakeet repo into deepspeech

pull/908/head
Hui Zhang 3 years ago
parent aaa87698c4
commit b079577e08

@ -0,0 +1,30 @@
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/src/conf.py
# Build documentation with MkDocs
#mkdocs:
# configuration: mkdocs.yml
# Optionally build your docs in additional formats such as PDF
formats: []
# Optionally set the version of Python and requirements required to build your docs
python:
version: 3.7
install:
- method: pip
path: .
extra_requirements:
- doc
- requirements: docs/requirements.txt

@ -1,4 +1,4 @@
# PaddlePaddle Speech to Any toolkit
# PaddlePaddle Speech toolkit
![License](https://img.shields.io/badge/license-Apache%202-red.svg)
![python version](https://img.shields.io/badge/python-3.7+-orange.svg)

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 335 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 116 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 131 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 304 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 143 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 361 KiB

@ -25,7 +25,7 @@ import sphinx_rtd_theme
# -- Project information -----------------------------------------------------
project = 'paddle deepspeech'
project = 'paddle speech'
copyright = '2021, Deepspeech-developers'
author = 'Deepspeech-developers'

@ -0,0 +1,129 @@
# Parakeet
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle dynamic graph and includes many influential TTS models.
<div align="center">
<img src="docs/images/logo.png" width=300 /> <br>
</div>
## News <img src="./docs/images/news_icon.png" width="40"/>
- Oct-12-2021, Refector examples code.
- Oct-12-2021, Parallel WaveGAN with LJSpeech. Check [examples/GANVocoder/parallelwave_gan/ljspeech](./examples/GANVocoder/parallelwave_gan/ljspeech).
- Oct-12-2021, FastSpeech2/FastPitch with LJSpeech. Check [examples/fastspeech2/ljspeech](./examples/fastspeech2/ljspeech).
- Sep-14-2021, Reconstruction of TransformerTTS. Check [examples/transformer_tts/ljspeech](./examples/transformer_tts/ljspeech).
- Aug-31-2021, Chinese Text Frontend. Check [examples/text_frontend](./examples/text_frontend).
- Aug-23-2021, FastSpeech2/FastPitch with AISHELL-3. Check [examples/fastspeech2/aishell3](./examples/fastspeech2/aishell3).
- Aug-03-2021, FastSpeech2/FastPitch with CSMSC. Check [examples/fastspeech2/baker](./examples/fastspeech2/baker).
- Jul-19-2021, SpeedySpeech with CSMSC. Check [examples/speedyspeech/baker](./examples/speedyspeech/baker).
- Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/GANVocoder/parallelwave_gan/baker](./examples/GANVocoder/parallelwave_gan/baker).
- Jul-01-2021, Montreal-Forced-Aligner. Check [examples/use_mfa](./examples/use_mfa).
- May-07-2021, Voice Cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).
## Overview
In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders:
- Text FrontEnd
- Rule based Chinese frontend.
- Acoustic Models
- [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
- [【SpeedySpeech】SpeedySpeech: Efficient Neural Speech Synthesis](https://arxiv.org/abs/2008.03802)
- [【Transformer TTS】Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
- [【Tacotron2】Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
- Vocoders
- [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
- [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
- Voice Cloning
- [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
- [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)
## Setup
It's difficult to install some dependent libraries for this repo in Windows system, we recommend that you **DO NOT** use Windows system, please use `Linux`.
Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
```bash
sudo apt-get install libsndfile1
```
### Install PaddlePaddle
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires PaddlePaddle **2.1.2** or above.
### Install Parakeet
```bash
git clone https://github.com/PaddlePaddle/Parakeet
cd Parakeet
pip install -e .
```
If some python dependent packages cannot be installed successfully, you can run the following script first.
(replace `python3.6` with your own python version)
```bash
sudo apt install -y python3.6-dev
```
See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for more details.
## Examples
Entries to the introduction, and the launch of training and synthsis for different example models:
- [>>> Chinese Text Frontend](./examples/text_frontend)
- [>>> FastSpeech2/FastPitch](./examples/fastspeech2)
- [>>> Montreal-Forced-Aligner](./examples/use_mfa)
- [>>> Parallel WaveGAN](./examples/GANVocoder/parallelwave_gan)
- [>>> SpeedySpeech](./examples/speedyspeech)
- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
- [>>> GE2E](./examples/ge2e)
- [>>> WaveFlow](./examples/waveflow)
- [>>> TransformerTTS](./examples/transformer_tts)
- [>>> Tacotron2](./examples/tacotron2)
## Audio samples
### TTS models (Acoustic Model + Neural Vocoder)
Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) for audio sampels.
## Released Model
### Acoustic Model
#### FastSpeech2/FastPitch
1. [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip)
2. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
3. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
#### SpeedySpeech
1. [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_nosil_baker_ckpt_0.5.zip)
#### TransformerTTS
1. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.4.zip)
#### Tacotron2
1. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
2. [tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
### Vocoder
#### WaveFlow
1. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip)
#### Parallel WaveGAN
1. [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip)
2. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip)
### Voice Cloning
#### Tacotron2_AISHELL3
1. [tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip)
#### GE2E
1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
## License
Parakeet is provided under the [Apache-2.0 license](LICENSE).

@ -0,0 +1,333 @@
# Advanced Usage
This sections covers how to extend parakeet by implementing your own models and experiments. Guidelines on implementation are also elaborated.
For the general deep learning experiment, there are several parts to deal with:
1. Preprocess the data according to the needs of the model, and iterate the dataset by batch.
2. Define the model, optimizer and other components.
3. Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.).
5. Configure and run the experiment.
## Parakeet's Model Components
In order to balance the reusability and function of models, we divide models into several types according to its characteristics.
For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of `paddle.nn.Layer`. Modules without trainable parameters can be directly implemented as a function, and its input and output are `paddle.Tensor`.
Models for a specific task are implemented as subclasses of `paddle.nn.Layer`. Models could be simple, like a single layer RNN. For complicated models, it is recommended to split the model into different components.
For a seq-to-seq model, it's natural to split it into encoder and decoder. For a model composed of several similar layers, it's natural to extract the sublayer as a separate layer.
There are two common ways to define a model which consists of several modules.
1. Define a module given the specifications. Here is an example with multilayer perceptron.
```python
class MLP(nn.Layer):
def __init__(self, input_size, hidden_size, output_size):
self.linear1 = nn.Linear(input_size, hidden_size)
self.linear2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
return self.linear2(paddle.tanh(self.linear1(x))
module = MLP(16, 32, 4) # intialize a module
```
When the module is intended to be a generic and reusable layer that can be integrated into a larger model, we prefer to define it in this way.
For considerations of readability and usability, we strongly recommend **NOT** to pack specifications into a single object. Heres an example below.
```python
class MLP(nn.Layer):
def __init__(self, hparams):
self.linear1 = nn.Linear(hparams.input_size, hparams.hidden_size)
self.linear2 = nn.Linear(hparams.hidden_size, hparams.output_size)
def forward(self, x):
return self.linear2(paddle.tanh(self.linear1(x))
```
For a module defined in this way, its harder for the user to initialize an instance. Users have to read the code to check what attributes are used.
Also, code in this style tend to be abused by passing a huge config object to initialize every module used in an experiment, thought each module may not need the whole configuration.
We prefer to be explicit.
2. Define a module as a combination given its components. Here is an example for a sequence-to-sequence model.
```python
class Seq2Seq(nn.Layer):
def __init__(self, encoder, decoder):
self.encoder = encoder
self.decoder = decoder
def forward(self, x):
encoder_output = self.encoder(x)
output = self.decoder(encoder_output)
return output
encoder = Encoder(...)
decoder = Decoder(...)
# compose two components
model = Seq2Seq(encoder, decoder)
```
When a model is a complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
In the directory structure of Parakeet, modules with high reusability are placed in `parakeet.modules`, but models for specific tasks are placed in `parakeet.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories.
## Parakeet's Data Components
Another critical componnet for a deep learning project is data.
Parakeet uses the following methods for training data:
1. Preprocess the data.
2. Load the preprocessed data for training.
Previously, we wrote the preprocessing in the `__getitem__` of the Dataset, which will process when accessing a certain batch samples, but encountered some problems:
1. Efficiency problem. Even if Paddle has a design to load data asynchronously, when the batch size is large, each sample needs to be preprocessed and set up batches, which takes a lot of time , and may even seriously slow down the training process.
2. Data filtering problem. Some filtering conditions depend on the features of the processed sample. For example, filtering samples that are too short according to text length. If the text length can only be known after `__getitem__`, every time you filter, the entire dataset needed to be loaded once! In addition, if you do not pre-filter, A small exception (such as too short text ) in `__getitem__` will cause an exception in the entire data flow, which is not feasible, because `collate_fn ` presupposes that the acquisition of each sample can be normal. Even if some special flags, such as `None`, are used to mark data acquisition failures, and skip `collate_fn`, it will change batch_size .
Therefore, it is not realistic to put preprocessing entirely on `__getitem__`. We use the method mentioned above instead.
During preprocessing, we can do filtering, We can also save more intermediate features, such as text length, audio length, etc., which can be used for subsequent filtering. Because of the habit of TTS field, data is stored in multiple files, and the processed results are stored in `npy` format.
Use a list-like way to store metadata and store the file path in it, so that you can not be restricted by the specific storage location of the file. In addition to the file path, other metadata can also be stored in it. For example, the path of the text, the path of the audio, the path of the spectrum, the number of frames, the number of sampling points, and so on.
Then for the path, there are multiple opening methods, such as `sf.read`, `np.load`, etc., so it's best to use a parameter that can be input, we don't even want to determine the reading method by it's extension, it's best to let the users input it , in this way, users can define their own method to parse the data.
So we learned from the design of `DataFrame`, but our construction method is simpler, only need a `list of dicts`, a dict represents a record, and it's convenient to interact with formats such as `json`, `yaml`. For each selected field, we need to give a parser (called `converter` in the interface), and that's it.
Then we need to select a format for saving metadata to the hard disk. There are two square brackets when storing the list of records in `json`, which is not convenient for stream reading and writing, so we use `jsonlines`. We don't use `yaml` because it occupies too many rows when storing the list of records.
Meanwhile, `cache` is added here, and a multi-process Manager is used to share memory between multiple processes. When `num_workers` is used, it is guaranteed that each sub process will not cache a copy.
The implementation of `DataTable` can be found in `parakeet/datasets/data_table.py`.
```python
class DataTable(Dataset):
"""Dataset to load and convert data for general purpose.
Parameters
----------
data : List[Dict[str, Any]]
Metadata, a list of meta datum, each of which is composed of
several fields
fields : List[str], optional
Fields to use, if not specified, all the fields in the data are
used, by default None
converters : Dict[str, Callable], optional
Converters used to process each field, by default None
use_cache : bool, optional
Whether to use cache, by default False
Raises
------
ValueError
If there is some field that does not exist in data.
ValueError
If there is some field in converters that does not exist in fields.
"""
def __init__(self,
data: List[Dict[str, Any]],
fields: List[str]=None,
converters: Dict[str, Callable]=None,
use_cache: bool=False):
```
It's `__getitem__` method is to parse each field with their own parser, and then compose a dictionary to return.
```python
def _convert(self, meta_datum: Dict[str, Any]) -> Dict[str, Any]:
"""Convert a meta datum to an example by applying the corresponding
converters to each fields requested.
Parameters
----------
meta_datum : Dict[str, Any]
Meta datum
Returns
-------
Dict[str, Any]
Converted example
"""
example = {}
for field in self.fields:
converter = self.converters.get(field, None)
meta_datum_field = meta_datum[field]
if converter is not None:
converted_field = converter(meta_datum_field)
else:
converted_field = meta_datum_field
example[field] = converted_field
return example
```
## Parakeet's Training Components
A typical training process includes the following processes:
1. Iterate the dataset.
2. Process batch data.
3. Neural network forward/backward calculation.
4. Parameter update.
5. Evaluate the model on the validation dataset, when some special conditions are reached.
6. Write logs, visualize, and in some cases save necessary intermediate results.
7. Save the state of the model and optimizer.
Here, we mainly introduce the training related components of Parakeet and why we designed it like this.
### Global Repoter
When training and modifying Deep Learning modelslogging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization toolssuch as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usuaally used for different purpose.
In these tools, `print` is the simplestit doesn't have the concept of `logger` and `handler` in `logging``summarywriter` and `logdir` in `tensorboard`, when printing, there is no need for `global_step` It's light enough to appear anywhere in the code, and it's printed to a common stdout. Of course, its customizability is limited, for example, it is no longer intuitive when printing dictionaries or more complex objects. And it's fleeting, people need to use redirection to save information.
For TTS models developmentwe hope to have a more universal multimedia stdout, which is actually a tool similar to `tensorboard`, which allows many multimedia forms, but it needs a `summary writer` when using, and a `step` when writing information. If the data are images or voices, some format control parameters are needed.
This will destroy the modular design to a certain extent. For example, If my model is composed of multiple sublayers, and I want to record some important information in the forward method of some sublayers. For this reason, I may need to pass the `summary writer` to this sublayers, but for the sublayers, its function is calculation, it should not have extra considerations, and it's also difficult for us to tolerate that the initialization of an `nn.Linear` has an optional `visualizer` in the method. And, for a calculation module, **HOW** can it know the global step? These are things related to the training process!
Therefore, a more common approach is not to put writing_log_code in the definition of layer, but return it, then obtain them during training, and write them to `summary writer`. However, the return values need to be modified. `summary writer ` is a broadcaster at the training level, and then each module transmits information to it by modifying the return values.
We think this method is a little ugly. We prefer to return the necessary information only rather than change the return values to accommodate visualization and recording. When you need to report some information, you should be able to report it without difficult. So we imitate the design of `chainer` and use the `global repoter`.
It takes advantage of the globality of Python's module level variables and the effect of context manager.
There is a module level variable in `parakeet/training/reporter.py` `OBSERVATIONS`which is a `Dict` to store key-value.
```python
# parakeet/training/reporter.py
@contextlib.contextmanager
def scope(observations):
# make `observation` the target to report to.
# it is basically a dictionary that stores temporary observations
global OBSERVATIONS
old = OBSERVATIONS
OBSERVATIONS = observations
try:
yield
finally:
OBSERVATIONS = old
```
Then we implement a context manager `scope`, which is used to switch the variables bound by the name of `OBSERVATIONS`. Then a `getter` function is defined to get the dictionary bound by `OBSERVATIONS`.
```python
def get_observations():
global OBSERVATIONS
return OBSERVATIONS
```
Then we define a function to get the current `OBSERVATIONS`and write key-value pair into it.
```python
def report(name, value):
# a simple function to report named value
# you can use it everywhere, it will get the default target and writ to it
# you can think of it as std.out
observations = get_observations()
if observations is None:
return
else:
observations[name] = value
```
The test code following shows the usage method.
- use `first` as the current `OBSERVATION`, write `first_begin=1`,
- then, open the second `OBSERVATION`, write `second_begin=2`,
- then, open the third `OBSERVATION`, write `third_begin=3`
- exit the third `OBSERVATION` , we back to the second `OBSERVATION` automatically
- write some context in the second `OBSERVATION` , then exit it, and we back to the first `OBSERVATION` automatically
```python
def test_reporter_scope():
first = {}
second = {}
third = {}
with scope(first):
report("first_begin", 1)
with scope(second):
report("second_begin", 2)
with scope(third):
report("third_begin", 3)
report("third_end", 4)
report("seconf_end", 5)
report("first_end", 6)
assert first == {'first_begin': 1, 'first_end': 6}
assert second == {'second_begin': 2, 'seconf_end': 5}
assert third == {'third_begin': 3, 'third_end': 4}
```
In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`.
The `Trainer` in Parakeet report the information in this way.
```python
while True:
self.observation = {}
# set observation as the report target
# you can use report freely in Updater.update()
# updating parameters and state
with scope(self.observation):
update() # training for a step is defined here
```
### Updater: Model Training Process
In order to maintain the purity of function and the reusability of code, we abstract the model code into a subclass of `paddle.nn.Layer`, and write the core computing functions in it.
We tend to write the forward process of training in `forward()`, but only write to the prediction result, not to the loss. Therefore, this module can be called by a larger module.
However, when we compose an experiment, we need to add some other things, such as training process, evaluation process, checkpoint saving, visualization and the like. In this process, we will encounter some things that only exist in the training process, such as `optimizer`, `learning rate scheduler`, `visualizer`, etc. These things are not part of the model, they should **NOT** be written in the model code.
We made an abstraction for these intermediate processes, that is, `Updater`, which takes the `model`, `optimizer`, and `data stream` as input, and its function is training. Since there may be differences in training methods of different models, we tend to write a corresponding `Updater` for each model. But this is different from the final training script, there is still a certain degree of encapsulation, just to extract the details of regular saving, visualization, evaluation, etc., and only retain the most basic function, that is, training the model.
### Visualizer
Because we choose observation as the communication mode, we can simply write the things in observation into `visualizer`.
## Parakeet's Configuration Components
Deep learning experiments often have many options to configure. These configurations can be roughly divided into several categories.
1. Data source and data processing mode configuration.
2. Save path configuration of experimental results.
3. Data preprocessing mode configuration.
4. Model structure and hyperparameterconfiguration.
5. Training process configuration.
Its common to change the running configuration to compare results. To keep track of running configuration, we use `yaml` configuration files.
Also, we want to interact with command line options. Some options that usually change according to running environments is provided by command line arguments. In addition, we want to override an option in the config file without editing it.
Taking these requirements in to consideration, we use [yacs](https://github.com/rbgirshick/yacs) as a config management tool. Other tools like [omegaconf](https://github.com/omry/omegaconf) are also powerful and have similar functions.
In each example provided, there is a `config.py`, the default config is defined at `conf/default.yaml`. If you want to get the default config, import `config.py` and call `get_cfg_defaults()` to get it. Then it can be updated with `yaml` config file or command line arguments if needed.
For details about how to use yacs in experiments, see [yacs](https://github.com/rbgirshick/yacs).
The following is the basic `ArgumentParser`:
1. `--config` is used to support configuration file parsing, and the configuration file itself handles the unique options of each experiment.
2. `--train-metadata` is the path to the training data.
3. `--output-dir` is the dir to save the training results.if there are checkpoints in `checkpoints/` of `--output-dir` , it's defalut to reload the newest checkpoint to train)
4. `--device` and `--nprocs` determine operation modes`--device` specifies the type of running device, whether to run on `cpu` or `gpu`. `--nprocs` refers to the number of training processes. If `nprocs` > 1, it means that multi process parallel training is used. (Note: currently only GPU multi card multi process training is supported.)
Developers can refer to the examples in `Parakeet/examples` to write the default configuration file when adding new experiments.
## Parakeet's Experiment template
The experimental codes in Parakeet are generally organized as follows:
```text
├── conf
│ └── default.yaml (defalut config)
├── README.md (help information)
├── batch_fn.py (organize metadata into batch)
├── config.py (code to read default config)
├── *_updater.py (Updater of a specific model)
├── preprocess.py (data preprocessing code)
├── preprocess.sh (script to call data preprocessing.py)
├── synthesis.py (synthesis from metadata)
├── synthesis.sh (script to call synthesis.py)
├── synthesis_e2e.py (synthesis from raw text)
├── synthesis_e2e.sh (script to call synthesis_e2e.py)
├── train.py (train code)
└── run.sh (script to call train.py)
```
We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, It's best for developers to follow this specification:
```text
exp/default/
├── checkpoints/
│ ├── records.jsonl (record file)
│ └── snapshot_iter_*.pdz (checkpoint files)
├── config.yaml (config fille of this experiment)
├── vdlrecords.*.log (visualdl record file)
├── worker_*.log (text logging, one file per process)
├── validation/ (output dir during training, information_iter_*/ is the output of each step, if necessary)
├── inference/ (output dir of exported static graph model, which is only used in the final stage of training, if implemented)
└── test/ (output dir of synthesis results)
```
You can view the examples we provide in `Parakeet/examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to Parakeet.

@ -0,0 +1,115 @@
# Basic Usage
This section shows how to use pretrained models provided by parakeet and make inference with them.
Pretrained models in v0.4 are provided in a archive. Extract it to get a folder like this:
```
checkpoint_name/
├──default.yaml
├──snapshot_iter_76000.pdz
├──speech_stats.npy
└──phone_id_map.txt
```
`default.yaml` stores the config used to train the model.
`snapshot_iter_N.pdz` is the chechpoint file, where `N` is the steps it has been trained.
`*_stats.npy` is the stats file of feature if it has been normalized before training.
`phone_id_map.txt` is the map of phonemes to phoneme_ids.
The example code below shows how to use the models for prediction.
## Acoustic Models (text to spectrogram)
The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction objectthen use fastspeech2_inferencet(phone_ids) to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
```python
from pathlib import Path
import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
from parakeet.models.fastspeech2 import FastSpeech2
from parakeet.models.fastspeech2 import FastSpeech2Inference
from parakeet.modules.normalizer import ZScore
# Parakeet/examples/fastspeech2/baker/frontend.py
from frontend import Frontend
# load the pretrained model
checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
with open(checkpoint_dir / "default.yaml") as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
# load stats file
stat = np.load(checkpoint_dir / "speech_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
# construct a prediction object
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
# load Chinese Frontend
frontend = Frontend(checkpoint_dir / "phone_id_map.txt")
# text to spectrogram
sentence = "你好吗?"
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
# The output of Chinese text frontend is segmented
for part_phone_ids in phone_ids:
with paddle.no_grad():
temp_mel = fastspeech2_inference(part_phone_ids)
if flags == 0:
mel = temp_mel
flags = 1
else:
mel = paddle.concat([mel, temp_mel])
```
## Vocoder (spectrogram to wave)
The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction objectthen use pwg_inference(mel) to generate raw audio (in wav format).
```python
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from parakeet.models.parallel_wavegan import PWGGenerator
from parakeet.models.parallel_wavegan import PWGInference
from parakeet.modules.normalizer import ZScore
# load the pretrained model
checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
with open(checkpoint_dir / "pwg_default.yaml") as f:
pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_params))
vocoder.remove_weight_norm()
vocoder.eval()
# load stats file
stat = np.load(checkpoint_dir / "pwg_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
# construct a prediction object
pwg_inference = PWGInference(pwg_normalizer, vocoder)
# spectrogram to wave
wav = pwg_inference(mel)
sf.write(
audio_path,
wav.numpy(),
samplerate=fastspeech2_config.fs)
```

@ -0,0 +1,108 @@
# Chinese Rule Based Text Frontend
TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in Parakeet, see exapmle in `Parakeet/examples/text_frontend/`.
A text frontend module mainly includes:
- Text Segmentation
- Text Normalization (TN)
- Word Segmentation (mainly in Chinese)
- Part-of-Speech
- Prosody
- G2P (Grapheme-to-Phoneme, include Polyphone and Tone Sandhi, etc.)
- Linguistic Features/Charactors/Phonemes
```text
• text: 90 后为中华人民共和国成立 70 周年准备了大礼
• Text Normalization: 九零后为中华人民共和国成立七十周年准备了大礼
• Word Segmentation: 九零后/为/中华人民/共和国/成立/七十/周年/准备/了/大礼
• G2P:
jiu3 ling2 hou4 wei4 zhong1 hua2 ren2 min2 gong4 he2 guo2 ...
• Prosody (prosodic words #1, prosodic phrases #2, intonation phrases #3, sentence #4):
九零后#1为中华人民#1共和国#2成立七十周年#3准备了大礼#4
```
Among them, Text Normalization and G2P are the most important modules. We mainly introduce them here.
## Text Normalization
### Supported NSW (Non-Standard-Word) Normalization
|NSW type|raw|normalized|
|:--|:-|:-|
|serial number|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九|
|cardinal|这块黄金重达324.75克<br>我们班的最高总分为583分|这块黄金重达三百二十四点七五克<br>我们班的最高总分为五百八十三分|
|numeric range |12\~23<br>-1.5\~2|十二到二十三<br>负一点五到二|
|date|她出生于86年8月18日她弟弟出生于1995年3月1日|她出生于八六年八月十八日, 她弟弟出生于一九九五年三月一日|
|time|等会请在12:05请通知我|等会请在十二点零五分请通知我
|temperature|今天的最低气温达到-10°C|今天的最低气温达到零下十度
|fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票|
|percentage|明天有62的概率降雨|明天有百分之六十二的概率降雨|
|money|随便来几个价格12块534.5元20.1万|随便来几个价格十二块五,三十四点五元,二十点一万|
|telephone|这是固话0421-33441122<br>这是手机+86 18544139121|这是固话零四二一三三四四一一二二<br>这是手机八六一八五四四一三九一二一|
## Grapheme-to-Phoneme
In Chinese, G2P is a very complex module, which mainly includes **polyphone** and **tone sandhi**.
We use [g2pM](https://github.com/kakaobrain/g2pM) and [pypinyin](https://github.com/mozillazg/python-pinyin) as the defalut g2p tools. They can solve the problem of polyphone to a certain extent. In the future, we intend to use a trainable language model (for example, [BERT](https://arxiv.org/abs/1810.04805)) for polyphone.
However, g2pM and pypinyin do not perform well in tone sandhi, we use rules to solve this problem, which requires relevant linguistic knowledge.
The **tone sandhi** in Chinese mainly include:
- soft tone sandhi (轻声变调)
- "一" "不" tone sandhi ("一" "不" 变调)
- three tone sandhi (三声变调)
For ease of understanding, we list the tone sandhi rules in Chinese here
### 1. 轻声变调
| |cases |
|:--|:-|
| 语气助词“吧、呢、啊”等 | 吃吧、走吗、去呢、跑啊 |
| 结构助词:“的、地、得”| 我的书、慢慢地走、跑得很快等 |
|有的轻声音节和非轻声音节构成对比区别意义 |买卖:一指生意;二指买和卖。 <br/> 地道:一指纯粹、真正;二指地下通道。<br> 大意:一指没有注意;二指主要的意思。 <br/> 东西:一指各种事物;二指东面与西面。<br> 言语:一指所说的话;二指开口,招呼。<br/>运气:一指一种锻炼的方法。二指幸运。<br> |
|名词的后缀:“们、子、头”|你们、房子、石头 |
|名词或动词的第二个重叠音节 | 奶奶、姐姐、爸爸、试试、看看、说说、问问 |
|名词后面表示方位的:“上、下、里” |桌上、地下、院里 |
| 动态助词:“了、着、过” | 走了、看着、去过|
| 作宾语的人称代词:“我、你、他” | 找我、请你、麻烦他。 |
| 约定俗成 | 匀称、盘算、枇杷、篱笆、活泼、玄乎。狐狸、学生、拾掇、麻烦、蛤蟆、石榴。玫瑰、凉快、萝卜、朋友、奴才、云彩。脑袋、老爷、老婆、嘴巴、指头、指甲。委屈、喇叭、讲究、打发、打听、喜欢。点心、伙计、打扮、哑巴、女婿、首饰。自在、吓唬、力气、漂亮、队伍、地方。痛快、念叨、笑语、丈夫、志气、钥匙。月亮、正经、位置、秀气、上司、悟性。告示、动静、热闹、屁股、阔气、意思。等 |
### 2. "一" "不" 变调
#### "一" 变调
| | 是否变调 | cases|
|:--|:-|:-|
| 单独念 | 否 | 第一、一楼|
| 序数 |否 | |
| 用在语句末尾 | 否 | |
| 去声前变阳平(四声前变二声) | | 一栋yí dòng、一段yí duàn、一律yí lǜ、一路yí lù|
| 非去声前变去声(非四声前变四声) | | 阴平(一声)<br>一发yì fā 、一端yì duān、一天yì tiān、一忽yì hū<br>阳平(二声)<br>一叠yì dié 、一同yì tóng 、一头yì tóu 、一条yì tiáo<br>上声(三声)<br>一统yì tǒng、一体yì tǐ、一览yì lǎn、一口yì kǒu|
|轻读,当“一”嵌在重叠式的动词之间 | | 听一听 tīng yi tīng|
#### "不" 变调
| | 是否变调 | cases|
|:--|:-|:-|
|单独念|否 | |
| 用在语句末尾| 否 | 我不|
|去声前变阳平(四声前变成二声) | | 不怕bú pà、不妙bú miào、不犯bú fàn、不忿bú fèn|
| 轻读,不”夹在重叠动词或重叠形容词之间、夹在动词和补语之间 | |懂不懂 dǒng bu dǒng 、看不清 kàn bu qīng |
### 3. 三声变调
| | 子类别| 如何变调|cases|
|:--|:-|:-|:-|
|单独念 | | 否| |
|句末 | | 否| |
|在句中停顿并没被后音节影响 | |否 | |
|三声+三声 | | 二声+三声|保险、保养、党委、尽管、老板、本领、引导、古老、敏感、鼓舞、永远、语法、口语、岛屿、保姆、远景、北海、首长、母语 |
| 三个三声相连| 双音节+单音节(“双单格”结构)| 前两个变二声|演讲稿、跑马场、展览馆、管理组、水彩笔、蒙古语、选取法、古典舞、虎骨酒、洗脸水、草稿纸|
| | 单音节+双音节(“单双格”结构)|第二个变二声|史小姐、党小组、好小伙、跑百米、纸老虎、李厂长、老保姆、冷处理、很友好、小雨伞|
| | 单音节+单音节+单音节(“单三格”结构)| 前两个变二声| 软懒散、稳准狠|
| 更多三声音节相连时| | 按语意与若干二字组成三字组,然后按以上变调规律处理|岂有 / 此理。<br>请你 / 给我 / 打点儿 / 洗脸水。<br>手表厂 / 有五种 /好产品。|
## References
- [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization)
- [声调篇|这些“一、不”变调规律,你不得不知](https://zhuanlan.zhihu.com/p/36156170)
- [TTS前端模块中的普通话变调规则](https://zhuanlan.zhihu.com/p/65091429)
- [轻声和变调](https://wenku.baidu.com/view/ad2016d94693daef5ef73db1.html)
- [必读轻声词语表546条](http://www.chaziwang.com/article-view-504.html)

@ -0,0 +1,583 @@
Audio Sample
==================
The main processes of TTS include:
1. Convert the original text into characters/phonemes, through ``text frontend`` module.
2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through ``Acoustic models``.
3. Convert acoustic features into waveforms through ``Vocoders``.
When training ``Tacotron2````TransformerTTS`` and ``WaveFlow``, we use English single speaker TTS dataset `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_ by default. However, when training ``SpeedySpeech``, ``FastSpeech2`` and ``ParallelWaveGAN``, we use Chinese single speaker dataset `CSMSC <https://test.data-baker.com/data/index/source/>`_ by default.
In the future, ``Parakeet`` will mainly use Chinese TTS datasets for default examples.
Here, we will display three types of audio samples:
1. Analysis/synthesis (ground-truth spectrograms + Vocoder)
2. TTS (Acoustic model + Vocoder)
3. Chinese TTS with/without text frontend (mainly tone sandhi)
Analysis/synthesis
--------------------------
Audio samples generated from ground-truth spectrograms with a vocoder.
.. raw:: html
<b>LJSpeech(English)</b>
<br>
</br>
<table>
<tr>
<th align="left"> GT </th>
<th align="left"> WaveFlow </th>
</tr>
<tr>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0001.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0002.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0003.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0004.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0005.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_0.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_1.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_2.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_3.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_4.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
</table>
<br>
</br>
<b>CSMSC(Chinese)</b>
<br>
</br>
<table>
<tr>
<th align="left"> GT (convert to 24k) </th>
<th align="left"> ParallelWaveGAN </th>
</tr>
<tr>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009901.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009902.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009903.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009904.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009905.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009901.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009902.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009903.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009904.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009905.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
</table>
TTS
-------------------
Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.
.. raw:: html
<table>
<tr>
<th align="left"> TransformerTTS + WaveFlow </th>
<th align="left"> Tacotron2 + WaveFlow </th>
</tr>
<tr>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/001.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/002.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/003.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/004.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/005.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/006.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/007.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/008.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/009.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_1.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_2.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_3.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_4.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_5.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_6.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_7.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_8.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_9.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
</table>
<table>
<tr>
<th align="left"> SpeedySpeech + ParallelWaveGAN </th>
<th align="left"> FastSpeech2 + ParallelWaveGAN </th>
</tr>
<tr>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/001.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/002.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/003.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/004.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/005.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/006.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/007.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/008.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/009.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/001.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/002.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/003.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/004.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/005.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/006.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/007.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/008.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/009.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
</table>
Chinese TTS with/without text frontend
--------------------------------------
We provide a complete Chinese text frontend module in ``Parakeet``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here.
We use ``FastSpeech2`` + ``ParallelWaveGAN`` here.
.. raw:: html
<table>
<tr>
<th align="left"> With Text Frontend </th>
<th align="left"> Without Text Frontend </th>
</tr>
<tr>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/001.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/002.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/003.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/004.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/005.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/006.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/007.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/008.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/009.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/010.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/001.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/002.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/003.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/004.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/005.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/006.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/007.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/008.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/009.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
<audio controls="controls">
<source
src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/010.wav"
type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<table>

@ -0,0 +1,45 @@
.. parakeet documentation master file, created by
sphinx-quickstart on Fri Sep 10 14:22:24 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Parakeet
====================================
``parakeet`` is a deep learning based text-to-speech toolkit built upon ``paddlepaddle`` framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by `Baidu Research <http://research.baidu.com>`_ and other research groups.
``parakeet`` mainly consists of components below.
#. Implementation of models and commonly used neural network layers.
#. Dataset abstraction and common data preprocessing pipelines.
#. Ready-to-run experiments.
.. toctree::
:maxdepth: 1
:caption: Introduction
introduction
.. toctree::
:maxdepth: 1
:caption: Getting started
install
basic_usage
advanced_usage
cn_text_frontend
released_models
.. toctree::
:maxdepth: 1
:caption: Demos
demo
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

@ -0,0 +1,47 @@
# Installation
## Install PaddlePaddle
Parakeet requires PaddlePaddle as its backend. Note that 2.1.2 or newer versions of paddle is required.
Since paddlepaddle has multiple packages depending on the device (cpu or gpu) and the dependency libraries, it is recommended to install a proper package of paddlepaddle with respect to the device and dependency library versons via `pip`.
Installing paddlepaddle with conda or build paddlepaddle from source is also supported. Please refer to [PaddlePaddle installation](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) for more details.
Example instruction to install paddlepaddle via pip is listed below.
### PaddlePaddle with GPU
```python
# CUDA10.1 的 PaddlePaddle
python -m pip install paddlepaddle-gpu==2.1.2.post101 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CUDA10.2 的 PaddlePaddle
python -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
# CUDA11.0 的 PaddlePaddle
python -m pip install paddlepaddle-gpu==2.1.2.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CUDA11.2 的 PaddlePaddle
python -m pip install paddlepaddle-gpu==2.1.2.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
```
### PaddlePaddle with CPU
```python
python -m pip install paddlepaddle==2.1.2 -i https://mirror.baidu.com/pypi/simple
```
## Install libsndfile
Experimemts in parakeet often involve audio and spectrum processing, thus `librosa` and `soundfile` are required. `soundfile` requires a extra C library `libsndfile`, which is not always handled by pip.
For Windows and Mac users, `libsndfile` is also installed when installing `soundfile` via pip, but for Linux users, installing `libsndfile` via system package manager is required. Example commands for popular distributions are listed below.
```bash
# ubuntu, debian
sudo apt-get install libsndfile1
# centos, fedora
sudo yum install libsndfile
# openSUSE
sudo zypper in libsndfile
```
For any problem with installtion of soundfile, please refer to [SoundFile](https://pypi.org/project/SoundFile/).
## Install Parakeet
There are two ways to install parakeet according to the purpose of using it.
1. If you want to run experiments provided by parakeet or add new models and experiments, it is recommended to clone the project from github (Parakeet), and install it in editable mode.
```python
git clone https://github.com/PaddlePaddle/Parakeet
cd Parakeet
pip install -e .
```

@ -0,0 +1,27 @@
# Parakeet - PAddle PARAllel text-to-speech toolKIT
## What is Parakeet?
Parakeet is a deep learning based text-to-speech toolkit built upon paddlepaddle framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by Baidu Research and other research groups.
## What can Parakeet do?
Parakeet mainly consists of components below:
- Implementation of models and commonly used neural network layers.
- Dataset abstraction and common data preprocessing pipelines.
- Ready-to-run experiments.
Parakeet provides you with a complete TTS pipeline, including:
- Text FrontEnd
- Rule based Chinese frontend.
- Acoustic Models
- FastSpeech2
- SpeedySpeech
- TransformerTTS
- Tacotron2
- Vocoders
- Parallel WaveGAN
- WaveFlow
- Voice Cloning
- Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- GE2E
Parakeet helps you to train TTS models with simple commands.

@ -0,0 +1,295 @@
# Released Models
TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We introduce a rule based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models.
The main processes of TTS include:
1. Convert the original text into characters/phonemes, through `text frontend` module.
2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
3. Convert acoustic features into waveforms through `Vocoders`.
A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by Parakeet are acoustic models and vocoders.
## Acoustic Models
### Modeling Objectives of Acoustic Models
Modeling the mapping relationship between text sequences and speech features
```text
text X = {x1,...,xM}
specch Y = {y1,...yN}
```
Modeling Objectives:
```text
Ω = argmax p(Y|X,Ω)
```
### Modeling process of Acoustic Models
At present, there are two mainstream acoustic model structures.
- Frame level acoustic model:
- Duration model (M Tokens - > N Frames).
- Acoustic decoder (N Frames - > N Frames).
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/frame_level_am.png" width=500 /> <br>
</div>
- Sequence to sequence acoustic model:
- M Tokens - > N Frames.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/seq2seq_am.png" width=500 /> <br>
</div>
### Tacotron2
[Tacotron](https://arxiv.org/abs/1703.10135) is the first end-to-end acoustic model based on deep learning, and it is also the most widely used acoustic model.
[Tacotron2](https://arxiv.org/abs/1712.05884) is the Improvement of Tacotron.
#### Tacotron
**Features of Tacotron:**
- Encoder.
- CBHG.
- Input: character sequence.
- Decoder.
- Global soft attention.
- unidirectional RNN.
- Autoregressive teacher force training (input real speech feature).
- Multi frame prediction.
- CBHG postprocess.
- Vocoder: Griffin-Lim.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron.png" width=700 /> <br>
</div>
**Advantage of Tacotron:**
- No need for complex text frontend analysis modules.
- No need for additional duration model.
- Greatly simplifies the acoustic model construction process and reduces the dependence of speech synthesis tasks on domain knowledge.
**Disadvantages of Tacotron:**
- The CBHG is complex and the amount of parameters is relatively large.
- Global soft attention.
- Poor stability for speech synthesis tasks.
- In training, the less the number of speech frames predicted at each moment, the more difficult it is to train.
- Phase problem in Griffin-Lim casues speech distortion during wave reconstruction.
- The autoregressive decoder cannot be stopped during the generation process.
#### Tacotron2
**Features of Tacotron2:**
- Reduction of parameters.
- CBHG -> PostNet (3 Conv layers + BLSTM or 5 Conv layers).
- remove Attention RNN.
- Speech distortion caused by Griffin-Lim.
- WaveNet.
- Improvements of PostNet.
- CBHG -> 5 Conv layers.
- The input and output of the PostNet calculate `L2` loss with real Mel spectrogram.
- Residual connection.
- Bad stop in autoregressive decoder.
- Predict whether it should stop at each moment of decoding (stop token).
- Set a threshold to determine whether to stop generating when decoding.
- Stability of attention.
- Location-aware attention.
- The alignment matrix of previous time is considered at the step `t` of decoder.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron2.png" width=500 /> <br>
</div>
You can find Parakeet's tacotron2 example at `Parakeet/examples/tacotron2`.
### TransformerTTS
**Disadvantages of the Tacotrons:**
- Encodr and decoder are relatively weak at global information modeling
- Vanishing gradient of RNN.
- Fixed-length context modeling problem in CNN kernel.
- Training is relatively inefficient.
- The attention is not robust enough and the stability is poor.
Transformer TTS is a combination of Tacotron2 and Transformer.
#### Transformer
[Transformer](https://arxiv.org/abs/1706.03762) is a seq2seq model based entirely on attention mechanism.
**Features of Transformer:**
- Encoder.
- `N` blocks based on self-attention mechanism.
- Positional Encoding.
- Decoder.
- `N` blocks based on self-attention mechanism.
- Add Mask to the self-attention in blocks to cover up the information after `t` step.
- Attentions between encoder and decoder.
- Positional Encoding.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer.png" width=500 /> <br>
</div>
#### Transformer TTS
Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.
**Motivations**
- RNNs in Tacotron2 make the inefficiency of training.
- Vanishing gradient of RNN makes the model's ability to model long-term contexts weak.
- Self-attention doesn't contain any recursive structure which can be trained in parallel.
- Self-attention can model global context information well.
**Features of Transformer TTS:**
- Add conv based PreNet in encoder and decoder.
- Stop Token in decoder controls when to stop autoregressive generation.
- Add PostNet after decoder to improve the quality of synthetic speech.
- Scaled position encoding.
- Uniform scale position encoding may have a negative impact on input or output sequences.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer_tts.png" width=500 /> <br>
</div>
**Disadvantages of Transformer TTS:**
- The ability of position encoding for timing information is still relatively weak.
- The ability to perceive local information is weak, and local information is more related to pronunciation.
- Stability is worse than Tacotron2.
You can find Parakeet's Transformer TTS example at `Parakeet/examples/transformer_tts`.
### FastSpeech2
**Disadvantage of seq2seq models:**
- In the seq2seq model based on attention, no matter how to improve the attention mechanism, it's difficult to avoid generation errors in the decoding stage.
Frame level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame level mapping does not have the uncertainty of sequence generation.
In seq2saq models, the concept of duration models is used as the alignment module of two sequences to replace attention, which can avoid the uncertainty in attention, and significantly improve the stability of the seq2saq models.
#### FastSpeech
Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, [FastSpeech](https://arxiv.org/abs/1905.09263) is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.
**Features of FastSpeech:**
- Encoder: based on Transformer.
- Change `FFN` to `CNN` in self-attention.
- Model local dependency.
- Length regulator.
- Use real phoneme durations to expand output frame of encoder during training.
- Non autoregressive decode.
- Improve generation efficiency.
**Length predictor:**
- Pretrain a TransformerTTS model.
- Get alignment matrix of train data.
- Caculate the phoneme durations according to the probability of the alignment matrix.
- Use the output of encoder to predict the phoneme durations and calculate the MSE loss.
- Use real phoneme durations to expand output frame of encoder during training.
- Use phoneme durations predicted by the duration model to expand the frame during prediction.
- Attentrion can not control phoneme durations. The explicit duration modeling can control durations through duration coefficient (duration coefficient is `1` during training).
**Advantages of non-autoregressive decoder:**
- The built-in duration model of the seq2seq model has converted the input length `M` to the output length `N`.
- The length of output is known, `stop token` is no longer used, avoiding the problem of being unable to stop.
• Can be generated in parallel (decoding time is less affected by sequence length)
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech.png" width=800 /> <br>
</div>
#### FastPitch
[FastPitch](https://arxiv.org/abs/2006.06873) follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastpitch.png" width=500 /> <br>
</div>
#### FastSpeech2
**Disadvantages of FastSpeech:**
- The teacher-student distillation pipeline is complicated and time-consuming.
- The duration extracted from the teacher model is not accurate enough.
- The target mel spectrograms distilled from teacher model suffer from information loss due to data simplification.
[FastSpeech2](https://arxiv.org/abs/2006.04558) addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.
**Features of FastSpeech2:**
- Directly training the model with ground-truth target instead of the simplified output from teacher.
- Introducing more variation information of speech as conditional inputs, extract `duration`, `pitch` and `energy` from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.
FastSpeech2 is similar to FastPitch but introduces more variation information of speech.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech2.png" width=800 /> <br>
</div>
You can find Parakeet's FastSpeech2/FastPitch example at `Parakeet/examples/fastspeech2`, We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2.
### SpeedySpeech
[SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.
**Features of SpeedySpeech:**
- Use a simpler, smaller and faster-to-train convolutional teacher model ([Deepvoice3](https://arxiv.org/abs/1710.07654) and [DCTTS](https://arxiv.org/abs/1710.08969)) with a single attention layer instead of Transformer used in FastSpeech.
- Show that self-attention layers in the student network are not needed for high-quality speech synthesis.
- Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/speedyspeech.png" width=500 /> <br>
</div>
You can find Parakeet's SpeedySpeech example at `Parakeet/examples/speedyspeech/baker`.
## Vocoders
In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform.
Taking into account the short-term change frequency of the waveform, the acoustic model usually avoids direct modeling of the speech waveform, but firstly models the spectral features extracted from the speech waveform, and then reconstructs the waveform by the decoding part of the vocoder.
A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimate the parameters, and then the decoder restores the speech.
Vocoders based on neural networks usually is speech synthesis, which learns the mapping relationship from spectral features to waveforms through training data.
### Categories of neural vocodes
- Autoregression
- WaveNet
- WaveRNN
- LPCNet
- Flow
- **WaveFlow**
- WaveGlow
- FloWaveNet
- Parallel WaveNet
- GAN
- WaveGAN
- **Parallel WaveGAN**
- MelGAN
- HiFi-GAN
- VAE
- Wave-VAE
- Diffusion
- WaveGrad
- DiffWave
**Motivations of GAN-based vocoders:**
- Modeling speech signal by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
- Although autoregressive neural vocoders can obtain high-quality synthetic speech, such models usually have a **slow generation speed**.
- The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long term context information.
- Vocoders based on Bipartite Transformation converge slowly and are complex.
- GAN-based vocoders don't need to make assumptions about the speech distribution, and train through adversarial learning.
Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Parallel WaveGAN.
### WaveFlow
[WaveFlow](https://arxiv.org/abs/1912.01219) is proposed by Baidu Research.
**Features of WaveFlow:**
- It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M).
- It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development.
You can find Parakeet's WaveFlow example at `Parakeet/examples/waveflow`.
### Parallel WaveGAN
[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN based training method.
**Features of Parallel WaveGAN:**
- Use non-causal convolution instead of causal convolution.
- The input is random Gaussian white noise.
- The model is non-autoregressive both in training and prediction, which is fast
- Multi-resolution STFT loss.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/pwg.png" width=600 /> <br>
</div>
You can find Parakeet's Parallel WaveGAN example at `Parakeet/examples/parallelwave_gan/baker`.

@ -0,0 +1,4 @@
# Aishell3
* tts0 - fastspeech2
* vc0 - tactron2 voice clone

@ -0,0 +1,112 @@
## Tacotron2 + AISHELL-3 数据集训练语音克隆模型
本实验的内容是利用 AISHELL-3 数据集和 Tacotron 2 模型进行语音克隆任务,使用的模型大体结构和论文 [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) 相同。大致步骤如下:
1. Speaker Encoder: 我们使用了一个 Speaker Verification 任务训练一个 speaker encoder。这部分任务所用的数据集和训练 Tacotron 2 的数据集不同,因为不需要 transcription 的缘故,我们使用了较多的训练数据,可以参考实现 [ge2e](../ge2e)。
2. Synthesizer: 然后使用训练好的 speaker encoder 为 AISHELL-3 数据集中的每个句子生成对应的 utterance embedding. 这个 Embedding 作为 Tacotron 模型中的一个额外输入和 encoder outputs 拼接在一起。
3. Vocoder: 我们使用的声码器是 WaveFlow参考实验 [waveflow](../waveflow).
## 数据处理
### utterance embedding 的生成
使用训练好的 speaker encoder 为 AISHELL-3 数据集中的每个句子生成对应的 utterance embedding. 以和音频文件夹同构的方式存储。存储格式是 `.npy` 文件。
首先 cd 到 [ge2e](../ge2e) 文件夹。下载训练好的 [模型](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip),然后运行脚本生成每个句子的 utterance embedding.
```bash
python inference.py --input=<intput> --output=<output> --device="gpu" --checkpoint_path=<pretrained checkpoint>
```
其中 input 是只包含音频文件夹的文件。这里可以用 `~/datasets/aishell3/train/wav`,然后 output 是用于存储 utterance embed 的文件夹,这里可以用 `~/datasets/aishell3/train/embed`。Utterance embedding 会以和音频文件夹相同的文件结构存储,格式为 `.npy`.
utterance embedding 的计算可能会用几个小时的时间,请耐心等待。
### 音频处理
因为 AISHELL-3 数据集前后有一些空白,静音片段,而且语音幅值很小,所以我们需要进行空白移除和音量规范化。空白移除可以简单的使用基于音量或者能量的方法,但是效果不是很好,对于不同的句子很难取到一个一致的阈值。我们使用的是先利用 Force Aligner 进行文本和语音的对齐。然后根据对齐结果截除空白。
我们使用的工具是 Montreal Force Aligner 1.0. 因为 aishell 的标注包含拼音标注,所以我们提供给 Montreal Force Aligner 的是拼音 transcription 而不是汉字 transcription. 而且需要把其中的韵律标记(`$` 和 `%`)去除,并且处理成 Montreal Force Alinger 所需要的文件形式。和音频同名的文本文件,扩展名为 `.lab`.
此外还需要准备词典文件。其中包含把拼音序列转换为 phone 序列的映射关系。在这里我们只做声母和韵母的切分,而声调则归为韵母的一部分。我们使用的[词典文件](./lexicon.txt)可以下载。
准备好之后运行训练和对齐。首先下载 [Montreal Force Aligner 1.0](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/tag/v1.0.1).下载之后解压即可运行。cd 到其中的 bin 文件夹运行命令,即可进行训练和对齐。前三个命令行参数分别是音频文件夹的路径,词典路径和对齐文件输出路径。可以通过`-o` 传入训练得到的模型保存路径。
```bash
./mfa_train_and_align \
~/datasets/aishell3/train/wav \
lexicon.txt \
~/datasets/aishell3/train/alignment \
-o aishell3_model \
-v
```
因为训练和对齐的时间比较长。我们提供了对齐后的 [alignment 文件](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz),其中每个句子对应的文件为 `.TextGrid` 格式的文本。
得到了对齐文件之后,可以运行 `process_wav.py` 脚本来处理音频。
```bash
python process_wav.py --input=<input> --output=<output> --alignment=<alignment>
```
默认 input, output, alignment 分别是 `~/datasets/aishell3/train/wav`, `~/datasets/aishell3/train/normalized_wav`, `~/datasets/aishell3/train/alignment`.
处理结束后,会将处理好的音频保存在 `<output>` 文件夹中。
### 转录文本处理
把文本转换成为 phone 和 tone 的形式,并存储起来。值得注意的是,这里我们的处理和用于 montreal force aligner 的不一样。我们把声调分了出来。这是一个处理方式,当然也可以只做声母和韵母的切分。
运行脚本处理转录文本。
```bash
python preprocess_transcription.py --input=<input> --output=<output>
```
默认的 input 是 `~/datasets/aishell3/train`,其中会包含 `label_train-set.txt` 文件,处理后的结果会 `metadata.yaml``metadata.pickle`. 前者是文本格式,方便查看,后者是二进制格式,方便直接读取。
### mel 频谱提取
对处理后的音频进行 mel 频谱的提取,并且以和音频文件夹同构的方式存储,存储格式是 `.npy` 文件。
```python
python extract_mel.py --input=<intput> --output=<output>
```
input 是处理后的音频所在的文件夹output 是输出频谱的文件夹。
## 训练
运行脚本训练。
```python
python train.py --data=<data> --output=<output> --device="gpu"
```
我们的模型去掉了 tacotron2 模型中的 stop token prediction。因为实践中由于 stop token prediction 是一个正负样例比例极不平衡的问题,每个句子可能有几百帧对应负样例,只有一帧正样例,而且这个 stop token prediction 对音频静音的裁切十分敏感。我们转用 attention 的最高点到达 encoder 侧的最后一个符号为终止条件。
另外,为了加速模型的收敛,我们加上了 guided attention loss, 诱导 encoder-decoder 之间的 alignment 更快地呈现对角线。
可以使用 visualdl 查看训练过程的 log。
```bash
visualdl --logdir=<output> --host=$HOSTNAME
```
示例 training loss / validation loss 曲线如下。
![train](./images/train.png)
![valid](./images/valid.png)
<img src="images/alignment-step2000.png" alt="alignment-step2000" style="zoom:50%;" />
大约从训练 2000 步左右就从 validation 过程中产出的 alignement 中可以观察到模糊的对角线。随着训练步数增加,对角线会更加清晰。但因为 validation 也是以 teacher forcing 的方式进行的,所以要在真正的 auto regressive 合成中产出的 alignment 中观察到对角线,需要更长的时间。
## 预训练模型
预训练模型下载链接。[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip).
## 使用
本实验包含了一个简单的使用示例,用户可以替换作为参考的声音以及文本,用训练好的模型来合成语音。使用方式参考 [notebook](./voice_cloning.ipynb) 上的使用说明。

@ -0,0 +1,88 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pickle
from pathlib import Path
import numpy as np
from paddle.io import Dataset
from parakeet.frontend import Vocab
from parakeet.data import batch_text_id, batch_spec
from preprocess_transcription import _phones, _tones
voc_phones = Vocab(sorted(list(_phones)))
print("vocab_phones:\n", voc_phones)
voc_tones = Vocab(sorted(list(_tones)))
print("vocab_tones:\n", voc_tones)
class AiShell3(Dataset):
"""Processed AiShell3 dataset."""
def __init__(self, root):
super().__init__()
self.root = Path(root).expanduser()
self.embed_dir = self.root / "embed"
self.mel_dir = self.root / "mel"
with open(self.root / "metadata.pickle", 'rb') as f:
self.records = pickle.load(f)
def __getitem__(self, index):
metadatum = self.records[index]
sentence_id = metadatum["sentence_id"]
speaker_id = sentence_id[:7]
phones = metadatum["phones"]
tones = metadatum["tones"]
phones = np.array(
[voc_phones.lookup(item) for item in phones], dtype=np.int64)
tones = np.array(
[voc_tones.lookup(item) for item in tones], dtype=np.int64)
mel = np.load(str(self.mel_dir / speaker_id / (sentence_id + ".npy")))
embed = np.load(
str(self.embed_dir / speaker_id / (sentence_id + ".npy")))
return phones, tones, mel, embed
def __len__(self):
return len(self.records)
def collate_aishell3_examples(examples):
phones, tones, mel, embed = list(zip(*examples))
text_lengths = np.array([item.shape[0] for item in phones], dtype=np.int64)
spec_lengths = np.array([item.shape[1] for item in mel], dtype=np.int64)
T_dec = np.max(spec_lengths)
stop_tokens = (
np.arange(T_dec) >= np.expand_dims(spec_lengths, -1)).astype(np.float32)
phones, _ = batch_text_id(phones)
tones, _ = batch_text_id(tones)
mel, _ = batch_spec(mel)
mel = np.transpose(mel, (0, 2, 1))
embed = np.stack(embed)
# 7 fields
# (B, T), (B, T), (B, T, C), (B, C), (B,), (B,), (B, T)
return phones, tones, mel, embed, text_lengths, spec_lengths, stop_tokens
if __name__ == "__main__":
dataset = AiShell3("~/datasets/aishell3/train")
example = dataset[0]
examples = [dataset[i] for i in range(10)]
batch = collate_aishell3_examples(examples)
for field in batch:
print(field.shape, field.dtype)

@ -0,0 +1,39 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Tuple
from pypinyin import lazy_pinyin, Style
from preprocess_transcription import split_syllable
def convert_to_pinyin(text: str) -> List[str]:
"""convert text into list of syllables, other characters that are not chinese, thus
cannot be converted to pinyin are splited.
"""
syllables = lazy_pinyin(
text, style=Style.TONE3, neutral_tone_with_five=True)
return syllables
def convert_sentence(text: str) -> List[Tuple[str]]:
"""convert a sentence into two list: phones and tones"""
syllables = convert_to_pinyin(text)
phones = []
tones = []
for syllable in syllables:
p, t = split_syllable(syllable)
phones.extend(p)
tones.extend(t)
return phones, tones

@ -0,0 +1,82 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from yacs.config import CfgNode as CN
_C = CN()
_C.data = CN(
dict(
batch_size=32, # batch size
valid_size=64, # the first N examples are reserved for validation
sample_rate=22050, # Hz, sample rate
n_fft=1024, # fft frame size
win_length=1024, # window size
hop_length=256, # hop size between ajacent frame
fmax=8000, # Hz, max frequency when converting to mel
fmin=0, # Hz, min frequency when converting to mel
d_mels=80, # mel bands
padding_idx=0, # text embedding's padding index
))
_C.model = CN(
dict(
vocab_size=70,
n_tones=10,
reduction_factor=1, # reduction factor
d_encoder=512, # embedding & encoder's internal size
encoder_conv_layers=3, # number of conv layer in tacotron2 encoder
encoder_kernel_size=5, # kernel size of conv layers in tacotron2 encoder
d_prenet=256, # hidden size of decoder prenet
# hidden size of the first rnn layer in tacotron2 decoder
d_attention_rnn=1024,
# hidden size of the second rnn layer in tacotron2 decoder
d_decoder_rnn=1024,
d_attention=128, # hidden size of decoder location linear layer
attention_filters=32, # number of filter in decoder location conv layer
attention_kernel_size=31, # kernel size of decoder location conv layer
d_postnet=512, # hidden size of decoder postnet
postnet_kernel_size=5, # kernel size of conv layers in postnet
postnet_conv_layers=5, # number of conv layer in decoder postnet
p_encoder_dropout=0.5, # droput probability in encoder
p_prenet_dropout=0.5, # droput probability in decoder prenet
# droput probability of first rnn layer in decoder
p_attention_dropout=0.1,
# droput probability of second rnn layer in decoder
p_decoder_dropout=0.1,
p_postnet_dropout=0.5, # droput probability in decoder postnet
guided_attention_loss_sigma=0.2,
d_global_condition=256,
# whether to use a classifier to predict stop probability
use_stop_token=False,
# whether to use guided attention loss in training
use_guided_attention_loss=True, ))
_C.training = CN(
dict(
lr=1e-3, # learning rate
weight_decay=1e-6, # the coeff of weight decay
grad_clip_thresh=1.0, # the clip norm of grad clip.
valid_interval=1000, # validation
save_interval=1000, # checkpoint
max_iteration=500000, # max iteration to train
))
def get_cfg_defaults():
"""Get a yacs CfgNode object with default values for my_project."""
# Return a clone so that the defaults will not be altered
# This is for the "local variable" use pattern
return _C.clone()

@ -0,0 +1,96 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import multiprocessing as mp
from functools import partial
from pathlib import Path
import numpy as np
from parakeet.audio import AudioProcessor
from parakeet.audio.spec_normalizer import NormalizerBase, LogMagnitude
import tqdm
from config import get_cfg_defaults
def extract_mel(fname: Path,
input_dir: Path,
output_dir: Path,
p: AudioProcessor,
n: NormalizerBase):
relative_path = fname.relative_to(input_dir)
out_path = (output_dir / relative_path).with_suffix(".npy")
out_path.parent.mkdir(parents=True, exist_ok=True)
wav = p.read_wav(fname)
mel = p.mel_spectrogram(wav)
mel = n.transform(mel)
np.save(out_path, mel)
def extract_mel_multispeaker(config, input_dir, output_dir, extension=".wav"):
input_dir = Path(input_dir).expanduser()
fnames = list(input_dir.rglob(f"*{extension}"))
output_dir = Path(output_dir).expanduser()
output_dir.mkdir(parents=True, exist_ok=True)
p = AudioProcessor(config.sample_rate, config.n_fft, config.win_length,
config.hop_length, config.n_mels, config.fmin,
config.fmax)
n = LogMagnitude(1e-5)
func = partial(
extract_mel, input_dir=input_dir, output_dir=output_dir, p=p, n=n)
with mp.Pool(16) as pool:
list(
tqdm.tqdm(
pool.imap(func, fnames), total=len(fnames), unit="utterance"))
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Extract mel spectrogram from processed wav in AiShell3 training dataset."
)
parser.add_argument(
"--config",
type=str,
help="yaml config file to overwrite the default config")
parser.add_argument(
"--input",
type=str,
default="~/datasets/aishell3/train/normalized_wav",
help="path of the processed wav folder")
parser.add_argument(
"--output",
type=str,
default="~/datasets/aishell3/train/mel",
help="path of the folder to save mel spectrograms")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
default_config = get_cfg_defaults()
args = parser.parse_args()
if args.config:
default_config.merge_from_file(args.config)
if args.opts:
default_config.merge_from_list(args.opts)
default_config.freeze()
audio_config = default_config.data
extract_mel_multispeaker(audio_config, args.input, args.output)

Binary file not shown.

After

Width:  |  Height:  |  Size: 221 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 550 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 514 KiB

File diff suppressed because it is too large Load Diff

@ -0,0 +1,258 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
import re
import pickle
import yaml
import tqdm
zh_pattern = re.compile("[\u4e00-\u9fa5]")
_tones = {'<pad>', '<s>', '</s>', '0', '1', '2', '3', '4', '5'}
_pauses = {'%', '$'}
_initials = {
'b',
'p',
'm',
'f',
'd',
't',
'n',
'l',
'g',
'k',
'h',
'j',
'q',
'x',
'zh',
'ch',
'sh',
'r',
'z',
'c',
's',
}
_finals = {
'ii',
'iii',
'a',
'o',
'e',
'ea',
'ai',
'ei',
'ao',
'ou',
'an',
'en',
'ang',
'eng',
'er',
'i',
'ia',
'io',
'ie',
'iai',
'iao',
'iou',
'ian',
'ien',
'iang',
'ieng',
'u',
'ua',
'uo',
'uai',
'uei',
'uan',
'uen',
'uang',
'ueng',
'v',
've',
'van',
'ven',
'veng',
}
_ernized_symbol = {'&r'}
_specials = {'<pad>', '<unk>', '<s>', '</s>'}
_phones = _initials | _finals | _ernized_symbol | _specials | _pauses
def is_zh(word):
global zh_pattern
match = zh_pattern.search(word)
return match is not None
def ernized(syllable):
return syllable[:2] != "er" and syllable[-2] == 'r'
def convert(syllable):
# expansion of o -> uo
syllable = re.sub(r"([bpmf])o$", r"\1uo", syllable)
# syllable = syllable.replace("bo", "buo").replace("po", "puo").replace("mo", "muo").replace("fo", "fuo")
# expansion for iong, ong
syllable = syllable.replace("iong", "veng").replace("ong", "ueng")
# expansion for ing, in
syllable = syllable.replace("ing", "ieng").replace("in", "ien")
# expansion for un, ui, iu
syllable = syllable.replace("un", "uen").replace("ui",
"uei").replace("iu", "iou")
# rule for variants of i
syllable = syllable.replace("zi", "zii").replace("ci", "cii").replace("si", "sii")\
.replace("zhi", "zhiii").replace("chi", "chiii").replace("shi", "shiii")\
.replace("ri", "riii")
# rule for y preceding i, u
syllable = syllable.replace("yi", "i").replace("yu", "v").replace("y", "i")
# rule for w
syllable = syllable.replace("wu", "u").replace("w", "u")
# rule for v following j, q, x
syllable = syllable.replace("ju", "jv").replace("qu",
"qv").replace("xu", "xv")
return syllable
def split_syllable(syllable: str):
"""Split a syllable in pinyin into a list of phones and a list of tones.
Initials have no tone, represented by '0', while finals have tones from
'1,2,3,4,5'.
e.g.
zhang -> ['zh', 'ang'], ['0', '1']
"""
if syllable in _pauses:
# syllable, tone
return [syllable], ['0']
tone = syllable[-1]
syllable = convert(syllable[:-1])
phones = []
tones = []
global _initials
if syllable[:2] in _initials:
phones.append(syllable[:2])
tones.append('0')
phones.append(syllable[2:])
tones.append(tone)
elif syllable[0] in _initials:
phones.append(syllable[0])
tones.append('0')
phones.append(syllable[1:])
tones.append(tone)
else:
phones.append(syllable)
tones.append(tone)
return phones, tones
def load_aishell3_transcription(line: str):
sentence_id, pinyin, text = line.strip().split("|")
syllables = pinyin.strip().split()
results = []
for syllable in syllables:
if syllable in _pauses:
results.append(syllable)
elif not ernized(syllable):
results.append(syllable)
else:
results.append(syllable[:-2] + syllable[-1])
results.append('&r5')
phones = []
tones = []
for syllable in results:
p, t = split_syllable(syllable)
phones.extend(p)
tones.extend(t)
for p in phones:
assert p in _phones, p
return {
"sentence_id": sentence_id,
"text": text,
"syllables": results,
"phones": phones,
"tones": tones
}
def process_aishell3(dataset_root, output_dir):
dataset_root = Path(dataset_root).expanduser()
output_dir = Path(output_dir).expanduser()
output_dir.mkdir(parents=True, exist_ok=True)
prosody_label_path = dataset_root / "label_train-set.txt"
with open(prosody_label_path, 'rt') as f:
lines = [line.strip() for line in f]
records = lines[5:]
processed_records = []
for record in tqdm.tqdm(records):
new_record = load_aishell3_transcription(record)
processed_records.append(new_record)
print(new_record)
with open(output_dir / "metadata.pickle", 'wb') as f:
pickle.dump(processed_records, f)
with open(output_dir / "metadata.yaml", 'wt', encoding="utf-8") as f:
yaml.safe_dump(
processed_records, f, default_flow_style=None, allow_unicode=True)
print("metadata done!")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Preprocess transcription of AiShell3 and save them in a compact file(yaml and pickle)."
)
parser.add_argument(
"--input",
type=str,
default="~/datasets/aishell3/train",
help="path of the training dataset,(contains a label_train-set.txt).")
parser.add_argument(
"--output",
type=str,
help="the directory to save the processed transcription."
"If not provided, it would be the same as the input.")
args = parser.parse_args()
if args.output is None:
args.output = args.input
process_aishell3(args.input, args.output)

@ -0,0 +1,95 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
from multiprocessing import Pool
from functools import partial
import numpy as np
import librosa
import soundfile as sf
from tqdm import tqdm
from praatio import tgio
def get_valid_part(fpath):
f = tgio.openTextgrid(fpath)
start = 0
phone_entry_list = f.tierDict['phones'].entryList
first_entry = phone_entry_list[0]
if first_entry.label == "sil":
start = first_entry.end
last_entry = phone_entry_list[-1]
if last_entry.label == "sp":
end = last_entry.start
else:
end = last_entry.end
return start, end
def process_utterance(fpath, source_dir, target_dir, alignment_dir):
rel_path = fpath.relative_to(source_dir)
opath = target_dir / rel_path
apath = (alignment_dir / rel_path).with_suffix(".TextGrid")
opath.parent.mkdir(parents=True, exist_ok=True)
start, end = get_valid_part(apath)
wav, _ = librosa.load(fpath, sr=22050, offset=start, duration=end - start)
normalized_wav = wav / np.max(wav) * 0.999
sf.write(opath, normalized_wav, samplerate=22050, subtype='PCM_16')
# print(f"{fpath} => {opath}")
def preprocess_aishell3(source_dir, target_dir, alignment_dir):
source_dir = Path(source_dir).expanduser()
target_dir = Path(target_dir).expanduser()
alignment_dir = Path(alignment_dir).expanduser()
wav_paths = list(source_dir.rglob("*.wav"))
print(f"there are {len(wav_paths)} audio files in total")
fx = partial(
process_utterance,
source_dir=source_dir,
target_dir=target_dir,
alignment_dir=alignment_dir)
with Pool(16) as p:
list(
tqdm(p.imap(fx, wav_paths), total=len(wav_paths), unit="utterance"))
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Process audio in AiShell3, trim silence according to the alignment "
"files generated by MFA, and normalize volume by peak.")
parser.add_argument(
"--input",
type=str,
default="~/datasets/aishell3/train/wav",
help="path of the original audio folder in aishell3.")
parser.add_argument(
"--output",
type=str,
default="~/datasets/aishell3/train/normalized_wav",
help="path of the folder to save the processed audio files.")
parser.add_argument(
"--alignment",
type=str,
default="~/datasets/aishell3/train/alignment",
help="path of the alignment files.")
args = parser.parse_args()
preprocess_aishell3(args.input, args.output, args.alignment)

@ -0,0 +1,262 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
from pathlib import Path
from collections import defaultdict
import numpy as np
from matplotlib import pyplot as plt
import paddle
from paddle import distributed as dist
from paddle.io import DataLoader, DistributedBatchSampler
from parakeet.data import dataset
from parakeet.training.cli import default_argument_parser
from parakeet.training.experiment import ExperimentBase
from parakeet.utils import display, mp_tools
from parakeet.models.tacotron2 import Tacotron2, Tacotron2Loss
from config import get_cfg_defaults
from aishell3 import AiShell3, collate_aishell3_examples
class Experiment(ExperimentBase):
def compute_losses(self, inputs, outputs):
texts, tones, mel_targets, utterance_embeds, text_lens, output_lens, stop_tokens = inputs
mel_outputs = outputs["mel_output"]
mel_outputs_postnet = outputs["mel_outputs_postnet"]
alignments = outputs["alignments"]
losses = self.criterion(mel_outputs, mel_outputs_postnet, mel_targets,
alignments, output_lens, text_lens)
return losses
def train_batch(self):
start = time.time()
batch = self.read_batch()
data_loader_time = time.time() - start
self.optimizer.clear_grad()
self.model.train()
texts, tones, mels, utterance_embeds, text_lens, output_lens, stop_tokens = batch
outputs = self.model(
texts,
text_lens,
mels,
output_lens,
tones=tones,
global_condition=utterance_embeds)
losses = self.compute_losses(batch, outputs)
loss = losses["loss"]
loss.backward()
self.optimizer.step()
iteration_time = time.time() - start
losses_np = {k: float(v) for k, v in losses.items()}
# logging
msg = "Rank: {}, ".format(dist.get_rank())
msg += "step: {}, ".format(self.iteration)
msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
iteration_time)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in losses_np.items())
self.logger.info(msg)
if dist.get_rank() == 0:
for key, value in losses_np.items():
self.visualizer.add_scalar(f"train_loss/{key}", value,
self.iteration)
@mp_tools.rank_zero_only
@paddle.no_grad()
def valid(self):
valid_losses = defaultdict(list)
for i, batch in enumerate(self.valid_loader):
texts, tones, mels, utterance_embeds, text_lens, output_lens, stop_tokens = batch
outputs = self.model(
texts,
text_lens,
mels,
output_lens,
tones=tones,
global_condition=utterance_embeds)
losses = self.compute_losses(batch, outputs)
for key, value in losses.items():
valid_losses[key].append(float(value))
attention_weights = outputs["alignments"]
self.visualizer.add_figure(
f"valid_sentence_{i}_alignments",
display.plot_alignment(attention_weights[0].numpy().T),
self.iteration)
self.visualizer.add_figure(
f"valid_sentence_{i}_target_spectrogram",
display.plot_spectrogram(mels[0].numpy().T), self.iteration)
mel_pred = outputs['mel_outputs_postnet']
self.visualizer.add_figure(
f"valid_sentence_{i}_predicted_spectrogram",
display.plot_spectrogram(mel_pred[0].numpy().T), self.iteration)
# write visual log
valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
# logging
msg = "Valid: "
msg += "step: {}, ".format(self.iteration)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in valid_losses.items())
self.logger.info(msg)
for key, value in valid_losses.items():
self.visualizer.add_scalar(f"valid/{key}", value, self.iteration)
@mp_tools.rank_zero_only
@paddle.no_grad()
def eval(self):
"""Evaluation of Tacotron2 in autoregressive manner."""
self.model.eval()
mel_dir = Path(self.output_dir / ("eval_{}".format(self.iteration)))
mel_dir.mkdir(parents=True, exist_ok=True)
for i, batch in enumerate(self.test_loader):
texts, tones, mels, utterance_embeds, *_ = batch
outputs = self.model.infer(
texts, tones=tones, global_condition=utterance_embeds)
display.plot_alignment(outputs["alignments"][0].numpy().T)
plt.savefig(mel_dir / f"sentence_{i}.png")
plt.close()
np.save(mel_dir / f"sentence_{i}",
outputs["mel_outputs_postnet"][0].numpy().T)
print(f"sentence_{i}")
def setup_model(self):
config = self.config
model = Tacotron2(
vocab_size=config.model.vocab_size,
n_tones=config.model.n_tones,
d_mels=config.data.d_mels,
d_encoder=config.model.d_encoder,
encoder_conv_layers=config.model.encoder_conv_layers,
encoder_kernel_size=config.model.encoder_kernel_size,
d_prenet=config.model.d_prenet,
d_attention_rnn=config.model.d_attention_rnn,
d_decoder_rnn=config.model.d_decoder_rnn,
attention_filters=config.model.attention_filters,
attention_kernel_size=config.model.attention_kernel_size,
d_attention=config.model.d_attention,
d_postnet=config.model.d_postnet,
postnet_kernel_size=config.model.postnet_kernel_size,
postnet_conv_layers=config.model.postnet_conv_layers,
reduction_factor=config.model.reduction_factor,
p_encoder_dropout=config.model.p_encoder_dropout,
p_prenet_dropout=config.model.p_prenet_dropout,
p_attention_dropout=config.model.p_attention_dropout,
p_decoder_dropout=config.model.p_decoder_dropout,
p_postnet_dropout=config.model.p_postnet_dropout,
d_global_condition=config.model.d_global_condition,
use_stop_token=config.model.use_stop_token, )
if self.parallel:
model = paddle.DataParallel(model)
grad_clip = paddle.nn.ClipGradByGlobalNorm(
config.training.grad_clip_thresh)
optimizer = paddle.optimizer.Adam(
learning_rate=config.training.lr,
parameters=model.parameters(),
weight_decay=paddle.regularizer.L2Decay(
config.training.weight_decay),
grad_clip=grad_clip)
criterion = Tacotron2Loss(
use_stop_token_loss=config.model.use_stop_token,
use_guided_attention_loss=config.model.use_guided_attention_loss,
sigma=config.model.guided_attention_loss_sigma)
self.model = model
self.optimizer = optimizer
self.criterion = criterion
def setup_dataloader(self):
args = self.args
config = self.config
ljspeech_dataset = AiShell3(args.data)
valid_set, train_set = dataset.split(ljspeech_dataset,
config.data.valid_size)
batch_fn = collate_aishell3_examples
if not self.parallel:
self.train_loader = DataLoader(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True,
collate_fn=batch_fn)
else:
sampler = DistributedBatchSampler(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True)
self.train_loader = DataLoader(
train_set, batch_sampler=sampler, collate_fn=batch_fn)
self.valid_loader = DataLoader(
valid_set,
batch_size=config.data.batch_size,
shuffle=False,
drop_last=False,
collate_fn=batch_fn)
self.test_loader = DataLoader(
valid_set,
batch_size=1,
shuffle=False,
drop_last=False,
collate_fn=batch_fn)
def main_sp(config, args):
exp = Experiment(config, args)
exp.setup()
exp.resume_or_load()
if not args.test:
exp.run()
else:
exp.eval()
def main(config, args):
if args.nprocs > 1 and args.device == "gpu":
dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
else:
main_sp(config, args)
if __name__ == "__main__":
config = get_cfg_defaults()
parser = default_argument_parser()
parser.add_argument("--test", action="store_true")
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)

File diff suppressed because one or more lines are too long

@ -0,0 +1,226 @@
# Speedyspeech with CSMSC
This example contains code used to train a [Speedyspeech](http://arxiv.org/abs/2008.03802) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). NOTE that we only implement the student part of the Speedyspeech model. The ground truth alignment used to train the model is extracted from the dataset using [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner).
## Dataset
### Download and Extract the datasaet
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
### Get MFA result of CSMSC and Extract it
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
## Preprocess the dataset
Assume the path to the dataset is `~/datasets/BZNSYP`.
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
Run the command below to preprocess the dataset.
```bash
./preprocess.sh
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, tones, durations, path of spectrogram, and id of each utterance.
## Train the model
`./run.sh` calls `../train.py`.
```bash
./run.sh
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
[--use-relative-path USE_RELATIVE_PATH]
[--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
Train a Speedyspeech model with sigle speaker dataset.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--device DEVICE device type to use.
--nprocs NPROCS number of processes.
--verbose VERBOSE verbose.
--use-relative-path USE_RELATIVE_PATH
whether use relative path in metadata
--phones-dict PHONES_DICT
phone vocabulary file.
--tones-dict TONES_DICT
tone vocabulary file.
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
4. `--device` is the type of the device to run the experiment, 'cpu' or 'gpu' are supported.
5. `--nprocs` is the number of processes to run in parallel, note that nprocs > 1 is only supported when `--device` is 'gpu'.
6. `--phones-dict` is the path of the phone vocabulary file.
7. `--tones-dict` is the path of the tone vocabulary file.
## Pretrained Model
Pretrained SpeedySpeech model with no silence in the edge of audios. [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_nosil_baker_ckpt_0.5.zip)
SpeedySpeech checkpoint contains files listed below.
```text
speedyspeech_nosil_baker_ckpt_0.5
├── default.yaml # default config used to train speedyspeech
├── feats_stats.npy # statistics used to normalize spectrogram when training speedyspeech
├── phone_id_map.txt # phone vocabulary file when training speedyspeech
├── snapshot_iter_11400.pdz # model parameters and optimizer states
└── tone_id_map.txt # tone vocabulary file when training speedyspeech
```
## Synthesize
We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip) and unzip it.
```bash
unzip pwg_baker_ckpt_0.4.zip
```
Parallel WaveGAN checkpoint contains files listed below.
```text
pwg_baker_ckpt_0.4
├── pwg_default.yaml # default config used to train parallel wavegan
├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
```
`synthesize.sh` calls `../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
./synthesize.sh
```
```text
usage: synthesize.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG]
[--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT]
[--speedyspeech-stat SPEEDYSPEECH_STAT]
[--pwg-config PWG_CONFIG]
[--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
[--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
[--inference-dir INFERENCE_DIR] [--device DEVICE]
[--verbose VERBOSE]
Synthesize with speedyspeech & parallel wavegan.
optional arguments:
-h, --help show this help message and exit
--speedyspeech-config SPEEDYSPEECH_CONFIG
config file for speedyspeech.
--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT
speedyspeech checkpoint to load.
--speedyspeech-stat SPEEDYSPEECH_STAT
mean and standard deviation used to normalize
spectrogram when training speedyspeech.
--pwg-config PWG_CONFIG
config file for parallelwavegan.
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training speedyspeech.
--phones-dict PHONES_DICT
phone vocabulary file.
--tones-dict TONES_DICT
tone vocabulary file.
--test-metadata TEST_METADATA
test metadata
--output-dir OUTPUT_DIR
output dir
--inference-dir INFERENCE_DIR
dir to save inference models
--device DEVICE device type to use
--verbose VERBOSE verbose
```
`synthesize_e2e.sh` calls `synthesize_e2e.py`, which can synthesize waveform from text file.
```bash
./synthesize_e2e.sh
```
```text
usage: synthesize_e2e.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG]
[--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT]
[--speedyspeech-stat SPEEDYSPEECH_STAT]
[--pwg-config PWG_CONFIG]
[--pwg-checkpoint PWG_CHECKPOINT]
[--pwg-stat PWG_STAT] [--text TEXT]
[--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
[--output-dir OUTPUT_DIR]
[--inference-dir INFERENCE_DIR] [--device DEVICE]
[--verbose VERBOSE]
Synthesize with speedyspeech & parallel wavegan.
optional arguments:
-h, --help show this help message and exit
--speedyspeech-config SPEEDYSPEECH_CONFIG
config file for speedyspeech.
--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT
speedyspeech checkpoint to load.
--speedyspeech-stat SPEEDYSPEECH_STAT
mean and standard deviation used to normalize
spectrogram when training speedyspeech.
--pwg-config PWG_CONFIG
config file for parallelwavegan.
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan checkpoint to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training speedyspeech.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line
--phones-dict PHONES_DICT
phone vocabulary file.
--tones-dict TONES_DICT
tone vocabulary file.
--output-dir OUTPUT_DIR
output dir
--inference-dir INFERENCE_DIR
dir to save inference models
--device DEVICE device type to use
--verbose VERBOSE verbose
```
1. `--speedyspeech-config`, `--speedyspeech-checkpoint`, `--speedyspeech-stat` are arguments for speedyspeech, which correspond to the 3 files in the speedyspeech pretrained model.
2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
3. `--text` is the text file, which contains sentences to synthesize.
4. `--output-dir` is the directory to save synthesized audio files.
5. `--inference-dir` is the directory to save exported model, which can be used with paddle infernece.
6. `--device` is the type of device to run synthesis, 'cpu' and 'gpu' are supported. 'gpu' is recommended for faster synthesis.
6. `--phones-dict` is the path of the phone vocabulary file.
7. `--tones-dict` is the path of the tone vocabulary file.
You can use the following scripts to synthesize for `../sentences.txt` using pretrained speedyspeech and parallel wavegan models.
```bash
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 synthesize_e2e.py \
--speedyspeech-config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
--speedyspeech-checkpoint=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
--speedyspeech-stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--text=../sentences.txt \
--output-dir=exp/default/test_e2e \
--inference-dir=exp/default/inference \
--device="gpu" \
--phones-dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
--tones-dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
```

@ -0,0 +1,50 @@
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size.
n_shift: 300 # Hop size.
win_length: 1200 # Window length.
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation.
fmax: 7600 # Maximum frequency in mel basis calculation.
###########################################################
# DATA SETTING #
###########################################################
batch_size: 64
num_workers: 4
###########################################################
# MODEL SETTING #
###########################################################
model:
encoder_hidden_size: 128
encoder_kernel_size: 3
encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1]
duration_predictor_hidden_size: 128
decoder_hidden_size: 128
decoder_output_size: 80
decoder_kernel_size: 3
decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1]
###########################################################
# OPTIMIZER SETTING #
###########################################################
optimizer:
optim: adam # optimizer type
learning_rate: 0.002 # learning rate
max_grad_norm: 1
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 200
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
seed: 10086

@ -0,0 +1,146 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from pathlib import Path
import soundfile as sf
from paddle import inference
from parakeet.frontend.zh_frontend import Frontend
def main():
parser = argparse.ArgumentParser(
description="Paddle Infernce with speedyspeech & parallel wavegan.")
parser.add_argument(
"--inference-dir", type=str, help="dir to save inference models")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line")
parser.add_argument("--output-dir", type=str, help="output dir")
parser.add_argument(
"--enable-auto-log", action="store_true", help="use auto log")
parser.add_argument(
"--phones-dict",
type=str,
default="phones.txt",
help="phone vocabulary file.")
parser.add_argument(
"--tones-dict",
type=str,
default="tones.txt",
help="tone vocabulary file.")
args, _ = parser.parse_known_args()
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
print("frontend done!")
speedyspeech_config = inference.Config(
str(Path(args.inference_dir) / "speedyspeech.pdmodel"),
str(Path(args.inference_dir) / "speedyspeech.pdiparams"))
speedyspeech_config.enable_use_gpu(100, 0)
speedyspeech_config.enable_memory_optim()
speedyspeech_predictor = inference.create_predictor(speedyspeech_config)
pwg_config = inference.Config(
str(Path(args.inference_dir) / "pwg.pdmodel"),
str(Path(args.inference_dir) / "pwg.pdiparams"))
pwg_config.enable_use_gpu(100, 0)
pwg_config.enable_memory_optim()
pwg_predictor = inference.create_predictor(pwg_config)
if args.enable_auto_log:
import auto_log
os.makedirs("output", exist_ok=True)
pid = os.getpid()
logger = auto_log.AutoLogger(
model_name="speedyspeech",
model_precision='float32',
batch_size=1,
data_shape="dynamic",
save_path="./output/auto_log.log",
inference_config=speedyspeech_config,
pids=pid,
process_name=None,
gpu_ids=0,
time_keys=['preprocess_time', 'inference_time', 'postprocess_time'],
warmup=0)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
sentences = []
with open(args.text, 'rt') as f:
for line in f:
utt_id, sentence = line.strip().split()
sentences.append((utt_id, sentence))
for utt_id, sentence in sentences:
if args.enable_auto_log:
logger.times.start()
input_ids = frontend.get_input_ids(
sentence, merge_sentences=True, get_tone_ids=True)
phone_ids = input_ids["phone_ids"]
tone_ids = input_ids["tone_ids"]
phones = phone_ids[0]
tones = tone_ids[0]
if args.enable_auto_log:
logger.times.stamp()
input_names = speedyspeech_predictor.get_input_names()
phones_handle = speedyspeech_predictor.get_input_handle(input_names[0])
tones_handle = speedyspeech_predictor.get_input_handle(input_names[1])
phones_handle.reshape(phones.shape)
phones_handle.copy_from_cpu(phones)
tones_handle.reshape(tones.shape)
tones_handle.copy_from_cpu(tones)
speedyspeech_predictor.run()
output_names = speedyspeech_predictor.get_output_names()
output_handle = speedyspeech_predictor.get_output_handle(
output_names[0])
output_data = output_handle.copy_to_cpu()
input_names = pwg_predictor.get_input_names()
mel_handle = pwg_predictor.get_input_handle(input_names[0])
mel_handle.reshape(output_data.shape)
mel_handle.copy_from_cpu(output_data)
pwg_predictor.run()
output_names = pwg_predictor.get_output_names()
output_handle = pwg_predictor.get_output_handle(output_names[0])
wav = output_data = output_handle.copy_to_cpu()
if args.enable_auto_log:
logger.times.stamp()
sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
if args.enable_auto_log:
logger.times.end(stamp=True)
print(f"{utt_id} done!")
if args.enable_auto_log:
logger.report()
if __name__ == "__main__":
main()

@ -0,0 +1,8 @@
#!/bin/bash
python3 inference.py \
--inference-dir=exp/default/inference \
--text=../sentences.txt \
--output-dir=exp/default/pd_infer_out \
--phones-dict=dump/phone_id_map.txt \
--tones-dict=dump/tone_id_map.txt

@ -0,0 +1,65 @@
#!/bin/bash
stage=0
stop_stage=100
export MAIN_ROOT=`realpath ${PWD}/../../../`
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./baker_alignment_tone \
--output=durations.txt \
--config=conf/default.yaml
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Extract features ..."
python3 ../preprocess.py \
--dataset=baker \
--rootdir=~/datasets/BZNSYP/ \
--dumpdir=dump \
--dur-file=durations.txt \
--config=conf/default.yaml \
--num-cpu=20 \
--cut-sil=True \
--use-relative-path=True
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats" \
--use-relative-path=True
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize and covert phone/tone to id, dev and test should use train's stats
echo "Normalize ..."
python3 ../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--tones-dict=dump/tone_id_map.txt \
--use-relative-path=True
python3 ../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--tones-dict=dump/tone_id_map.txt \
--use-relative-path=True
python3 ../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--tones-dict=dump/tone_id_map.txt \
--use-relative-path=True
fi

@ -0,0 +1,12 @@
#!/bin/bash
python ../train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=conf/default.yaml \
--output-dir=exp/default \
--nprocs=2 \
--phones-dict=dump/phone_id_map.txt \
--tones-dict=dump/tone_id_map.txt \
--use-relative-path=True

@ -0,0 +1,16 @@
#!/bin/bash
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ../synthesize.py \
--speedyspeech-config=conf/default.yaml \
--speedyspeech-checkpoint=exp/default/checkpoints/snapshot_iter_11400.pdz \
--speedyspeech-stat=dump/train/feats_stats.npy \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=exp/default/test \
--inference-dir=exp/default/inference \
--phones-dict=dump/phone_id_map.txt \
--tones-dict=dump/tone_id_map.txt \
--device="gpu"

@ -0,0 +1,196 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
from pathlib import Path
import numpy as np
import soundfile as sf
import paddle
import yaml
from paddle import jit
from paddle.static import InputSpec
from parakeet.frontend.zh_frontend import Frontend
from parakeet.models.speedyspeech import SpeedySpeech
from parakeet.models.speedyspeech import SpeedySpeechInference
from parakeet.models.parallel_wavegan import PWGGenerator
from parakeet.models.parallel_wavegan import PWGInference
from parakeet.modules.normalizer import ZScore
from yacs.config import CfgNode
def evaluate(args, speedyspeech_config, pwg_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
utt_id, sentence = line.strip().split()
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
with open(args.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id)
print("tone_size:", tone_size)
model = SpeedySpeech(
vocab_size=vocab_size,
tone_size=tone_size,
**speedyspeech_config["model"])
model.set_state_dict(
paddle.load(args.speedyspeech_checkpoint)["main_params"])
model.eval()
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
print("model done!")
stat = np.load(args.speedyspeech_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
speedyspeech_normalizer = ZScore(mu, std)
stat = np.load(args.pwg_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer,
model)
speedyspeech_inference.eval()
speedyspeech_inference = jit.to_static(
speedyspeech_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64), InputSpec(
[-1], dtype=paddle.int64)
])
paddle.jit.save(speedyspeech_inference,
os.path.join(args.inference_dir, "speedyspeech"))
speedyspeech_inference = paddle.jit.load(
os.path.join(args.inference_dir, "speedyspeech"))
pwg_inference = PWGInference(pwg_normalizer, vocoder)
pwg_inference.eval()
pwg_inference = jit.to_static(
pwg_inference, input_spec=[
InputSpec([-1, 80], dtype=paddle.float32),
])
paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
print("frontend done!")
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for utt_id, sentence in sentences:
input_ids = frontend.get_input_ids(
sentence, merge_sentences=True, get_tone_ids=True)
phone_ids = input_ids["phone_ids"]
tone_ids = input_ids["tone_ids"]
flags = 0
for i in range(len(phone_ids)):
part_phone_ids = phone_ids[i]
part_tone_ids = tone_ids[i]
with paddle.no_grad():
mel = speedyspeech_inference(part_phone_ids, part_tone_ids)
temp_wav = pwg_inference(mel)
if flags == 0:
wav = temp_wav
flags = 1
else:
wav = paddle.concat([wav, temp_wav])
sf.write(
output_dir / (utt_id + ".wav"),
wav.numpy(),
samplerate=speedyspeech_config.fs)
print(f"{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with speedyspeech & parallel wavegan.")
parser.add_argument(
"--speedyspeech-config", type=str, help="config file for speedyspeech.")
parser.add_argument(
"--speedyspeech-checkpoint",
type=str,
help="speedyspeech checkpoint to load.")
parser.add_argument(
"--speedyspeech-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
)
parser.add_argument(
"--pwg-config", type=str, help="config file for parallelwavegan.")
parser.add_argument(
"--pwg-checkpoint",
type=str,
help="parallel wavegan checkpoint to load.")
parser.add_argument(
"--pwg-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
)
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line")
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--tones-dict", type=str, default=None, help="tone vocabulary file.")
parser.add_argument("--output-dir", type=str, help="output dir")
parser.add_argument(
"--inference-dir", type=str, help="dir to save inference models")
parser.add_argument(
"--device", type=str, default="gpu", help="device type to use")
parser.add_argument("--verbose", type=int, default=1, help="verbose")
args, _ = parser.parse_known_args()
paddle.set_device(args.device)
with open(args.speedyspeech_config) as f:
speedyspeech_config = CfgNode(yaml.safe_load(f))
with open(args.pwg_config) as f:
pwg_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(speedyspeech_config)
print(pwg_config)
evaluate(args, speedyspeech_config, pwg_config)
if __name__ == "__main__":
main()

@ -0,0 +1,16 @@
#!/bin/bash
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python synthesize_e2e.py \
--speedyspeech-config=conf/default.yaml \
--speedyspeech-checkpoint=exp/default/checkpoints/snapshot_iter_11400.pdz \
--speedyspeech-stat=dump/train/feats_stats.npy \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--text=../sentences.txt \
--output-dir=exp/default/test_e2e \
--inference-dir=exp/default/inference \
--device="gpu" \
--phones-dict=dump/phone_id_map.txt \
--tones-dict=dump/tone_id_map.txt

@ -0,0 +1,159 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Normalize feature files and dump them."""
import argparse
import logging
from operator import itemgetter
from pathlib import Path
import jsonlines
import numpy as np
from parakeet.datasets.data_table import DataTable
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
def main():
"""Run preprocessing process."""
parser = argparse.ArgumentParser(
description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)."
)
parser.add_argument(
"--metadata",
type=str,
required=True,
help="directory including feature files to be normalized. "
"you need to specify either *-scp or rootdir.")
parser.add_argument(
"--dumpdir",
type=str,
required=True,
help="directory to dump normalized feature files.")
parser.add_argument(
"--stats", type=str, required=True, help="statistics file.")
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--tones-dict", type=str, default=None, help="tone vocabulary file.")
parser.add_argument(
"--verbose",
type=int,
default=1,
help="logging level. higher is more logging. (default=1)")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument(
"--use-relative-path",
type=str2bool,
default=False,
help="whether use relative path in metadata")
args = parser.parse_args()
# set logger
if args.verbose > 1:
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
)
elif args.verbose > 0:
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
)
else:
logging.basicConfig(
level=logging.WARN,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
)
logging.warning('Skip DEBUG/INFO messages')
dumpdir = Path(args.dumpdir).expanduser()
# use absolute path
dumpdir = dumpdir.resolve()
dumpdir.mkdir(parents=True, exist_ok=True)
# get dataset
with jsonlines.open(args.metadata, 'r') as reader:
metadata = list(reader)
if args.use_relative_path:
# if use_relative_path in preprocess, covert it to absolute path here
metadata_dir = Path(args.metadata).parent
for item in metadata:
item["feats"] = str(metadata_dir / item["feats"])
dataset = DataTable(
metadata, converters={
'feats': np.load,
})
logging.info(f"The number of files = {len(dataset)}.")
# restore scaler
scaler = StandardScaler()
scaler.mean_ = np.load(args.stats)[0]
scaler.scale_ = np.load(args.stats)[1]
# from version 0.23.0, this information is needed
scaler.n_features_in_ = scaler.mean_.shape[0]
vocab_phones = {}
with open(args.phones_dict, 'rt') as f:
phn_id = [line.strip().split() for line in f.readlines()]
for phn, id in phn_id:
vocab_phones[phn] = int(id)
vocab_tones = {}
with open(args.tones_dict, 'rt') as f:
tone_id = [line.strip().split() for line in f.readlines()]
for tone, id in tone_id:
vocab_tones[tone] = int(id)
# process each file
output_metadata = []
for item in tqdm(dataset):
utt_id = item['utt_id']
mel = item['feats']
# normalize
mel = scaler.transform(mel)
# save
mel_path = dumpdir / f"{utt_id}_feats.npy"
np.save(mel_path, mel.astype(np.float32), allow_pickle=False)
phone_ids = [vocab_phones[p] for p in item['phones']]
tone_ids = [vocab_tones[p] for p in item['tones']]
if args.use_relative_path:
# convert absolute path to relative path:
mel_path = mel_path.relative_to(dumpdir)
output_metadata.append({
'utt_id': utt_id,
'phones': phone_ids,
'tones': tone_ids,
'num_phones': item['num_phones'],
'num_frames': item['num_frames'],
'durations': item['durations'],
'feats': str(mel_path),
})
output_metadata.sort(key=itemgetter('utt_id'))
output_metadata_path = Path(args.dumpdir) / "metadata.jsonl"
with jsonlines.open(output_metadata_path, 'w') as writer:
for item in output_metadata:
writer.write(item)
logging.info(f"metadata dumped into {output_metadata_path}")
if __name__ == "__main__":
main()

@ -0,0 +1,293 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from operator import itemgetter
from typing import Any
from typing import Dict
from typing import List
import argparse
import jsonlines
import librosa
import numpy as np
import re
import tqdm
import yaml
from concurrent.futures import ThreadPoolExecutor
from parakeet.data.get_feats import LogMelFBank
from parakeet.datasets.preprocess_utils import compare_duration_and_mel_length
from parakeet.datasets.preprocess_utils import get_phones_tones
from parakeet.datasets.preprocess_utils import get_phn_dur
from parakeet.datasets.preprocess_utils import merge_silence
from pathlib import Path
from yacs.config import CfgNode
def process_sentence(config: Dict[str, Any],
fp: Path,
sentences: Dict,
output_dir: Path,
mel_extractor=None,
cut_sil: bool=True):
utt_id = fp.stem
record = None
if utt_id in sentences:
# reading, resampling may occur
wav, _ = librosa.load(str(fp), sr=config.fs)
if len(wav.shape) != 1 or np.abs(wav).max() > 1.0:
return record
assert len(wav.shape) == 1, f"{utt_id} is not a mono-channel audio."
assert np.abs(wav).max(
) <= 1.0, f"{utt_id} is seems to be different that 16 bit PCM."
phones = sentences[utt_id][0]
durations = sentences[utt_id][1]
speaker = sentences[utt_id][2]
d_cumsum = np.pad(np.array(durations).cumsum(0), (1, 0), 'constant')
# little imprecise than use *.TextGrid directly
times = librosa.frames_to_time(
d_cumsum, sr=config.fs, hop_length=config.n_shift)
if cut_sil:
start = 0
end = d_cumsum[-1]
if phones[0] == "sil" and len(durations) > 1:
start = times[1]
durations = durations[1:]
phones = phones[1:]
if phones[-1] == 'sil' and len(durations) > 1:
end = times[-2]
durations = durations[:-1]
phones = phones[:-1]
sentences[utt_id][0] = phones
sentences[utt_id][1] = durations
start, end = librosa.time_to_samples([start, end], sr=config.fs)
wav = wav[start:end]
# extract mel feats
logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel)
labels = sentences[utt_id][0]
# extract phone and duration
phones = []
tones = []
for label in labels:
# split tone from finals
match = re.match(r'^(\w+)([012345])$', label)
if match:
phones.append(match.group(1))
tones.append(match.group(2))
else:
phones.append(label)
tones.append('0')
durations = sentences[utt_id][1]
num_frames = logmel.shape[0]
assert sum(durations) == num_frames
assert len(phones) == len(tones) == len(durations)
mel_path = output_dir / (utt_id + "_feats.npy")
np.save(mel_path, logmel) # (num_frames, n_mels)
record = {
"utt_id": utt_id,
"phones": phones,
"tones": tones,
"num_phones": len(phones),
"num_frames": num_frames,
"durations": durations,
"feats": str(mel_path), # Path object
}
return record
def process_sentences(config,
fps: List[Path],
sentences: Dict,
output_dir: Path,
mel_extractor=None,
nprocs: int=1,
cut_sil: bool=True,
use_relative_path: bool=False):
if nprocs == 1:
results = []
for fp in tqdm.tqdm(fps, total=len(fps)):
record = process_sentence(config, fp, sentences, output_dir,
mel_extractor, cut_sil)
if record:
results.append(record)
else:
with ThreadPoolExecutor(nprocs) as pool:
futures = []
with tqdm.tqdm(total=len(fps)) as progress:
for fp in fps:
future = pool.submit(process_sentence, config, fp,
sentences, output_dir, mel_extractor,
cut_sil)
future.add_done_callback(lambda p: progress.update())
futures.append(future)
results = []
for ft in futures:
record = ft.result()
if record:
results.append(record)
results.sort(key=itemgetter("utt_id"))
output_dir = Path(output_dir)
metadata_path = output_dir / "metadata.jsonl"
# NOTE: use relative path to the meta jsonlines file for Full Chain Project
with jsonlines.open(metadata_path, 'w') as writer:
for item in results:
if use_relative_path:
item["feats"] = str(Path(item["feats"]).relative_to(output_dir))
writer.write(item)
print("Done")
def main():
# parse config and args
parser = argparse.ArgumentParser(
description="Preprocess audio and then extract features.")
parser.add_argument(
"--dataset",
default="baker",
type=str,
help="name of dataset, should in {baker} now")
parser.add_argument(
"--rootdir", default=None, type=str, help="directory to dataset.")
parser.add_argument(
"--dumpdir",
type=str,
required=True,
help="directory to dump feature files.")
parser.add_argument(
"--dur-file",
default=None,
type=str,
help="path to baker durations.txt.")
parser.add_argument("--config", type=str, help="fastspeech2 config file.")
parser.add_argument(
"--verbose",
type=int,
default=1,
help="logging level. higher is more logging. (default=1)")
parser.add_argument(
"--num-cpu", type=int, default=1, help="number of process.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument(
"--cut-sil",
type=str2bool,
default=True,
help="whether cut sil in the edge of audio")
parser.add_argument(
"--use-relative-path",
type=str2bool,
default=False,
help="whether use relative path in metadata")
args = parser.parse_args()
rootdir = Path(args.rootdir).expanduser()
dumpdir = Path(args.dumpdir).expanduser()
# use absolute path
dumpdir = dumpdir.resolve()
dumpdir.mkdir(parents=True, exist_ok=True)
dur_file = Path(args.dur_file).expanduser()
assert rootdir.is_dir()
assert dur_file.is_file()
with open(args.config, 'rt') as f:
config = CfgNode(yaml.safe_load(f))
if args.verbose > 1:
print(vars(args))
print(config)
sentences, speaker_set = get_phn_dur(dur_file)
merge_silence(sentences)
phone_id_map_path = dumpdir / "phone_id_map.txt"
tone_id_map_path = dumpdir / "tone_id_map.txt"
get_phones_tones(sentences, phone_id_map_path, tone_id_map_path,
args.dataset)
if args.dataset == "baker":
wav_files = sorted(list((rootdir / "Wave").rglob("*.wav")))
# split data into 3 sections
num_train = 9800
num_dev = 100
train_wav_files = wav_files[:num_train]
dev_wav_files = wav_files[num_train:num_train + num_dev]
test_wav_files = wav_files[num_train + num_dev:]
train_dump_dir = dumpdir / "train" / "raw"
train_dump_dir.mkdir(parents=True, exist_ok=True)
dev_dump_dir = dumpdir / "dev" / "raw"
dev_dump_dir.mkdir(parents=True, exist_ok=True)
test_dump_dir = dumpdir / "test" / "raw"
test_dump_dir.mkdir(parents=True, exist_ok=True)
# Extractor
mel_extractor = LogMelFBank(
sr=config.fs,
n_fft=config.n_fft,
hop_length=config.n_shift,
win_length=config.win_length,
window=config.window,
n_mels=config.n_mels,
fmin=config.fmin,
fmax=config.fmax)
# process for the 3 sections
if train_wav_files:
process_sentences(
config,
train_wav_files,
sentences,
train_dump_dir,
mel_extractor,
nprocs=args.num_cpu,
cut_sil=args.cut_sil,
use_relative_path=args.use_relative_path)
if dev_wav_files:
process_sentences(
config,
dev_wav_files,
sentences,
dev_dump_dir,
mel_extractor,
cut_sil=args.cut_sil,
use_relative_path=args.use_relative_path)
if test_wav_files:
process_sentences(
config,
test_wav_files,
sentences,
test_dump_dir,
mel_extractor,
nprocs=args.num_cpu,
cut_sil=args.cut_sil,
use_relative_path=args.use_relative_path)
if __name__ == "__main__":
main()

@ -0,0 +1,16 @@
001 凯莫瑞安联合体的经济崩溃,迫在眉睫。
002 对于所有想要离开那片废土,去寻找更美好生活的人来说。
003 克哈,是你们所有人安全的港湾。
004 为了保护尤摩扬人民不受异虫的残害,我所做的,比他们自己的领导委员会都多。
005 无论他们如何诽谤我,我将继续为所有泰伦人的最大利益,而努力奋斗。
006 身为你们的元首,我带领泰伦人实现了人类统治领地和经济的扩张。
007 我们将继续成长,用行动回击那些只会说风凉话,不愿意和我们相向而行的害群之马。
008 帝国武装力量,无数的优秀儿女,正时刻守卫着我们的家园大门,但是他们孤木难支。
009 凡是今天应征入伍者,所获的所有刑罚罪责,减半。
010 激进分子和异见者希望你们一听见枪声,就背弃多年的和平与繁荣。
011 他们没有勇气和能力,带领人类穿越一个充满危险的星系。
012 法治是我们的命脉,然而它却受到前所未有的挑战。
013 我将恢复我们帝国的荣光,绝不会向任何外星势力低头。
014 我已经驯服了异虫,荡平了星灵。如今它们的创造者,想要夺走我们拥有的一切。
015 永远记住,谁才是最能保护你们的人。
016 不要听信别人的谗言,我不是什么克隆人。

@ -0,0 +1,180 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import logging
import argparse
from pathlib import Path
import jsonlines
import numpy as np
import soundfile as sf
import paddle
import yaml
from paddle import jit
from paddle.static import InputSpec
from yacs.config import CfgNode
from parakeet.datasets.data_table import DataTable
from parakeet.models.speedyspeech import SpeedySpeech
from parakeet.models.speedyspeech import SpeedySpeechInference
from parakeet.models.parallel_wavegan import PWGGenerator
from parakeet.models.parallel_wavegan import PWGInference
from parakeet.modules.normalizer import ZScore
def evaluate(args, speedyspeech_config, pwg_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
with jsonlines.open(args.test_metadata, 'r') as reader:
test_metadata = list(reader)
test_dataset = DataTable(
data=test_metadata, fields=["utt_id", "phones", "tones"])
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
with open(args.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id)
print("tone_size:", tone_size)
model = SpeedySpeech(
vocab_size=vocab_size,
tone_size=tone_size,
**speedyspeech_config["model"])
model.set_state_dict(
paddle.load(args.speedyspeech_checkpoint)["main_params"])
model.eval()
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
print("model done!")
stat = np.load(args.speedyspeech_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
speedyspeech_normalizer = ZScore(mu, std)
speedyspeech_normalizer.eval()
stat = np.load(args.pwg_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
pwg_normalizer.eval()
speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer,
model)
speedyspeech_inference.eval()
speedyspeech_inference = jit.to_static(
speedyspeech_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64), InputSpec(
[-1], dtype=paddle.int64)
])
paddle.jit.save(speedyspeech_inference,
os.path.join(args.inference_dir, "speedyspeech"))
speedyspeech_inference = paddle.jit.load(
os.path.join(args.inference_dir, "speedyspeech"))
pwg_inference = PWGInference(pwg_normalizer, vocoder)
pwg_inference.eval()
pwg_inference = jit.to_static(
pwg_inference, input_spec=[
InputSpec([-1, 80], dtype=paddle.float32),
])
paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for datum in test_dataset:
utt_id = datum["utt_id"]
phones = paddle.to_tensor(datum["phones"])
tones = paddle.to_tensor(datum["tones"])
with paddle.no_grad():
wav = pwg_inference(speedyspeech_inference(phones, tones))
sf.write(
output_dir / (utt_id + ".wav"),
wav.numpy(),
samplerate=speedyspeech_config.fs)
print(f"{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with speedyspeech & parallel wavegan.")
parser.add_argument(
"--speedyspeech-config", type=str, help="config file for speedyspeech.")
parser.add_argument(
"--speedyspeech-checkpoint",
type=str,
help="speedyspeech checkpoint to load.")
parser.add_argument(
"--speedyspeech-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
)
parser.add_argument(
"--pwg-config", type=str, help="config file for parallelwavegan.")
parser.add_argument(
"--pwg-checkpoint",
type=str,
help="parallel wavegan generator parameters to load.")
parser.add_argument(
"--pwg-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
)
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--tones-dict", type=str, default=None, help="tone vocabulary file.")
parser.add_argument("--test-metadata", type=str, help="test metadata")
parser.add_argument("--output-dir", type=str, help="output dir")
parser.add_argument(
"--inference-dir", type=str, help="dir to save inference models")
parser.add_argument(
"--device", type=str, default="gpu", help="device type to use")
parser.add_argument("--verbose", type=int, default=1, help="verbose")
args, _ = parser.parse_known_args()
paddle.set_device(args.device)
with open(args.speedyspeech_config) as f:
speedyspeech_config = CfgNode(yaml.safe_load(f))
with open(args.pwg_config) as f:
pwg_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(speedyspeech_config)
print(pwg_config)
evaluate(args, speedyspeech_config, pwg_config)
if __name__ == "__main__":
main()

@ -0,0 +1,224 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
import shutil
import jsonlines
import numpy as np
import paddle
import yaml
from paddle import distributed as dist
from paddle import DataParallel
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from parakeet.datasets.data_table import DataTable
from parakeet.datasets.am_batch_fn import speedyspeech_batch_fn
from parakeet.models.speedyspeech import SpeedySpeech
from parakeet.models.speedyspeech import SpeedySpeechEvaluator
from parakeet.models.speedyspeech import SpeedySpeechUpdater
from parakeet.training.extensions.snapshot import Snapshot
from parakeet.training.extensions.visualizer import VisualDL
from parakeet.training.optimizer import build_optimizers
from parakeet.training.seeding import seed_everything
from parakeet.training.trainer import Trainer
from pathlib import Path
from visualdl import LogWriter
from yacs.config import CfgNode
def train_sp(args, config):
# decides device type and whether to run in parallel
# setup running environment correctly
world_size = paddle.distributed.get_world_size()
if not paddle.is_compiled_with_cuda():
paddle.set_device("cpu")
else:
paddle.set_device("gpu")
if world_size > 1:
paddle.distributed.init_parallel_env()
# set the random seed, it is a must for multiprocess training
seed_everything(config.seed)
print(
f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
)
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for training and validation
with jsonlines.open(args.train_metadata, 'r') as reader:
train_metadata = list(reader)
if args.use_relative_path:
# if use_relative_path in preprocess, covert it to absolute path here
metadata_dir = Path(args.train_metadata).parent
for item in train_metadata:
item["feats"] = str(metadata_dir / item["feats"])
train_dataset = DataTable(
data=train_metadata,
fields=[
"phones", "tones", "num_phones", "num_frames", "feats", "durations"
],
converters={
"feats": np.load,
}, )
with jsonlines.open(args.dev_metadata, 'r') as reader:
dev_metadata = list(reader)
if args.use_relative_path:
# if use_relative_path in preprocess, covert it to absolute path here
metadata_dir = Path(args.dev_metadata).parent
for item in dev_metadata:
item["feats"] = str(metadata_dir / item["feats"])
dev_dataset = DataTable(
data=dev_metadata,
fields=[
"phones", "tones", "num_phones", "num_frames", "feats", "durations"
],
converters={
"feats": np.load,
}, )
# collate function and dataloader
train_sampler = DistributedBatchSampler(
train_dataset,
batch_size=config.batch_size,
shuffle=True,
drop_last=True)
print("samplers done!")
train_dataloader = DataLoader(
train_dataset,
batch_sampler=train_sampler,
collate_fn=speedyspeech_batch_fn,
num_workers=config.num_workers)
dev_dataloader = DataLoader(
dev_dataset,
shuffle=False,
drop_last=False,
batch_size=config.batch_size,
collate_fn=speedyspeech_batch_fn,
num_workers=config.num_workers)
print("dataloaders done!")
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
with open(args.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id)
print("tone_size:", tone_size)
model = SpeedySpeech(
vocab_size=vocab_size, tone_size=tone_size, **config["model"])
if world_size > 1:
model = DataParallel(model)
print("model done!")
optimizer = build_optimizers(model, **config["optimizer"])
print("optimizer done!")
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
if dist.get_rank() == 0:
config_name = args.config.split("/")[-1]
# copy conf to output_dir
shutil.copyfile(args.config, output_dir / config_name)
updater = SpeedySpeechUpdater(
model=model,
optimizer=optimizer,
dataloader=train_dataloader,
output_dir=output_dir)
trainer = Trainer(updater, (config.max_epoch, 'epoch'), output_dir)
evaluator = SpeedySpeechEvaluator(
model, dev_dataloader, output_dir=output_dir)
if dist.get_rank() == 0:
trainer.extend(evaluator, trigger=(1, "epoch"))
writer = LogWriter(str(output_dir))
trainer.extend(VisualDL(writer), trigger=(1, "iteration"))
trainer.extend(
Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
trainer.run()
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Train a Speedyspeech model with sigle speaker dataset.")
parser.add_argument("--config", type=str, help="config file.")
parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--device", type=str, default="gpu", help="device type to use.")
parser.add_argument(
"--nprocs", type=int, default=1, help="number of processes.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument(
"--use-relative-path",
type=str2bool,
default=False,
help="whether use relative path in metadata")
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--tones-dict", type=str, default=None, help="tone vocabulary file.")
# 这里可以多传入 max_epoch 等
args, rest = parser.parse_known_args()
if args.device == "cpu" and args.nprocs > 1:
raise RuntimeError("Multiprocess training on CPU is not supported.")
with open(args.config) as f:
config = CfgNode(yaml.safe_load(f))
if rest:
extra = []
# to support key=value format
for item in rest:
# remove "--"
item = item[2:]
extra.extend(item.split("=", maxsplit=1))
config.merge_from_list(extra)
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(config)
print(
f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
)
# dispatch
if args.nprocs > 1:
dist.spawn(train_sp, (args, config), nprocs=args.nprocs)
else:
train_sp(args, config)
if __name__ == "__main__":
main()

@ -0,0 +1,6 @@
# LJSpeech
* tts0 - Tactron2
* tts1 - TransformerTTS
* voc0 - WaveFlow

@ -0,0 +1,92 @@
# Tacotron2
PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
## Project Structure
```text
├── config.py # default configuration file
├── ljspeech.py # dataset and dataloader settings for LJSpeech
├── preprocess.py # script to preprocess LJSpeech dataset
├── synthesize.py # script to synthesize spectrogram from text
├── train.py # script for tacotron2 model training
├── synthesize.ipynb # notebook example for end-to-end TTS
```
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
Then you need to preprocess the data by running ``preprocess.py``, the preprocessed data will be placed in ``--output`` directory.
```bash
python preprocess.py \
--input=${DATAPATH} \
--output=${PREPROCESSEDDATAPATH} \
-v \
```
For more help on arguments
``python preprocess.py --help``.
## Train the model
Tacotron2 model can be trained by running ``train.py``.
```bash
python train.py \
--data=${PREPROCESSEDDATAPATH} \
--output=${OUTPUTPATH} \
--device=gpu \
```
If you want to train on CPU, just set ``--device=cpu``.
If you want to train on multiple GPUs, just set ``--nprocs`` as num of GPU.
By default, training will be resumed from the latest checkpoint in ``--output``, if you want to start a new training, please use a new ``${OUTPUTPATH}`` with no checkpoint. And if you want to resume from an other existing model, you should set ``checkpoint_path`` to be the checkpoint path you want to load.
**Note: The checkpoint path cannot contain the file extension.**
For more help on arguments
``python train_transformer.py --help``.
## Synthesize
After training the Tacotron2, spectrogram can be synthesized by running ``synthesize.py``.
```bash
python synthesize.py \
--config=${CONFIGPATH} \
--checkpoint_path=${CHECKPOINTPATH} \
--input=${TEXTPATH} \
--output=${OUTPUTPATH}
--device=gpu
```
The ``${CONFIGPATH}`` needs to be matched with ``${CHECKPOINTPATH}``.
For more help on arguments
``python synthesize.py --help``.
Then you can find the spectrogram files in ``${OUTPUTPATH}``, and then they can be the input of vocoder like [waveflow](../waveflow/README.md#Synthesis) to get audio files.
## Pretrained Models
Pretrained Models can be downloaded from links below. We provide 2 models with different configurations.
1. This model use a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
2. This model does not have a stop token predictor. It uses the attention peak position to decided whether all the contents have been uttered. Also guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
## Notebook: End-to-end TTS
See [synthesize.ipynb](./synthesize.ipynb) for details about end-to-end TTS with tacotron2 and waveflow.

@ -0,0 +1,76 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from yacs.config import CfgNode as CN
_C = CN()
_C.data = CN(
dict(
batch_size=32, # batch size
valid_size=64, # the first N examples are reserved for validation
sample_rate=22050, # Hz, sample rate
n_fft=1024, # fft frame size
win_length=1024, # window size
hop_length=256, # hop size between ajacent frame
fmax=8000, # Hz, max frequency when converting to mel
fmin=0, # Hz, min frequency when converting to mel
n_mels=80, # mel bands
padding_idx=0, # text embedding's padding index
))
_C.model = CN(
dict(
vocab_size=37, # set this according to the frontend's vocab size
n_tones=None,
reduction_factor=1, # reduction factor
d_encoder=512, # embedding & encoder's internal size
encoder_conv_layers=3, # number of conv layer in tacotron2 encoder
encoder_kernel_size=5, # kernel size of conv layers in tacotron2 encoder
d_prenet=256, # hidden size of decoder prenet
d_attention_rnn=1024, # hidden size of the first rnn layer in tacotron2 decoder
d_decoder_rnn=1024, # hidden size of the second rnn layer in tacotron2 decoder
d_attention=128, # hidden size of decoder location linear layer
attention_filters=32, # number of filter in decoder location conv layer
attention_kernel_size=31, # kernel size of decoder location conv layer
d_postnet=512, # hidden size of decoder postnet
postnet_kernel_size=5, # kernel size of conv layers in postnet
postnet_conv_layers=5, # number of conv layer in decoder postnet
p_encoder_dropout=0.5, # droput probability in encoder
p_prenet_dropout=0.5, # droput probability in decoder prenet
p_attention_dropout=0.1, # droput probability of first rnn layer in decoder
p_decoder_dropout=0.1, # droput probability of second rnn layer in decoder
p_postnet_dropout=0.5, # droput probability in decoder postnet
d_global_condition=None,
use_stop_token=True, # wherther to use binary classifier to predict when to stop
use_guided_attention_loss=False, # whether to use guided attention loss
guided_attention_loss_sigma=0.2 # sigma in guided attention loss
))
_C.training = CN(
dict(
lr=1e-3, # learning rate
weight_decay=1e-6, # the coeff of weight decay
grad_clip_thresh=1.0, # the clip norm of grad clip.
plot_interval=1000, # plot attention and spectrogram
valid_interval=1000, # validation
save_interval=1000, # checkpoint
max_iteration=500000, # max iteration to train
))
def get_cfg_defaults():
"""Get a yacs CfgNode object with default values for my_project."""
# Return a clone so that the defaults will not be altered
# This is for the "local variable" use pattern
return _C.clone()

@ -0,0 +1,94 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
import pickle
import numpy as np
from paddle.io import Dataset
from parakeet.data.batch import batch_spec, batch_text_id
class LJSpeech(Dataset):
"""A simple dataset adaptor for the processed ljspeech dataset."""
def __init__(self, root):
self.root = Path(root).expanduser()
records = []
with open(self.root / "metadata.pkl", 'rb') as f:
metadata = pickle.load(f)
for mel_name, text, ids in metadata:
mel_name = self.root / "mel" / (mel_name + ".npy")
records.append((mel_name, text, ids))
self.records = records
def __getitem__(self, i):
mel_name, _, ids = self.records[i]
mel = np.load(mel_name)
return ids, mel
def __len__(self):
return len(self.records)
class LJSpeechCollector(object):
"""A simple callable to batch LJSpeech examples."""
def __init__(self, padding_idx=0, padding_value=0., padding_stop_token=1.0):
self.padding_idx = padding_idx
self.padding_value = padding_value
self.padding_stop_token = padding_stop_token
def __call__(self, examples):
texts = []
mels = []
text_lens = []
mel_lens = []
for data in examples:
text, mel = data
text = np.array(text, dtype=np.int64)
text_lens.append(len(text))
mels.append(mel)
texts.append(text)
mel_lens.append(mel.shape[1])
# Sort by text_len in descending order
texts = [
i
for i, _ in sorted(
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
]
mels = [
i
for i, _ in sorted(
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
]
mel_lens = [
i
for i, _ in sorted(
zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
]
mel_lens = np.array(mel_lens, dtype=np.int64)
text_lens = np.array(sorted(text_lens, reverse=True), dtype=np.int64)
# Pad sequence with largest len of the batch
texts, _ = batch_text_id(texts, pad_id=self.padding_idx)
mels, _ = batch_spec(mels, pad_value=self.padding_value)
mels = np.transpose(mels, axes=(0, 2, 1))
return texts, mels, text_lens, mel_lens

@ -0,0 +1,99 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import pickle
import argparse
from pathlib import Path
import tqdm
import numpy as np
from parakeet.datasets import LJSpeechMetaData
from parakeet.audio import AudioProcessor, LogMagnitude
from parakeet.frontend import EnglishCharacter
from config import get_cfg_defaults
def create_dataset(config, source_path, target_path, verbose=False):
# create output dir
target_path = Path(target_path).expanduser()
mel_path = target_path / "mel"
os.makedirs(mel_path, exist_ok=True)
meta_data = LJSpeechMetaData(source_path)
frontend = EnglishCharacter()
processor = AudioProcessor(
sample_rate=config.data.sample_rate,
n_fft=config.data.n_fft,
n_mels=config.data.n_mels,
win_length=config.data.win_length,
hop_length=config.data.hop_length,
fmax=config.data.fmax,
fmin=config.data.fmin)
normalizer = LogMagnitude()
records = []
for (fname, text, _) in tqdm.tqdm(meta_data):
wav = processor.read_wav(fname)
mel = processor.mel_spectrogram(wav)
mel = normalizer.transform(mel)
ids = frontend(text)
mel_name = os.path.splitext(os.path.basename(fname))[0]
# save mel spectrogram
records.append((mel_name, text, ids))
np.save(mel_path / mel_name, mel)
if verbose:
print("save mel spectrograms into {}".format(mel_path))
# save meta data as pickle archive
with open(target_path / "metadata.pkl", 'wb') as f:
pickle.dump(records, f)
if verbose:
print("saved metadata into {}".format(target_path / "metadata.pkl"))
print("Done.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="create dataset")
parser.add_argument(
"--config",
type=str,
metavar="FILE",
help="extra config to overwrite the default config")
parser.add_argument(
"--input", type=str, help="path of the ljspeech dataset")
parser.add_argument(
"--output", type=str, help="path to save output dataset")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print msg")
config = get_cfg_defaults()
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config.data)
create_dataset(config, args.input, args.output, args.verbose)

File diff suppressed because one or more lines are too long

@ -0,0 +1,95 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
import paddle
import numpy as np
from matplotlib import pyplot as plt
from parakeet.frontend import EnglishCharacter
from parakeet.models.tacotron2 import Tacotron2
from parakeet.utils import display
from config import get_cfg_defaults
def main(config, args):
paddle.set_device(args.device)
# model
frontend = EnglishCharacter()
model = Tacotron2.from_pretrained(config, args.checkpoint_path)
model.eval()
# inputs
input_path = Path(args.input).expanduser()
with open(input_path, "rt") as f:
sentences = f.readlines()
if args.output is None:
output_dir = input_path.parent / "synthesis"
else:
output_dir = Path(args.output).expanduser()
output_dir.mkdir(exist_ok=True)
for i, sentence in enumerate(sentences):
sentence = paddle.to_tensor(frontend(sentence)).unsqueeze(0)
outputs = model.infer(sentence)
mel_output = outputs["mel_outputs_postnet"][0].numpy().T
alignment = outputs["alignments"][0].numpy().T
np.save(str(output_dir / f"sentence_{i}"), mel_output)
display.plot_alignment(alignment)
plt.savefig(str(output_dir / f"sentence_{i}.png"))
if args.verbose:
print("spectrogram saved at {}".format(output_dir /
f"sentence_{i}.npy"))
if __name__ == "__main__":
config = get_cfg_defaults()
parser = argparse.ArgumentParser(
description="generate mel spectrogram with TransformerTTS.")
parser.add_argument(
"--config",
type=str,
metavar="FILE",
help="extra config to overwrite the default config")
parser.add_argument(
"--checkpoint_path", type=str, help="path of the checkpoint to load.")
parser.add_argument("--input", type=str, help="path of the text sentences")
parser.add_argument("--output", type=str, help="path to save outputs")
parser.add_argument(
"--device", type=str, default="cpu", help="device type to use.")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print msg")
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)

@ -0,0 +1,218 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
from collections import defaultdict
import numpy as np
import paddle
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from paddle import distributed as dist
from parakeet.data import dataset
from parakeet.training.cli import default_argument_parser
from parakeet.training.experiment import ExperimentBase
from parakeet.utils import display, mp_tools
from parakeet.models.tacotron2 import Tacotron2, Tacotron2Loss
from config import get_cfg_defaults
from ljspeech import LJSpeech, LJSpeechCollector
class Experiment(ExperimentBase):
def compute_losses(self, inputs, outputs):
texts, mel_targets, plens, slens = inputs
mel_outputs = outputs["mel_output"]
mel_outputs_postnet = outputs["mel_outputs_postnet"]
attention_weight = outputs["alignments"]
if self.config.model.use_stop_token:
stop_logits = outputs["stop_logits"]
else:
stop_logits = None
losses = self.criterion(mel_outputs, mel_outputs_postnet, mel_targets,
attention_weight, slens, plens, stop_logits)
return losses
def train_batch(self):
start = time.time()
batch = self.read_batch()
data_loader_time = time.time() - start
self.optimizer.clear_grad()
self.model.train()
texts, mels, text_lens, output_lens = batch
outputs = self.model(texts, text_lens, mels, output_lens)
losses = self.compute_losses(batch, outputs)
loss = losses["loss"]
loss.backward()
self.optimizer.step()
iteration_time = time.time() - start
losses_np = {k: float(v) for k, v in losses.items()}
# logging
msg = "Rank: {}, ".format(dist.get_rank())
msg += "step: {}, ".format(self.iteration)
msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
iteration_time)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in losses_np.items())
self.logger.info(msg)
if dist.get_rank() == 0:
for k, v in losses_np.items():
self.visualizer.add_scalar(f"train_loss/{k}", v, self.iteration)
@mp_tools.rank_zero_only
@paddle.no_grad()
def valid(self):
valid_losses = defaultdict(list)
for i, batch in enumerate(self.valid_loader):
texts, mels, text_lens, output_lens = batch
outputs = self.model(texts, text_lens, mels, output_lens)
losses = self.compute_losses(batch, outputs)
for k, v in losses.items():
valid_losses[k].append(float(v))
attention_weights = outputs["alignments"]
self.visualizer.add_figure(
f"valid_sentence_{i}_alignments",
display.plot_alignment(attention_weights[0].numpy().T),
self.iteration)
self.visualizer.add_figure(
f"valid_sentence_{i}_target_spectrogram",
display.plot_spectrogram(mels[0].numpy().T), self.iteration)
self.visualizer.add_figure(
f"valid_sentence_{i}_predicted_spectrogram",
display.plot_spectrogram(outputs['mel_outputs_postnet'][0]
.numpy().T), self.iteration)
# write visual log
valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
# logging
msg = "Valid: "
msg += "step: {}, ".format(self.iteration)
msg += ', '.join('{}: {:>.6f}'.format(k, v)
for k, v in valid_losses.items())
self.logger.info(msg)
for k, v in valid_losses.items():
self.visualizer.add_scalar(f"valid/{k}", v, self.iteration)
def setup_model(self):
config = self.config
model = Tacotron2(
vocab_size=config.model.vocab_size,
d_mels=config.data.n_mels,
d_encoder=config.model.d_encoder,
encoder_conv_layers=config.model.encoder_conv_layers,
encoder_kernel_size=config.model.encoder_kernel_size,
d_prenet=config.model.d_prenet,
d_attention_rnn=config.model.d_attention_rnn,
d_decoder_rnn=config.model.d_decoder_rnn,
attention_filters=config.model.attention_filters,
attention_kernel_size=config.model.attention_kernel_size,
d_attention=config.model.d_attention,
d_postnet=config.model.d_postnet,
postnet_kernel_size=config.model.postnet_kernel_size,
postnet_conv_layers=config.model.postnet_conv_layers,
reduction_factor=config.model.reduction_factor,
p_encoder_dropout=config.model.p_encoder_dropout,
p_prenet_dropout=config.model.p_prenet_dropout,
p_attention_dropout=config.model.p_attention_dropout,
p_decoder_dropout=config.model.p_decoder_dropout,
p_postnet_dropout=config.model.p_postnet_dropout,
use_stop_token=config.model.use_stop_token)
if self.parallel:
model = paddle.DataParallel(model)
grad_clip = paddle.nn.ClipGradByGlobalNorm(
config.training.grad_clip_thresh)
optimizer = paddle.optimizer.Adam(
learning_rate=config.training.lr,
parameters=model.parameters(),
weight_decay=paddle.regularizer.L2Decay(
config.training.weight_decay),
grad_clip=grad_clip)
criterion = Tacotron2Loss(
use_stop_token_loss=config.model.use_stop_token,
use_guided_attention_loss=config.model.use_guided_attention_loss,
sigma=config.model.guided_attention_loss_sigma)
self.model = model
self.optimizer = optimizer
self.criterion = criterion
def setup_dataloader(self):
args = self.args
config = self.config
ljspeech_dataset = LJSpeech(args.data)
valid_set, train_set = dataset.split(ljspeech_dataset,
config.data.valid_size)
batch_fn = LJSpeechCollector(padding_idx=config.data.padding_idx)
if not self.parallel:
self.train_loader = DataLoader(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True,
collate_fn=batch_fn)
else:
sampler = DistributedBatchSampler(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True)
self.train_loader = DataLoader(
train_set, batch_sampler=sampler, collate_fn=batch_fn)
self.valid_loader = DataLoader(
valid_set,
batch_size=config.data.batch_size,
shuffle=False,
drop_last=False,
collate_fn=batch_fn)
def main_sp(config, args):
exp = Experiment(config, args)
exp.setup()
exp.resume_or_load()
exp.run()
def main(config, args):
if args.nprocs > 1 and args.device == "gpu":
dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
else:
main_sp(config, args)
if __name__ == "__main__":
config = get_cfg_defaults()
parser = default_argument_parser()
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)

@ -0,0 +1,194 @@
# TransformerTTS with LJSpeech
## Dataset
### Download the datasaet
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
```
### Extract the dataset
```bash
tar xjvf LJSpeech-1.1.tar.bz2
```
### Preprocess the dataset
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
Run the command below to preprocess the dataset.
```bash
./preprocess.sh.
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── speech_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech feature of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, path of speech features, speaker and id of each utterance.
## Train the model
`./run.sh` calls `../train.py`.
```bash
./run.sh
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
[--phones-dict PHONES_DICT]
Train a TransformerTTS model with LJSpeech TTS dataset.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--device DEVICE device type to use.
--nprocs NPROCS number of processes.
--verbose VERBOSE verbose.
--phones-dict PHONES_DICT
phone vocabulary file.
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
4. `--device` is the type of the device to run the experiment, 'cpu' or 'gpu' are supported.
5. `--nprocs` is the number of processes to run in parallel, note that nprocs > 1 is only supported when `--device` is 'gpu'.
6. `--phones-dict` is the path of the phone vocabulary file.
## Pretrained Model
Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.4.zip)
TransformerTTS checkpoint contains files listed below.
```text
transformer_tts_ljspeech_ckpt_0.4
├── default.yaml # default config used to train transformer_tts
├── phone_id_map.txt # phone vocabulary file when training transformer_tts
├── snapshot_iter_201500.pdz # model parameters and optimizer states
└── speech_stats.npy # statistics used to normalize spectrogram when training transformer_tts
```
## Synthesize
We use [waveflow](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow) as the neural vocoder.
Download Pretrained WaveFlow Model with residual channel equals 128 from [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip) and unzip it.
```bash
unzip waveflow_ljspeech_ckpt_0.3.zip
```
WaveFlow checkpoint contains files listed below.
```text
waveflow_ljspeech_ckpt_0.3
├── config.yaml # default config used to train waveflow
└── step-2000000.pdparams # model parameters of waveflow
```
`synthesize.sh` calls `../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
./synthesize.sh
```
```text
usage: synthesize.py [-h] [--transformer-tts-config TRANSFORMER_TTS_CONFIG]
[--transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT]
[--transformer-tts-stat TRANSFORMER_TTS_STAT]
[--waveflow-config WAVEFLOW_CONFIG]
[--waveflow-checkpoint WAVEFLOW_CHECKPOINT]
[--phones-dict PHONES_DICT]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--verbose VERBOSE]
Synthesize with transformer tts & waveflow.
optional arguments:
-h, --help show this help message and exit
--transformer-tts-config TRANSFORMER_TTS_CONFIG
transformer tts config file.
--transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT
transformer tts checkpoint to load.
--transformer-tts-stat TRANSFORMER_TTS_STAT
mean and standard deviation used to normalize
spectrogram when training transformer tts.
--waveflow-config WAVEFLOW_CONFIG
waveflow config file.
--waveflow-checkpoint WAVEFLOW_CHECKPOINT
waveflow checkpoint to load.
--phones-dict PHONES_DICT
phone vocabulary file.
--test-metadata TEST_METADATA
test metadata.
--output-dir OUTPUT_DIR
output dir.
--device DEVICE device type to use.
--verbose VERBOSE verbose.
```
`synthesize_e2e.sh` calls `synthesize_e2e.py`, which can synthesize waveform from text file.
```bash
./synthesize_e2e.sh
```
```text
usage: synthesize_e2e.py [-h]
[--transformer-tts-config TRANSFORMER_TTS_CONFIG]
[--transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT]
[--transformer-tts-stat TRANSFORMER_TTS_STAT]
[--waveflow-config WAVEFLOW_CONFIG]
[--waveflow-checkpoint WAVEFLOW_CHECKPOINT]
[--phones-dict PHONES_DICT] [--text TEXT]
[--output-dir OUTPUT_DIR] [--device DEVICE]
[--verbose VERBOSE]
Synthesize with transformer tts & waveflow.
optional arguments:
-h, --help show this help message and exit
--transformer-tts-config TRANSFORMER_TTS_CONFIG
transformer tts config file.
--transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT
transformer tts checkpoint to load.
--transformer-tts-stat TRANSFORMER_TTS_STAT
mean and standard deviation used to normalize
spectrogram when training transformer tts.
--waveflow-config WAVEFLOW_CONFIG
waveflow config file.
--waveflow-checkpoint WAVEFLOW_CHECKPOINT
waveflow checkpoint to load.
--phones-dict PHONES_DICT
phone vocabulary file.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output-dir OUTPUT_DIR
output dir.
--device DEVICE device type to use.
--verbose VERBOSE verbose.
```
1. `--transformer-tts-config`, `--transformer-tts-checkpoint`, `--transformer-tts-stat` and `--phones-dict` are arguments for transformer_tts, which correspond to the 4 files in the transformer_tts pretrained model.
2. `--waveflow-config`, `--waveflow-checkpoint` are arguments for waveflow, which correspond to the 2 files in the waveflow pretrained model.
3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
4. `--text` is the text file, which contains sentences to synthesize.
5. `--output-dir` is the directory to save synthesized audio files.
6. `--device` is the type of device to run synthesis, 'cpu' and 'gpu' are supported. 'gpu' is recommended for faster synthesis.
You can use the following scripts to synthesize for `../sentences.txt` using pretrained transformer_tts and waveflow models.
```bash
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 synthesize_e2e.py \
--transformer-tts-config=transformer_tts_ljspeech_ckpt_0.4/default.yaml \
--transformer-tts-checkpoint=transformer_tts_ljspeech_ckpt_0.4/snapshot_iter_201500.pdz \
--transformer-tts-stat=transformer_tts_ljspeech_ckpt_0.4/speech_stats.npy \
--waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
--waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
--text=../sentences.txt \
--output-dir=exp/default/test_e2e \
--device="gpu" \
--phones-dict=transformer_tts_ljspeech_ckpt_0.4/phone_id_map.txt
```

@ -0,0 +1,92 @@
fs : 22050 # Hz, sample rate
n_fft : 1024 # fft frame size
win_length : 1024 # window size
n_shift : 256 # hop size between ajacent frame
fmin : 0 # Hz, min frequency when converting to mel
fmax : 8000 # Hz, max frequency when converting to mel
n_mels : 80 # mel bands
window: "hann" # Window function.
###########################################################
# DATA SETTING #
###########################################################
batch_size: 16
num_workers: 2
##########################################################
# TTS MODEL SETTING #
##########################################################
tts: transformertts # model architecture
model: # keyword arguments for the selected model
embed_dim: 0 # embedding dimension in encoder prenet
eprenet_conv_layers: 0 # number of conv layers in encoder prenet
# if set to 0, no encoder prenet will be used
eprenet_conv_filts: 0 # filter size of conv layers in encoder prenet
eprenet_conv_chans: 0 # number of channels of conv layers in encoder prenet
dprenet_layers: 2 # number of layers in decoder prenet
dprenet_units: 256 # number of units in decoder prenet
adim: 512 # attention dimension
aheads: 8 # number of attention heads
elayers: 6 # number of encoder layers
eunits: 1024 # number of encoder ff units
dlayers: 6 # number of decoder layers
dunits: 1024 # number of decoder ff units
positionwise_layer_type: conv1d # type of position-wise layer
positionwise_conv_kernel_size: 1 # kernel size of position wise conv layer
postnet_layers: 5 # number of layers of postnset
postnet_filts: 5 # filter size of conv layers in postnet
postnet_chans: 256 # number of channels of conv layers in postnet
use_scaled_pos_enc: True # whether to use scaled positional encoding
encoder_normalize_before: True # whether to perform layer normalization before the input
decoder_normalize_before: True # whether to perform layer normalization before the input
reduction_factor: 1 # reduction factor
init_type: xavier_uniform # initialization type
init_enc_alpha: 1.0 # initial value of alpha of encoder scaled position encoding
init_dec_alpha: 1.0 # initial value of alpha of decoder scaled position encoding
eprenet_dropout_rate: 0.0 # dropout rate for encoder prenet
dprenet_dropout_rate: 0.5 # dropout rate for decoder prenet
postnet_dropout_rate: 0.5 # dropout rate for postnet
transformer_enc_dropout_rate: 0.1 # dropout rate for transformer encoder layer
transformer_enc_positional_dropout_rate: 0.1 # dropout rate for transformer encoder positional encoding
transformer_enc_attn_dropout_rate: 0.1 # dropout rate for transformer encoder attention layer
transformer_dec_dropout_rate: 0.1 # dropout rate for transformer decoder layer
transformer_dec_positional_dropout_rate: 0.1 # dropout rate for transformer decoder positional encoding
transformer_dec_attn_dropout_rate: 0.1 # dropout rate for transformer decoder attention layer
transformer_enc_dec_attn_dropout_rate: 0.1 # dropout rate for transformer encoder-decoder attention layer
num_heads_applied_guided_attn: 2 # number of heads to apply guided attention loss
num_layers_applied_guided_attn: 2 # number of layers to apply guided attention loss
###########################################################
# UPDATER SETTING #
###########################################################
updater:
use_masking: true # whether to apply masking for padded part in loss calculation
loss_type: L1
use_guided_attn_loss: true # whether to use guided attention loss
guided_attn_loss_sigma: 0.4 # sigma in guided attention loss
guided_attn_loss_lambda: 10.0 # lambda in guided attention loss
modules_applied_guided_attn: ["encoder-decoder"] # modules to apply guided attention loss
bce_pos_weight: 5.0 # weight of positive sample in binary cross entropy calculation
##########################################################
# OPTIMIZER & SCHEDULER SETTING #
##########################################################
optimizer:
optim: adam # optimizer type
learning_rate: 0.001 # learning rate
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 500
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
seed: 10086

@ -0,0 +1,50 @@
#!/bin/bash
stage=1
stop_stage=100
export MAIN_ROOT=`realpath ${PWD}/../../../`
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ../preprocess.py \
--dataset=ljspeech \
--rootdir=~/datasets/LJSpeech-1.1/ \
--dumpdir=dump \
--config-path=conf/default.yaml \
--num-cpu=8
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="speech"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize and covert phone to id, dev and test should use train's stats
echo "Normalize ..."
python3 ../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
fi

@ -0,0 +1,9 @@
#!/bin/bash
python3 ../train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=conf/default.yaml \
--output-dir=exp/default \
--nprocs=2 \
--phones-dict=dump/phone_id_map.txt

@ -0,0 +1,13 @@
#!/bin/bash
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ../synthesize.py \
--transformer-tts-config=conf/default.yaml \
--transformer-tts-checkpoint=exp/default/checkpoints/snapshot_iter_201500.pdz \
--transformer-tts-stat=dump/train/speech_stats.npy \
--waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
--waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=exp/default/test \
--device="gpu" \
--phones-dict=dump/phone_id_map.txt

@ -0,0 +1,161 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from parakeet.frontend import English
from parakeet.models.transformer_tts import TransformerTTS
from parakeet.models.transformer_tts import TransformerTTSInference
from parakeet.models.waveflow import ConditionalWaveFlow
from parakeet.modules.normalizer import ZScore
from parakeet.utils import layer_tools
def evaluate(args, acoustic_model_config, vocoder_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
line_list = line.strip().split()
utt_id = line_list[0]
sentence = " ".join(line_list[1:])
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
phone_id_map = {}
for phn, id in phn_id:
phone_id_map[phn] = int(id)
print("vocab_size:", vocab_size)
odim = acoustic_model_config.n_mels
model = TransformerTTS(
idim=vocab_size, odim=odim, **acoustic_model_config["model"])
model.set_state_dict(
paddle.load(args.transformer_tts_checkpoint)["main_params"])
model.eval()
# remove ".pdparams" in waveflow_checkpoint
vocoder_checkpoint_path = args.waveflow_checkpoint[:-9] if args.waveflow_checkpoint.endswith(
".pdparams") else args.waveflow_checkpoint
vocoder = ConditionalWaveFlow.from_pretrained(vocoder_config,
vocoder_checkpoint_path)
layer_tools.recursively_remove_weight_norm(vocoder)
vocoder.eval()
print("model done!")
frontend = English()
print("frontend done!")
stat = np.load(args.transformer_tts_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
transformer_tts_normalizer = ZScore(mu, std)
transformer_tts_inference = TransformerTTSInference(
transformer_tts_normalizer, model)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for utt_id, sentence in sentences:
phones = frontend.phoneticize(sentence)
# remove start_symbol and end_symbol
phones = phones[1:-1]
phones = [phn for phn in phones if not phn.isspace()]
phones = [phn if phn in phone_id_map else "," for phn in phones]
phone_ids = [phone_id_map[phn] for phn in phones]
with paddle.no_grad():
mel = transformer_tts_inference(paddle.to_tensor(phone_ids))
# mel shape is (T, feats) and waveflow's input shape is (batch, feats, T)
mel = mel.unsqueeze(0).transpose([0, 2, 1])
# wavflow's output shape is (B, T)
wav = vocoder.infer(mel)[0]
sf.write(
str(output_dir / (utt_id + ".wav")),
wav.numpy(),
samplerate=acoustic_model_config.fs)
print(f"{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with transformer tts & waveflow.")
parser.add_argument(
"--transformer-tts-config",
type=str,
help="transformer tts config file.")
parser.add_argument(
"--transformer-tts-checkpoint",
type=str,
help="transformer tts checkpoint to load.")
parser.add_argument(
"--transformer-tts-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training transformer tts."
)
parser.add_argument(
"--waveflow-config", type=str, help="waveflow config file.")
# not normalize when training waveflow
parser.add_argument(
"--waveflow-checkpoint", type=str, help="waveflow checkpoint to load.")
parser.add_argument(
"--phones-dict",
type=str,
default="phone_id_map.txt",
help="phone vocabulary file.")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--device", type=str, default="gpu", help="device type to use.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
paddle.set_device(args.device)
with open(args.transformer_tts_config) as f:
transformer_tts_config = CfgNode(yaml.safe_load(f))
with open(args.waveflow_config) as f:
waveflow_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(transformer_tts_config)
print(waveflow_config)
evaluate(args, transformer_tts_config, waveflow_config)
if __name__ == "__main__":
main()

@ -0,0 +1,13 @@
#!/bin/bash
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 synthesize_e2e.py \
--transformer-tts-config=conf/default.yaml \
--transformer-tts-checkpoint=exp/default/checkpoints/snapshot_iter_201500.pdz \
--transformer-tts-stat=dump/train/speech_stats.npy \
--waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
--waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
--text=../sentences.txt \
--output-dir=exp/default/test_e2e \
--device="gpu" \
--phones-dict=dump/phone_id_map.txt

@ -0,0 +1,143 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Normalize feature files and dump them."""
import argparse
import logging
from operator import itemgetter
from pathlib import Path
import jsonlines
import numpy as np
from parakeet.datasets.data_table import DataTable
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
def main():
"""Run preprocessing process."""
parser = argparse.ArgumentParser(
description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)."
)
parser.add_argument(
"--metadata",
type=str,
required=True,
help="directory including feature files to be normalized. "
"you need to specify either *-scp or rootdir.")
parser.add_argument(
"--dumpdir",
type=str,
required=True,
help="directory to dump normalized feature files.")
parser.add_argument(
"--speech-stats",
type=str,
required=True,
help="speech statistics file.")
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--speaker-dict", type=str, default=None, help="speaker id map file.")
parser.add_argument(
"--verbose",
type=int,
default=1,
help="logging level. higher is more logging. (default=1)")
args = parser.parse_args()
# set logger
if args.verbose > 1:
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
)
elif args.verbose > 0:
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
)
else:
logging.basicConfig(
level=logging.WARN,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
)
logging.warning('Skip DEBUG/INFO messages')
# check directory existence
dumpdir = Path(args.dumpdir).resolve()
dumpdir.mkdir(parents=True, exist_ok=True)
# get dataset
with jsonlines.open(args.metadata, 'r') as reader:
metadata = list(reader)
dataset = DataTable(
metadata, converters={
"speech": np.load,
})
logging.info(f"The number of files = {len(dataset)}.")
# restore scaler
speech_scaler = StandardScaler()
speech_scaler.mean_ = np.load(args.speech_stats)[0]
speech_scaler.scale_ = np.load(args.speech_stats)[1]
speech_scaler.n_features_in_ = speech_scaler.mean_.shape[0]
vocab_phones = {}
with open(args.phones_dict, 'rt') as f:
phn_id = [line.strip().split() for line in f.readlines()]
for phn, id in phn_id:
vocab_phones[phn] = int(id)
vocab_speaker = {}
with open(args.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()]
for spk, id in spk_id:
vocab_speaker[spk] = int(id)
# process each file
output_metadata = []
for item in tqdm(dataset):
utt_id = item['utt_id']
speech = item['speech']
# normalize
speech = speech_scaler.transform(speech)
speech_dir = dumpdir / "data_speech"
speech_dir.mkdir(parents=True, exist_ok=True)
speech_path = speech_dir / f"{utt_id}_speech.npy"
np.save(speech_path, speech.astype(np.float32), allow_pickle=False)
phone_ids = [vocab_phones[p] for p in item['phones']]
spk_id = vocab_speaker[item["speaker"]]
record = {
"utt_id": item['utt_id'],
"spk_id": spk_id,
"text": phone_ids,
"text_lengths": item['text_lengths'],
"speech_lengths": item['speech_lengths'],
"speech": str(speech_path),
}
output_metadata.append(record)
output_metadata.sort(key=itemgetter('utt_id'))
output_metadata_path = Path(args.dumpdir) / "metadata.jsonl"
with jsonlines.open(output_metadata_path, 'w') as writer:
for item in output_metadata:
writer.write(item)
logging.info(f"metadata dumped into {output_metadata_path}")
if __name__ == "__main__":
main()

@ -0,0 +1,281 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from concurrent.futures import ThreadPoolExecutor
from operator import itemgetter
from pathlib import Path
from typing import Any
from typing import Dict
from typing import List
import jsonlines
import librosa
import numpy as np
import tqdm
import yaml
from parakeet.data.get_feats import LogMelFBank
from parakeet.frontend import English
from yacs.config import CfgNode as Configuration
def get_lj_sentences(file_name, frontend):
'''
read MFA duration.txt
Parameters
----------
file_name : str or Path
Returns
----------
Dict
sentence: {'utt': ([char], [int])}
'''
f = open(file_name, 'r')
sentence = {}
speaker_set = set()
for line in f:
line_list = line.strip().split('|')
utt = line_list[0]
speaker = utt.split("-")[0][:2]
speaker_set.add(speaker)
raw_text = line_list[-1]
phonemes = frontend.phoneticize(raw_text)
phonemes = phonemes[1:-1]
phonemes = [phn for phn in phonemes if not phn.isspace()]
sentence[utt] = (phonemes, speaker)
f.close()
return sentence, speaker_set
def get_input_token(sentence, output_path):
'''
get phone set from training data and save it
Parameters
----------
sentence : Dict
sentence: {'utt': ([char], str)}
output_path : str or path
path to save phone_id_map
'''
phn_token = set()
for utt in sentence:
for phn in sentence[utt][0]:
if phn != "<eos>":
phn_token.add(phn)
phn_token = list(phn_token)
phn_token.sort()
phn_token = ["<pad>", "<unk>"] + phn_token
phn_token += ["<eos>"]
with open(output_path, 'w') as f:
for i, phn in enumerate(phn_token):
f.write(phn + ' ' + str(i) + '\n')
def get_spk_id_map(speaker_set, output_path):
speakers = sorted(list(speaker_set))
with open(output_path, 'w') as f:
for i, spk in enumerate(speakers):
f.write(spk + ' ' + str(i) + '\n')
def process_sentence(config: Dict[str, Any],
fp: Path,
sentences: Dict,
output_dir: Path,
mel_extractor=None):
utt_id = fp.stem
record = None
if utt_id in sentences:
# reading, resampling may occur
wav, _ = librosa.load(str(fp), sr=config.fs)
if len(wav.shape) != 1 or np.abs(wav).max() > 1.0:
return record
assert len(wav.shape) == 1, f"{utt_id} is not a mono-channel audio."
assert np.abs(wav).max(
) <= 1.0, f"{utt_id} is seems to be different that 16 bit PCM."
phones = sentences[utt_id][0]
speaker = sentences[utt_id][1]
logmel = mel_extractor.get_log_mel_fbank(wav, base='e')
# change duration according to mel_length
num_frames = logmel.shape[0]
mel_dir = output_dir / "data_speech"
mel_dir.mkdir(parents=True, exist_ok=True)
mel_path = mel_dir / (utt_id + "_speech.npy")
np.save(mel_path, logmel)
record = {
"utt_id": utt_id,
"phones": phones,
"text_lengths": len(phones),
"speech_lengths": num_frames,
"speech": str(mel_path),
"speaker": speaker
}
return record
def process_sentences(config,
fps: List[Path],
sentences: Dict,
output_dir: Path,
mel_extractor=None,
nprocs: int=1):
if nprocs == 1:
results = []
for fp in tqdm.tqdm(fps, total=len(fps)):
record = process_sentence(config, fp, sentences, output_dir,
mel_extractor)
if record:
results.append(record)
else:
with ThreadPoolExecutor(nprocs) as pool:
futures = []
with tqdm.tqdm(total=len(fps)) as progress:
for fp in fps:
future = pool.submit(process_sentence, config, fp,
sentences, output_dir, mel_extractor)
future.add_done_callback(lambda p: progress.update())
futures.append(future)
results = []
for ft in futures:
record = ft.result()
if record:
results.append(record)
results.sort(key=itemgetter("utt_id"))
with jsonlines.open(output_dir / "metadata.jsonl", 'w') as writer:
for item in results:
writer.write(item)
print("Done")
def main():
# parse config and args
parser = argparse.ArgumentParser(
description="Preprocess audio and then extract features.")
parser.add_argument(
"--dataset",
default="ljspeech",
type=str,
help="name of dataset, should in {ljspeech} now")
parser.add_argument(
"--rootdir", default=None, type=str, help="directory to dataset.")
parser.add_argument(
"--dumpdir",
type=str,
required=True,
help="directory to dump feature files.")
parser.add_argument(
"--config-path",
default="conf/default.yaml",
type=str,
help="yaml format configuration file.")
parser.add_argument(
"--verbose",
type=int,
default=1,
help="logging level. higher is more logging. (default=1)")
parser.add_argument(
"--num-cpu", type=int, default=1, help="number of process.")
args = parser.parse_args()
config_path = Path(args.config_path).resolve()
root_dir = Path(args.rootdir).expanduser()
dumpdir = Path(args.dumpdir).expanduser()
# use absolute path
dumpdir = dumpdir.resolve()
dumpdir.mkdir(parents=True, exist_ok=True)
assert root_dir.is_dir()
with open(config_path, 'rt') as f:
_C = yaml.safe_load(f)
_C = Configuration(_C)
config = _C.clone()
if args.verbose > 1:
print(vars(args))
print(config)
phone_id_map_path = dumpdir / "phone_id_map.txt"
speaker_id_map_path = dumpdir / "speaker_id_map.txt"
if args.dataset == "ljspeech":
wav_files = sorted(list((root_dir / "wavs").rglob("*.wav")))
frontend = English()
sentences, speaker_set = get_lj_sentences(root_dir / "metadata.csv",
frontend)
get_input_token(sentences, phone_id_map_path)
get_spk_id_map(speaker_set, speaker_id_map_path)
# split data into 3 sections
num_train = 12900
num_dev = 100
train_wav_files = wav_files[:num_train]
dev_wav_files = wav_files[num_train:num_train + num_dev]
test_wav_files = wav_files[num_train + num_dev:]
train_dump_dir = dumpdir / "train" / "raw"
train_dump_dir.mkdir(parents=True, exist_ok=True)
dev_dump_dir = dumpdir / "dev" / "raw"
dev_dump_dir.mkdir(parents=True, exist_ok=True)
test_dump_dir = dumpdir / "test" / "raw"
test_dump_dir.mkdir(parents=True, exist_ok=True)
# Extractor
mel_extractor = LogMelFBank(
sr=config.fs,
n_fft=config.n_fft,
hop_length=config.n_shift,
win_length=config.win_length,
window=config.window,
n_mels=config.n_mels,
fmin=config.fmin,
fmax=config.fmax)
# process for the 3 sections
if train_wav_files:
process_sentences(
config,
train_wav_files,
sentences,
train_dump_dir,
mel_extractor,
nprocs=args.num_cpu)
if dev_wav_files:
process_sentences(
config,
dev_wav_files,
sentences,
dev_dump_dir,
mel_extractor,
nprocs=args.num_cpu)
if test_wav_files:
process_sentences(
config,
test_wav_files,
sentences,
test_dump_dir,
mel_extractor,
nprocs=args.num_cpu)
if __name__ == "__main__":
main()

@ -0,0 +1,9 @@
001 Life was like a box of chocolates, you never know what you're gonna get.
002 With great power there must come great responsibility.
003 To be or not to be, thats a question.
004 A man can be destroyed but not defeated
005 Do not, for one repulse, give up the purpose that you resolved to effort.
006 Death is just a part of life, something we're all destined to do.
007 I think it's hard winning a war with words.
008 Dont argue with the people of strong determination, because they may change the fact!
009 Love you three thousand times.

@ -0,0 +1,142 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
from pathlib import Path
import jsonlines
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from parakeet.datasets.data_table import DataTable
from parakeet.models.transformer_tts import TransformerTTS
from parakeet.models.transformer_tts import TransformerTTSInference
from parakeet.models.waveflow import ConditionalWaveFlow
from parakeet.modules.normalizer import ZScore
from parakeet.utils import layer_tools
def evaluate(args, acoustic_model_config, vocoder_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
with jsonlines.open(args.test_metadata, 'r') as reader:
test_metadata = list(reader)
test_dataset = DataTable(data=test_metadata, fields=["utt_id", "text"])
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
odim = acoustic_model_config.n_mels
model = TransformerTTS(
idim=vocab_size, odim=odim, **acoustic_model_config["model"])
model.set_state_dict(
paddle.load(args.transformer_tts_checkpoint)["main_params"])
model.eval()
# remove ".pdparams" in waveflow_checkpoint
vocoder_checkpoint_path = args.waveflow_checkpoint[:-9] if args.waveflow_checkpoint.endswith(
".pdparams") else args.waveflow_checkpoint
vocoder = ConditionalWaveFlow.from_pretrained(vocoder_config,
vocoder_checkpoint_path)
layer_tools.recursively_remove_weight_norm(vocoder)
vocoder.eval()
print("model done!")
stat = np.load(args.transformer_tts_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
transformer_tts_normalizer = ZScore(mu, std)
transformer_tts_inference = TransformerTTSInference(
transformer_tts_normalizer, model)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for datum in test_dataset:
utt_id = datum["utt_id"]
text = paddle.to_tensor(datum["text"])
with paddle.no_grad():
mel = transformer_tts_inference(text)
# mel shape is (T, feats) and waveflow's input shape is (batch, feats, T)
mel = mel.unsqueeze(0).transpose([0, 2, 1])
# wavflow's output shape is (B, T)
wav = vocoder.infer(mel)[0]
sf.write(
str(output_dir / (utt_id + ".wav")),
wav.numpy(),
samplerate=acoustic_model_config.fs)
print(f"{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with transformer tts & waveflow.")
parser.add_argument(
"--transformer-tts-config",
type=str,
help="transformer tts config file.")
parser.add_argument(
"--transformer-tts-checkpoint",
type=str,
help="transformer tts checkpoint to load.")
parser.add_argument(
"--transformer-tts-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training transformer tts."
)
parser.add_argument(
"--waveflow-config", type=str, help="waveflow config file.")
# not normalize when training waveflow
parser.add_argument(
"--waveflow-checkpoint", type=str, help="waveflow checkpoint to load.")
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument("--test-metadata", type=str, help="test metadata.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--device", type=str, default="gpu", help="device type to use.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
paddle.set_device(args.device)
with open(args.transformer_tts_config) as f:
transformer_tts_config = CfgNode(yaml.safe_load(f))
with open(args.waveflow_config) as f:
waveflow_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(transformer_tts_config)
print(waveflow_config)
evaluate(args, transformer_tts_config, waveflow_config)
if __name__ == "__main__":
main()

@ -0,0 +1,199 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import logging
import shutil
from pathlib import Path
import jsonlines
import numpy as np
import paddle
import yaml
from paddle import DataParallel
from paddle import distributed as dist
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from parakeet.datasets.data_table import DataTable
from parakeet.datasets.am_batch_fn import transformer_single_spk_batch_fn
from parakeet.models.transformer_tts import TransformerTTS
from parakeet.models.transformer_tts import TransformerTTSUpdater
from parakeet.models.transformer_tts import TransformerTTSEvaluator
from parakeet.training.extensions.snapshot import Snapshot
from parakeet.training.extensions.visualizer import VisualDL
from parakeet.training.optimizer import build_optimizers
from parakeet.training.seeding import seed_everything
from parakeet.training.trainer import Trainer
from visualdl import LogWriter
from yacs.config import CfgNode
def train_sp(args, config):
# decides device type and whether to run in parallel
# setup running environment correctly
if not paddle.is_compiled_with_cuda():
paddle.set_device("cpu")
else:
paddle.set_device("gpu")
world_size = paddle.distributed.get_world_size()
if world_size > 1:
paddle.distributed.init_parallel_env()
# set the random seed, it is a must for multiprocess training
seed_everything(config.seed)
print(
f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
)
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for training and validation
with jsonlines.open(args.train_metadata, 'r') as reader:
train_metadata = list(reader)
train_dataset = DataTable(
data=train_metadata,
fields=[
"text",
"text_lengths",
"speech",
"speech_lengths",
],
converters={
"speech": np.load,
}, )
with jsonlines.open(args.dev_metadata, 'r') as reader:
dev_metadata = list(reader)
dev_dataset = DataTable(
data=dev_metadata,
fields=[
"text",
"text_lengths",
"speech",
"speech_lengths",
],
converters={
"speech": np.load,
}, )
# collate function and dataloader
train_sampler = DistributedBatchSampler(
train_dataset,
batch_size=config.batch_size,
shuffle=True,
drop_last=True)
print("samplers done!")
train_dataloader = DataLoader(
train_dataset,
batch_sampler=train_sampler,
collate_fn=transformer_single_spk_batch_fn,
num_workers=config.num_workers)
dev_dataloader = DataLoader(
dev_dataset,
shuffle=False,
drop_last=False,
batch_size=config.batch_size,
collate_fn=transformer_single_spk_batch_fn,
num_workers=config.num_workers)
print("dataloaders done!")
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
odim = config.n_mels
model = TransformerTTS(idim=vocab_size, odim=odim, **config["model"])
if world_size > 1:
model = DataParallel(model)
print("model done!")
optimizer = build_optimizers(model, **config["optimizer"])
print("optimizer done!")
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
if dist.get_rank() == 0:
config_name = args.config.split("/")[-1]
# copy conf to output_dir
shutil.copyfile(args.config, output_dir / config_name)
updater = TransformerTTSUpdater(
model=model,
optimizer=optimizer,
dataloader=train_dataloader,
output_dir=output_dir,
**config["updater"])
trainer = Trainer(updater, (config.max_epoch, 'epoch'), output_dir)
evaluator = TransformerTTSEvaluator(
model, dev_dataloader, output_dir=output_dir, **config["updater"])
if dist.get_rank() == 0:
trainer.extend(evaluator, trigger=(1, "epoch"))
writer = LogWriter(str(output_dir))
trainer.extend(VisualDL(writer), trigger=(1, "iteration"))
trainer.extend(
Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
# print(trainer.extensions)
trainer.run()
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="Train a TransformerTTS "
"model with LJSpeech TTS dataset.")
parser.add_argument(
"--config", type=str, help="config file to overwrite default config.")
parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--device", type=str, default="gpu", help="device type to use.")
parser.add_argument(
"--nprocs", type=int, default=1, help="number of processes.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
args = parser.parse_args()
if args.device == "cpu" and args.nprocs > 1:
raise RuntimeError("Multiprocess training on CPU is not supported.")
with open(args.config) as f:
config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(config)
print(
f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
)
# dispatch
if args.nprocs > 1:
dist.spawn(train_sp, (args, config), nprocs=args.nprocs)
else:
train_sp(args, config)
if __name__ == "__main__":
main()

@ -0,0 +1,52 @@
# WaveFlow with LJSpeech
## Dataset
### Download the datasaet.
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
```
### Extract the dataset.
```bash
tar xjvf LJSpeech-1.1.tar.bz2
```
### Preprocess the dataset.
Assume the path to save the preprocessed dataset is `ljspeech_waveflow`. Run the command below to preprocess the dataset.
```bash
python preprocess.py --input=LJSpeech-1.1/ --output=ljspeech_waveflow
```
## Train the model
The training script requires 4 command line arguments.
`--data` is the path of the training dataset, `--output` is the path of the output directory (we recommend to use a subdirectory in `runs` to manage different experiments.)
`--device` should be "cpu" or "gpu", `--nprocs` is the number of processes to train the model in parallel.
```bash
python train.py --data=ljspeech_waveflow/ --output=runs/test --device="gpu" --nprocs=1
```
If you want distributed training, set a larger `--nprocs` (e.g. 4). Note that distributed training with cpu is not supported yet.
## Synthesize
Synthesize waveform. We assume the `--input` is a directory containing several mel spectrograms(log magnitude) in `.npy` format. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
`--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
`--device` specifies to device to run synthesis on.
```bash
python synthesize.py --input=mels/ --output=wavs/ --checkpoint_path='step-2000000' --device="gpu" --verbose
```
## Pretrained Model
Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip).

@ -0,0 +1,56 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from yacs.config import CfgNode as CN
_C = CN()
_C.data = CN(
dict(
batch_size=8, # batch size
valid_size=16, # the first N examples are reserved for validation
sample_rate=22050, # Hz, sample rate
n_fft=1024, # fft frame size
win_length=1024, # window size
hop_length=256, # hop size between ajacent frame
fmin=0,
fmax=8000, # Hz, max frequency when converting to mel
n_mels=80, # mel bands
clip_frames=65, # mel clip frames
))
_C.model = CN(
dict(
upsample_factors=[16, 16],
n_flows=8, # number of flows in WaveFlow
n_layers=8, # number of conv block in each flow
n_group=16, # folding factor of audio and spectrogram
channels=128, # resiaudal channel in each flow
kernel_size=[3, 3], # kernel size in each conv block
sigma=1.0, # stddev of the random noise
))
_C.training = CN(
dict(
lr=2e-4, # learning rates
valid_interval=1000, # validation
save_interval=10000, # checkpoint
max_iteration=3000000, # max iteration to train
))
def get_cfg_defaults():
"""Get a yacs CfgNode object with default values for my_project."""
# Return a clone so that the defaults will not be altered
# This is for the "local variable" use pattern
return _C.clone()

@ -0,0 +1,89 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
import numpy as np
import pandas
from paddle.io import Dataset
from parakeet.data.batch import batch_spec, batch_wav
class LJSpeech(Dataset):
"""A simple dataset adaptor for the processed ljspeech dataset."""
def __init__(self, root):
self.root = Path(root).expanduser()
meta_data = pandas.read_csv(
str(self.root / "metadata.csv"),
sep="\t",
header=None,
names=["fname", "frames", "samples"])
records = []
for row in meta_data.itertuples():
mel_path = str(self.root / "mel" / (row.fname + ".npy"))
wav_path = str(self.root / "wav" / (row.fname + ".npy"))
records.append((mel_path, wav_path))
self.records = records
def __getitem__(self, i):
mel_name, wav_name = self.records[i]
mel = np.load(mel_name)
wav = np.load(wav_name)
return mel, wav
def __len__(self):
return len(self.records)
class LJSpeechCollector(object):
"""A simple callable to batch LJSpeech examples."""
def __init__(self, padding_value=0.):
self.padding_value = padding_value
def __call__(self, examples):
mels = [example[0] for example in examples]
wavs = [example[1] for example in examples]
mels, _ = batch_spec(mels, pad_value=self.padding_value)
wavs, _ = batch_wav(wavs, pad_value=self.padding_value)
return mels, wavs
class LJSpeechClipCollector(object):
def __init__(self, clip_frames=65, hop_length=256):
self.clip_frames = clip_frames
self.hop_length = hop_length
def __call__(self, examples):
mels = []
wavs = []
for example in examples:
mel_clip, wav_clip = self.clip(example)
mels.append(mel_clip)
wavs.append(wav_clip)
mels = np.stack(mels)
wavs = np.stack(wavs)
return mels, wavs
def clip(self, example):
mel, wav = example
frames = mel.shape[-1]
start = np.random.randint(0, frames - self.clip_frames)
mel_clip = mel[:, start:start + self.clip_frames]
wav_clip = wav[start * self.hop_length:(start + self.clip_frames) *
self.hop_length]
return mel_clip, wav_clip

@ -0,0 +1,162 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
from pathlib import Path
import tqdm
import numpy as np
import librosa
import pandas as pd
from parakeet.datasets import LJSpeechMetaData
from parakeet.audio import LogMagnitude
from config import get_cfg_defaults
class Transform(object):
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels, fmin,
fmax):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.n_mels = n_mels
self.fmin = fmin
self.fmax = fmax
self.spec_normalizer = LogMagnitude(min=1e-5)
def __call__(self, example):
wav_path, _, _ = example
sr = self.sample_rate
n_fft = self.n_fft
win_length = self.win_length
hop_length = self.hop_length
n_mels = self.n_mels
fmin = self.fmin
fmax = self.fmax
wav, loaded_sr = librosa.load(wav_path, sr=None)
assert loaded_sr == sr, "sample rate does not match, resampling applied"
# Pad audio to the right size.
frames = int(np.ceil(float(wav.size) / hop_length))
fft_padding = (n_fft - hop_length) // 2 # sound
desired_length = frames * hop_length + fft_padding * 2
pad_amount = (desired_length - wav.size) // 2
if wav.size % 2 == 0:
wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
else:
wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
# Normalize audio.
wav = wav / np.abs(wav).max() * 0.999
# Compute mel-spectrogram.
# Turn center to False to prevent internal padding.
spectrogram = librosa.core.stft(
wav,
hop_length=hop_length,
win_length=win_length,
n_fft=n_fft,
center=False)
spectrogram_magnitude = np.abs(spectrogram)
# Compute mel-spectrograms.
mel_filter_bank = librosa.filters.mel(
sr=sr, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
# log scale mel_spectrogram.
mel_spectrogram = self.spec_normalizer.transform(mel_spectrogram)
# Extract the center of audio that corresponds to mel spectrograms.
audio = wav[fft_padding:-fft_padding]
assert mel_spectrogram.shape[1] * hop_length == audio.size
# there is no clipping here
return audio, mel_spectrogram
def create_dataset(config, input_dir, output_dir):
input_dir = Path(input_dir).expanduser()
dataset = LJSpeechMetaData(input_dir)
output_dir = Path(output_dir).expanduser()
output_dir.mkdir(exist_ok=True)
transform = Transform(config.sample_rate, config.n_fft, config.win_length,
config.hop_length, config.n_mels, config.fmin,
config.fmax)
file_names = []
for example in tqdm.tqdm(dataset):
fname, _, _ = example
base_name = os.path.splitext(os.path.basename(fname))[0]
wav_dir = output_dir / "wav"
mel_dir = output_dir / "mel"
wav_dir.mkdir(exist_ok=True)
mel_dir.mkdir(exist_ok=True)
audio, mel = transform(example)
np.save(str(wav_dir / base_name), audio)
np.save(str(mel_dir / base_name), mel)
file_names.append((base_name, mel.shape[-1], audio.shape[-1]))
meta_data = pd.DataFrame.from_records(file_names)
meta_data.to_csv(
str(output_dir / "metadata.csv"), sep="\t", index=None, header=None)
print("saved meta data in to {}".format(
os.path.join(output_dir, "metadata.csv")))
print("Done!")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="create dataset")
parser.add_argument(
"--config",
type=str,
metavar="FILE",
help="extra config to overwrite the default config")
parser.add_argument(
"--input", type=str, help="path of the ljspeech dataset")
parser.add_argument(
"--output", type=str, help="path to save output dataset")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print msg")
config = get_cfg_defaults()
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
if args.verbose:
print(config.data)
print(args)
create_dataset(config.data, args.input, args.output)

@ -0,0 +1,83 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
from pathlib import Path
import numpy as np
import soundfile as sf
import paddle
from parakeet.models.waveflow import ConditionalWaveFlow
from parakeet.utils import layer_tools
from config import get_cfg_defaults
def main(config, args):
paddle.set_device(args.device)
model = ConditionalWaveFlow.from_pretrained(config, args.checkpoint_path)
layer_tools.recursively_remove_weight_norm(model)
model.eval()
mel_dir = Path(args.input).expanduser()
output_dir = Path(args.output).expanduser()
output_dir.mkdir(parents=True, exist_ok=True)
for file_path in mel_dir.glob("*.npy"):
mel = np.load(str(file_path))
with paddle.amp.auto_cast():
audio = model.predict(mel)
audio_path = output_dir / (os.path.splitext(file_path.name)[0] + ".wav")
sf.write(audio_path, audio, config.data.sample_rate)
print("[synthesize] {} -> {}".format(file_path, audio_path))
if __name__ == "__main__":
config = get_cfg_defaults()
parser = argparse.ArgumentParser(
description="generate mel spectrogram with TransformerTTS.")
parser.add_argument(
"--config",
type=str,
metavar="FILE",
help="extra config to overwrite the default config")
parser.add_argument(
"--checkpoint_path", type=str, help="path of the checkpoint to load.")
parser.add_argument(
"--input",
type=str,
help="path of directory containing mel spectrogram (in .npy format)")
parser.add_argument("--output", type=str, help="path to save outputs")
parser.add_argument(
"--device", type=str, default="cpu", help="device type to use.")
parser.add_argument(
"--opts",
nargs=argparse.REMAINDER,
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
)
parser.add_argument(
"-v", "--verbose", action="store_true", help="print msg")
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)

@ -0,0 +1,158 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
import numpy as np
import paddle
from paddle import distributed as dist
from paddle.io import DataLoader, DistributedBatchSampler
from parakeet.data import dataset
from parakeet.models.waveflow import ConditionalWaveFlow, WaveFlowLoss
from parakeet.utils import mp_tools
from parakeet.training.cli import default_argument_parser
from parakeet.training.experiment import ExperimentBase
from config import get_cfg_defaults
from ljspeech import LJSpeech, LJSpeechClipCollector, LJSpeechCollector
class Experiment(ExperimentBase):
def setup_model(self):
config = self.config
model = ConditionalWaveFlow(
upsample_factors=config.model.upsample_factors,
n_flows=config.model.n_flows,
n_layers=config.model.n_layers,
n_group=config.model.n_group,
channels=config.model.channels,
n_mels=config.data.n_mels,
kernel_size=config.model.kernel_size)
if self.parallel:
model = paddle.DataParallel(model)
optimizer = paddle.optimizer.Adam(
config.training.lr, parameters=model.parameters())
criterion = WaveFlowLoss(sigma=config.model.sigma)
self.model = model
self.optimizer = optimizer
self.criterion = criterion
def setup_dataloader(self):
config = self.config
args = self.args
ljspeech_dataset = LJSpeech(args.data)
valid_set, train_set = dataset.split(ljspeech_dataset,
config.data.valid_size)
batch_fn = LJSpeechClipCollector(config.data.clip_frames,
config.data.hop_length)
if not self.parallel:
train_loader = DataLoader(
train_set,
batch_size=config.data.batch_size,
shuffle=True,
drop_last=True,
collate_fn=batch_fn)
else:
sampler = DistributedBatchSampler(
train_set,
batch_size=config.data.batch_size,
num_replicas=dist.get_world_size(),
rank=dist.get_rank(),
shuffle=True,
drop_last=True)
train_loader = DataLoader(
train_set, batch_sampler=sampler, collate_fn=batch_fn)
valid_batch_fn = LJSpeechCollector()
valid_loader = DataLoader(
valid_set, batch_size=1, collate_fn=valid_batch_fn)
self.train_loader = train_loader
self.valid_loader = valid_loader
def compute_outputs(self, mel, wav):
# model_core = model._layers if isinstance(model, paddle.DataParallel) else model
z, log_det_jocobian = self.model(wav, mel)
return z, log_det_jocobian
def train_batch(self):
start = time.time()
batch = self.read_batch()
data_loader_time = time.time() - start
self.model.train()
self.optimizer.clear_grad()
mel, wav = batch
z, log_det_jocobian = self.compute_outputs(mel, wav)
loss = self.criterion(z, log_det_jocobian)
loss.backward()
self.optimizer.step()
iteration_time = time.time() - start
loss_value = float(loss)
msg = "Rank: {}, ".format(dist.get_rank())
msg += "step: {}, ".format(self.iteration)
msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
iteration_time)
msg += "loss: {:>.6f}".format(loss_value)
self.logger.info(msg)
if dist.get_rank() == 0:
self.visualizer.add_scalar("train/loss", loss_value, self.iteration)
@mp_tools.rank_zero_only
@paddle.no_grad()
def valid(self):
valid_iterator = iter(self.valid_loader)
valid_losses = []
mel, wav = next(valid_iterator)
z, log_det_jocobian = self.compute_outputs(mel, wav)
loss = self.criterion(z, log_det_jocobian)
valid_losses.append(float(loss))
valid_loss = np.mean(valid_losses)
self.visualizer.add_scalar("valid/loss", valid_loss, self.iteration)
def main_sp(config, args):
exp = Experiment(config, args)
exp.setup()
exp.resume_or_load()
exp.run()
def main(config, args):
if args.nprocs > 1 and args.device == "gpu":
dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
else:
main_sp(config, args)
if __name__ == "__main__":
config = get_cfg_defaults()
parser = default_argument_parser()
args = parser.parse_args()
if args.config:
config.merge_from_file(args.config)
if args.opts:
config.merge_from_list(args.opts)
config.freeze()
print(config)
print(args)
main(config, args)

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save