Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSpeech into finetune
commit
2a95eaa8c6
@ -0,0 +1,42 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F41B TTS Bug Report"
|
||||||
|
about: Create a report to help us improve
|
||||||
|
title: "[TTS]XXXX"
|
||||||
|
labels: Bug, T2S
|
||||||
|
assignees: yt605155624
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
For support and discussions, please use our [Discourse forums](https://github.com/PaddlePaddle/DeepSpeech/discussions).
|
||||||
|
|
||||||
|
If you've found a bug then please create an issue with the following information:
|
||||||
|
|
||||||
|
**Describe the bug**
|
||||||
|
A clear and concise description of what the bug is.
|
||||||
|
|
||||||
|
**To Reproduce**
|
||||||
|
Steps to reproduce the behavior:
|
||||||
|
1. Go to '...'
|
||||||
|
2. Click on '....'
|
||||||
|
3. Scroll down to '....'
|
||||||
|
4. See error
|
||||||
|
|
||||||
|
**Expected behavior**
|
||||||
|
A clear and concise description of what you expected to happen.
|
||||||
|
|
||||||
|
**Screenshots**
|
||||||
|
If applicable, add screenshots to help explain your problem.
|
||||||
|
|
||||||
|
**Environment (please complete the following information):**
|
||||||
|
- OS: [e.g. Ubuntu]
|
||||||
|
- GCC/G++ Version [e.g. 8.3]
|
||||||
|
- Python Version [e.g. 3.7]
|
||||||
|
- PaddlePaddle Version [e.g. 2.0.0]
|
||||||
|
- Model Version [e.g. 2.0.0]
|
||||||
|
- GPU/DRIVER Informationo [e.g. Tesla V100-SXM2-32GB/440.64.00]
|
||||||
|
- CUDA/CUDNN Version [e.g. cuda-10.2]
|
||||||
|
- MKL Version
|
||||||
|
- TensorRT Version
|
||||||
|
|
||||||
|
**Additional context**
|
||||||
|
Add any other context about the problem here.
|
@ -0,0 +1,19 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F680 Feature Request"
|
||||||
|
about: As a user, I want to request a New Feature on the product.
|
||||||
|
title: ''
|
||||||
|
labels: feature request
|
||||||
|
assignees: D-DanielYang, iftaken
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Feature Request
|
||||||
|
|
||||||
|
**Is your feature request related to a problem? Please describe:**
|
||||||
|
<!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] -->
|
||||||
|
|
||||||
|
**Describe the feature you'd like:**
|
||||||
|
<!-- A clear and concise description of what you want to happen. -->
|
||||||
|
|
||||||
|
**Describe alternatives you've considered:**
|
||||||
|
<!-- A clear and concise description of any alternative solutions or features you've considered. -->
|
@ -0,0 +1,15 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F9E9 Others"
|
||||||
|
about: Report any other non-support related issues.
|
||||||
|
title: ''
|
||||||
|
labels: ''
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Others
|
||||||
|
|
||||||
|
<!--
|
||||||
|
你可以在这里提出任何前面几类模板不适用的问题,包括但不限于:优化性建议、框架使用体验反馈、版本兼容性问题、报错信息不清楚等。
|
||||||
|
You can report any issues that are not applicable to the previous types of templates, including but not limited to: enhancement suggestions, feedback on the use of the framework, version compatibility issues, unclear error information, etc.
|
||||||
|
-->
|
@ -0,0 +1,19 @@
|
|||||||
|
---
|
||||||
|
name: "\U0001F914 Ask a Question"
|
||||||
|
about: I want to ask a question.
|
||||||
|
title: ''
|
||||||
|
labels: Question
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## General Question
|
||||||
|
|
||||||
|
<!--
|
||||||
|
Before asking a question, make sure you have:
|
||||||
|
- Baidu/Google your question.
|
||||||
|
- Searched open and closed [GitHub issues](https://github.com/PaddlePaddle/PaddleSpeech/issues?q=is%3Aissue)
|
||||||
|
- Read the documentation:
|
||||||
|
- [Readme](https://github.com/PaddlePaddle/PaddleSpeech)
|
||||||
|
- [Doc](https://paddlespeech.readthedocs.io/)
|
||||||
|
-->
|
@ -1,66 +0,0 @@
|
|||||||
# Changelog
|
|
||||||
|
|
||||||
Date: 2022-3-22, Author: yt605155624.
|
|
||||||
Add features to: CLI:
|
|
||||||
- Support aishell3_hifigan、vctk_hifigan
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1587
|
|
||||||
|
|
||||||
Date: 2022-3-09, Author: yt605155624.
|
|
||||||
Add features to: T2S:
|
|
||||||
- Add ljspeech hifigan egs.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1549
|
|
||||||
|
|
||||||
Date: 2022-3-08, Author: yt605155624.
|
|
||||||
Add features to: T2S:
|
|
||||||
- Add aishell3 hifigan egs.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1545
|
|
||||||
|
|
||||||
Date: 2022-3-08, Author: yt605155624.
|
|
||||||
Add features to: T2S:
|
|
||||||
- Add vctk hifigan egs.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1544
|
|
||||||
|
|
||||||
Date: 2022-1-29, Author: yt605155624.
|
|
||||||
Add features to: T2S:
|
|
||||||
- Update aishell3 vc0 with new Tacotron2.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1419
|
|
||||||
|
|
||||||
Date: 2022-1-29, Author: yt605155624.
|
|
||||||
Add features to: T2S:
|
|
||||||
- Add ljspeech Tacotron2.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1416
|
|
||||||
|
|
||||||
Date: 2022-1-24, Author: yt605155624.
|
|
||||||
Add features to: T2S:
|
|
||||||
- Add csmsc WaveRNN.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1379
|
|
||||||
|
|
||||||
Date: 2022-1-19, Author: yt605155624.
|
|
||||||
Add features to: T2S:
|
|
||||||
- Add csmsc Tacotron2.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1314
|
|
||||||
|
|
||||||
|
|
||||||
Date: 2022-1-10, Author: Jackwaterveg.
|
|
||||||
Add features to: CLI:
|
|
||||||
- Support English (librispeech/asr1/transformer).
|
|
||||||
- Support choosing `decode_method` for conformer and transformer models.
|
|
||||||
- Refactor the config, using the unified config.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1297
|
|
||||||
|
|
||||||
***
|
|
||||||
|
|
||||||
Date: 2022-1-17, Author: Jackwaterveg.
|
|
||||||
Add features to: CLI:
|
|
||||||
- Support deepspeech2 online/offline model(aishell).
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1356
|
|
||||||
|
|
||||||
***
|
|
||||||
|
|
||||||
Date: 2022-1-24, Author: Jackwaterveg.
|
|
||||||
Add features to: ctc_decoders:
|
|
||||||
- Support online ctc prefix-beam search decoder.
|
|
||||||
- Unified ctc online decoder and ctc offline decoder.
|
|
||||||
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/821
|
|
||||||
|
|
||||||
***
|
|
@ -0,0 +1,27 @@
|
|||||||
|
(简体中文|[English](./README.md))
|
||||||
|
|
||||||
|
# Metaverse
|
||||||
|
|
||||||
|
## 简介
|
||||||
|
|
||||||
|
Metaverse 是一种新的互联网应用和社交形式,融合了多种新技术,产生了虚拟现实。
|
||||||
|
|
||||||
|
这个演示是一个让图片中的名人“说话”的实现。通过 `PaddleSpeech` 的 `TTS` 模块和 `PaddleGAN` 的组合,我们集成了安装和特定模块到一个 shell 脚本中。
|
||||||
|
|
||||||
|
## 使用
|
||||||
|
|
||||||
|
您可以使用 `PaddleSpeech` 的 `TTS` 模块和 `PaddleGAN` 让您最喜欢的人说出指定的内容,并构建您的虚拟人。
|
||||||
|
|
||||||
|
运行 `run.sh` 完成所有基本程序,包括安装。
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
在 `run.sh`, 先会执行 `source path.sh` 来设置好环境变量。
|
||||||
|
|
||||||
|
如果您想尝试您的句子,请替换 `sentences.txt` 中的句子。
|
||||||
|
|
||||||
|
如果您想尝试图像,请将图像替换 shell 脚本中的 `download/Lamarr.png` 。
|
||||||
|
|
||||||
|
结果已显示在我们的 [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb)。
|
@ -1,14 +1,13 @@
|
|||||||
aiofiles
|
aiofiles
|
||||||
|
faiss-cpu
|
||||||
fastapi
|
fastapi
|
||||||
librosa
|
librosa
|
||||||
numpy
|
numpy
|
||||||
|
paddlenlp
|
||||||
|
paddlepaddle
|
||||||
|
paddlespeech
|
||||||
pydantic
|
pydantic
|
||||||
scikit_learn
|
python-multipartscikit_learn
|
||||||
SoundFile
|
SoundFile
|
||||||
starlette
|
starlette
|
||||||
uvicorn
|
uvicorn
|
||||||
paddlepaddle
|
|
||||||
paddlespeech
|
|
||||||
paddlenlp
|
|
||||||
faiss-cpu
|
|
||||||
python-multipart
|
|
@ -1,18 +1,13 @@
|
|||||||
import random
|
import random
|
||||||
|
|
||||||
|
|
||||||
def randName(n=5):
|
def randName(n=5):
|
||||||
return "".join(random.sample('zyxwvutsrqponmlkjihgfedcba',n))
|
return "".join(random.sample('zyxwvutsrqponmlkjihgfedcba', n))
|
||||||
|
|
||||||
|
|
||||||
def SuccessRequest(result=None, message="ok"):
|
def SuccessRequest(result=None, message="ok"):
|
||||||
return {
|
return {"code": 0, "result": result, "message": message}
|
||||||
"code": 0,
|
|
||||||
"result":result,
|
|
||||||
"message": message
|
|
||||||
}
|
|
||||||
|
|
||||||
def ErrorRequest(result=None, message="error"):
|
def ErrorRequest(result=None, message="error"):
|
||||||
return {
|
return {"code": -1, "result": result, "message": message}
|
||||||
"code": -1,
|
|
||||||
"result":result,
|
|
||||||
"message": message
|
|
||||||
}
|
|
||||||
|
@ -0,0 +1,20 @@
|
|||||||
|
|
||||||
|
(简体中文|[English](./README.md))
|
||||||
|
|
||||||
|
# Story Talker
|
||||||
|
|
||||||
|
## 简介
|
||||||
|
|
||||||
|
故事书是非常重要的儿童启蒙书,但家长通常没有足够的时间为孩子读故事书。对于非常小的孩子,他们可能不理解故事书中的汉字。或有时,孩子们只是想“听”,而不想“读”。
|
||||||
|
|
||||||
|
您可以使用 `PaddleOCR` 获取故事书的文本,并通过 `PaddleSpeech` 的 `TTS` 模块进行阅读。
|
||||||
|
|
||||||
|
## 使用
|
||||||
|
|
||||||
|
运行以下命令行开始:
|
||||||
|
|
||||||
|
```
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
结果已显示在 [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb)。
|
@ -0,0 +1,33 @@
|
|||||||
|
(简体中文|[English](./README.md))
|
||||||
|
|
||||||
|
# Style FastSpeech2
|
||||||
|
|
||||||
|
## 简介
|
||||||
|
|
||||||
|
[FastSpeech2](https://arxiv.org/abs/2006.04558) 是用于语音合成的经典声学模型,它引入了可控语音输入,包括 `phoneme duration` 、 `energy` 和 `pitch` 。
|
||||||
|
|
||||||
|
在预测阶段,您可以更改这些变量以获得一些有趣的结果。
|
||||||
|
|
||||||
|
例如:
|
||||||
|
|
||||||
|
1. `FastSpeech2` 中的 `duration` 可以控制音频的速度 ,并保持 `pitch` 。(在某些语音工具中,增加速度将增加音调,反之亦然。)
|
||||||
|
2. 当我们将一个句子的 `pitch` 设置为平均值并将音素的 `tones` 设置为 `1` 时,我们将获得 `robot-style` 的音色。
|
||||||
|
3. 当我们提高成年女性的 `pitch` (比例固定)时,我们会得到 `child-style` 的音色。
|
||||||
|
|
||||||
|
句子中不同音素的 `duration` 和 `pitch` 可以具有不同的比例。您可以设置不同的音阶比例来强调或削弱某些音素的发音。
|
||||||
|
|
||||||
|
## 运行
|
||||||
|
|
||||||
|
运行以下命令行开始:
|
||||||
|
|
||||||
|
```
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
在 `run.sh`, 会首先执行 `source path.sh` 去设置好环境变量。
|
||||||
|
|
||||||
|
如果您想尝试您的句子,请替换 `sentences.txt`中的句子。
|
||||||
|
|
||||||
|
更多的细节,请查看 `style_syn.py`。
|
||||||
|
|
||||||
|
语音样例可以在 [style-control-in-fastspeech2](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html#style-control-in-fastspeech2) 查看。
|
@ -0,0 +1,126 @@
|
|||||||
|
# FastSpeech2 + AISHELL-3 Voice Cloning (ECAPA-TDNN)
|
||||||
|
This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
|
||||||
|
1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2` because the transcriptions are not needed, we use more datasets, refer to [ECAPA-TDNN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0).
|
||||||
|
2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `FastSpeech2` which will be concated with encoder outputs.
|
||||||
|
3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1).
|
||||||
|
|
||||||
|
## Dataset
|
||||||
|
### Download and Extract
|
||||||
|
Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
|
||||||
|
|
||||||
|
### Get MFA Result and Extract
|
||||||
|
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
|
||||||
|
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
|
||||||
|
|
||||||
|
## Get Started
|
||||||
|
Assume the path to the dataset is `~/datasets/data_aishell3`.
|
||||||
|
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
|
||||||
|
|
||||||
|
Run the command below to
|
||||||
|
1. **source path**.
|
||||||
|
2. preprocess the dataset.
|
||||||
|
3. train the model.
|
||||||
|
4. synthesize waveform from `metadata.jsonl`.
|
||||||
|
5. start a voice cloning inference.
|
||||||
|
```bash
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
|
||||||
|
```bash
|
||||||
|
./run.sh --stage 0 --stop-stage 0
|
||||||
|
```
|
||||||
|
### Data Preprocessing
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path}
|
||||||
|
```
|
||||||
|
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
|
||||||
|
```text
|
||||||
|
dump
|
||||||
|
├── dev
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
├── embed
|
||||||
|
│ ├── SSB0005
|
||||||
|
│ ├── SSB0009
|
||||||
|
│ ├── ...
|
||||||
|
│ └── ...
|
||||||
|
├── phone_id_map.txt
|
||||||
|
├── speaker_id_map.txt
|
||||||
|
├── test
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
└── train
|
||||||
|
├── energy_stats.npy
|
||||||
|
├── norm
|
||||||
|
├── pitch_stats.npy
|
||||||
|
├── raw
|
||||||
|
└── speech_stats.npy
|
||||||
|
```
|
||||||
|
The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`.
|
||||||
|
|
||||||
|
The computing time of utterance embedding can be x hours.
|
||||||
|
|
||||||
|
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
|
||||||
|
|
||||||
|
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
|
||||||
|
|
||||||
|
The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ECAPA-TDNN/inference` step here.
|
||||||
|
|
||||||
|
### Model Training
|
||||||
|
`./local/train.sh` calls `${BIN_DIR}/train.py`.
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
|
||||||
|
```
|
||||||
|
The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
|
||||||
|
|
||||||
|
### Synthesizing
|
||||||
|
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
|
||||||
|
Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
|
||||||
|
```bash
|
||||||
|
unzip pwg_aishell3_ckpt_0.5.zip
|
||||||
|
```
|
||||||
|
Parallel WaveGAN checkpoint contains files listed below.
|
||||||
|
```text
|
||||||
|
pwg_aishell3_ckpt_0.5
|
||||||
|
├── default.yaml # default config used to train parallel wavegan
|
||||||
|
├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
|
||||||
|
└── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
|
||||||
|
```
|
||||||
|
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
||||||
|
```
|
||||||
|
The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/../synthesize.py`.
|
||||||
|
|
||||||
|
### Voice Cloning
|
||||||
|
Assume there are some reference audios in `./ref_audio` (the format must be wav here)
|
||||||
|
```text
|
||||||
|
ref_audio
|
||||||
|
├── 001238.wav
|
||||||
|
├── LJ015-0254.wav
|
||||||
|
└── audio_self_test.wav
|
||||||
|
```
|
||||||
|
`./local/voice_cloning.sh` calls `${BIN_DIR}/../voice_cloning.py`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ref_audio_dir}
|
||||||
|
```
|
||||||
|
## Pretrained Model
|
||||||
|
- [fastspeech2_aishell3_ckpt_vc2_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_vc2_1.2.0.zip)
|
||||||
|
|
||||||
|
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
|
||||||
|
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
|
||||||
|
default|2(gpu) x 96400|0.991855|0.599517|0.052142|0.094877| 0.245318|
|
||||||
|
|
||||||
|
FastSpeech2 checkpoint contains files listed below.
|
||||||
|
(There is no need for `speaker_id_map.txt` here )
|
||||||
|
|
||||||
|
```text
|
||||||
|
fastspeech2_aishell3_ckpt_vc2_1.2.0
|
||||||
|
├── default.yaml # default config used to train fastspeech2
|
||||||
|
├── energy_stats.npy # statistics used to normalize energy when training fastspeech2
|
||||||
|
├── phone_id_map.txt # phone vocabulary file when training fastspeech2
|
||||||
|
├── pitch_stats.npy # statistics used to normalize pitch when training fastspeech2
|
||||||
|
├── snapshot_iter_96400.pdz # model parameters and optimizer states
|
||||||
|
└── speech_stats.npy # statistics used to normalize spectrogram when training fastspeech2
|
||||||
|
```
|
@ -0,0 +1,104 @@
|
|||||||
|
###########################################################
|
||||||
|
# FEATURE EXTRACTION SETTING #
|
||||||
|
###########################################################
|
||||||
|
|
||||||
|
fs: 24000 # sr
|
||||||
|
n_fft: 2048 # FFT size (samples).
|
||||||
|
n_shift: 300 # Hop size (samples). 12.5ms
|
||||||
|
win_length: 1200 # Window length (samples). 50ms
|
||||||
|
# If set to null, it will be the same as fft_size.
|
||||||
|
window: "hann" # Window function.
|
||||||
|
|
||||||
|
# Only used for feats_type != raw
|
||||||
|
|
||||||
|
fmin: 80 # Minimum frequency of Mel basis.
|
||||||
|
fmax: 7600 # Maximum frequency of Mel basis.
|
||||||
|
n_mels: 80 # The number of mel basis.
|
||||||
|
|
||||||
|
# Only used for the model using pitch features (e.g. FastSpeech2)
|
||||||
|
f0min: 80 # Minimum f0 for pitch extraction.
|
||||||
|
f0max: 400 # Maximum f0 for pitch extraction.
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# DATA SETTING #
|
||||||
|
###########################################################
|
||||||
|
batch_size: 64
|
||||||
|
num_workers: 2
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# MODEL SETTING #
|
||||||
|
###########################################################
|
||||||
|
model:
|
||||||
|
adim: 384 # attention dimension
|
||||||
|
aheads: 2 # number of attention heads
|
||||||
|
elayers: 4 # number of encoder layers
|
||||||
|
eunits: 1536 # number of encoder ff units
|
||||||
|
dlayers: 4 # number of decoder layers
|
||||||
|
dunits: 1536 # number of decoder ff units
|
||||||
|
positionwise_layer_type: conv1d # type of position-wise layer
|
||||||
|
positionwise_conv_kernel_size: 3 # kernel size of position wise conv layer
|
||||||
|
duration_predictor_layers: 2 # number of layers of duration predictor
|
||||||
|
duration_predictor_chans: 256 # number of channels of duration predictor
|
||||||
|
duration_predictor_kernel_size: 3 # filter size of duration predictor
|
||||||
|
postnet_layers: 5 # number of layers of postnset
|
||||||
|
postnet_filts: 5 # filter size of conv layers in postnet
|
||||||
|
postnet_chans: 256 # number of channels of conv layers in postnet
|
||||||
|
use_scaled_pos_enc: True # whether to use scaled positional encoding
|
||||||
|
encoder_normalize_before: True # whether to perform layer normalization before the input
|
||||||
|
decoder_normalize_before: True # whether to perform layer normalization before the input
|
||||||
|
reduction_factor: 1 # reduction factor
|
||||||
|
init_type: xavier_uniform # initialization type
|
||||||
|
init_enc_alpha: 1.0 # initial value of alpha of encoder scaled position encoding
|
||||||
|
init_dec_alpha: 1.0 # initial value of alpha of decoder scaled position encoding
|
||||||
|
transformer_enc_dropout_rate: 0.2 # dropout rate for transformer encoder layer
|
||||||
|
transformer_enc_positional_dropout_rate: 0.2 # dropout rate for transformer encoder positional encoding
|
||||||
|
transformer_enc_attn_dropout_rate: 0.2 # dropout rate for transformer encoder attention layer
|
||||||
|
transformer_dec_dropout_rate: 0.2 # dropout rate for transformer decoder layer
|
||||||
|
transformer_dec_positional_dropout_rate: 0.2 # dropout rate for transformer decoder positional encoding
|
||||||
|
transformer_dec_attn_dropout_rate: 0.2 # dropout rate for transformer decoder attention layer
|
||||||
|
pitch_predictor_layers: 5 # number of conv layers in pitch predictor
|
||||||
|
pitch_predictor_chans: 256 # number of channels of conv layers in pitch predictor
|
||||||
|
pitch_predictor_kernel_size: 5 # kernel size of conv leyers in pitch predictor
|
||||||
|
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
|
||||||
|
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
|
||||||
|
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
|
||||||
|
stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder
|
||||||
|
energy_predictor_layers: 2 # number of conv layers in energy predictor
|
||||||
|
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
|
||||||
|
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
|
||||||
|
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
|
||||||
|
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
|
||||||
|
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
|
||||||
|
stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
|
||||||
|
spk_embed_dim: 192 # speaker embedding dimension
|
||||||
|
spk_embed_integration_type: concat # speaker embedding integration type
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# UPDATER SETTING #
|
||||||
|
###########################################################
|
||||||
|
updater:
|
||||||
|
use_masking: True # whether to apply masking for padded part in loss calculation
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# OPTIMIZER SETTING #
|
||||||
|
###########################################################
|
||||||
|
optimizer:
|
||||||
|
optim: adam # optimizer type
|
||||||
|
learning_rate: 0.001 # learning rate
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# TRAINING SETTING #
|
||||||
|
###########################################################
|
||||||
|
max_epoch: 200
|
||||||
|
num_snapshots: 5
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# OTHER SETTING #
|
||||||
|
###########################################################
|
||||||
|
seed: 10086
|
@ -0,0 +1,85 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
|
||||||
|
# gen speaker embedding
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
python3 ${BIN_DIR}/vc2_infer.py \
|
||||||
|
--input=~/datasets/data_aishell3/train/wav/ \
|
||||||
|
--output=dump/embed \
|
||||||
|
--num-cpu=20
|
||||||
|
fi
|
||||||
|
|
||||||
|
# copy from tts3/preprocess
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# get durations from MFA's result
|
||||||
|
echo "Generate durations.txt from MFA results ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
|
||||||
|
--inputdir=./aishell3_alignment_tone \
|
||||||
|
--output durations.txt \
|
||||||
|
--config=${config_path}
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
# extract features
|
||||||
|
echo "Extract features ..."
|
||||||
|
python3 ${BIN_DIR}/preprocess.py \
|
||||||
|
--dataset=aishell3 \
|
||||||
|
--rootdir=~/datasets/data_aishell3/ \
|
||||||
|
--dumpdir=dump \
|
||||||
|
--dur-file=durations.txt \
|
||||||
|
--config=${config_path} \
|
||||||
|
--num-cpu=20 \
|
||||||
|
--cut-sil=True \
|
||||||
|
--spk_emb_dir=dump/embed
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
# get features' stats(mean and std)
|
||||||
|
echo "Get features' stats ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--field-name="speech"
|
||||||
|
|
||||||
|
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--field-name="pitch"
|
||||||
|
|
||||||
|
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--field-name="energy"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||||
|
# normalize and covert phone/speaker to id, dev and test should use train's stats
|
||||||
|
echo "Normalize ..."
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/train/norm \
|
||||||
|
--speech-stats=dump/train/speech_stats.npy \
|
||||||
|
--pitch-stats=dump/train/pitch_stats.npy \
|
||||||
|
--energy-stats=dump/train/energy_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/dev/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/dev/norm \
|
||||||
|
--speech-stats=dump/train/speech_stats.npy \
|
||||||
|
--pitch-stats=dump/train/pitch_stats.npy \
|
||||||
|
--energy-stats=dump/train/energy_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/test/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/test/norm \
|
||||||
|
--speech-stats=dump/train/speech_stats.npy \
|
||||||
|
--pitch-stats=dump/train/pitch_stats.npy \
|
||||||
|
--energy-stats=dump/train/energy_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt
|
||||||
|
fi
|
@ -0,0 +1,22 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
ckpt_name=$3
|
||||||
|
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/../synthesize.py \
|
||||||
|
--am=fastspeech2_aishell3 \
|
||||||
|
--am_config=${config_path} \
|
||||||
|
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||||
|
--am_stat=dump/train/speech_stats.npy \
|
||||||
|
--voc=pwgan_aishell3 \
|
||||||
|
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
|
||||||
|
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
||||||
|
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
||||||
|
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||||
|
--output_dir=${train_output_path}/test \
|
||||||
|
--phones_dict=dump/phone_id_map.txt \
|
||||||
|
--speaker_dict=dump/speaker_id_map.txt \
|
||||||
|
--voice-cloning=True
|
@ -0,0 +1,13 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/train.py \
|
||||||
|
--train-metadata=dump/train/norm/metadata.jsonl \
|
||||||
|
--dev-metadata=dump/dev/norm/metadata.jsonl \
|
||||||
|
--config=${config_path} \
|
||||||
|
--output-dir=${train_output_path} \
|
||||||
|
--ngpu=2 \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--voice-cloning=True
|
@ -0,0 +1,23 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
ckpt_name=$3
|
||||||
|
ref_audio_dir=$4
|
||||||
|
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/../voice_cloning.py \
|
||||||
|
--am=fastspeech2_aishell3 \
|
||||||
|
--am_config=${config_path} \
|
||||||
|
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||||
|
--am_stat=dump/train/speech_stats.npy \
|
||||||
|
--voc=pwgan_aishell3 \
|
||||||
|
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
|
||||||
|
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
||||||
|
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
||||||
|
--text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \
|
||||||
|
--input-dir=${ref_audio_dir} \
|
||||||
|
--output-dir=${train_output_path}/vc_syn \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--use_ecapa=True
|
@ -0,0 +1,13 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
export MAIN_ROOT=`realpath ${PWD}/../../../`
|
||||||
|
|
||||||
|
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
|
||||||
|
export LC_ALL=C
|
||||||
|
|
||||||
|
export PYTHONDONTWRITEBYTECODE=1
|
||||||
|
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
|
||||||
|
export PYTHONIOENCODING=UTF-8
|
||||||
|
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
|
||||||
|
|
||||||
|
MODEL=fastspeech2
|
||||||
|
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
|
@ -0,0 +1,39 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
source path.sh
|
||||||
|
|
||||||
|
gpus=0,1
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
conf_path=conf/default.yaml
|
||||||
|
train_output_path=exp/default
|
||||||
|
ckpt_name=snapshot_iter_96400.pdz
|
||||||
|
ref_audio_dir=ref_audio
|
||||||
|
|
||||||
|
|
||||||
|
# with the following command, you can choose the stage range you want to run
|
||||||
|
# such as `./run.sh --stage 0 --stop-stage 0`
|
||||||
|
# this can not be mixed use with `$1`, `$2` ...
|
||||||
|
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
# prepare data
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
# synthesize, vocoder is pwgan
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
# synthesize, vocoder is pwgan
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ref_audio_dir} || exit -1
|
||||||
|
fi
|
@ -0,0 +1,154 @@
|
|||||||
|
# VITS with AISHELL-3
|
||||||
|
This example contains code used to train a [VITS](https://arxiv.org/abs/2106.06103) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
|
||||||
|
1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `VITS` because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
|
||||||
|
2. Synthesizer and Vocoder: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `VITS` which will be concated with encoder outputs. The vocoder is part of `VITS` due to its special structure.
|
||||||
|
|
||||||
|
## Dataset
|
||||||
|
### Download and Extract
|
||||||
|
Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
|
||||||
|
|
||||||
|
### Get MFA Result and Extract
|
||||||
|
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for VITS, the durations of MFA are not needed here.
|
||||||
|
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
|
||||||
|
|
||||||
|
## Pretrained GE2E Model
|
||||||
|
We use pretrained GE2E model to generate speaker embedding for each sentence.
|
||||||
|
|
||||||
|
Download pretrained GE2E model from here [ge2e_ckpt_0.3.zip](https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip), and `unzip` it.
|
||||||
|
|
||||||
|
## Get Started
|
||||||
|
Assume the path to the dataset is `~/datasets/data_aishell3`.
|
||||||
|
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
|
||||||
|
Assume the path to the pretrained ge2e model is `./ge2e_ckpt_0.3`.
|
||||||
|
|
||||||
|
Run the command below to
|
||||||
|
1. **source path**.
|
||||||
|
2. preprocess the dataset.
|
||||||
|
3. train the model.
|
||||||
|
4. synthesize waveform from `metadata.jsonl`.
|
||||||
|
5. start a voice cloning inference.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
|
||||||
|
```bash
|
||||||
|
./run.sh --stage 0 --stop-stage 0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Preprocessing
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path}
|
||||||
|
```
|
||||||
|
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
|
||||||
|
|
||||||
|
```text
|
||||||
|
dump
|
||||||
|
├── dev
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
├── embed
|
||||||
|
│ ├── SSB0005
|
||||||
|
│ ├── SSB0009
|
||||||
|
│ ├── ...
|
||||||
|
│ └── ...
|
||||||
|
├── phone_id_map.txt
|
||||||
|
├── speaker_id_map.txt
|
||||||
|
├── test
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
└── train
|
||||||
|
├── feats_stats.npy
|
||||||
|
├── norm
|
||||||
|
└── raw
|
||||||
|
```
|
||||||
|
The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`.
|
||||||
|
|
||||||
|
The computing time of utterance embedding can be x hours.
|
||||||
|
|
||||||
|
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains wave and linear spectrogram of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/feats_stats.npy`.
|
||||||
|
|
||||||
|
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, feats, feats_lengths, the path of linear spectrogram features, the path of raw waves, speaker, and the id of each utterance.
|
||||||
|
|
||||||
|
The preprocessing step is very similar to that one of [vits](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vits), but there is one more `ge2e/inference` step here.
|
||||||
|
|
||||||
|
### Model Training
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
|
||||||
|
```
|
||||||
|
The training step is very similar to that one of [vits](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vits), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
|
||||||
|
|
||||||
|
### Synthesizing
|
||||||
|
|
||||||
|
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
||||||
|
```
|
||||||
|
```text
|
||||||
|
usage: synthesize.py [-h] [--config CONFIG] [--ckpt CKPT]
|
||||||
|
[--phones_dict PHONES_DICT] [--speaker_dict SPEAKER_DICT]
|
||||||
|
[--voice-cloning VOICE_CLONING] [--ngpu NGPU]
|
||||||
|
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
|
||||||
|
|
||||||
|
Synthesize with VITS
|
||||||
|
|
||||||
|
optional arguments:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
--config CONFIG Config of VITS.
|
||||||
|
--ckpt CKPT Checkpoint file of VITS.
|
||||||
|
--phones_dict PHONES_DICT
|
||||||
|
phone vocabulary file.
|
||||||
|
--speaker_dict SPEAKER_DICT
|
||||||
|
speaker id map file.
|
||||||
|
--voice-cloning VOICE_CLONING
|
||||||
|
whether training voice cloning model.
|
||||||
|
--ngpu NGPU if ngpu == 0, use cpu.
|
||||||
|
--test_metadata TEST_METADATA
|
||||||
|
test metadata.
|
||||||
|
--output_dir OUTPUT_DIR
|
||||||
|
output dir.
|
||||||
|
```
|
||||||
|
The synthesizing step is very similar to that one of [vits](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vits), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/../synthesize.py`.
|
||||||
|
|
||||||
|
### Voice Cloning
|
||||||
|
Assume there are some reference audios in `./ref_audio`
|
||||||
|
```text
|
||||||
|
ref_audio
|
||||||
|
├── 001238.wav
|
||||||
|
├── LJ015-0254.wav
|
||||||
|
└── audio_self_test.mp3
|
||||||
|
```
|
||||||
|
`./local/voice_cloning.sh` calls `${BIN_DIR}/voice_cloning.py`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${add_blank} ${ref_audio_dir}
|
||||||
|
```
|
||||||
|
|
||||||
|
If you want to convert a speaker audio file to refered speaker, run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${add_blank} ${ref_audio_dir} ${src_audio_path}
|
||||||
|
```
|
||||||
|
|
||||||
|
<!-- TODO display these after we trained the model -->
|
||||||
|
<!--
|
||||||
|
## Pretrained Model
|
||||||
|
|
||||||
|
The pretrained model can be downloaded here:
|
||||||
|
|
||||||
|
- [vits_vc_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/vits/vits_vc_aishell3_ckpt_1.1.0.zip) (add_blank=true)
|
||||||
|
|
||||||
|
VITS checkpoint contains files listed below.
|
||||||
|
(There is no need for `speaker_id_map.txt` here )
|
||||||
|
|
||||||
|
```text
|
||||||
|
vits_vc_aishell3_ckpt_1.1.0
|
||||||
|
├── default.yaml # default config used to train vitx
|
||||||
|
├── phone_id_map.txt # phone vocabulary file when training vits
|
||||||
|
└── snapshot_iter_333000.pdz # model parameters and optimizer states
|
||||||
|
```
|
||||||
|
|
||||||
|
ps: This ckpt is not good enough, a better result is training
|
||||||
|
|
||||||
|
-->
|
@ -0,0 +1,185 @@
|
|||||||
|
# This configuration tested on 4 GPUs (V100) with 32GB GPU
|
||||||
|
# memory. It takes around 2 weeks to finish the training
|
||||||
|
# but 100k iters model should generate reasonable results.
|
||||||
|
###########################################################
|
||||||
|
# FEATURE EXTRACTION SETTING #
|
||||||
|
###########################################################
|
||||||
|
|
||||||
|
fs: 22050 # sr
|
||||||
|
n_fft: 1024 # FFT size (samples).
|
||||||
|
n_shift: 256 # Hop size (samples). 12.5ms
|
||||||
|
win_length: null # Window length (samples). 50ms
|
||||||
|
# If set to null, it will be the same as fft_size.
|
||||||
|
window: "hann" # Window function.
|
||||||
|
|
||||||
|
|
||||||
|
##########################################################
|
||||||
|
# TTS MODEL SETTING #
|
||||||
|
##########################################################
|
||||||
|
model:
|
||||||
|
# generator related
|
||||||
|
generator_type: vits_generator
|
||||||
|
generator_params:
|
||||||
|
hidden_channels: 192
|
||||||
|
spk_embed_dim: 256
|
||||||
|
global_channels: 256
|
||||||
|
segment_size: 32
|
||||||
|
text_encoder_attention_heads: 2
|
||||||
|
text_encoder_ffn_expand: 4
|
||||||
|
text_encoder_blocks: 6
|
||||||
|
text_encoder_positionwise_layer_type: "conv1d"
|
||||||
|
text_encoder_positionwise_conv_kernel_size: 3
|
||||||
|
text_encoder_positional_encoding_layer_type: "rel_pos"
|
||||||
|
text_encoder_self_attention_layer_type: "rel_selfattn"
|
||||||
|
text_encoder_activation_type: "swish"
|
||||||
|
text_encoder_normalize_before: True
|
||||||
|
text_encoder_dropout_rate: 0.1
|
||||||
|
text_encoder_positional_dropout_rate: 0.0
|
||||||
|
text_encoder_attention_dropout_rate: 0.1
|
||||||
|
use_macaron_style_in_text_encoder: True
|
||||||
|
use_conformer_conv_in_text_encoder: False
|
||||||
|
text_encoder_conformer_kernel_size: -1
|
||||||
|
decoder_kernel_size: 7
|
||||||
|
decoder_channels: 512
|
||||||
|
decoder_upsample_scales: [8, 8, 2, 2]
|
||||||
|
decoder_upsample_kernel_sizes: [16, 16, 4, 4]
|
||||||
|
decoder_resblock_kernel_sizes: [3, 7, 11]
|
||||||
|
decoder_resblock_dilations: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
|
||||||
|
use_weight_norm_in_decoder: True
|
||||||
|
posterior_encoder_kernel_size: 5
|
||||||
|
posterior_encoder_layers: 16
|
||||||
|
posterior_encoder_stacks: 1
|
||||||
|
posterior_encoder_base_dilation: 1
|
||||||
|
posterior_encoder_dropout_rate: 0.0
|
||||||
|
use_weight_norm_in_posterior_encoder: True
|
||||||
|
flow_flows: 4
|
||||||
|
flow_kernel_size: 5
|
||||||
|
flow_base_dilation: 1
|
||||||
|
flow_layers: 4
|
||||||
|
flow_dropout_rate: 0.0
|
||||||
|
use_weight_norm_in_flow: True
|
||||||
|
use_only_mean_in_flow: True
|
||||||
|
stochastic_duration_predictor_kernel_size: 3
|
||||||
|
stochastic_duration_predictor_dropout_rate: 0.5
|
||||||
|
stochastic_duration_predictor_flows: 4
|
||||||
|
stochastic_duration_predictor_dds_conv_layers: 3
|
||||||
|
# discriminator related
|
||||||
|
discriminator_type: hifigan_multi_scale_multi_period_discriminator
|
||||||
|
discriminator_params:
|
||||||
|
scales: 1
|
||||||
|
scale_downsample_pooling: "AvgPool1D"
|
||||||
|
scale_downsample_pooling_params:
|
||||||
|
kernel_size: 4
|
||||||
|
stride: 2
|
||||||
|
padding: 2
|
||||||
|
scale_discriminator_params:
|
||||||
|
in_channels: 1
|
||||||
|
out_channels: 1
|
||||||
|
kernel_sizes: [15, 41, 5, 3]
|
||||||
|
channels: 128
|
||||||
|
max_downsample_channels: 1024
|
||||||
|
max_groups: 16
|
||||||
|
bias: True
|
||||||
|
downsample_scales: [2, 2, 4, 4, 1]
|
||||||
|
nonlinear_activation: "leakyrelu"
|
||||||
|
nonlinear_activation_params:
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: True
|
||||||
|
use_spectral_norm: False
|
||||||
|
follow_official_norm: False
|
||||||
|
periods: [2, 3, 5, 7, 11]
|
||||||
|
period_discriminator_params:
|
||||||
|
in_channels: 1
|
||||||
|
out_channels: 1
|
||||||
|
kernel_sizes: [5, 3]
|
||||||
|
channels: 32
|
||||||
|
downsample_scales: [3, 3, 3, 3, 1]
|
||||||
|
max_downsample_channels: 1024
|
||||||
|
bias: True
|
||||||
|
nonlinear_activation: "leakyrelu"
|
||||||
|
nonlinear_activation_params:
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: True
|
||||||
|
use_spectral_norm: False
|
||||||
|
# others
|
||||||
|
sampling_rate: 22050 # needed in the inference for saving wav
|
||||||
|
cache_generator_outputs: True # whether to cache generator outputs in the training
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
# loss function related
|
||||||
|
generator_adv_loss_params:
|
||||||
|
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||||
|
loss_type: mse # loss type, "mse" or "hinge"
|
||||||
|
discriminator_adv_loss_params:
|
||||||
|
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||||
|
loss_type: mse # loss type, "mse" or "hinge"
|
||||||
|
feat_match_loss_params:
|
||||||
|
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||||
|
average_by_layers: False # whether to average loss value by #layers of each discriminator
|
||||||
|
include_final_outputs: True # whether to include final outputs for loss calculation
|
||||||
|
mel_loss_params:
|
||||||
|
fs: 22050 # must be the same as the training data
|
||||||
|
fft_size: 1024 # fft points
|
||||||
|
hop_size: 256 # hop size
|
||||||
|
win_length: null # window length
|
||||||
|
window: hann # window type
|
||||||
|
num_mels: 80 # number of Mel basis
|
||||||
|
fmin: 0 # minimum frequency for Mel basis
|
||||||
|
fmax: null # maximum frequency for Mel basis
|
||||||
|
log_base: null # null represent natural log
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# ADVERSARIAL LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
lambda_adv: 1.0 # loss scaling coefficient for adversarial loss
|
||||||
|
lambda_mel: 45.0 # loss scaling coefficient for Mel loss
|
||||||
|
lambda_feat_match: 2.0 # loss scaling coefficient for feat match loss
|
||||||
|
lambda_dur: 1.0 # loss scaling coefficient for duration loss
|
||||||
|
lambda_kl: 1.0 # loss scaling coefficient for KL divergence loss
|
||||||
|
# others
|
||||||
|
sampling_rate: 22050 # needed in the inference for saving wav
|
||||||
|
cache_generator_outputs: True # whether to cache generator outputs in the training
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# DATA LOADER SETTING #
|
||||||
|
###########################################################
|
||||||
|
batch_size: 50 # Batch size.
|
||||||
|
num_workers: 4 # Number of workers in DataLoader.
|
||||||
|
|
||||||
|
##########################################################
|
||||||
|
# OPTIMIZER & SCHEDULER SETTING #
|
||||||
|
##########################################################
|
||||||
|
# optimizer setting for generator
|
||||||
|
generator_optimizer_params:
|
||||||
|
beta1: 0.8
|
||||||
|
beta2: 0.99
|
||||||
|
epsilon: 1.0e-9
|
||||||
|
weight_decay: 0.0
|
||||||
|
generator_scheduler: exponential_decay
|
||||||
|
generator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4
|
||||||
|
gamma: 0.999875
|
||||||
|
|
||||||
|
# optimizer setting for discriminator
|
||||||
|
discriminator_optimizer_params:
|
||||||
|
beta1: 0.8
|
||||||
|
beta2: 0.99
|
||||||
|
epsilon: 1.0e-9
|
||||||
|
weight_decay: 0.0
|
||||||
|
discriminator_scheduler: exponential_decay
|
||||||
|
discriminator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4
|
||||||
|
gamma: 0.999875
|
||||||
|
generator_first: False # whether to start updating generator first
|
||||||
|
|
||||||
|
##########################################################
|
||||||
|
# OTHER TRAINING SETTING #
|
||||||
|
##########################################################
|
||||||
|
num_snapshots: 10 # max number of snapshots to keep while training
|
||||||
|
train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
|
||||||
|
save_interval_steps: 1000 # Interval steps to save checkpoint.
|
||||||
|
eval_interval_steps: 250 # Interval steps to evaluate the network.
|
||||||
|
seed: 777 # random seed number
|
@ -0,0 +1,79 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
add_blank=$2
|
||||||
|
ge2e_ckpt_path=$3
|
||||||
|
|
||||||
|
# gen speaker embedding
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
python3 ${MAIN_ROOT}/paddlespeech/vector/exps/ge2e/inference.py \
|
||||||
|
--input=~/datasets/data_aishell3/train/wav/ \
|
||||||
|
--output=dump/embed \
|
||||||
|
--checkpoint_path=${ge2e_ckpt_path}
|
||||||
|
fi
|
||||||
|
|
||||||
|
# copy from tts3/preprocess
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# get durations from MFA's result
|
||||||
|
echo "Generate durations.txt from MFA results ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
|
||||||
|
--inputdir=./aishell3_alignment_tone \
|
||||||
|
--output durations.txt \
|
||||||
|
--config=${config_path}
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
# extract features
|
||||||
|
echo "Extract features ..."
|
||||||
|
python3 ${BIN_DIR}/preprocess.py \
|
||||||
|
--dataset=aishell3 \
|
||||||
|
--rootdir=~/datasets/data_aishell3/ \
|
||||||
|
--dumpdir=dump \
|
||||||
|
--dur-file=durations.txt \
|
||||||
|
--config=${config_path} \
|
||||||
|
--num-cpu=20 \
|
||||||
|
--cut-sil=True \
|
||||||
|
--spk_emb_dir=dump/embed
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
# get features' stats(mean and std)
|
||||||
|
echo "Get features' stats ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--field-name="feats"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||||
|
# normalize and covert phone/speaker to id, dev and test should use train's stats
|
||||||
|
echo "Normalize ..."
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/train/norm \
|
||||||
|
--feats-stats=dump/train/feats_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt \
|
||||||
|
--add-blank=${add_blank} \
|
||||||
|
--skip-wav-copy
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/dev/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/dev/norm \
|
||||||
|
--feats-stats=dump/train/feats_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt \
|
||||||
|
--add-blank=${add_blank} \
|
||||||
|
--skip-wav-copy
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/test/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/test/norm \
|
||||||
|
--feats-stats=dump/train/feats_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt \
|
||||||
|
--add-blank=${add_blank} \
|
||||||
|
--skip-wav-copy
|
||||||
|
fi
|
@ -0,0 +1,19 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
ckpt_name=$3
|
||||||
|
stage=0
|
||||||
|
stop_stage=0
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/synthesize.py \
|
||||||
|
--config=${config_path} \
|
||||||
|
--ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||||
|
--phones_dict=dump/phone_id_map.txt \
|
||||||
|
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||||
|
--output_dir=${train_output_path}/test \
|
||||||
|
--voice-cloning=True
|
||||||
|
fi
|
@ -0,0 +1,18 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
|
||||||
|
# install monotonic_align
|
||||||
|
cd ${MAIN_ROOT}/paddlespeech/t2s/models/vits/monotonic_align
|
||||||
|
python3 setup.py build_ext --inplace
|
||||||
|
cd -
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/train.py \
|
||||||
|
--train-metadata=dump/train/norm/metadata.jsonl \
|
||||||
|
--dev-metadata=dump/dev/norm/metadata.jsonl \
|
||||||
|
--config=${config_path} \
|
||||||
|
--output-dir=${train_output_path} \
|
||||||
|
--ngpu=4 \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--voice-cloning=True
|
@ -0,0 +1,22 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
ckpt_name=$3
|
||||||
|
ge2e_params_path=$4
|
||||||
|
add_blank=$5
|
||||||
|
ref_audio_dir=$6
|
||||||
|
src_audio_path=$7
|
||||||
|
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/voice_cloning.py \
|
||||||
|
--config=${config_path} \
|
||||||
|
--ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||||
|
--ge2e_params_path=${ge2e_params_path} \
|
||||||
|
--phones_dict=dump/phone_id_map.txt \
|
||||||
|
--text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \
|
||||||
|
--audio-path=${src_audio_path} \
|
||||||
|
--input-dir=${ref_audio_dir} \
|
||||||
|
--output-dir=${train_output_path}/vc_syn \
|
||||||
|
--add-blank=${add_blank}
|
@ -0,0 +1,45 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
source path.sh
|
||||||
|
|
||||||
|
gpus=0,1,2,3
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
conf_path=conf/default.yaml
|
||||||
|
train_output_path=exp/default
|
||||||
|
ckpt_name=snapshot_iter_153.pdz
|
||||||
|
add_blank=true
|
||||||
|
ref_audio_dir=ref_audio
|
||||||
|
src_audio_path=''
|
||||||
|
|
||||||
|
# not include ".pdparams" here
|
||||||
|
ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000
|
||||||
|
|
||||||
|
# include ".pdparams" here
|
||||||
|
ge2e_params_path=${ge2e_ckpt_path}.pdparams
|
||||||
|
|
||||||
|
# with the following command, you can choose the stage range you want to run
|
||||||
|
# such as `./run.sh --stage 0 --stop-stage 0`
|
||||||
|
# this can not be mixed use with `$1`, `$2` ...
|
||||||
|
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
# prepare data
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${add_blank} ${ge2e_ckpt_path} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} \
|
||||||
|
${ge2e_params_path} ${add_blank} ${ref_audio_dir} ${src_audio_path} || exit -1
|
||||||
|
fi
|
@ -0,0 +1,202 @@
|
|||||||
|
# VITS with AISHELL-3
|
||||||
|
This example contains code used to train a [VITS](https://arxiv.org/abs/2106.06103) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
|
||||||
|
|
||||||
|
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
|
||||||
|
|
||||||
|
We use AISHELL-3 to train a multi-speaker VITS model here.
|
||||||
|
## Dataset
|
||||||
|
### Download and Extract
|
||||||
|
Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
|
||||||
|
|
||||||
|
### Get MFA Result and Extract
|
||||||
|
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for VITS, the durations of MFA are not needed here.
|
||||||
|
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
|
||||||
|
|
||||||
|
## Get Started
|
||||||
|
Assume the path to the dataset is `~/datasets/data_aishell3`.
|
||||||
|
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
|
||||||
|
Run the command below to
|
||||||
|
1. **source path**.
|
||||||
|
2. preprocess the dataset.
|
||||||
|
3. train the model.
|
||||||
|
4. synthesize wavs.
|
||||||
|
- synthesize waveform from `metadata.jsonl`.
|
||||||
|
- synthesize waveform from a text file.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
|
||||||
|
```bash
|
||||||
|
./run.sh --stage 0 --stop-stage 0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Preprocessing
|
||||||
|
```bash
|
||||||
|
./local/preprocess.sh ${conf_path}
|
||||||
|
```
|
||||||
|
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
|
||||||
|
|
||||||
|
```text
|
||||||
|
dump
|
||||||
|
├── dev
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
├── phone_id_map.txt
|
||||||
|
├── speaker_id_map.txt
|
||||||
|
├── test
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
└── train
|
||||||
|
├── feats_stats.npy
|
||||||
|
├── norm
|
||||||
|
└── raw
|
||||||
|
```
|
||||||
|
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains wave and linear spectrogram of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/feats_stats.npy`.
|
||||||
|
|
||||||
|
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, feats, feats_lengths, the path of linear spectrogram features, the path of raw waves, speaker, and the id of each utterance.
|
||||||
|
|
||||||
|
### Model Training
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
|
||||||
|
```
|
||||||
|
`./local/train.sh` calls `${BIN_DIR}/train.py`.
|
||||||
|
Here's the complete help message.
|
||||||
|
```text
|
||||||
|
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
|
||||||
|
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
|
||||||
|
[--ngpu NGPU] [--phones-dict PHONES_DICT]
|
||||||
|
[--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
|
||||||
|
|
||||||
|
Train a VITS model.
|
||||||
|
|
||||||
|
optional arguments:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
--config CONFIG config file to overwrite default config.
|
||||||
|
--train-metadata TRAIN_METADATA
|
||||||
|
training data.
|
||||||
|
--dev-metadata DEV_METADATA
|
||||||
|
dev data.
|
||||||
|
--output-dir OUTPUT_DIR
|
||||||
|
output dir.
|
||||||
|
--ngpu NGPU if ngpu == 0, use cpu.
|
||||||
|
--phones-dict PHONES_DICT
|
||||||
|
phone vocabulary file.
|
||||||
|
--speaker-dict SPEAKER_DICT
|
||||||
|
speaker id map file for multiple speaker model.
|
||||||
|
--voice-cloning VOICE_CLONING
|
||||||
|
whether training voice cloning model.
|
||||||
|
```
|
||||||
|
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
|
||||||
|
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
|
||||||
|
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
|
||||||
|
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
|
||||||
|
5. `--phones-dict` is the path of the phone vocabulary file.
|
||||||
|
6. `--speaker-dict` is the path of the speaker id map file when training a multi-speaker VITS.
|
||||||
|
|
||||||
|
### Synthesizing
|
||||||
|
|
||||||
|
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
||||||
|
```
|
||||||
|
```text
|
||||||
|
usage: synthesize.py [-h] [--config CONFIG] [--ckpt CKPT]
|
||||||
|
[--phones_dict PHONES_DICT] [--speaker_dict SPEAKER_DICT]
|
||||||
|
[--voice-cloning VOICE_CLONING] [--ngpu NGPU]
|
||||||
|
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
|
||||||
|
|
||||||
|
Synthesize with VITS
|
||||||
|
|
||||||
|
optional arguments:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
--config CONFIG Config of VITS.
|
||||||
|
--ckpt CKPT Checkpoint file of VITS.
|
||||||
|
--phones_dict PHONES_DICT
|
||||||
|
phone vocabulary file.
|
||||||
|
--speaker_dict SPEAKER_DICT
|
||||||
|
speaker id map file.
|
||||||
|
--voice-cloning VOICE_CLONING
|
||||||
|
whether training voice cloning model.
|
||||||
|
--ngpu NGPU if ngpu == 0, use cpu.
|
||||||
|
--test_metadata TEST_METADATA
|
||||||
|
test metadata.
|
||||||
|
--output_dir OUTPUT_DIR
|
||||||
|
output dir.
|
||||||
|
```
|
||||||
|
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file.
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
||||||
|
```
|
||||||
|
```text
|
||||||
|
usage: synthesize_e2e.py [-h] [--config CONFIG] [--ckpt CKPT]
|
||||||
|
[--phones_dict PHONES_DICT]
|
||||||
|
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
|
||||||
|
[--lang LANG]
|
||||||
|
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
|
||||||
|
[--text TEXT] [--output_dir OUTPUT_DIR]
|
||||||
|
|
||||||
|
Synthesize with VITS
|
||||||
|
|
||||||
|
optional arguments:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
--config CONFIG Config of VITS.
|
||||||
|
--ckpt CKPT Checkpoint file of VITS.
|
||||||
|
--phones_dict PHONES_DICT
|
||||||
|
phone vocabulary file.
|
||||||
|
--speaker_dict SPEAKER_DICT
|
||||||
|
speaker id map file.
|
||||||
|
--spk_id SPK_ID spk id for multi speaker acoustic model
|
||||||
|
--lang LANG Choose model language. zh or en
|
||||||
|
--inference_dir INFERENCE_DIR
|
||||||
|
dir to save inference models
|
||||||
|
--ngpu NGPU if ngpu == 0, use cpu.
|
||||||
|
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
|
||||||
|
--output_dir OUTPUT_DIR
|
||||||
|
output dir.
|
||||||
|
```
|
||||||
|
1. `--config`, `--ckpt`, `--phones_dict` and `--speaker_dict` are arguments for acoustic model, which correspond to the 3 files in the VITS pretrained model.
|
||||||
|
2. `--lang` is the model language, which can be `zh` or `en`.
|
||||||
|
3. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
|
||||||
|
4. `--text` is the text file, which contains sentences to synthesize.
|
||||||
|
5. `--output_dir` is the directory to save synthesized audio files.
|
||||||
|
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
|
||||||
|
|
||||||
|
<!-- TODO display these after we trained the model -->
|
||||||
|
<!--
|
||||||
|
## Pretrained Model
|
||||||
|
|
||||||
|
The pretrained model can be downloaded here:
|
||||||
|
|
||||||
|
- [vits_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/vits/vits_aishell3_ckpt_1.1.0.zip) (add_blank=true)
|
||||||
|
|
||||||
|
VITS checkpoint contains files listed below.
|
||||||
|
```text
|
||||||
|
vits_aishell3_ckpt_1.1.0
|
||||||
|
├── default.yaml # default config used to train vitx
|
||||||
|
├── phone_id_map.txt # phone vocabulary file when training vits
|
||||||
|
├── speaker_id_map.txt # speaker id map file when training a multi-speaker vits
|
||||||
|
└── snapshot_iter_333000.pdz # model parameters and optimizer states
|
||||||
|
```
|
||||||
|
|
||||||
|
ps: This ckpt is not good enough, a better result is training
|
||||||
|
|
||||||
|
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained VITS.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source path.sh
|
||||||
|
add_blank=true
|
||||||
|
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/synthesize_e2e.py \
|
||||||
|
--config=vits_aishell3_ckpt_1.1.0/default.yaml \
|
||||||
|
--ckpt=vits_aishell3_ckpt_1.1.0/snapshot_iter_333000.pdz \
|
||||||
|
--phones_dict=vits_aishell3_ckpt_1.1.0/phone_id_map.txt \
|
||||||
|
--speaker_dict=vits_aishell3_ckpt_1.1.0/speaker_id_map.txt \
|
||||||
|
--output_dir=exp/default/test_e2e \
|
||||||
|
--text=${BIN_DIR}/../sentences.txt \
|
||||||
|
--add-blank=${add_blank}
|
||||||
|
```
|
||||||
|
-->
|
@ -0,0 +1,184 @@
|
|||||||
|
# This configuration tested on 4 GPUs (V100) with 32GB GPU
|
||||||
|
# memory. It takes around 2 weeks to finish the training
|
||||||
|
# but 100k iters model should generate reasonable results.
|
||||||
|
###########################################################
|
||||||
|
# FEATURE EXTRACTION SETTING #
|
||||||
|
###########################################################
|
||||||
|
|
||||||
|
fs: 22050 # sr
|
||||||
|
n_fft: 1024 # FFT size (samples).
|
||||||
|
n_shift: 256 # Hop size (samples). 12.5ms
|
||||||
|
win_length: null # Window length (samples). 50ms
|
||||||
|
# If set to null, it will be the same as fft_size.
|
||||||
|
window: "hann" # Window function.
|
||||||
|
|
||||||
|
|
||||||
|
##########################################################
|
||||||
|
# TTS MODEL SETTING #
|
||||||
|
##########################################################
|
||||||
|
model:
|
||||||
|
# generator related
|
||||||
|
generator_type: vits_generator
|
||||||
|
generator_params:
|
||||||
|
hidden_channels: 192
|
||||||
|
global_channels: 256
|
||||||
|
segment_size: 32
|
||||||
|
text_encoder_attention_heads: 2
|
||||||
|
text_encoder_ffn_expand: 4
|
||||||
|
text_encoder_blocks: 6
|
||||||
|
text_encoder_positionwise_layer_type: "conv1d"
|
||||||
|
text_encoder_positionwise_conv_kernel_size: 3
|
||||||
|
text_encoder_positional_encoding_layer_type: "rel_pos"
|
||||||
|
text_encoder_self_attention_layer_type: "rel_selfattn"
|
||||||
|
text_encoder_activation_type: "swish"
|
||||||
|
text_encoder_normalize_before: True
|
||||||
|
text_encoder_dropout_rate: 0.1
|
||||||
|
text_encoder_positional_dropout_rate: 0.0
|
||||||
|
text_encoder_attention_dropout_rate: 0.1
|
||||||
|
use_macaron_style_in_text_encoder: True
|
||||||
|
use_conformer_conv_in_text_encoder: False
|
||||||
|
text_encoder_conformer_kernel_size: -1
|
||||||
|
decoder_kernel_size: 7
|
||||||
|
decoder_channels: 512
|
||||||
|
decoder_upsample_scales: [8, 8, 2, 2]
|
||||||
|
decoder_upsample_kernel_sizes: [16, 16, 4, 4]
|
||||||
|
decoder_resblock_kernel_sizes: [3, 7, 11]
|
||||||
|
decoder_resblock_dilations: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
|
||||||
|
use_weight_norm_in_decoder: True
|
||||||
|
posterior_encoder_kernel_size: 5
|
||||||
|
posterior_encoder_layers: 16
|
||||||
|
posterior_encoder_stacks: 1
|
||||||
|
posterior_encoder_base_dilation: 1
|
||||||
|
posterior_encoder_dropout_rate: 0.0
|
||||||
|
use_weight_norm_in_posterior_encoder: True
|
||||||
|
flow_flows: 4
|
||||||
|
flow_kernel_size: 5
|
||||||
|
flow_base_dilation: 1
|
||||||
|
flow_layers: 4
|
||||||
|
flow_dropout_rate: 0.0
|
||||||
|
use_weight_norm_in_flow: True
|
||||||
|
use_only_mean_in_flow: True
|
||||||
|
stochastic_duration_predictor_kernel_size: 3
|
||||||
|
stochastic_duration_predictor_dropout_rate: 0.5
|
||||||
|
stochastic_duration_predictor_flows: 4
|
||||||
|
stochastic_duration_predictor_dds_conv_layers: 3
|
||||||
|
# discriminator related
|
||||||
|
discriminator_type: hifigan_multi_scale_multi_period_discriminator
|
||||||
|
discriminator_params:
|
||||||
|
scales: 1
|
||||||
|
scale_downsample_pooling: "AvgPool1D"
|
||||||
|
scale_downsample_pooling_params:
|
||||||
|
kernel_size: 4
|
||||||
|
stride: 2
|
||||||
|
padding: 2
|
||||||
|
scale_discriminator_params:
|
||||||
|
in_channels: 1
|
||||||
|
out_channels: 1
|
||||||
|
kernel_sizes: [15, 41, 5, 3]
|
||||||
|
channels: 128
|
||||||
|
max_downsample_channels: 1024
|
||||||
|
max_groups: 16
|
||||||
|
bias: True
|
||||||
|
downsample_scales: [2, 2, 4, 4, 1]
|
||||||
|
nonlinear_activation: "leakyrelu"
|
||||||
|
nonlinear_activation_params:
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: True
|
||||||
|
use_spectral_norm: False
|
||||||
|
follow_official_norm: False
|
||||||
|
periods: [2, 3, 5, 7, 11]
|
||||||
|
period_discriminator_params:
|
||||||
|
in_channels: 1
|
||||||
|
out_channels: 1
|
||||||
|
kernel_sizes: [5, 3]
|
||||||
|
channels: 32
|
||||||
|
downsample_scales: [3, 3, 3, 3, 1]
|
||||||
|
max_downsample_channels: 1024
|
||||||
|
bias: True
|
||||||
|
nonlinear_activation: "leakyrelu"
|
||||||
|
nonlinear_activation_params:
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: True
|
||||||
|
use_spectral_norm: False
|
||||||
|
# others
|
||||||
|
sampling_rate: 22050 # needed in the inference for saving wav
|
||||||
|
cache_generator_outputs: True # whether to cache generator outputs in the training
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
# loss function related
|
||||||
|
generator_adv_loss_params:
|
||||||
|
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||||
|
loss_type: mse # loss type, "mse" or "hinge"
|
||||||
|
discriminator_adv_loss_params:
|
||||||
|
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||||
|
loss_type: mse # loss type, "mse" or "hinge"
|
||||||
|
feat_match_loss_params:
|
||||||
|
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||||
|
average_by_layers: False # whether to average loss value by #layers of each discriminator
|
||||||
|
include_final_outputs: True # whether to include final outputs for loss calculation
|
||||||
|
mel_loss_params:
|
||||||
|
fs: 22050 # must be the same as the training data
|
||||||
|
fft_size: 1024 # fft points
|
||||||
|
hop_size: 256 # hop size
|
||||||
|
win_length: null # window length
|
||||||
|
window: hann # window type
|
||||||
|
num_mels: 80 # number of Mel basis
|
||||||
|
fmin: 0 # minimum frequency for Mel basis
|
||||||
|
fmax: null # maximum frequency for Mel basis
|
||||||
|
log_base: null # null represent natural log
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# ADVERSARIAL LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
lambda_adv: 1.0 # loss scaling coefficient for adversarial loss
|
||||||
|
lambda_mel: 45.0 # loss scaling coefficient for Mel loss
|
||||||
|
lambda_feat_match: 2.0 # loss scaling coefficient for feat match loss
|
||||||
|
lambda_dur: 1.0 # loss scaling coefficient for duration loss
|
||||||
|
lambda_kl: 1.0 # loss scaling coefficient for KL divergence loss
|
||||||
|
# others
|
||||||
|
sampling_rate: 22050 # needed in the inference for saving wav
|
||||||
|
cache_generator_outputs: True # whether to cache generator outputs in the training
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# DATA LOADER SETTING #
|
||||||
|
###########################################################
|
||||||
|
batch_size: 50 # Batch size.
|
||||||
|
num_workers: 4 # Number of workers in DataLoader.
|
||||||
|
|
||||||
|
##########################################################
|
||||||
|
# OPTIMIZER & SCHEDULER SETTING #
|
||||||
|
##########################################################
|
||||||
|
# optimizer setting for generator
|
||||||
|
generator_optimizer_params:
|
||||||
|
beta1: 0.8
|
||||||
|
beta2: 0.99
|
||||||
|
epsilon: 1.0e-9
|
||||||
|
weight_decay: 0.0
|
||||||
|
generator_scheduler: exponential_decay
|
||||||
|
generator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4
|
||||||
|
gamma: 0.999875
|
||||||
|
|
||||||
|
# optimizer setting for discriminator
|
||||||
|
discriminator_optimizer_params:
|
||||||
|
beta1: 0.8
|
||||||
|
beta2: 0.99
|
||||||
|
epsilon: 1.0e-9
|
||||||
|
weight_decay: 0.0
|
||||||
|
discriminator_scheduler: exponential_decay
|
||||||
|
discriminator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4
|
||||||
|
gamma: 0.999875
|
||||||
|
generator_first: False # whether to start updating generator first
|
||||||
|
|
||||||
|
##########################################################
|
||||||
|
# OTHER TRAINING SETTING #
|
||||||
|
##########################################################
|
||||||
|
num_snapshots: 10 # max number of snapshots to keep while training
|
||||||
|
train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
|
||||||
|
save_interval_steps: 1000 # Interval steps to save checkpoint.
|
||||||
|
eval_interval_steps: 250 # Interval steps to evaluate the network.
|
||||||
|
seed: 777 # random seed number
|
@ -0,0 +1,69 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
add_blank=$2
|
||||||
|
|
||||||
|
# copy from tts3/preprocess
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
# get durations from MFA's result
|
||||||
|
echo "Generate durations.txt from MFA results ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
|
||||||
|
--inputdir=./aishell3_alignment_tone \
|
||||||
|
--output durations.txt \
|
||||||
|
--config=${config_path}
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# extract features
|
||||||
|
echo "Extract features ..."
|
||||||
|
python3 ${BIN_DIR}/preprocess.py \
|
||||||
|
--dataset=aishell3 \
|
||||||
|
--rootdir=~/datasets/data_aishell3/ \
|
||||||
|
--dumpdir=dump \
|
||||||
|
--dur-file=durations.txt \
|
||||||
|
--config=${config_path} \
|
||||||
|
--num-cpu=20 \
|
||||||
|
--cut-sil=True
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
# get features' stats(mean and std)
|
||||||
|
echo "Get features' stats ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--field-name="feats"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
# normalize and covert phone/speaker to id, dev and test should use train's stats
|
||||||
|
echo "Normalize ..."
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/train/norm \
|
||||||
|
--feats-stats=dump/train/feats_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt \
|
||||||
|
--add-blank=${add_blank} \
|
||||||
|
--skip-wav-copy
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/dev/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/dev/norm \
|
||||||
|
--feats-stats=dump/train/feats_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt \
|
||||||
|
--add-blank=${add_blank} \
|
||||||
|
--skip-wav-copy
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/normalize.py \
|
||||||
|
--metadata=dump/test/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/test/norm \
|
||||||
|
--feats-stats=dump/train/feats_stats.npy \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt \
|
||||||
|
--add-blank=${add_blank} \
|
||||||
|
--skip-wav-copy
|
||||||
|
fi
|
@ -0,0 +1,19 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
ckpt_name=$3
|
||||||
|
stage=0
|
||||||
|
stop_stage=0
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/synthesize.py \
|
||||||
|
--config=${config_path} \
|
||||||
|
--ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||||
|
--phones_dict=dump/phone_id_map.txt \
|
||||||
|
--speaker_dict=dump/speaker_id_map.txt \
|
||||||
|
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||||
|
--output_dir=${train_output_path}/test
|
||||||
|
fi
|
@ -0,0 +1,24 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
ckpt_name=$3
|
||||||
|
add_blank=$4
|
||||||
|
|
||||||
|
stage=0
|
||||||
|
stop_stage=0
|
||||||
|
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/synthesize_e2e.py \
|
||||||
|
--config=${config_path} \
|
||||||
|
--ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||||
|
--phones_dict=dump/phone_id_map.txt \
|
||||||
|
--speaker_dict=dump/speaker_id_map.txt \
|
||||||
|
--spk_id=0 \
|
||||||
|
--output_dir=${train_output_path}/test_e2e \
|
||||||
|
--text=${BIN_DIR}/../sentences.txt \
|
||||||
|
--add-blank=${add_blank}
|
||||||
|
fi
|
@ -0,0 +1,18 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
|
||||||
|
# install monotonic_align
|
||||||
|
cd ${MAIN_ROOT}/paddlespeech/t2s/models/vits/monotonic_align
|
||||||
|
python3 setup.py build_ext --inplace
|
||||||
|
cd -
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/train.py \
|
||||||
|
--train-metadata=dump/train/norm/metadata.jsonl \
|
||||||
|
--dev-metadata=dump/dev/norm/metadata.jsonl \
|
||||||
|
--config=${config_path} \
|
||||||
|
--output-dir=${train_output_path} \
|
||||||
|
--ngpu=4 \
|
||||||
|
--phones-dict=dump/phone_id_map.txt \
|
||||||
|
--speaker-dict=dump/speaker_id_map.txt
|
@ -0,0 +1,13 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
export MAIN_ROOT=`realpath ${PWD}/../../../`
|
||||||
|
|
||||||
|
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
|
||||||
|
export LC_ALL=C
|
||||||
|
|
||||||
|
export PYTHONDONTWRITEBYTECODE=1
|
||||||
|
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
|
||||||
|
export PYTHONIOENCODING=UTF-8
|
||||||
|
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
|
||||||
|
|
||||||
|
MODEL=vits
|
||||||
|
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
|
@ -0,0 +1,36 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
source path.sh
|
||||||
|
|
||||||
|
gpus=0,1,2,3
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
conf_path=conf/default.yaml
|
||||||
|
train_output_path=exp/default
|
||||||
|
ckpt_name=snapshot_iter_153.pdz
|
||||||
|
add_blank=true
|
||||||
|
|
||||||
|
# with the following command, you can choose the stage range you want to run
|
||||||
|
# such as `./run.sh --stage 0 --stop-stage 0`
|
||||||
|
# this can not be mixed use with `$1`, `$2` ...
|
||||||
|
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
# prepare data
|
||||||
|
./local/preprocess.sh ${conf_path} ${add_blank}|| exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} ${add_blank}|| exit -1
|
||||||
|
fi
|
Before Width: | Height: | Size: 140 KiB |
@ -1,609 +0,0 @@
|
|||||||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import os
|
|
||||||
import random
|
|
||||||
from typing import Dict
|
|
||||||
from typing import List
|
|
||||||
|
|
||||||
import librosa
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
from align import alignment
|
|
||||||
from align import alignment_zh
|
|
||||||
from align import words2phns
|
|
||||||
from align import words2phns_zh
|
|
||||||
from paddle import nn
|
|
||||||
from sedit_arg_parser import parse_args
|
|
||||||
from utils import eval_durs
|
|
||||||
from utils import get_voc_out
|
|
||||||
from utils import is_chinese
|
|
||||||
from utils import load_num_sequence_text
|
|
||||||
from utils import read_2col_text
|
|
||||||
|
|
||||||
from paddlespeech.t2s.datasets.am_batch_fn import build_mlm_collate_fn
|
|
||||||
from paddlespeech.t2s.models.ernie_sat.mlm import build_model_from_file
|
|
||||||
|
|
||||||
random.seed(0)
|
|
||||||
np.random.seed(0)
|
|
||||||
|
|
||||||
|
|
||||||
def get_wav(wav_path: str,
|
|
||||||
source_lang: str='english',
|
|
||||||
target_lang: str='english',
|
|
||||||
model_name: str="paddle_checkpoint_en",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
non_autoreg: bool=True):
|
|
||||||
wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length = get_mlm_output(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
model_name=model_name,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
use_teacher_forcing=non_autoreg)
|
|
||||||
|
|
||||||
masked_feat = output_feat[new_span_bdy[0]:new_span_bdy[1]]
|
|
||||||
|
|
||||||
alt_wav = get_voc_out(masked_feat)
|
|
||||||
|
|
||||||
old_time_bdy = [hop_length * x for x in old_span_bdy]
|
|
||||||
|
|
||||||
wav_replaced = np.concatenate(
|
|
||||||
[wav_org[:old_time_bdy[0]], alt_wav, wav_org[old_time_bdy[1]:]])
|
|
||||||
|
|
||||||
data_dict = {"origin": wav_org, "output": wav_replaced}
|
|
||||||
|
|
||||||
return data_dict
|
|
||||||
|
|
||||||
|
|
||||||
def load_model(model_name: str="paddle_checkpoint_en"):
|
|
||||||
config_path = './pretrained_model/{}/config.yaml'.format(model_name)
|
|
||||||
model_path = './pretrained_model/{}/model.pdparams'.format(model_name)
|
|
||||||
mlm_model, conf = build_model_from_file(
|
|
||||||
config_file=config_path, model_file=model_path)
|
|
||||||
return mlm_model, conf
|
|
||||||
|
|
||||||
|
|
||||||
def read_data(uid: str, prefix: os.PathLike):
|
|
||||||
# 获取 uid 对应的文本
|
|
||||||
mfa_text = read_2col_text(prefix + '/text')[uid]
|
|
||||||
# 获取 uid 对应的音频路径
|
|
||||||
mfa_wav_path = read_2col_text(prefix + '/wav.scp')[uid]
|
|
||||||
if not os.path.isabs(mfa_wav_path):
|
|
||||||
mfa_wav_path = prefix + mfa_wav_path
|
|
||||||
return mfa_text, mfa_wav_path
|
|
||||||
|
|
||||||
|
|
||||||
def get_align_data(uid: str, prefix: os.PathLike):
|
|
||||||
mfa_path = prefix + "mfa_"
|
|
||||||
mfa_text = read_2col_text(mfa_path + 'text')[uid]
|
|
||||||
mfa_start = load_num_sequence_text(
|
|
||||||
mfa_path + 'start', loader_type='text_float')[uid]
|
|
||||||
mfa_end = load_num_sequence_text(
|
|
||||||
mfa_path + 'end', loader_type='text_float')[uid]
|
|
||||||
mfa_wav_path = read_2col_text(mfa_path + 'wav.scp')[uid]
|
|
||||||
return mfa_text, mfa_start, mfa_end, mfa_wav_path
|
|
||||||
|
|
||||||
|
|
||||||
# 获取需要被 mask 的 mel 帧的范围
|
|
||||||
def get_masked_mel_bdy(mfa_start: List[float],
|
|
||||||
mfa_end: List[float],
|
|
||||||
fs: int,
|
|
||||||
hop_length: int,
|
|
||||||
span_to_repl: List[List[int]]):
|
|
||||||
align_start = np.array(mfa_start)
|
|
||||||
align_end = np.array(mfa_end)
|
|
||||||
align_start = np.floor(fs * align_start / hop_length).astype('int')
|
|
||||||
align_end = np.floor(fs * align_end / hop_length).astype('int')
|
|
||||||
if span_to_repl[0] >= len(mfa_start):
|
|
||||||
span_bdy = [align_end[-1], align_end[-1]]
|
|
||||||
else:
|
|
||||||
span_bdy = [
|
|
||||||
align_start[span_to_repl[0]], align_end[span_to_repl[1] - 1]
|
|
||||||
]
|
|
||||||
return span_bdy, align_start, align_end
|
|
||||||
|
|
||||||
|
|
||||||
def recover_dict(word2phns: Dict[str, str], tp_word2phns: Dict[str, str]):
|
|
||||||
dic = {}
|
|
||||||
keys_to_del = []
|
|
||||||
exist_idx = []
|
|
||||||
sp_count = 0
|
|
||||||
add_sp_count = 0
|
|
||||||
for key in word2phns.keys():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
if wrd == 'sp':
|
|
||||||
sp_count += 1
|
|
||||||
exist_idx.append(int(idx))
|
|
||||||
else:
|
|
||||||
keys_to_del.append(key)
|
|
||||||
|
|
||||||
for key in keys_to_del:
|
|
||||||
del word2phns[key]
|
|
||||||
|
|
||||||
cur_id = 0
|
|
||||||
for key in tp_word2phns.keys():
|
|
||||||
if cur_id in exist_idx:
|
|
||||||
dic[str(cur_id) + "_sp"] = 'sp'
|
|
||||||
cur_id += 1
|
|
||||||
add_sp_count += 1
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
dic[str(cur_id) + "_" + wrd] = tp_word2phns[key]
|
|
||||||
cur_id += 1
|
|
||||||
|
|
||||||
if add_sp_count + 1 == sp_count:
|
|
||||||
dic[str(cur_id) + "_sp"] = 'sp'
|
|
||||||
add_sp_count += 1
|
|
||||||
|
|
||||||
assert add_sp_count == sp_count, "sp are not added in dic"
|
|
||||||
return dic
|
|
||||||
|
|
||||||
|
|
||||||
def get_max_idx(dic):
|
|
||||||
return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1]
|
|
||||||
|
|
||||||
|
|
||||||
def get_phns_and_spans(wav_path: str,
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english"):
|
|
||||||
is_append = (old_str == new_str[:len(old_str)])
|
|
||||||
old_phns, mfa_start, mfa_end = [], [], []
|
|
||||||
# source
|
|
||||||
if source_lang == "english":
|
|
||||||
intervals, word2phns = alignment(wav_path, old_str)
|
|
||||||
elif source_lang == "chinese":
|
|
||||||
intervals, word2phns = alignment_zh(wav_path, old_str)
|
|
||||||
_, tp_word2phns = words2phns_zh(old_str)
|
|
||||||
|
|
||||||
for key, value in tp_word2phns.items():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
cur_val = " ".join(value)
|
|
||||||
tp_word2phns[key] = cur_val
|
|
||||||
|
|
||||||
word2phns = recover_dict(word2phns, tp_word2phns)
|
|
||||||
else:
|
|
||||||
assert source_lang == "chinese" or source_lang == "english", \
|
|
||||||
"source_lang is wrong..."
|
|
||||||
|
|
||||||
for item in intervals:
|
|
||||||
old_phns.append(item[0])
|
|
||||||
mfa_start.append(float(item[1]))
|
|
||||||
mfa_end.append(float(item[2]))
|
|
||||||
# target
|
|
||||||
if is_append and (source_lang != target_lang):
|
|
||||||
cross_lingual_clone = True
|
|
||||||
else:
|
|
||||||
cross_lingual_clone = False
|
|
||||||
|
|
||||||
if cross_lingual_clone:
|
|
||||||
str_origin = new_str[:len(old_str)]
|
|
||||||
str_append = new_str[len(old_str):]
|
|
||||||
|
|
||||||
if target_lang == "chinese":
|
|
||||||
phns_origin, origin_word2phns = words2phns(str_origin)
|
|
||||||
phns_append, append_word2phns_tmp = words2phns_zh(str_append)
|
|
||||||
|
|
||||||
elif target_lang == "english":
|
|
||||||
# 原始句子
|
|
||||||
phns_origin, origin_word2phns = words2phns_zh(str_origin)
|
|
||||||
# clone 句子
|
|
||||||
phns_append, append_word2phns_tmp = words2phns(str_append)
|
|
||||||
else:
|
|
||||||
assert target_lang == "chinese" or target_lang == "english", \
|
|
||||||
"cloning is not support for this language, please check it."
|
|
||||||
|
|
||||||
new_phns = phns_origin + phns_append
|
|
||||||
|
|
||||||
append_word2phns = {}
|
|
||||||
length = len(origin_word2phns)
|
|
||||||
for key, value in append_word2phns_tmp.items():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
append_word2phns[str(int(idx) + length) + '_' + wrd] = value
|
|
||||||
new_word2phns = origin_word2phns.copy()
|
|
||||||
new_word2phns.update(append_word2phns)
|
|
||||||
|
|
||||||
else:
|
|
||||||
if source_lang == target_lang and target_lang == "english":
|
|
||||||
new_phns, new_word2phns = words2phns(new_str)
|
|
||||||
elif source_lang == target_lang and target_lang == "chinese":
|
|
||||||
new_phns, new_word2phns = words2phns_zh(new_str)
|
|
||||||
else:
|
|
||||||
assert source_lang == target_lang, \
|
|
||||||
"source language is not same with target language..."
|
|
||||||
|
|
||||||
span_to_repl = [0, len(old_phns) - 1]
|
|
||||||
span_to_add = [0, len(new_phns) - 1]
|
|
||||||
left_idx = 0
|
|
||||||
new_phns_left = []
|
|
||||||
sp_count = 0
|
|
||||||
# find the left different index
|
|
||||||
for key in word2phns.keys():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
if wrd == 'sp':
|
|
||||||
sp_count += 1
|
|
||||||
new_phns_left.append('sp')
|
|
||||||
else:
|
|
||||||
idx = str(int(idx) - sp_count)
|
|
||||||
if idx + '_' + wrd in new_word2phns:
|
|
||||||
left_idx += len(new_word2phns[idx + '_' + wrd])
|
|
||||||
new_phns_left.extend(word2phns[key].split())
|
|
||||||
else:
|
|
||||||
span_to_repl[0] = len(new_phns_left)
|
|
||||||
span_to_add[0] = len(new_phns_left)
|
|
||||||
break
|
|
||||||
|
|
||||||
# reverse word2phns and new_word2phns
|
|
||||||
right_idx = 0
|
|
||||||
new_phns_right = []
|
|
||||||
sp_count = 0
|
|
||||||
word2phns_max_idx = get_max_idx(word2phns)
|
|
||||||
new_word2phns_max_idx = get_max_idx(new_word2phns)
|
|
||||||
new_phns_mid = []
|
|
||||||
if is_append:
|
|
||||||
new_phns_right = []
|
|
||||||
new_phns_mid = new_phns[left_idx:]
|
|
||||||
span_to_repl[0] = len(new_phns_left)
|
|
||||||
span_to_add[0] = len(new_phns_left)
|
|
||||||
span_to_add[1] = len(new_phns_left) + len(new_phns_mid)
|
|
||||||
span_to_repl[1] = len(old_phns) - len(new_phns_right)
|
|
||||||
# speech edit
|
|
||||||
else:
|
|
||||||
for key in list(word2phns.keys())[::-1]:
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
if wrd == 'sp':
|
|
||||||
sp_count += 1
|
|
||||||
new_phns_right = ['sp'] + new_phns_right
|
|
||||||
else:
|
|
||||||
idx = str(new_word2phns_max_idx - (word2phns_max_idx - int(idx)
|
|
||||||
- sp_count))
|
|
||||||
if idx + '_' + wrd in new_word2phns:
|
|
||||||
right_idx -= len(new_word2phns[idx + '_' + wrd])
|
|
||||||
new_phns_right = word2phns[key].split() + new_phns_right
|
|
||||||
else:
|
|
||||||
span_to_repl[1] = len(old_phns) - len(new_phns_right)
|
|
||||||
new_phns_mid = new_phns[left_idx:right_idx]
|
|
||||||
span_to_add[1] = len(new_phns_left) + len(new_phns_mid)
|
|
||||||
if len(new_phns_mid) == 0:
|
|
||||||
span_to_add[1] = min(span_to_add[1] + 1, len(new_phns))
|
|
||||||
span_to_add[0] = max(0, span_to_add[0] - 1)
|
|
||||||
span_to_repl[0] = max(0, span_to_repl[0] - 1)
|
|
||||||
span_to_repl[1] = min(span_to_repl[1] + 1,
|
|
||||||
len(old_phns))
|
|
||||||
break
|
|
||||||
new_phns = new_phns_left + new_phns_mid + new_phns_right
|
|
||||||
'''
|
|
||||||
For that reason cover should not be given.
|
|
||||||
For that reason cover is impossible to be given.
|
|
||||||
span_to_repl: [17, 23] "should not"
|
|
||||||
span_to_add: [17, 30] "is impossible to"
|
|
||||||
'''
|
|
||||||
return mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add
|
|
||||||
|
|
||||||
|
|
||||||
# mfa 获得的 duration 和 fs2 的 duration_predictor 获取的 duration 可能不同
|
|
||||||
# 此处获得一个缩放比例, 用于预测值和真实值之间的缩放
|
|
||||||
def get_dur_adj_factor(orig_dur: List[int],
|
|
||||||
pred_dur: List[int],
|
|
||||||
phns: List[str]):
|
|
||||||
length = 0
|
|
||||||
factor_list = []
|
|
||||||
for orig, pred, phn in zip(orig_dur, pred_dur, phns):
|
|
||||||
if pred == 0 or phn == 'sp':
|
|
||||||
continue
|
|
||||||
else:
|
|
||||||
factor_list.append(orig / pred)
|
|
||||||
factor_list = np.array(factor_list)
|
|
||||||
factor_list.sort()
|
|
||||||
if len(factor_list) < 5:
|
|
||||||
return 1
|
|
||||||
length = 2
|
|
||||||
avg = np.average(factor_list[length:-length])
|
|
||||||
return avg
|
|
||||||
|
|
||||||
|
|
||||||
def prep_feats_with_dur(wav_path: str,
|
|
||||||
source_lang: str="English",
|
|
||||||
target_lang: str="English",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
mask_reconstruct: bool=False,
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False,
|
|
||||||
fs: int=24000,
|
|
||||||
hop_length: int=300):
|
|
||||||
'''
|
|
||||||
Returns:
|
|
||||||
np.ndarray: new wav, replace the part to be edited in original wav with 0
|
|
||||||
List[str]: new phones
|
|
||||||
List[float]: mfa start of new wav
|
|
||||||
List[float]: mfa end of new wav
|
|
||||||
List[int]: masked mel boundary of original wav
|
|
||||||
List[int]: masked mel boundary of new wav
|
|
||||||
'''
|
|
||||||
wav_org, _ = librosa.load(wav_path, sr=fs)
|
|
||||||
|
|
||||||
mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add = get_phns_and_spans(
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang)
|
|
||||||
|
|
||||||
if start_end_sp:
|
|
||||||
if new_phns[-1] != 'sp':
|
|
||||||
new_phns = new_phns + ['sp']
|
|
||||||
# 中文的 phns 不一定都在 fastspeech2 的字典里, 用 sp 代替
|
|
||||||
if target_lang == "english" or target_lang == "chinese":
|
|
||||||
old_durs = eval_durs(old_phns, target_lang=source_lang)
|
|
||||||
else:
|
|
||||||
assert target_lang == "chinese" or target_lang == "english", \
|
|
||||||
"calculate duration_predict is not support for this language..."
|
|
||||||
|
|
||||||
orig_old_durs = [e - s for e, s in zip(mfa_end, mfa_start)]
|
|
||||||
if '[MASK]' in new_str:
|
|
||||||
new_phns = old_phns
|
|
||||||
span_to_add = span_to_repl
|
|
||||||
d_factor_left = get_dur_adj_factor(
|
|
||||||
orig_dur=orig_old_durs[:span_to_repl[0]],
|
|
||||||
pred_dur=old_durs[:span_to_repl[0]],
|
|
||||||
phns=old_phns[:span_to_repl[0]])
|
|
||||||
d_factor_right = get_dur_adj_factor(
|
|
||||||
orig_dur=orig_old_durs[span_to_repl[1]:],
|
|
||||||
pred_dur=old_durs[span_to_repl[1]:],
|
|
||||||
phns=old_phns[span_to_repl[1]:])
|
|
||||||
d_factor = (d_factor_left + d_factor_right) / 2
|
|
||||||
new_durs_adjusted = [d_factor * i for i in old_durs]
|
|
||||||
else:
|
|
||||||
if duration_adjust:
|
|
||||||
d_factor = get_dur_adj_factor(
|
|
||||||
orig_dur=orig_old_durs, pred_dur=old_durs, phns=old_phns)
|
|
||||||
d_factor = d_factor * 1.25
|
|
||||||
else:
|
|
||||||
d_factor = 1
|
|
||||||
|
|
||||||
if target_lang == "english" or target_lang == "chinese":
|
|
||||||
new_durs = eval_durs(new_phns, target_lang=target_lang)
|
|
||||||
else:
|
|
||||||
assert target_lang == "chinese" or target_lang == "english", \
|
|
||||||
"calculate duration_predict is not support for this language..."
|
|
||||||
|
|
||||||
new_durs_adjusted = [d_factor * i for i in new_durs]
|
|
||||||
|
|
||||||
new_span_dur_sum = sum(new_durs_adjusted[span_to_add[0]:span_to_add[1]])
|
|
||||||
old_span_dur_sum = sum(orig_old_durs[span_to_repl[0]:span_to_repl[1]])
|
|
||||||
dur_offset = new_span_dur_sum - old_span_dur_sum
|
|
||||||
new_mfa_start = mfa_start[:span_to_repl[0]]
|
|
||||||
new_mfa_end = mfa_end[:span_to_repl[0]]
|
|
||||||
for i in new_durs_adjusted[span_to_add[0]:span_to_add[1]]:
|
|
||||||
if len(new_mfa_end) == 0:
|
|
||||||
new_mfa_start.append(0)
|
|
||||||
new_mfa_end.append(i)
|
|
||||||
else:
|
|
||||||
new_mfa_start.append(new_mfa_end[-1])
|
|
||||||
new_mfa_end.append(new_mfa_end[-1] + i)
|
|
||||||
new_mfa_start += [i + dur_offset for i in mfa_start[span_to_repl[1]:]]
|
|
||||||
new_mfa_end += [i + dur_offset for i in mfa_end[span_to_repl[1]:]]
|
|
||||||
|
|
||||||
# 3. get new wav
|
|
||||||
# 在原始句子后拼接
|
|
||||||
if span_to_repl[0] >= len(mfa_start):
|
|
||||||
left_idx = len(wav_org)
|
|
||||||
right_idx = left_idx
|
|
||||||
# 在原始句子中间替换
|
|
||||||
else:
|
|
||||||
left_idx = int(np.floor(mfa_start[span_to_repl[0]] * fs))
|
|
||||||
right_idx = int(np.ceil(mfa_end[span_to_repl[1] - 1] * fs))
|
|
||||||
blank_wav = np.zeros(
|
|
||||||
(int(np.ceil(new_span_dur_sum * fs)), ), dtype=wav_org.dtype)
|
|
||||||
# 原始音频,需要编辑的部分替换成空音频,空音频的时间由 fs2 的 duration_predictor 决定
|
|
||||||
new_wav = np.concatenate(
|
|
||||||
[wav_org[:left_idx], blank_wav, wav_org[right_idx:]])
|
|
||||||
|
|
||||||
# 4. get old and new mel span to be mask
|
|
||||||
# [92, 92]
|
|
||||||
|
|
||||||
old_span_bdy, mfa_start, mfa_end = get_masked_mel_bdy(
|
|
||||||
mfa_start=mfa_start,
|
|
||||||
mfa_end=mfa_end,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length,
|
|
||||||
span_to_repl=span_to_repl)
|
|
||||||
# [92, 174]
|
|
||||||
# new_mfa_start, new_mfa_end 时间级别的开始和结束时间 -> 帧级别
|
|
||||||
new_span_bdy, new_mfa_start, new_mfa_end = get_masked_mel_bdy(
|
|
||||||
mfa_start=new_mfa_start,
|
|
||||||
mfa_end=new_mfa_end,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length,
|
|
||||||
span_to_repl=span_to_add)
|
|
||||||
|
|
||||||
# old_span_bdy, new_span_bdy 是帧级别的范围
|
|
||||||
return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy
|
|
||||||
|
|
||||||
|
|
||||||
def prep_feats(wav_path: str,
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False,
|
|
||||||
mask_reconstruct: bool=False,
|
|
||||||
fs: int=24000,
|
|
||||||
hop_length: int=300,
|
|
||||||
token_list: List[str]=[]):
|
|
||||||
wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
wav_path=wav_path,
|
|
||||||
duration_adjust=duration_adjust,
|
|
||||||
start_end_sp=start_end_sp,
|
|
||||||
mask_reconstruct=mask_reconstruct,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length)
|
|
||||||
|
|
||||||
token_to_id = {item: i for i, item in enumerate(token_list)}
|
|
||||||
text = np.array(
|
|
||||||
list(map(lambda x: token_to_id.get(x, token_to_id['<unk>']), phns)))
|
|
||||||
span_bdy = np.array(new_span_bdy)
|
|
||||||
|
|
||||||
batch = [('1', {
|
|
||||||
"speech": wav,
|
|
||||||
"align_start": mfa_start,
|
|
||||||
"align_end": mfa_end,
|
|
||||||
"text": text,
|
|
||||||
"span_bdy": span_bdy
|
|
||||||
})]
|
|
||||||
|
|
||||||
return batch, old_span_bdy, new_span_bdy
|
|
||||||
|
|
||||||
|
|
||||||
def decode_with_model(mlm_model: nn.Layer,
|
|
||||||
collate_fn,
|
|
||||||
wav_path: str,
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
use_teacher_forcing: bool=False,
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False,
|
|
||||||
fs: int=24000,
|
|
||||||
hop_length: int=300,
|
|
||||||
token_list: List[str]=[]):
|
|
||||||
batch, old_span_bdy, new_span_bdy = prep_feats(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
duration_adjust=duration_adjust,
|
|
||||||
start_end_sp=start_end_sp,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length,
|
|
||||||
token_list=token_list)
|
|
||||||
|
|
||||||
feats = collate_fn(batch)[1]
|
|
||||||
|
|
||||||
if 'text_masked_pos' in feats.keys():
|
|
||||||
feats.pop('text_masked_pos')
|
|
||||||
|
|
||||||
output = mlm_model.inference(
|
|
||||||
text=feats['text'],
|
|
||||||
speech=feats['speech'],
|
|
||||||
masked_pos=feats['masked_pos'],
|
|
||||||
speech_mask=feats['speech_mask'],
|
|
||||||
text_mask=feats['text_mask'],
|
|
||||||
speech_seg_pos=feats['speech_seg_pos'],
|
|
||||||
text_seg_pos=feats['text_seg_pos'],
|
|
||||||
span_bdy=new_span_bdy,
|
|
||||||
use_teacher_forcing=use_teacher_forcing)
|
|
||||||
|
|
||||||
# 拼接音频
|
|
||||||
output_feat = paddle.concat(x=output, axis=0)
|
|
||||||
wav_org, _ = librosa.load(wav_path, sr=fs)
|
|
||||||
return wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length
|
|
||||||
|
|
||||||
|
|
||||||
def get_mlm_output(wav_path: str,
|
|
||||||
model_name: str="paddle_checkpoint_en",
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
use_teacher_forcing: bool=False,
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False):
|
|
||||||
mlm_model, train_conf = load_model(model_name)
|
|
||||||
mlm_model.eval()
|
|
||||||
|
|
||||||
collate_fn = build_mlm_collate_fn(
|
|
||||||
sr=train_conf.feats_extract_conf['fs'],
|
|
||||||
n_fft=train_conf.feats_extract_conf['n_fft'],
|
|
||||||
hop_length=train_conf.feats_extract_conf['hop_length'],
|
|
||||||
win_length=train_conf.feats_extract_conf['win_length'],
|
|
||||||
n_mels=train_conf.feats_extract_conf['n_mels'],
|
|
||||||
fmin=train_conf.feats_extract_conf['fmin'],
|
|
||||||
fmax=train_conf.feats_extract_conf['fmax'],
|
|
||||||
mlm_prob=train_conf['mlm_prob'],
|
|
||||||
mean_phn_span=train_conf['mean_phn_span'],
|
|
||||||
seg_emb=train_conf.encoder_conf['input_layer'] == 'sega_mlm')
|
|
||||||
|
|
||||||
return decode_with_model(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
mlm_model=mlm_model,
|
|
||||||
collate_fn=collate_fn,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
use_teacher_forcing=use_teacher_forcing,
|
|
||||||
duration_adjust=duration_adjust,
|
|
||||||
start_end_sp=start_end_sp,
|
|
||||||
fs=train_conf.feats_extract_conf['fs'],
|
|
||||||
hop_length=train_conf.feats_extract_conf['hop_length'],
|
|
||||||
token_list=train_conf.token_list)
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(uid: str,
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
prefix: os.PathLike="./prompt/dev/",
|
|
||||||
model_name: str="paddle_checkpoint_en",
|
|
||||||
new_str: str="",
|
|
||||||
prompt_decoding: bool=False,
|
|
||||||
task_name: str=None):
|
|
||||||
|
|
||||||
# get origin text and path of origin wav
|
|
||||||
old_str, wav_path = read_data(uid=uid, prefix=prefix)
|
|
||||||
|
|
||||||
if task_name == 'edit':
|
|
||||||
new_str = new_str
|
|
||||||
elif task_name == 'synthesize':
|
|
||||||
new_str = old_str + new_str
|
|
||||||
else:
|
|
||||||
new_str = old_str + ' '.join([ch for ch in new_str if is_chinese(ch)])
|
|
||||||
|
|
||||||
print('new_str is ', new_str)
|
|
||||||
|
|
||||||
results_dict = get_wav(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
model_name=model_name,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str)
|
|
||||||
return results_dict
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
# parse config and args
|
|
||||||
args = parse_args()
|
|
||||||
|
|
||||||
data_dict = evaluate(
|
|
||||||
uid=args.uid,
|
|
||||||
source_lang=args.source_lang,
|
|
||||||
target_lang=args.target_lang,
|
|
||||||
prefix=args.prefix,
|
|
||||||
model_name=args.model_name,
|
|
||||||
new_str=args.new_str,
|
|
||||||
task_name=args.task_name)
|
|
||||||
sf.write(args.output_name, data_dict['output'], samplerate=24000)
|
|
||||||
print("finished...")
|
|
@ -1,622 +0,0 @@
|
|||||||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import os
|
|
||||||
import random
|
|
||||||
from typing import Dict
|
|
||||||
from typing import List
|
|
||||||
|
|
||||||
import librosa
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from align import alignment
|
|
||||||
from align import alignment_zh
|
|
||||||
from align import words2phns
|
|
||||||
from align import words2phns_zh
|
|
||||||
from paddle import nn
|
|
||||||
from sedit_arg_parser import parse_args
|
|
||||||
from utils import eval_durs
|
|
||||||
from utils import get_voc_out
|
|
||||||
from utils import is_chinese
|
|
||||||
from utils import load_num_sequence_text
|
|
||||||
from utils import read_2col_text
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.datasets.am_batch_fn import build_mlm_collate_fn
|
|
||||||
from paddlespeech.t2s.models.ernie_sat.ernie_sat import ErnieSAT
|
|
||||||
|
|
||||||
random.seed(0)
|
|
||||||
np.random.seed(0)
|
|
||||||
|
|
||||||
|
|
||||||
def get_wav(wav_path: str,
|
|
||||||
source_lang: str='english',
|
|
||||||
target_lang: str='english',
|
|
||||||
model_name: str="paddle_checkpoint_en",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
non_autoreg: bool=True):
|
|
||||||
wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length = get_mlm_output(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
model_name=model_name,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
use_teacher_forcing=non_autoreg)
|
|
||||||
|
|
||||||
masked_feat = output_feat[new_span_bdy[0]:new_span_bdy[1]]
|
|
||||||
|
|
||||||
alt_wav = get_voc_out(masked_feat)
|
|
||||||
|
|
||||||
old_time_bdy = [hop_length * x for x in old_span_bdy]
|
|
||||||
|
|
||||||
wav_replaced = np.concatenate(
|
|
||||||
[wav_org[:old_time_bdy[0]], alt_wav, wav_org[old_time_bdy[1]:]])
|
|
||||||
|
|
||||||
data_dict = {"origin": wav_org, "output": wav_replaced}
|
|
||||||
|
|
||||||
return data_dict
|
|
||||||
|
|
||||||
|
|
||||||
def load_model(model_name: str="paddle_checkpoint_en"):
|
|
||||||
config_path = './pretrained_model/{}/default.yaml'.format(model_name)
|
|
||||||
model_path = './pretrained_model/{}/model.pdparams'.format(model_name)
|
|
||||||
with open(config_path) as f:
|
|
||||||
conf = CfgNode(yaml.safe_load(f))
|
|
||||||
token_list = list(conf.token_list)
|
|
||||||
vocab_size = len(token_list)
|
|
||||||
odim = conf.n_mels
|
|
||||||
mlm_model = ErnieSAT(idim=vocab_size, odim=odim, **conf["model"])
|
|
||||||
state_dict = paddle.load(model_path)
|
|
||||||
new_state_dict = {}
|
|
||||||
for key, value in state_dict.items():
|
|
||||||
new_key = "model." + key
|
|
||||||
new_state_dict[new_key] = value
|
|
||||||
mlm_model.set_state_dict(new_state_dict)
|
|
||||||
mlm_model.eval()
|
|
||||||
|
|
||||||
return mlm_model, conf
|
|
||||||
|
|
||||||
|
|
||||||
def read_data(uid: str, prefix: os.PathLike):
|
|
||||||
# 获取 uid 对应的文本
|
|
||||||
mfa_text = read_2col_text(prefix + '/text')[uid]
|
|
||||||
# 获取 uid 对应的音频路径
|
|
||||||
mfa_wav_path = read_2col_text(prefix + '/wav.scp')[uid]
|
|
||||||
if not os.path.isabs(mfa_wav_path):
|
|
||||||
mfa_wav_path = prefix + mfa_wav_path
|
|
||||||
return mfa_text, mfa_wav_path
|
|
||||||
|
|
||||||
|
|
||||||
def get_align_data(uid: str, prefix: os.PathLike):
|
|
||||||
mfa_path = prefix + "mfa_"
|
|
||||||
mfa_text = read_2col_text(mfa_path + 'text')[uid]
|
|
||||||
mfa_start = load_num_sequence_text(
|
|
||||||
mfa_path + 'start', loader_type='text_float')[uid]
|
|
||||||
mfa_end = load_num_sequence_text(
|
|
||||||
mfa_path + 'end', loader_type='text_float')[uid]
|
|
||||||
mfa_wav_path = read_2col_text(mfa_path + 'wav.scp')[uid]
|
|
||||||
return mfa_text, mfa_start, mfa_end, mfa_wav_path
|
|
||||||
|
|
||||||
|
|
||||||
# 获取需要被 mask 的 mel 帧的范围
|
|
||||||
def get_masked_mel_bdy(mfa_start: List[float],
|
|
||||||
mfa_end: List[float],
|
|
||||||
fs: int,
|
|
||||||
hop_length: int,
|
|
||||||
span_to_repl: List[List[int]]):
|
|
||||||
align_start = np.array(mfa_start)
|
|
||||||
align_end = np.array(mfa_end)
|
|
||||||
align_start = np.floor(fs * align_start / hop_length).astype('int')
|
|
||||||
align_end = np.floor(fs * align_end / hop_length).astype('int')
|
|
||||||
if span_to_repl[0] >= len(mfa_start):
|
|
||||||
span_bdy = [align_end[-1], align_end[-1]]
|
|
||||||
else:
|
|
||||||
span_bdy = [
|
|
||||||
align_start[span_to_repl[0]], align_end[span_to_repl[1] - 1]
|
|
||||||
]
|
|
||||||
return span_bdy, align_start, align_end
|
|
||||||
|
|
||||||
|
|
||||||
def recover_dict(word2phns: Dict[str, str], tp_word2phns: Dict[str, str]):
|
|
||||||
dic = {}
|
|
||||||
keys_to_del = []
|
|
||||||
exist_idx = []
|
|
||||||
sp_count = 0
|
|
||||||
add_sp_count = 0
|
|
||||||
for key in word2phns.keys():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
if wrd == 'sp':
|
|
||||||
sp_count += 1
|
|
||||||
exist_idx.append(int(idx))
|
|
||||||
else:
|
|
||||||
keys_to_del.append(key)
|
|
||||||
|
|
||||||
for key in keys_to_del:
|
|
||||||
del word2phns[key]
|
|
||||||
|
|
||||||
cur_id = 0
|
|
||||||
for key in tp_word2phns.keys():
|
|
||||||
if cur_id in exist_idx:
|
|
||||||
dic[str(cur_id) + "_sp"] = 'sp'
|
|
||||||
cur_id += 1
|
|
||||||
add_sp_count += 1
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
dic[str(cur_id) + "_" + wrd] = tp_word2phns[key]
|
|
||||||
cur_id += 1
|
|
||||||
|
|
||||||
if add_sp_count + 1 == sp_count:
|
|
||||||
dic[str(cur_id) + "_sp"] = 'sp'
|
|
||||||
add_sp_count += 1
|
|
||||||
|
|
||||||
assert add_sp_count == sp_count, "sp are not added in dic"
|
|
||||||
return dic
|
|
||||||
|
|
||||||
|
|
||||||
def get_max_idx(dic):
|
|
||||||
return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1]
|
|
||||||
|
|
||||||
|
|
||||||
def get_phns_and_spans(wav_path: str,
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english"):
|
|
||||||
is_append = (old_str == new_str[:len(old_str)])
|
|
||||||
old_phns, mfa_start, mfa_end = [], [], []
|
|
||||||
# source
|
|
||||||
if source_lang == "english":
|
|
||||||
intervals, word2phns = alignment(wav_path, old_str)
|
|
||||||
elif source_lang == "chinese":
|
|
||||||
intervals, word2phns = alignment_zh(wav_path, old_str)
|
|
||||||
_, tp_word2phns = words2phns_zh(old_str)
|
|
||||||
|
|
||||||
for key, value in tp_word2phns.items():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
cur_val = " ".join(value)
|
|
||||||
tp_word2phns[key] = cur_val
|
|
||||||
|
|
||||||
word2phns = recover_dict(word2phns, tp_word2phns)
|
|
||||||
else:
|
|
||||||
assert source_lang == "chinese" or source_lang == "english", \
|
|
||||||
"source_lang is wrong..."
|
|
||||||
|
|
||||||
for item in intervals:
|
|
||||||
old_phns.append(item[0])
|
|
||||||
mfa_start.append(float(item[1]))
|
|
||||||
mfa_end.append(float(item[2]))
|
|
||||||
# target
|
|
||||||
if is_append and (source_lang != target_lang):
|
|
||||||
cross_lingual_clone = True
|
|
||||||
else:
|
|
||||||
cross_lingual_clone = False
|
|
||||||
|
|
||||||
if cross_lingual_clone:
|
|
||||||
str_origin = new_str[:len(old_str)]
|
|
||||||
str_append = new_str[len(old_str):]
|
|
||||||
|
|
||||||
if target_lang == "chinese":
|
|
||||||
phns_origin, origin_word2phns = words2phns(str_origin)
|
|
||||||
phns_append, append_word2phns_tmp = words2phns_zh(str_append)
|
|
||||||
|
|
||||||
elif target_lang == "english":
|
|
||||||
# 原始句子
|
|
||||||
phns_origin, origin_word2phns = words2phns_zh(str_origin)
|
|
||||||
# clone 句子
|
|
||||||
phns_append, append_word2phns_tmp = words2phns(str_append)
|
|
||||||
else:
|
|
||||||
assert target_lang == "chinese" or target_lang == "english", \
|
|
||||||
"cloning is not support for this language, please check it."
|
|
||||||
|
|
||||||
new_phns = phns_origin + phns_append
|
|
||||||
|
|
||||||
append_word2phns = {}
|
|
||||||
length = len(origin_word2phns)
|
|
||||||
for key, value in append_word2phns_tmp.items():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
append_word2phns[str(int(idx) + length) + '_' + wrd] = value
|
|
||||||
new_word2phns = origin_word2phns.copy()
|
|
||||||
new_word2phns.update(append_word2phns)
|
|
||||||
|
|
||||||
else:
|
|
||||||
if source_lang == target_lang and target_lang == "english":
|
|
||||||
new_phns, new_word2phns = words2phns(new_str)
|
|
||||||
elif source_lang == target_lang and target_lang == "chinese":
|
|
||||||
new_phns, new_word2phns = words2phns_zh(new_str)
|
|
||||||
else:
|
|
||||||
assert source_lang == target_lang, \
|
|
||||||
"source language is not same with target language..."
|
|
||||||
|
|
||||||
span_to_repl = [0, len(old_phns) - 1]
|
|
||||||
span_to_add = [0, len(new_phns) - 1]
|
|
||||||
left_idx = 0
|
|
||||||
new_phns_left = []
|
|
||||||
sp_count = 0
|
|
||||||
# find the left different index
|
|
||||||
for key in word2phns.keys():
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
if wrd == 'sp':
|
|
||||||
sp_count += 1
|
|
||||||
new_phns_left.append('sp')
|
|
||||||
else:
|
|
||||||
idx = str(int(idx) - sp_count)
|
|
||||||
if idx + '_' + wrd in new_word2phns:
|
|
||||||
left_idx += len(new_word2phns[idx + '_' + wrd])
|
|
||||||
new_phns_left.extend(word2phns[key].split())
|
|
||||||
else:
|
|
||||||
span_to_repl[0] = len(new_phns_left)
|
|
||||||
span_to_add[0] = len(new_phns_left)
|
|
||||||
break
|
|
||||||
|
|
||||||
# reverse word2phns and new_word2phns
|
|
||||||
right_idx = 0
|
|
||||||
new_phns_right = []
|
|
||||||
sp_count = 0
|
|
||||||
word2phns_max_idx = get_max_idx(word2phns)
|
|
||||||
new_word2phns_max_idx = get_max_idx(new_word2phns)
|
|
||||||
new_phns_mid = []
|
|
||||||
if is_append:
|
|
||||||
new_phns_right = []
|
|
||||||
new_phns_mid = new_phns[left_idx:]
|
|
||||||
span_to_repl[0] = len(new_phns_left)
|
|
||||||
span_to_add[0] = len(new_phns_left)
|
|
||||||
span_to_add[1] = len(new_phns_left) + len(new_phns_mid)
|
|
||||||
span_to_repl[1] = len(old_phns) - len(new_phns_right)
|
|
||||||
# speech edit
|
|
||||||
else:
|
|
||||||
for key in list(word2phns.keys())[::-1]:
|
|
||||||
idx, wrd = key.split('_')
|
|
||||||
if wrd == 'sp':
|
|
||||||
sp_count += 1
|
|
||||||
new_phns_right = ['sp'] + new_phns_right
|
|
||||||
else:
|
|
||||||
idx = str(new_word2phns_max_idx - (word2phns_max_idx - int(idx)
|
|
||||||
- sp_count))
|
|
||||||
if idx + '_' + wrd in new_word2phns:
|
|
||||||
right_idx -= len(new_word2phns[idx + '_' + wrd])
|
|
||||||
new_phns_right = word2phns[key].split() + new_phns_right
|
|
||||||
else:
|
|
||||||
span_to_repl[1] = len(old_phns) - len(new_phns_right)
|
|
||||||
new_phns_mid = new_phns[left_idx:right_idx]
|
|
||||||
span_to_add[1] = len(new_phns_left) + len(new_phns_mid)
|
|
||||||
if len(new_phns_mid) == 0:
|
|
||||||
span_to_add[1] = min(span_to_add[1] + 1, len(new_phns))
|
|
||||||
span_to_add[0] = max(0, span_to_add[0] - 1)
|
|
||||||
span_to_repl[0] = max(0, span_to_repl[0] - 1)
|
|
||||||
span_to_repl[1] = min(span_to_repl[1] + 1,
|
|
||||||
len(old_phns))
|
|
||||||
break
|
|
||||||
new_phns = new_phns_left + new_phns_mid + new_phns_right
|
|
||||||
'''
|
|
||||||
For that reason cover should not be given.
|
|
||||||
For that reason cover is impossible to be given.
|
|
||||||
span_to_repl: [17, 23] "should not"
|
|
||||||
span_to_add: [17, 30] "is impossible to"
|
|
||||||
'''
|
|
||||||
return mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add
|
|
||||||
|
|
||||||
|
|
||||||
# mfa 获得的 duration 和 fs2 的 duration_predictor 获取的 duration 可能不同
|
|
||||||
# 此处获得一个缩放比例, 用于预测值和真实值之间的缩放
|
|
||||||
def get_dur_adj_factor(orig_dur: List[int],
|
|
||||||
pred_dur: List[int],
|
|
||||||
phns: List[str]):
|
|
||||||
length = 0
|
|
||||||
factor_list = []
|
|
||||||
for orig, pred, phn in zip(orig_dur, pred_dur, phns):
|
|
||||||
if pred == 0 or phn == 'sp':
|
|
||||||
continue
|
|
||||||
else:
|
|
||||||
factor_list.append(orig / pred)
|
|
||||||
factor_list = np.array(factor_list)
|
|
||||||
factor_list.sort()
|
|
||||||
if len(factor_list) < 5:
|
|
||||||
return 1
|
|
||||||
length = 2
|
|
||||||
avg = np.average(factor_list[length:-length])
|
|
||||||
return avg
|
|
||||||
|
|
||||||
|
|
||||||
def prep_feats_with_dur(wav_path: str,
|
|
||||||
source_lang: str="English",
|
|
||||||
target_lang: str="English",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
mask_reconstruct: bool=False,
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False,
|
|
||||||
fs: int=24000,
|
|
||||||
hop_length: int=300):
|
|
||||||
'''
|
|
||||||
Returns:
|
|
||||||
np.ndarray: new wav, replace the part to be edited in original wav with 0
|
|
||||||
List[str]: new phones
|
|
||||||
List[float]: mfa start of new wav
|
|
||||||
List[float]: mfa end of new wav
|
|
||||||
List[int]: masked mel boundary of original wav
|
|
||||||
List[int]: masked mel boundary of new wav
|
|
||||||
'''
|
|
||||||
wav_org, _ = librosa.load(wav_path, sr=fs)
|
|
||||||
|
|
||||||
mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add = get_phns_and_spans(
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang)
|
|
||||||
|
|
||||||
if start_end_sp:
|
|
||||||
if new_phns[-1] != 'sp':
|
|
||||||
new_phns = new_phns + ['sp']
|
|
||||||
# 中文的 phns 不一定都在 fastspeech2 的字典里, 用 sp 代替
|
|
||||||
if target_lang == "english" or target_lang == "chinese":
|
|
||||||
old_durs = eval_durs(old_phns, target_lang=source_lang)
|
|
||||||
else:
|
|
||||||
assert target_lang == "chinese" or target_lang == "english", \
|
|
||||||
"calculate duration_predict is not support for this language..."
|
|
||||||
|
|
||||||
orig_old_durs = [e - s for e, s in zip(mfa_end, mfa_start)]
|
|
||||||
if '[MASK]' in new_str:
|
|
||||||
new_phns = old_phns
|
|
||||||
span_to_add = span_to_repl
|
|
||||||
d_factor_left = get_dur_adj_factor(
|
|
||||||
orig_dur=orig_old_durs[:span_to_repl[0]],
|
|
||||||
pred_dur=old_durs[:span_to_repl[0]],
|
|
||||||
phns=old_phns[:span_to_repl[0]])
|
|
||||||
d_factor_right = get_dur_adj_factor(
|
|
||||||
orig_dur=orig_old_durs[span_to_repl[1]:],
|
|
||||||
pred_dur=old_durs[span_to_repl[1]:],
|
|
||||||
phns=old_phns[span_to_repl[1]:])
|
|
||||||
d_factor = (d_factor_left + d_factor_right) / 2
|
|
||||||
new_durs_adjusted = [d_factor * i for i in old_durs]
|
|
||||||
else:
|
|
||||||
if duration_adjust:
|
|
||||||
d_factor = get_dur_adj_factor(
|
|
||||||
orig_dur=orig_old_durs, pred_dur=old_durs, phns=old_phns)
|
|
||||||
d_factor = d_factor * 1.25
|
|
||||||
else:
|
|
||||||
d_factor = 1
|
|
||||||
|
|
||||||
if target_lang == "english" or target_lang == "chinese":
|
|
||||||
new_durs = eval_durs(new_phns, target_lang=target_lang)
|
|
||||||
else:
|
|
||||||
assert target_lang == "chinese" or target_lang == "english", \
|
|
||||||
"calculate duration_predict is not support for this language..."
|
|
||||||
|
|
||||||
new_durs_adjusted = [d_factor * i for i in new_durs]
|
|
||||||
|
|
||||||
new_span_dur_sum = sum(new_durs_adjusted[span_to_add[0]:span_to_add[1]])
|
|
||||||
old_span_dur_sum = sum(orig_old_durs[span_to_repl[0]:span_to_repl[1]])
|
|
||||||
dur_offset = new_span_dur_sum - old_span_dur_sum
|
|
||||||
new_mfa_start = mfa_start[:span_to_repl[0]]
|
|
||||||
new_mfa_end = mfa_end[:span_to_repl[0]]
|
|
||||||
for i in new_durs_adjusted[span_to_add[0]:span_to_add[1]]:
|
|
||||||
if len(new_mfa_end) == 0:
|
|
||||||
new_mfa_start.append(0)
|
|
||||||
new_mfa_end.append(i)
|
|
||||||
else:
|
|
||||||
new_mfa_start.append(new_mfa_end[-1])
|
|
||||||
new_mfa_end.append(new_mfa_end[-1] + i)
|
|
||||||
new_mfa_start += [i + dur_offset for i in mfa_start[span_to_repl[1]:]]
|
|
||||||
new_mfa_end += [i + dur_offset for i in mfa_end[span_to_repl[1]:]]
|
|
||||||
|
|
||||||
# 3. get new wav
|
|
||||||
# 在原始句子后拼接
|
|
||||||
if span_to_repl[0] >= len(mfa_start):
|
|
||||||
left_idx = len(wav_org)
|
|
||||||
right_idx = left_idx
|
|
||||||
# 在原始句子中间替换
|
|
||||||
else:
|
|
||||||
left_idx = int(np.floor(mfa_start[span_to_repl[0]] * fs))
|
|
||||||
right_idx = int(np.ceil(mfa_end[span_to_repl[1] - 1] * fs))
|
|
||||||
blank_wav = np.zeros(
|
|
||||||
(int(np.ceil(new_span_dur_sum * fs)), ), dtype=wav_org.dtype)
|
|
||||||
# 原始音频,需要编辑的部分替换成空音频,空音频的时间由 fs2 的 duration_predictor 决定
|
|
||||||
new_wav = np.concatenate(
|
|
||||||
[wav_org[:left_idx], blank_wav, wav_org[right_idx:]])
|
|
||||||
|
|
||||||
# 4. get old and new mel span to be mask
|
|
||||||
# [92, 92]
|
|
||||||
|
|
||||||
old_span_bdy, mfa_start, mfa_end = get_masked_mel_bdy(
|
|
||||||
mfa_start=mfa_start,
|
|
||||||
mfa_end=mfa_end,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length,
|
|
||||||
span_to_repl=span_to_repl)
|
|
||||||
# [92, 174]
|
|
||||||
# new_mfa_start, new_mfa_end 时间级别的开始和结束时间 -> 帧级别
|
|
||||||
new_span_bdy, new_mfa_start, new_mfa_end = get_masked_mel_bdy(
|
|
||||||
mfa_start=new_mfa_start,
|
|
||||||
mfa_end=new_mfa_end,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length,
|
|
||||||
span_to_repl=span_to_add)
|
|
||||||
|
|
||||||
# old_span_bdy, new_span_bdy 是帧级别的范围
|
|
||||||
return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy
|
|
||||||
|
|
||||||
|
|
||||||
def prep_feats(wav_path: str,
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False,
|
|
||||||
mask_reconstruct: bool=False,
|
|
||||||
fs: int=24000,
|
|
||||||
hop_length: int=300,
|
|
||||||
token_list: List[str]=[]):
|
|
||||||
wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
wav_path=wav_path,
|
|
||||||
duration_adjust=duration_adjust,
|
|
||||||
start_end_sp=start_end_sp,
|
|
||||||
mask_reconstruct=mask_reconstruct,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length)
|
|
||||||
|
|
||||||
token_to_id = {item: i for i, item in enumerate(token_list)}
|
|
||||||
text = np.array(
|
|
||||||
list(map(lambda x: token_to_id.get(x, token_to_id['<unk>']), phns)))
|
|
||||||
span_bdy = np.array(new_span_bdy)
|
|
||||||
|
|
||||||
batch = [('1', {
|
|
||||||
"speech": wav,
|
|
||||||
"align_start": mfa_start,
|
|
||||||
"align_end": mfa_end,
|
|
||||||
"text": text,
|
|
||||||
"span_bdy": span_bdy
|
|
||||||
})]
|
|
||||||
|
|
||||||
return batch, old_span_bdy, new_span_bdy
|
|
||||||
|
|
||||||
|
|
||||||
def decode_with_model(mlm_model: nn.Layer,
|
|
||||||
collate_fn,
|
|
||||||
wav_path: str,
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
use_teacher_forcing: bool=False,
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False,
|
|
||||||
fs: int=24000,
|
|
||||||
hop_length: int=300,
|
|
||||||
token_list: List[str]=[]):
|
|
||||||
batch, old_span_bdy, new_span_bdy = prep_feats(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
duration_adjust=duration_adjust,
|
|
||||||
start_end_sp=start_end_sp,
|
|
||||||
fs=fs,
|
|
||||||
hop_length=hop_length,
|
|
||||||
token_list=token_list)
|
|
||||||
|
|
||||||
feats = collate_fn(batch)[1]
|
|
||||||
|
|
||||||
if 'text_masked_pos' in feats.keys():
|
|
||||||
feats.pop('text_masked_pos')
|
|
||||||
|
|
||||||
output = mlm_model.inference(
|
|
||||||
text=feats['text'],
|
|
||||||
speech=feats['speech'],
|
|
||||||
masked_pos=feats['masked_pos'],
|
|
||||||
speech_mask=feats['speech_mask'],
|
|
||||||
text_mask=feats['text_mask'],
|
|
||||||
speech_seg_pos=feats['speech_seg_pos'],
|
|
||||||
text_seg_pos=feats['text_seg_pos'],
|
|
||||||
span_bdy=new_span_bdy,
|
|
||||||
use_teacher_forcing=use_teacher_forcing)
|
|
||||||
|
|
||||||
# 拼接音频
|
|
||||||
output_feat = paddle.concat(x=output, axis=0)
|
|
||||||
wav_org, _ = librosa.load(wav_path, sr=fs)
|
|
||||||
return wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length
|
|
||||||
|
|
||||||
|
|
||||||
def get_mlm_output(wav_path: str,
|
|
||||||
model_name: str="paddle_checkpoint_en",
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
old_str: str="",
|
|
||||||
new_str: str="",
|
|
||||||
use_teacher_forcing: bool=False,
|
|
||||||
duration_adjust: bool=True,
|
|
||||||
start_end_sp: bool=False):
|
|
||||||
mlm_model, train_conf = load_model(model_name)
|
|
||||||
|
|
||||||
collate_fn = build_mlm_collate_fn(
|
|
||||||
sr=train_conf.fs,
|
|
||||||
n_fft=train_conf.n_fft,
|
|
||||||
hop_length=train_conf.n_shift,
|
|
||||||
win_length=train_conf.win_length,
|
|
||||||
n_mels=train_conf.n_mels,
|
|
||||||
fmin=train_conf.fmin,
|
|
||||||
fmax=train_conf.fmax,
|
|
||||||
mlm_prob=train_conf.mlm_prob,
|
|
||||||
mean_phn_span=train_conf.mean_phn_span,
|
|
||||||
seg_emb=train_conf.model['enc_input_layer'] == 'sega_mlm')
|
|
||||||
|
|
||||||
return decode_with_model(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
mlm_model=mlm_model,
|
|
||||||
collate_fn=collate_fn,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str,
|
|
||||||
use_teacher_forcing=use_teacher_forcing,
|
|
||||||
duration_adjust=duration_adjust,
|
|
||||||
start_end_sp=start_end_sp,
|
|
||||||
fs=train_conf.fs,
|
|
||||||
hop_length=train_conf.n_shift,
|
|
||||||
token_list=train_conf.token_list)
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(uid: str,
|
|
||||||
source_lang: str="english",
|
|
||||||
target_lang: str="english",
|
|
||||||
prefix: os.PathLike="./prompt/dev/",
|
|
||||||
model_name: str="paddle_checkpoint_en",
|
|
||||||
new_str: str="",
|
|
||||||
prompt_decoding: bool=False,
|
|
||||||
task_name: str=None):
|
|
||||||
|
|
||||||
# get origin text and path of origin wav
|
|
||||||
old_str, wav_path = read_data(uid=uid, prefix=prefix)
|
|
||||||
|
|
||||||
if task_name == 'edit':
|
|
||||||
new_str = new_str
|
|
||||||
elif task_name == 'synthesize':
|
|
||||||
new_str = old_str + new_str
|
|
||||||
else:
|
|
||||||
new_str = old_str + ' '.join([ch for ch in new_str if is_chinese(ch)])
|
|
||||||
|
|
||||||
print('new_str is ', new_str)
|
|
||||||
|
|
||||||
results_dict = get_wav(
|
|
||||||
source_lang=source_lang,
|
|
||||||
target_lang=target_lang,
|
|
||||||
model_name=model_name,
|
|
||||||
wav_path=wav_path,
|
|
||||||
old_str=old_str,
|
|
||||||
new_str=new_str)
|
|
||||||
return results_dict
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
# parse config and args
|
|
||||||
args = parse_args()
|
|
||||||
|
|
||||||
data_dict = evaluate(
|
|
||||||
uid=args.uid,
|
|
||||||
source_lang=args.source_lang,
|
|
||||||
target_lang=args.target_lang,
|
|
||||||
prefix=args.prefix,
|
|
||||||
model_name=args.model_name,
|
|
||||||
new_str=args.new_str,
|
|
||||||
task_name=args.task_name)
|
|
||||||
sf.write(args.output_name, data_dict['output'], samplerate=24000)
|
|
||||||
print("finished...")
|
|
@ -1,97 +0,0 @@
|
|||||||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
|
|
||||||
def parse_args():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with acoustic model & vocoder")
|
|
||||||
# acoustic model
|
|
||||||
parser.add_argument(
|
|
||||||
'--am',
|
|
||||||
type=str,
|
|
||||||
default='fastspeech2_csmsc',
|
|
||||||
choices=[
|
|
||||||
'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech',
|
|
||||||
'fastspeech2_aishell3', 'fastspeech2_vctk', 'tacotron2_csmsc',
|
|
||||||
'tacotron2_ljspeech', 'tacotron2_aishell3'
|
|
||||||
],
|
|
||||||
help='Choose acoustic model type of tts task.')
|
|
||||||
parser.add_argument(
|
|
||||||
'--am_config',
|
|
||||||
type=str,
|
|
||||||
default=None,
|
|
||||||
help='Config of acoustic model. Use deault config when it is None.')
|
|
||||||
parser.add_argument(
|
|
||||||
'--am_ckpt',
|
|
||||||
type=str,
|
|
||||||
default=None,
|
|
||||||
help='Checkpoint file of acoustic model.')
|
|
||||||
parser.add_argument(
|
|
||||||
"--am_stat",
|
|
||||||
type=str,
|
|
||||||
default=None,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training acoustic model."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--tones_dict", type=str, default=None, help="tone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--speaker_dict", type=str, default=None, help="speaker id map file.")
|
|
||||||
|
|
||||||
# vocoder
|
|
||||||
parser.add_argument(
|
|
||||||
'--voc',
|
|
||||||
type=str,
|
|
||||||
default='pwgan_aishell3',
|
|
||||||
choices=[
|
|
||||||
'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
|
|
||||||
'mb_melgan_csmsc', 'wavernn_csmsc', 'hifigan_csmsc',
|
|
||||||
'hifigan_ljspeech', 'hifigan_aishell3', 'hifigan_vctk',
|
|
||||||
'style_melgan_csmsc'
|
|
||||||
],
|
|
||||||
help='Choose vocoder type of tts task.')
|
|
||||||
parser.add_argument(
|
|
||||||
'--voc_config',
|
|
||||||
type=str,
|
|
||||||
default=None,
|
|
||||||
help='Config of voc. Use deault config when it is None.')
|
|
||||||
parser.add_argument(
|
|
||||||
'--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
|
|
||||||
parser.add_argument(
|
|
||||||
"--voc_stat",
|
|
||||||
type=str,
|
|
||||||
default=None,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training voc."
|
|
||||||
)
|
|
||||||
# other
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
|
|
||||||
parser.add_argument("--model_name", type=str, help="model name")
|
|
||||||
parser.add_argument("--uid", type=str, help="uid")
|
|
||||||
parser.add_argument("--new_str", type=str, help="new string")
|
|
||||||
parser.add_argument("--prefix", type=str, help="prefix")
|
|
||||||
parser.add_argument(
|
|
||||||
"--source_lang", type=str, default="english", help="source language")
|
|
||||||
parser.add_argument(
|
|
||||||
"--target_lang", type=str, default="english", help="target language")
|
|
||||||
parser.add_argument("--output_name", type=str, help="output name")
|
|
||||||
parser.add_argument("--task_name", type=str, help="task name")
|
|
||||||
|
|
||||||
# pre
|
|
||||||
args = parser.parse_args()
|
|
||||||
return args
|
|
@ -1,175 +0,0 @@
|
|||||||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Dict
|
|
||||||
from typing import List
|
|
||||||
from typing import Union
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import yaml
|
|
||||||
from sedit_arg_parser import parse_args
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.exps.syn_utils import get_am_inference
|
|
||||||
from paddlespeech.t2s.exps.syn_utils import get_voc_inference
|
|
||||||
|
|
||||||
|
|
||||||
def read_2col_text(path: Union[Path, str]) -> Dict[str, str]:
|
|
||||||
"""Read a text file having 2 column as dict object.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
wav.scp:
|
|
||||||
key1 /some/path/a.wav
|
|
||||||
key2 /some/path/b.wav
|
|
||||||
|
|
||||||
>>> read_2col_text('wav.scp')
|
|
||||||
{'key1': '/some/path/a.wav', 'key2': '/some/path/b.wav'}
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
data = {}
|
|
||||||
with Path(path).open("r", encoding="utf-8") as f:
|
|
||||||
for linenum, line in enumerate(f, 1):
|
|
||||||
sps = line.rstrip().split(maxsplit=1)
|
|
||||||
if len(sps) == 1:
|
|
||||||
k, v = sps[0], ""
|
|
||||||
else:
|
|
||||||
k, v = sps
|
|
||||||
if k in data:
|
|
||||||
raise RuntimeError(f"{k} is duplicated ({path}:{linenum})")
|
|
||||||
data[k] = v
|
|
||||||
return data
|
|
||||||
|
|
||||||
|
|
||||||
def load_num_sequence_text(path: Union[Path, str], loader_type: str="csv_int"
|
|
||||||
) -> Dict[str, List[Union[float, int]]]:
|
|
||||||
"""Read a text file indicating sequences of number
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
key1 1 2 3
|
|
||||||
key2 34 5 6
|
|
||||||
|
|
||||||
>>> d = load_num_sequence_text('text')
|
|
||||||
>>> np.testing.assert_array_equal(d["key1"], np.array([1, 2, 3]))
|
|
||||||
"""
|
|
||||||
if loader_type == "text_int":
|
|
||||||
delimiter = " "
|
|
||||||
dtype = int
|
|
||||||
elif loader_type == "text_float":
|
|
||||||
delimiter = " "
|
|
||||||
dtype = float
|
|
||||||
elif loader_type == "csv_int":
|
|
||||||
delimiter = ","
|
|
||||||
dtype = int
|
|
||||||
elif loader_type == "csv_float":
|
|
||||||
delimiter = ","
|
|
||||||
dtype = float
|
|
||||||
else:
|
|
||||||
raise ValueError(f"Not supported loader_type={loader_type}")
|
|
||||||
|
|
||||||
# path looks like:
|
|
||||||
# utta 1,0
|
|
||||||
# uttb 3,4,5
|
|
||||||
# -> return {'utta': np.ndarray([1, 0]),
|
|
||||||
# 'uttb': np.ndarray([3, 4, 5])}
|
|
||||||
d = read_2column_text(path)
|
|
||||||
# Using for-loop instead of dict-comprehension for debuggability
|
|
||||||
retval = {}
|
|
||||||
for k, v in d.items():
|
|
||||||
try:
|
|
||||||
retval[k] = [dtype(i) for i in v.split(delimiter)]
|
|
||||||
except TypeError:
|
|
||||||
print(f'Error happened with path="{path}", id="{k}", value="{v}"')
|
|
||||||
raise
|
|
||||||
return retval
|
|
||||||
|
|
||||||
|
|
||||||
def is_chinese(ch):
|
|
||||||
if u'\u4e00' <= ch <= u'\u9fff':
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def get_voc_out(mel):
|
|
||||||
# vocoder
|
|
||||||
args = parse_args()
|
|
||||||
with open(args.voc_config) as f:
|
|
||||||
voc_config = CfgNode(yaml.safe_load(f))
|
|
||||||
voc_inference = get_voc_inference(
|
|
||||||
voc=args.voc,
|
|
||||||
voc_config=voc_config,
|
|
||||||
voc_ckpt=args.voc_ckpt,
|
|
||||||
voc_stat=args.voc_stat)
|
|
||||||
|
|
||||||
with paddle.no_grad():
|
|
||||||
wav = voc_inference(mel)
|
|
||||||
return np.squeeze(wav)
|
|
||||||
|
|
||||||
|
|
||||||
def eval_durs(phns, target_lang="chinese", fs=24000, hop_length=300):
|
|
||||||
args = parse_args()
|
|
||||||
|
|
||||||
if target_lang == 'english':
|
|
||||||
args.am = "fastspeech2_ljspeech"
|
|
||||||
args.am_config = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml"
|
|
||||||
args.am_ckpt = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz"
|
|
||||||
args.am_stat = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy"
|
|
||||||
args.phones_dict = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt"
|
|
||||||
|
|
||||||
elif target_lang == 'chinese':
|
|
||||||
args.am = "fastspeech2_csmsc"
|
|
||||||
args.am_config = "download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml"
|
|
||||||
args.am_ckpt = "download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz"
|
|
||||||
args.am_stat = "download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy"
|
|
||||||
args.phones_dict = "download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt"
|
|
||||||
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
# Init body.
|
|
||||||
with open(args.am_config) as f:
|
|
||||||
am_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
am_inference, am = get_am_inference(
|
|
||||||
am=args.am,
|
|
||||||
am_config=am_config,
|
|
||||||
am_ckpt=args.am_ckpt,
|
|
||||||
am_stat=args.am_stat,
|
|
||||||
phones_dict=args.phones_dict,
|
|
||||||
tones_dict=args.tones_dict,
|
|
||||||
speaker_dict=args.speaker_dict,
|
|
||||||
return_am=True)
|
|
||||||
|
|
||||||
vocab_phones = {}
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
for tone, id in phn_id:
|
|
||||||
vocab_phones[tone] = int(id)
|
|
||||||
vocab_size = len(vocab_phones)
|
|
||||||
phonemes = [phn if phn in vocab_phones else "sp" for phn in phns]
|
|
||||||
|
|
||||||
phone_ids = [vocab_phones[item] for item in phonemes]
|
|
||||||
phone_ids.append(vocab_size - 1)
|
|
||||||
phone_ids = paddle.to_tensor(np.array(phone_ids, np.int64))
|
|
||||||
_, d_outs, _, _ = am.inference(phone_ids, spk_id=None, spk_emb=None)
|
|
||||||
pre_d_outs = d_outs
|
|
||||||
phu_durs_new = pre_d_outs * hop_length / fs
|
|
||||||
phu_durs_new = phu_durs_new.tolist()[:-1]
|
|
||||||
return phu_durs_new
|
|
@ -1,3 +0,0 @@
|
|||||||
p243_new For that reason cover should not be given.
|
|
||||||
Prompt_003_new This was not the show for me.
|
|
||||||
p299_096 We are trying to establish a date.
|
|
@ -1,3 +0,0 @@
|
|||||||
p243_new ../../prompt_wav/p243_313.wav
|
|
||||||
Prompt_003_new ../../prompt_wav/this_was_not_the_show_for_me.wav
|
|
||||||
p299_096 ../../prompt_wav/p299_096.wav
|
|
@ -1,27 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -e
|
|
||||||
source path.sh
|
|
||||||
|
|
||||||
# en --> zh 的 语音合成
|
|
||||||
# 根据 Prompt_003_new 作为提示语音: This was not the show for me. 来合成: '今天天气很好'
|
|
||||||
# 注: 输入的 new_str 需为中文汉字, 否则会通过预处理只保留中文汉字, 即合成预处理后的中文语音。
|
|
||||||
|
|
||||||
python local/inference.py \
|
|
||||||
--task_name=cross-lingual_clone \
|
|
||||||
--model_name=paddle_checkpoint_dual_mask_enzh \
|
|
||||||
--uid=Prompt_003_new \
|
|
||||||
--new_str='今天天气很好.' \
|
|
||||||
--prefix='./prompt/dev/' \
|
|
||||||
--source_lang=english \
|
|
||||||
--target_lang=chinese \
|
|
||||||
--output_name=pred_clone.wav \
|
|
||||||
--voc=pwgan_aishell3 \
|
|
||||||
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
|
|
||||||
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
|
||||||
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
|
||||||
--am=fastspeech2_csmsc \
|
|
||||||
--am_config=download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml \
|
|
||||||
--am_ckpt=download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz \
|
|
||||||
--am_stat=download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy \
|
|
||||||
--phones_dict=download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt
|
|
@ -1,27 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -e
|
|
||||||
source path.sh
|
|
||||||
|
|
||||||
# en --> zh 的 语音合成
|
|
||||||
# 根据 Prompt_003_new 作为提示语音: This was not the show for me. 来合成: '今天天气很好'
|
|
||||||
# 注: 输入的 new_str 需为中文汉字, 否则会通过预处理只保留中文汉字, 即合成预处理后的中文语音。
|
|
||||||
|
|
||||||
python local/inference_new.py \
|
|
||||||
--task_name=cross-lingual_clone \
|
|
||||||
--model_name=paddle_checkpoint_dual_mask_enzh \
|
|
||||||
--uid=Prompt_003_new \
|
|
||||||
--new_str='今天天气很好.' \
|
|
||||||
--prefix='./prompt/dev/' \
|
|
||||||
--source_lang=english \
|
|
||||||
--target_lang=chinese \
|
|
||||||
--output_name=pred_clone.wav \
|
|
||||||
--voc=pwgan_aishell3 \
|
|
||||||
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
|
|
||||||
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
|
||||||
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
|
||||||
--am=fastspeech2_csmsc \
|
|
||||||
--am_config=download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml \
|
|
||||||
--am_ckpt=download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz \
|
|
||||||
--am_stat=download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy \
|
|
||||||
--phones_dict=download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt
|
|
@ -1,26 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -e
|
|
||||||
source path.sh
|
|
||||||
|
|
||||||
# 纯英文的语音合成
|
|
||||||
# 样例为根据 p299_096 对应的语音作为提示语音: This was not the show for me. 来合成: 'I enjoy my life.'
|
|
||||||
|
|
||||||
python local/inference.py \
|
|
||||||
--task_name=synthesize \
|
|
||||||
--model_name=paddle_checkpoint_en \
|
|
||||||
--uid=p299_096 \
|
|
||||||
--new_str='I enjoy my life, do you?' \
|
|
||||||
--prefix='./prompt/dev/' \
|
|
||||||
--source_lang=english \
|
|
||||||
--target_lang=english \
|
|
||||||
--output_name=pred_gen.wav \
|
|
||||||
--voc=pwgan_aishell3 \
|
|
||||||
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
|
|
||||||
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
|
||||||
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
|
||||||
--am=fastspeech2_ljspeech \
|
|
||||||
--am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
|
|
||||||
--am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
|
|
||||||
--am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
|
|
||||||
--phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
|
|
@ -1,26 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -e
|
|
||||||
source path.sh
|
|
||||||
|
|
||||||
# 纯英文的语音合成
|
|
||||||
# 样例为根据 p299_096 对应的语音作为提示语音: This was not the show for me. 来合成: 'I enjoy my life.'
|
|
||||||
|
|
||||||
python local/inference_new.py \
|
|
||||||
--task_name=synthesize \
|
|
||||||
--model_name=paddle_checkpoint_en \
|
|
||||||
--uid=p299_096 \
|
|
||||||
--new_str='I enjoy my life, do you?' \
|
|
||||||
--prefix='./prompt/dev/' \
|
|
||||||
--source_lang=english \
|
|
||||||
--target_lang=english \
|
|
||||||
--output_name=pred_gen.wav \
|
|
||||||
--voc=pwgan_aishell3 \
|
|
||||||
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
|
|
||||||
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
|
||||||
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
|
||||||
--am=fastspeech2_ljspeech \
|
|
||||||
--am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
|
|
||||||
--am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
|
|
||||||
--am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
|
|
||||||
--phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
|
|
@ -1,27 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -e
|
|
||||||
source path.sh
|
|
||||||
|
|
||||||
# 纯英文的语音编辑
|
|
||||||
# 样例为把 p243_new 对应的原始语音: For that reason cover should not be given.编辑成 'for that reason cover is impossible to be given.' 对应的语音
|
|
||||||
# NOTE: 语音编辑任务暂支持句子中 1 个位置的替换或者插入文本操作
|
|
||||||
|
|
||||||
python local/inference.py \
|
|
||||||
--task_name=edit \
|
|
||||||
--model_name=paddle_checkpoint_en \
|
|
||||||
--uid=p243_new \
|
|
||||||
--new_str='for that reason cover is impossible to be given.' \
|
|
||||||
--prefix='./prompt/dev/' \
|
|
||||||
--source_lang=english \
|
|
||||||
--target_lang=english \
|
|
||||||
--output_name=pred_edit.wav \
|
|
||||||
--voc=pwgan_aishell3 \
|
|
||||||
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
|
|
||||||
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
|
||||||
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
|
||||||
--am=fastspeech2_ljspeech \
|
|
||||||
--am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
|
|
||||||
--am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
|
|
||||||
--am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
|
|
||||||
--phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
|
|
@ -1,27 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -e
|
|
||||||
source path.sh
|
|
||||||
|
|
||||||
# 纯英文的语音编辑
|
|
||||||
# 样例为把 p243_new 对应的原始语音: For that reason cover should not be given.编辑成 'for that reason cover is impossible to be given.' 对应的语音
|
|
||||||
# NOTE: 语音编辑任务暂支持句子中 1 个位置的替换或者插入文本操作
|
|
||||||
|
|
||||||
python local/inference_new.py \
|
|
||||||
--task_name=edit \
|
|
||||||
--model_name=paddle_checkpoint_en \
|
|
||||||
--uid=p243_new \
|
|
||||||
--new_str='for that reason cover is impossible to be given.' \
|
|
||||||
--prefix='./prompt/dev/' \
|
|
||||||
--source_lang=english \
|
|
||||||
--target_lang=english \
|
|
||||||
--output_name=pred_edit.wav \
|
|
||||||
--voc=pwgan_aishell3 \
|
|
||||||
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
|
|
||||||
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
|
|
||||||
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
|
|
||||||
--am=fastspeech2_ljspeech \
|
|
||||||
--am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
|
|
||||||
--am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
|
|
||||||
--am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
|
|
||||||
--phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
|
|
@ -1,6 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
rm -rf *.wav
|
|
||||||
./run_sedit_en.sh # 语音编辑任务(英文)
|
|
||||||
./run_gen_en.sh # 个性化语音合成任务(英文)
|
|
||||||
./run_clone_en_to_zh.sh # 跨语言语音合成任务(英文到中文的语音克隆)
|
|
@ -1,6 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
rm -rf *.wav
|
|
||||||
./run_sedit_en_new.sh # 语音编辑任务(英文)
|
|
||||||
./run_gen_en_new.sh # 个性化语音合成任务(英文)
|
|
||||||
./run_clone_en_to_zh_new.sh # 跨语言语音合成任务(英文到中文的语音克隆)
|
|
@ -1,27 +1,29 @@
|
|||||||
import argparse
|
import argparse
|
||||||
import os
|
|
||||||
|
|
||||||
def process_sentence(line):
|
def process_sentence(line):
|
||||||
if line == '': return ''
|
if line == '':
|
||||||
res = line[0]
|
return ''
|
||||||
for i in range(1, len(line)):
|
res = line[0]
|
||||||
res += (' ' + line[i])
|
for i in range(1, len(line)):
|
||||||
return res
|
res += (' ' + line[i])
|
||||||
|
return res
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
paser = argparse.ArgumentParser(description = "Input filename")
|
paser = argparse.ArgumentParser(description="Input filename")
|
||||||
paser.add_argument('-input_file')
|
paser.add_argument('-input_file')
|
||||||
paser.add_argument('-output_file')
|
paser.add_argument('-output_file')
|
||||||
sentence_cnt = 0
|
sentence_cnt = 0
|
||||||
args = paser.parse_args()
|
args = paser.parse_args()
|
||||||
with open(args.input_file, 'r') as f:
|
with open(args.input_file, 'r') as f:
|
||||||
with open(args.output_file, 'w') as write_f:
|
with open(args.output_file, 'w') as write_f:
|
||||||
while True:
|
while True:
|
||||||
line = f.readline()
|
line = f.readline()
|
||||||
if line:
|
if line:
|
||||||
sentence_cnt += 1
|
sentence_cnt += 1
|
||||||
write_f.write(process_sentence(line))
|
write_f.write(process_sentence(line))
|
||||||
else:
|
else:
|
||||||
break
|
break
|
||||||
print('preprocess over')
|
print('preprocess over')
|
||||||
print('total sentences number:', sentence_cnt)
|
print('total sentences number:', sentence_cnt)
|
||||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue