Merge branch 'database-search' of github.com:qingen/PaddleSpeech into database-search

pull/1589/head
qingen 4 years ago
commit 43d1dff1a0

@ -50,13 +50,13 @@ repos:
entry: bash .pre-commit-hooks/clang-format.hook -i
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
exclude: (?=speechx/speechx/kaldi).*(\.cpp|\.cc|\.h|\.py)$
exclude: (?=speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
- id: copyright_checker
name: copyright_checker
entry: python .pre-commit-hooks/copyright-check.hook
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
exclude: (?=third_party|pypinyin|speechx/speechx/kaldi).*(\.cpp|\.cc|\.h|\.py)$
exclude: (?=third_party|pypinyin|speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
- repo: https://github.com/asottile/reorder_python_imports
rev: v2.4.0
hooks:

@ -1,4 +1,13 @@
# Changelog
Date: 2022-3-08, Author: yt605155624.
Add features to: T2S:
- Add aishell3 hifigan egs.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1545
Date: 2022-3-08, Author: yt605155624.
Add features to: T2S:
- Add vctk hifigan egs.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1544
Date: 2022-1-29, Author: yt605155624.
Add features to: T2S:

@ -178,7 +178,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
<!---
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/akhaliq/paddlespeech) Demos on Hugging Face Spaces are available!
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 👏🏻 2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech.
### Community
@ -207,6 +207,7 @@ paddlespeech cls --input input.wav
```shell
paddlespeech asr --lang zh --input input_16k.wav
```
- web demo for Automatic Speech Recognition is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [ASR Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR)
**Speech Translation** (English to Chinese)
(not support for Mac and Windows now)
@ -218,7 +219,7 @@ paddlespeech st --input input_16k.wav
```shell
paddlespeech tts --input "你好,欢迎使用飞桨深度学习框架!" --output output.wav
```
- web demo for Text to Speech is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech)
- web demo for Text to Speech is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
**Text Postprocessing**
- Punctuation Restoration
@ -397,9 +398,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tr>
<tr>
<td >HiFiGAN</td>
<td >CSMSC</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td>
<a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
</td>
</tr>
<tr>
@ -573,7 +574,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
- Many thanks to [AK391](https://github.com/AK391) for TTS web demo on Huggingface Spaces using Gradio.
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.
- Many thanks to [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) for contributing Punctuation Restoration model.

@ -392,9 +392,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tr>
<tr>
<td >HiFiGAN</td>
<td >CSMSC</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td>
<a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
</td>
</tr>
<tr>

@ -84,5 +84,8 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
| Model | Language | Sample Rate
| :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000
| transformer_librispeech| en| 16000
| conformer_wenetspeech| zh| 16k
| transformer_librispeech| en| 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
|deepspeech2offline_librispeech|en| 16k

@ -81,5 +81,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
| 模型 | 语言 | 采样率
| :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000
| transformer_librispeech| en| 16000
| conformer_wenetspeech | zh | 16k
| transformer_librispeech | en | 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
| deepspeech2offline_librispeech | en | 16k

@ -110,21 +110,22 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
import json
asrclient_executor = ASRClientExecutor()
asrclient_executor(
res = asrclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
sample_rate=16000,
lang="zh_cn",
audio_format="wav")
print(res.json())
```
Output:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
time cost 0.604353 s.
```
### 5. TTS Client Usage
@ -146,7 +147,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0
- `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0
- `sample_rate`: Sampling rate, choice: [0, 8000, 16000], the default is the same as the model. Default: 0
- `output`: Output wave filepath. Default: `output.wav`.
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
Output:
```bash
@ -160,9 +161,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor
import json
ttsclient_executor = TTSClientExecutor()
ttsclient_executor(
res = ttsclient_executor(
input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1",
port=8090,
@ -171,6 +173,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
volume=1.0,
sample_rate=0,
output="./output.wav")
response_dict = res.json()
print(response_dict["message"])
print("Save synthesized audio successfully on %s." % (response_dict['result']['save_path']))
print("Audio duration: %f s." %(response_dict['result']['duration']))
```
Output:
@ -178,7 +185,52 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
{'description': 'success.'}
Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s.
Response time: 0.388317 s.
```
### 6. CLS Client Usage
**Note:** The response time will be slightly longer when using the client for the first time
- Command Line (Recommended)
```
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
```
Usage:
```bash
paddlespeech_client cls --help
```
Arguments:
- `server_ip`: server ip. Default: 127.0.0.1
- `port`: server port. Default: 8090
- `input`(required): Audio file to be classified.
- `topk`: topk scores of classification result.
Output:
```bash
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import CLSClientExecutor
import json
clsclient_executor = CLSClientExecutor()
res = clsclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
topk=1)
print(res.json())
```
Output:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
```
@ -189,3 +241,6 @@ Get all models supported by the ASR service via `paddlespeech_server stats --tas
### TTS model
Get all models supported by the TTS service via `paddlespeech_server stats --task tts`, where static models can be used for paddle inference inference.
### CLS model
Get all models supported by the CLS service via `paddlespeech_server stats --task cls`, where static models can be used for paddle inference inference.

@ -80,7 +80,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
```
### 4. ASR客户端使用方法
### 4. ASR 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
```
@ -111,25 +111,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
import json
asrclient_executor = ASRClientExecutor()
asrclient_executor(
res = asrclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
sample_rate=16000,
lang="zh_cn",
audio_format="wav")
print(res.json())
```
输出:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
time cost 0.604353 s.
```
### 5. TTS客户端使用方法
### 5. TTS 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
@ -150,7 +151,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值1.0
- `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0
- `sample_rate`: 采样率,可选 [0, 8000, 16000],默认与模型相同。 默认值0
- `output`: 输出音频的路径, 默认值:output.wav
- `output`: 输出音频的路径, 默认值:None表示不保存音频到本地
输出:
```bash
@ -163,9 +164,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor
import json
ttsclient_executor = TTSClientExecutor()
ttsclient_executor(
res = ttsclient_executor(
input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1",
port=8090,
@ -174,6 +176,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
volume=1.0,
sample_rate=0,
output="./output.wav")
response_dict = res.json()
print(response_dict["message"])
print("Save synthesized audio successfully on %s." % (response_dict['result']['save_path']))
print("Audio duration: %f s." %(response_dict['result']['duration']))
```
输出:
@ -181,13 +188,63 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
{'description': 'success.'}
Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s.
Response time: 0.388317 s.
```
### 5. CLS 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
```
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
```
使用帮助:
```bash
paddlespeech_client cls --help
```
参数:
- `server_ip`: 服务端ip地址默认: 127.0.0.1。
- `port`: 服务端口,默认: 8090。
- `input`(必须输入): 用于分类的音频文件。
- `topk`: 分类结果的topk。
输出:
```bash
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import CLSClientExecutor
import json
clsclient_executor = CLSClientExecutor()
res = clsclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
topk=1)
print(res.json())
```
输出:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
```
## 服务支持的模型
### ASR支持的模型
通过 `paddlespeech_server stats --task asr` 获取ASR服务支持的所有模型其中静态模型可用于 paddle inference 推理。
### TTS支持的模型
通过 `paddlespeech_server stats --task tts` 获取TTS服务支持的所有模型其中静态模型可用于 paddle inference 推理。
### CLS支持的模型
通过 `paddlespeech_server stats --task cls` 获取CLS服务支持的所有模型其中静态模型可用于 paddle inference 推理。

@ -0,0 +1,4 @@
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav --topk 1

@ -9,12 +9,14 @@ port: 8090
# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference']
engine_list: ['asr_python', 'tts_python']
engine_list: ['asr_python', 'tts_python', 'cls_python']
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### ASR #########################################
################### speech task: asr; engine_type: python #######################
asr_python:
model: 'conformer_wenetspeech'
@ -46,6 +48,7 @@ asr_inference:
summary: True # False -> do not show predictor config
################################### TTS #########################################
################### speech task: tts; engine_type: python #######################
tts_python:
# am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc',
@ -105,3 +108,30 @@ tts_inference:
# others
lang: 'zh'
################################### CLS #########################################
################### speech task: cls; engine_type: python #######################
cls_python:
# model choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model: 'panns_cnn14'
cfg_path: # [optional] Config of cls task.
ckpt_path: # [optional] Checkpoint file of model.
label_file: # [optional] Label file of cls task.
device: # set 'gpu:id' or 'cpu'
################### speech task: cls; engine_type: inference #######################
cls_inference:
# model_type choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model_type: 'panns_cnn14'
cfg_path:
model_path: # the pdmodel file of am static model [optional]
params_path: # the pdiparams file of am static model [optional]
label_file: # [optional] Label file of cls task.
predictor_conf:
device: # set 'gpu:id' or 'cpu'
switch_ir_optim: True
glog_info: False # True -> print glog
summary: True # False -> do not show predictor config

@ -35,3 +35,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks
* [librosa](https://github.com/librosa/librosa/blob/main/LICENSE.md)
- ISC License
- Audio feature
* [ThreadPool](https://github.com/progschj/ThreadPool/blob/master/COPYING)
- zlib License
- ThreadPool

@ -49,17 +49,19 @@ Model Type | Dataset| Example Link | Pretrained Models| Static Models|Size (stat
WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)|||
Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)|5.1MB|
Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)|||
Parallel WaveGAN|AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)|||
Parallel WaveGAN| AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)|||
Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)|||
|Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) |8.2MB|
Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | |
HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)|50MB|
HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|||
HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|||
WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB|
### Voice Cloning
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----:
:-------------:| :------------:| :-----: | :-----: |
GE2E| AISHELL-3, etc. |[ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e)|[ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip)
GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc1)|[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
@ -67,9 +69,9 @@ GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/
## Audio Classification Models
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----:
PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams)
Model Type | Dataset| Example Link | Pretrained Models | Static Models
:-------------:| :------------:| :-----: | :-----: | :-----:
PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams) | [panns_cnn6_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn6_static.tar.gz)(18M), [panns_cnn10_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn10_static.tar.gz)(19M), [panns_cnn14_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn14_static.tar.gz)(289M)
PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz)
## Punctuation Restoration Models

@ -4,18 +4,44 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pd \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi

@ -4,21 +4,50 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "in hifigan syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi

@ -1,6 +1,6 @@
#!/bin/bash
stage=3
stage=0
stop_stage=100
config_path=$1

@ -3,7 +3,7 @@
set -e
source path.sh
gpus=0
gpus=0,1
stage=0
stop_stage=100

@ -0,0 +1,156 @@
# HiFiGAN with AISHELL-3
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
## Dataset
### Download and Extract
Download AISHELL-3.
```bash
wget https://www.openslr.org/resources/93/data_aishell3.tgz
```
Extract AISHELL-3.
```bash
mkdir data_aishell3
tar zxvf data_aishell3.tgz -C data_aishell3
```
### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`.
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 2500000|24.060|0.1068|7.499
HiFiGAN checkpoint contains files listed below.
```text
hifigan_aishell3_ckpt_0.2.0
├── default.yaml # default config used to train hifigan
├── feats_stats.npy # statistics used to normalize spectrogram when training hifigan
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -0,0 +1,168 @@
# This is the configuration file for AISHELL-3 dataset.
# This configuration is based on HiFiGAN V1, which is
# an official configuration. But I found that the optimizer
# setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales
# is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,55 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./aishell3_alignment_tone \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/data_aishell3/ \
--dataset=aishell3 \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi

@ -0,0 +1,14 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan

@ -0,0 +1,13 @@
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}

@ -0,0 +1,32 @@
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -17,11 +17,8 @@ Data preparation.
Download: http://groups.inf.ed.ac.uk/ami/download/
Prepares metadata files (JSON) from manual annotations "segments/" using RTTM format (Oracle VAD).
Authors
* qingenz123@126.com (Qingen ZHAO) 2022
"""
import argparse
import glob
import json

@ -15,10 +15,6 @@
AMI corpus contained 100 hours of meeting recording.
This script returns the standard train, dev and eval split for AMI corpus.
For more information on dataset please refer to http://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml
Authors
* qingenz123@126.com (Qingen ZHAO) 2022
"""
ALLOWED_OPTIONS = ["scenario_only", "full_corpus", "full_corpus_asr"]

@ -13,10 +13,6 @@
# limitations under the License.
"""
Data reading and writing.
Authors
* qingenz123@126.com (Qingen ZHAO) 2022
"""
import os
import pickle

@ -7,7 +7,7 @@ ckpt_name=$3
stage=0
stop_stage=0
# TODO: tacotron2 动转静的结果没有态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
# TODO: tacotron2 动转静的结果没有态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \

@ -14,7 +14,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=pwgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
@ -34,7 +34,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=mb_melgan_csmsc \
--voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
@ -53,7 +53,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=style_melgan_csmsc \
--voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
@ -73,7 +73,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=hifigan_csmsc \
--voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
@ -93,7 +93,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=wavernn_csmsc \
--voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
--voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \

@ -0,0 +1,133 @@
# HiFiGAN with the LJSpeech-1.1
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
## Dataset
### Download and Extract
Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` parallel wavegan config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -0,0 +1,167 @@
# This is the configuration file for LJSpeech dataset.
# This configuration is based on HiFiGAN V1, which is an official configuration.
# But I found that the optimizer setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 22050 # Sampling rate.
n_fft: 1024 # FFT size (samples).
n_shift: 256 # Hop size (samples). 11.6ms
win_length: null # Window length (samples).
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [8, 8, 2, 2] # Upsampling scales.
upsample_kernel_sizes: [16, 16, 4, 4] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 22050
fft_size: 1024
hop_size: 256
win_length: null
window: "hann"
num_mels: 80
fmin: 0
fmax: 11025
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8192 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,55 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./ljspeech_alignment \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/LJSpeech-1.1/ \
--dataset=ljspeech \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi

@ -0,0 +1,14 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan

@ -0,0 +1,13 @@
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}

@ -0,0 +1,32 @@
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -4,18 +4,43 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_vctk \
--voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi

@ -4,21 +4,49 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_vctk \
--voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi

@ -0,0 +1,153 @@
# HiFiGAN with VCTK
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443).
## Dataset
### Download and Extract
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
1. `p315`, because of no text for it.
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them.
## Get Started
Assume the path to the dataset is `~/datasets/VCTK-Corpus-0.92`.
Assume the path to the MFA result of VCTK is `./vctk_alignment`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 2500000|58.092|0.1234|24.384
HiFiGAN checkpoint contains files listed below.
```text
hifigan_vctk_ckpt_0.2.0
├── default.yaml # default config used to train hifigan
├── feats_stats.npy # statistics used to normalize spectrogram when training hifigan
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -0,0 +1,168 @@
# This is the configuration file for VCTK dataset.
# This configuration is based on HiFiGAN V1, which is
# an official configuration. But I found that the optimizer
# setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales
# is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,55 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./vctk_alignment \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/VCTK-Corpus-0.92/ \
--dataset=vctk \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi

@ -0,0 +1,14 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan

@ -0,0 +1,13 @@
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}

@ -0,0 +1,32 @@
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -1,5 +1,9 @@
# Changelog
Date: 2022-3-15, Author: Xiaojie Chen.
- kaldi and librosa mfcc, fbank, spectrogram.
- unit test and benchmark.
Date: 2022-2-25, Author: Hui Zhang.
- Refactor architecture.
- dtw distance and mcd style dtw
- dtw distance and mcd style dtw.

@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import warnings
from typing import Optional
from typing import Tuple
@ -19,7 +20,6 @@ from typing import Union
import numpy as np
import resampy
import soundfile as sf
from numpy import ndarray as array
from scipy.io import wavfile
from ..utils import ParameterError
@ -38,13 +38,21 @@ RESAMPLE_MODES = ['kaiser_best', 'kaiser_fast']
EPS = 1e-8
def resample(y: array, src_sr: int, target_sr: int,
mode: str='kaiser_fast') -> array:
""" Audio resampling
This function is the same as using resampy.resample().
Notes:
The default mode is kaiser_fast. For better audio quality, use mode = 'kaiser_fast'
"""
def resample(y: np.ndarray,
src_sr: int,
target_sr: int,
mode: str='kaiser_fast') -> np.ndarray:
"""Audio resampling.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
src_sr (int): Source sample rate.
target_sr (int): Target sample rate.
mode (str, optional): The resampling filter to use. Defaults to 'kaiser_fast'.
Returns:
np.ndarray: `y` resampled to `target_sr`
"""
if mode == 'kaiser_best':
warnings.warn(
@ -53,7 +61,7 @@ def resample(y: array, src_sr: int, target_sr: int,
if not isinstance(y, np.ndarray):
raise ParameterError(
'Only support numpy array, but received y in {type(y)}')
'Only support numpy np.ndarray, but received y in {type(y)}')
if mode not in RESAMPLE_MODES:
raise ParameterError(f'resample mode must in {RESAMPLE_MODES}')
@ -61,9 +69,17 @@ def resample(y: array, src_sr: int, target_sr: int,
return resampy.resample(y, src_sr, target_sr, filter=mode)
def to_mono(y: array, merge_type: str='average') -> array:
""" convert sterior audio to mono
def to_mono(y: np.ndarray, merge_type: str='average') -> np.ndarray:
"""Convert sterior audio to mono.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
merge_type (str, optional): Merge type to generate mono waveform. Defaults to 'average'.
Returns:
np.ndarray: `y` with mono channel.
"""
if merge_type not in MERGE_TYPES:
raise ParameterError(
f'Unsupported merge type {merge_type}, available types are {MERGE_TYPES}'
@ -101,18 +117,34 @@ def to_mono(y: array, merge_type: str='average') -> array:
return y_out
def _safe_cast(y: array, dtype: Union[type, str]) -> array:
""" data type casting in a safe way, i.e., prevent overflow or underflow
This function is used internally.
def _safe_cast(y: np.ndarray, dtype: Union[type, str]) -> np.ndarray:
"""Data type casting in a safe way, i.e., prevent overflow or underflow.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
dtype (Union[type, str]): Data type of waveform.
Returns:
np.ndarray: `y` after safe casting.
"""
return np.clip(y, np.iinfo(dtype).min, np.iinfo(dtype).max).astype(dtype)
if 'float' in str(y.dtype):
return np.clip(y, np.finfo(dtype).min,
np.finfo(dtype).max).astype(dtype)
else:
return np.clip(y, np.iinfo(dtype).min,
np.iinfo(dtype).max).astype(dtype)
def depth_convert(y: array, dtype: Union[type, str],
dithering: bool=True) -> array:
"""Convert audio array to target dtype safely
This function convert audio waveform to a target dtype, with addition steps of
def depth_convert(y: np.ndarray, dtype: Union[type, str]) -> np.ndarray:
"""Convert audio array to target dtype safely. This function convert audio waveform to a target dtype, with addition steps of
preventing overflow/underflow and preserving audio range.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
dtype (Union[type, str]): Data type of waveform.
Returns:
np.ndarray: `y` after safe casting.
"""
SUPPORT_DTYPE = ['int16', 'int8', 'float32', 'float64']
@ -157,14 +189,20 @@ def depth_convert(y: array, dtype: Union[type, str],
return y
def sound_file_load(file: str,
def sound_file_load(file: os.PathLike,
offset: Optional[float]=None,
dtype: str='int16',
duration: Optional[int]=None) -> Tuple[array, int]:
"""Load audio using soundfile library
This function load audio file using libsndfile.
Reference:
http://www.mega-nerd.com/libsndfile/#Features
duration: Optional[int]=None) -> Tuple[np.ndarray, int]:
"""Load audio using soundfile library. This function load audio file using libsndfile.
Args:
file (os.PathLike): File of waveform.
offset (Optional[float], optional): Offset to the start of waveform. Defaults to None.
dtype (str, optional): Data type of waveform. Defaults to 'int16'.
duration (Optional[int], optional): Duration of waveform to read. Defaults to None.
Returns:
Tuple[np.ndarray, int]: Waveform in ndarray and its samplerate.
"""
with sf.SoundFile(file) as sf_desc:
sr_native = sf_desc.samplerate
@ -179,9 +217,17 @@ def sound_file_load(file: str,
return y, sf_desc.samplerate
def normalize(y: array, norm_type: str='linear',
mul_factor: float=1.0) -> array:
""" normalize an input audio with additional multiplier.
def normalize(y: np.ndarray, norm_type: str='linear',
mul_factor: float=1.0) -> np.ndarray:
"""Normalize an input audio with additional multiplier.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
norm_type (str, optional): Type of normalization. Defaults to 'linear'.
mul_factor (float, optional): Scaling factor. Defaults to 1.0.
Returns:
np.ndarray: `y` after normalization.
"""
if norm_type == 'linear':
@ -199,12 +245,13 @@ def normalize(y: array, norm_type: str='linear',
return y
def save(y: array, sr: int, file: str) -> None:
"""Save audio file to disk.
This function saves audio to disk using scipy.io.wavfile, with additional step
to convert input waveform to int16 unless it already is int16
Notes:
It only support raw wav format.
def save(y: np.ndarray, sr: int, file: os.PathLike) -> None:
"""Save audio file to disk. This function saves audio to disk using scipy.io.wavfile, with additional step to convert input waveform to int16.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
sr (int): Sample rate.
file (os.PathLike): Path of auido file to save.
"""
if not file.endswith('.wav'):
raise ParameterError(
@ -226,7 +273,7 @@ def save(y: array, sr: int, file: str) -> None:
def load(
file: str,
file: os.PathLike,
sr: Optional[int]=None,
mono: bool=True,
merge_type: str='average', # ch0,ch1,random,average
@ -236,11 +283,24 @@ def load(
offset: float=0.0,
duration: Optional[int]=None,
dtype: str='float32',
resample_mode: str='kaiser_fast') -> Tuple[array, int]:
"""Load audio file from disk.
This function loads audio from disk using using audio beackend.
Parameters:
Notes:
resample_mode: str='kaiser_fast') -> Tuple[np.ndarray, int]:
"""Load audio file from disk. This function loads audio from disk using using audio beackend.
Args:
file (os.PathLike): Path of auido file to load.
sr (Optional[int], optional): Sample rate of loaded waveform. Defaults to None.
mono (bool, optional): Return waveform with mono channel. Defaults to True.
merge_type (str, optional): Merge type of multi-channels waveform. Defaults to 'average'.
normal (bool, optional): Waveform normalization. Defaults to True.
norm_type (str, optional): Type of normalization. Defaults to 'linear'.
norm_mul_factor (float, optional): Scaling factor. Defaults to 1.0.
offset (float, optional): Offset to the start of waveform. Defaults to 0.0.
duration (Optional[int], optional): Duration of waveform to read. Defaults to None.
dtype (str, optional): Data type of waveform. Defaults to 'float32'.
resample_mode (str, optional): The resampling filter to use. Defaults to 'kaiser_fast'.
Returns:
Tuple[np.ndarray, int]: Waveform in ndarray and its samplerate.
"""
y, r = sound_file_load(file, offset=offset, dtype=dtype, duration=duration)

@ -220,7 +220,7 @@ def spectrogram(waveform: Tensor,
"""Compute and return a spectrogram from a waveform. The output is identical to Kaldi's.
Args:
waveform (Tensor): A waveform tensor with shape [C, T].
waveform (Tensor): A waveform tensor with shape `(C, T)`.
blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42.
channel (int, optional): Select the channel of waveform. Defaults to -1.
dither (float, optional): Dithering constant . Defaults to 0.0.
@ -239,7 +239,7 @@ def spectrogram(waveform: Tensor,
window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY.
Returns:
Tensor: A spectrogram tensor with shape (m, padded_window_size // 2 + 1) where m is the number of frames
Tensor: A spectrogram tensor with shape `(m, padded_window_size // 2 + 1)` where m is the number of frames
depends on frame_length and frame_shift.
"""
dtype = waveform.dtype
@ -422,7 +422,7 @@ def fbank(waveform: Tensor,
"""Compute and return filter banks from a waveform. The output is identical to Kaldi's.
Args:
waveform (Tensor): A waveform tensor with shape [C, T].
waveform (Tensor): A waveform tensor with shape `(C, T)`.
blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42.
channel (int, optional): Select the channel of waveform. Defaults to -1.
dither (float, optional): Dithering constant . Defaults to 0.0.
@ -451,7 +451,7 @@ def fbank(waveform: Tensor,
window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY.
Returns:
Tensor: A filter banks tensor with shape (m, n_mels).
Tensor: A filter banks tensor with shape `(m, n_mels)`.
"""
dtype = waveform.dtype
@ -542,7 +542,7 @@ def mfcc(waveform: Tensor,
identical to Kaldi's.
Args:
waveform (Tensor): A waveform tensor with shape [C, T].
waveform (Tensor): A waveform tensor with shape `(C, T)`.
blackman_coeff (float, optional): Coefficient for Blackman window.. Defaults to 0.42.
cepstral_lifter (float, optional): Scaling of output mfccs. Defaults to 22.0.
channel (int, optional): Select the channel of waveform. Defaults to -1.
@ -571,7 +571,7 @@ def mfcc(waveform: Tensor,
window_type (str, optional): Choose type of window for FFT computation. Defaults to POVEY.
Returns:
Tensor: A mel frequency cepstral coefficients tensor with shape (m, n_mfcc).
Tensor: A mel frequency cepstral coefficients tensor with shape `(m, n_mfcc)`.
"""
assert n_mfcc <= n_mels, 'n_mfcc cannot be larger than n_mels: %d vs %d' % (
n_mfcc, n_mels)

@ -19,7 +19,6 @@ from typing import Union
import numpy as np
import scipy
from numpy import ndarray as array
from numpy.lib.stride_tricks import as_strided
from scipy import signal
@ -32,7 +31,6 @@ __all__ = [
'mfcc',
'hz_to_mel',
'mel_to_hz',
'split_frames',
'mel_frequencies',
'power_to_db',
'compute_fbank_matrix',
@ -49,7 +47,8 @@ __all__ = [
]
def pad_center(data: array, size: int, axis: int=-1, **kwargs) -> array:
def _pad_center(data: np.ndarray, size: int, axis: int=-1,
**kwargs) -> np.ndarray:
"""Pad an array to a target length along a target axis.
This differs from `np.pad` by centering the data prior to padding,
@ -69,8 +68,10 @@ def pad_center(data: array, size: int, axis: int=-1, **kwargs) -> array:
return np.pad(data, lengths, **kwargs)
def split_frames(x: array, frame_length: int, hop_length: int,
axis: int=-1) -> array:
def _split_frames(x: np.ndarray,
frame_length: int,
hop_length: int,
axis: int=-1) -> np.ndarray:
"""Slice a data array into (overlapping) frames.
This function is aligned with librosa.frame
@ -142,11 +143,16 @@ def _check_audio(y, mono=True) -> bool:
return True
def hz_to_mel(frequencies: Union[float, List[float], array],
htk: bool=False) -> array:
"""Convert Hz to Mels
def hz_to_mel(frequencies: Union[float, List[float], np.ndarray],
htk: bool=False) -> np.ndarray:
"""Convert Hz to Mels.
This function is aligned with librosa.
Args:
frequencies (Union[float, List[float], np.ndarray]): Frequencies in Hz.
htk (bool, optional): Use htk scaling. Defaults to False.
Returns:
np.ndarray: Frequency in mels.
"""
freq = np.asanyarray(frequencies)
@ -177,10 +183,16 @@ def hz_to_mel(frequencies: Union[float, List[float], array],
return mels
def mel_to_hz(mels: Union[float, List[float], array], htk: int=False) -> array:
def mel_to_hz(mels: Union[float, List[float], np.ndarray],
htk: int=False) -> np.ndarray:
"""Convert mel bin numbers to frequencies.
This function is aligned with librosa.
Args:
mels (Union[float, List[float], np.ndarray]): Frequency in mels.
htk (bool, optional): Use htk scaling. Defaults to False.
Returns:
np.ndarray: Frequencies in Hz.
"""
mel_array = np.asanyarray(mels)
@ -212,10 +224,17 @@ def mel_to_hz(mels: Union[float, List[float], array], htk: int=False) -> array:
def mel_frequencies(n_mels: int=128,
fmin: float=0.0,
fmax: float=11025.0,
htk: bool=False) -> array:
"""Compute mel frequencies
htk: bool=False) -> np.ndarray:
"""Compute mel frequencies.
Args:
n_mels (int, optional): Number of mel bins. Defaults to 128.
fmin (float, optional): Minimum frequency in Hz. Defaults to 0.0.
fmax (float, optional): Maximum frequency in Hz. Defaults to 11025.0.
htk (bool, optional): Use htk scaling. Defaults to False.
This function is aligned with librosa.
Returns:
np.ndarray: Vector of n_mels frequencies in Hz with shape `(n_mels,)`.
"""
# 'Center freqs' of mel bands - uniformly spaced between limits
min_mel = hz_to_mel(fmin, htk=htk)
@ -226,10 +245,15 @@ def mel_frequencies(n_mels: int=128,
return mel_to_hz(mels, htk=htk)
def fft_frequencies(sr: int, n_fft: int) -> array:
def fft_frequencies(sr: int, n_fft: int) -> np.ndarray:
"""Compute fourier frequencies.
This function is aligned with librosa.
Args:
sr (int): Sample rate.
n_fft (int): FFT size.
Returns:
np.ndarray: FFT frequencies in Hz with shape `(n_fft//2 + 1,)`.
"""
return np.linspace(0, float(sr) / 2, int(1 + n_fft // 2), endpoint=True)
@ -241,10 +265,22 @@ def compute_fbank_matrix(sr: int,
fmax: Optional[float]=None,
htk: bool=False,
norm: str="slaney",
dtype: type=np.float32):
dtype: type=np.float32) -> np.ndarray:
"""Compute fbank matrix.
This funciton is aligned with librosa.
Args:
sr (int): Sample rate.
n_fft (int): FFT size.
n_mels (int, optional): Number of mel bins. Defaults to 128.
fmin (float, optional): Minimum frequency in Hz. Defaults to 0.0.
fmax (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
htk (bool, optional): Use htk scaling. Defaults to False.
norm (str, optional): Type of normalization. Defaults to "slaney".
dtype (type, optional): Data type. Defaults to np.float32.
Returns:
np.ndarray: Mel transform matrix with shape `(n_mels, n_fft//2 + 1)`.
"""
if norm != "slaney":
raise ParameterError('norm must set to slaney')
@ -289,17 +325,28 @@ def compute_fbank_matrix(sr: int,
return weights
def stft(x: array,
def stft(x: np.ndarray,
n_fft: int=2048,
hop_length: Optional[int]=None,
win_length: Optional[int]=None,
window: str="hann",
center: bool=True,
dtype: type=np.complex64,
pad_mode: str="reflect") -> array:
pad_mode: str="reflect") -> np.ndarray:
"""Short-time Fourier transform (STFT).
This function is aligned with librosa.
Args:
x (np.ndarray): Input waveform in one dimension.
n_fft (int, optional): FFT size. Defaults to 2048.
hop_length (Optional[int], optional): Number of steps to advance between adjacent windows. Defaults to None.
win_length (Optional[int], optional): The size of window. Defaults to None.
window (str, optional): A string of window specification. Defaults to "hann".
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
dtype (type, optional): Data type of STFT results. Defaults to np.complex64.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
Returns:
np.ndarray: The complex STFT output with shape `(n_fft//2 + 1, num_frames)`.
"""
_check_audio(x)
@ -314,7 +361,7 @@ def stft(x: array,
fft_window = signal.get_window(window, win_length, fftbins=True)
# Pad the window out to n_fft size
fft_window = pad_center(fft_window, n_fft)
fft_window = _pad_center(fft_window, n_fft)
# Reshape so that the window can be broadcast
fft_window = fft_window.reshape((-1, 1))
@ -333,7 +380,7 @@ def stft(x: array,
)
# Window the time series.
x_frames = split_frames(x, frame_length=n_fft, hop_length=hop_length)
x_frames = _split_frames(x, frame_length=n_fft, hop_length=hop_length)
# Pre-allocate the STFT matrix
stft_matrix = np.empty(
(int(1 + n_fft // 2), x_frames.shape[1]), dtype=dtype, order="F")
@ -352,16 +399,20 @@ def stft(x: array,
return stft_matrix
def power_to_db(spect: array,
def power_to_db(spect: np.ndarray,
ref: float=1.0,
amin: float=1e-10,
top_db: Optional[float]=80.0) -> array:
"""Convert a power spectrogram (amplitude squared) to decibel (dB) units
top_db: Optional[float]=80.0) -> np.ndarray:
"""Convert a power spectrogram (amplitude squared) to decibel (dB) units. The function computes the scaling `10 * log10(x / ref)` in a numerically stable way.
This computes the scaling ``10 * log10(spect / ref)`` in a numerically
stable way.
Args:
spect (np.ndarray): STFT power spectrogram of an input waveform.
ref (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0.
amin (float, optional): Minimum threshold. Defaults to 1e-10.
top_db (Optional[float], optional): Threshold the output at `top_db` below the peak. Defaults to 80.0.
This function is aligned with librosa.
Returns:
np.ndarray: Power spectrogram in db scale.
"""
spect = np.asarray(spect)
@ -394,49 +445,27 @@ def power_to_db(spect: array,
return log_spec
def mfcc(x,
def mfcc(x: np.ndarray,
sr: int=16000,
spect: Optional[array]=None,
spect: Optional[np.ndarray]=None,
n_mfcc: int=20,
dct_type: int=2,
norm: str="ortho",
lifter: int=0,
**kwargs) -> array:
**kwargs) -> np.ndarray:
"""Mel-frequency cepstral coefficients (MFCCs)
This function is NOT strictly aligned with librosa. The following example shows how to get the
same result with librosa:
# mfcc:
kwargs = {
'window_size':512,
'hop_length':320,
'mel_bins':64,
'fmin':50,
'to_db':False}
a = mfcc(x,
spect=None,
n_mfcc=20,
dct_type=2,
norm='ortho',
lifter=0,
**kwargs)
# librosa mfcc:
spect = librosa.feature.melspectrogram(y=x,sr=16000,n_fft=512,
win_length=512,
hop_length=320,
n_mels=64, fmin=50)
b = librosa.feature.mfcc(y=x,
sr=16000,
S=spect,
n_mfcc=20,
dct_type=2,
norm='ortho',
lifter=0)
assert np.mean( (a-b)**2) < 1e-8
Args:
x (np.ndarray): Input waveform in one dimension.
sr (int, optional): Sample rate. Defaults to 16000.
spect (Optional[np.ndarray], optional): Input log-power Mel spectrogram. Defaults to None.
n_mfcc (int, optional): Number of cepstra in MFCC. Defaults to 20.
dct_type (int, optional): Discrete cosine transform (DCT) type. Defaults to 2.
norm (str, optional): Type of normalization. Defaults to "ortho".
lifter (int, optional): Cepstral filtering. Defaults to 0.
Returns:
np.ndarray: Mel frequency cepstral coefficients array with shape `(n_mfcc, num_frames)`.
"""
if spect is None:
spect = melspectrogram(x, sr=sr, **kwargs)
@ -454,12 +483,12 @@ def mfcc(x,
f"MFCC lifter={lifter} must be a non-negative number")
def melspectrogram(x: array,
def melspectrogram(x: np.ndarray,
sr: int=16000,
window_size: int=512,
hop_length: int=320,
n_mels: int=64,
fmin: int=50,
fmin: float=50.0,
fmax: Optional[float]=None,
window: str='hann',
center: bool=True,
@ -468,27 +497,28 @@ def melspectrogram(x: array,
to_db: bool=True,
ref: float=1.0,
amin: float=1e-10,
top_db: Optional[float]=None) -> array:
top_db: Optional[float]=None) -> np.ndarray:
"""Compute mel-spectrogram.
Parameters:
x: numpy.ndarray
The input wavform is a numpy array [shape=(n,)]
window_size: int, typically 512, 1024, 2048, etc.
The window size for framing, also used as n_fft for stft
Args:
x (np.ndarray): Input waveform in one dimension.
sr (int, optional): Sample rate. Defaults to 16000.
window_size (int, optional): Size of FFT and window length. Defaults to 512.
hop_length (int, optional): Number of steps to advance between adjacent windows. Defaults to 320.
n_mels (int, optional): Number of mel bins. Defaults to 64.
fmin (float, optional): Minimum frequency in Hz. Defaults to 50.0.
fmax (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
window (str, optional): A string of window specification. Defaults to "hann".
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.
to_db (bool, optional): Enable db scale. Defaults to True.
ref (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0.
amin (float, optional): Minimum threshold. Defaults to 1e-10.
top_db (Optional[float], optional): Threshold the output at `top_db` below the peak. Defaults to None.
Returns:
The mel-spectrogram in power scale or db scale(default)
Notes:
1. sr is default to 16000, which is commonly used in speech/speaker processing.
2. when fmax is None, it is set to sr//2.
3. this function will convert mel spectgrum to db scale by default. This is different
that of librosa.
np.ndarray: The mel-spectrogram in power scale or db scale with shape `(n_mels, num_frames)`.
"""
_check_audio(x, mono=True)
if len(x) <= 0:
@ -518,18 +548,28 @@ def melspectrogram(x: array,
return mel_spect
def spectrogram(x: array,
def spectrogram(x: np.ndarray,
sr: int=16000,
window_size: int=512,
hop_length: int=320,
window: str='hann',
center: bool=True,
pad_mode: str='reflect',
power: float=2.0) -> array:
"""Compute spectrogram from an input waveform.
power: float=2.0) -> np.ndarray:
"""Compute spectrogram.
Args:
x (np.ndarray): Input waveform in one dimension.
sr (int, optional): Sample rate. Defaults to 16000.
window_size (int, optional): Size of FFT and window length. Defaults to 512.
hop_length (int, optional): Number of steps to advance between adjacent windows. Defaults to 320.
window (str, optional): A string of window specification. Defaults to "hann".
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.
This function is a wrapper for librosa.feature.stft, with addition step to
compute the magnitude of the complex spectrogram.
Returns:
np.ndarray: The STFT spectrogram in power scale `(n_fft//2 + 1, num_frames)`.
"""
s = stft(
@ -544,18 +584,16 @@ def spectrogram(x: array,
return np.abs(s)**power
def mu_encode(x: array, mu: int=255, quantized: bool=True) -> array:
"""Mu-law encoding.
Compute the mu-law decoding given an input code.
When quantized is True, the result will be converted to
integer in range [0,mu-1]. Otherwise, the resulting signal
is in range [-1,1]
def mu_encode(x: np.ndarray, mu: int=255, quantized: bool=True) -> np.ndarray:
"""Mu-law encoding. Encode waveform based on mu-law companding. When quantized is True, the result will be converted to integer in range `[0,mu-1]`. Otherwise, the resulting waveform is in range `[-1,1]`.
Reference:
https://en.wikipedia.org/wiki/%CE%9C-law_algorithm
Args:
x (np.ndarray): The input waveform to encode.
mu (int, optional): The endoceding parameter. Defaults to 255.
quantized (bool, optional): If `True`, quantize the encoded values into `1 + mu` distinct integer values. Defaults to True.
Returns:
np.ndarray: The mu-law encoded waveform.
"""
mu = 255
y = np.sign(x) * np.log1p(mu * np.abs(x)) / np.log1p(mu)
@ -564,17 +602,16 @@ def mu_encode(x: array, mu: int=255, quantized: bool=True) -> array:
return y
def mu_decode(y: array, mu: int=255, quantized: bool=True) -> array:
"""Mu-law decoding.
Compute the mu-law decoding given an input code.
def mu_decode(y: np.ndarray, mu: int=255, quantized: bool=True) -> np.ndarray:
"""Mu-law decoding. Compute the mu-law decoding given an input code. It assumes that the input `y` is in range `[0,mu-1]` when quantize is True and `[-1,1]` otherwise.
it assumes that the input y is in
range [0,mu-1] when quantize is True and [-1,1] otherwise
Reference:
https://en.wikipedia.org/wiki/%CE%9C-law_algorithm
Args:
y (np.ndarray): The encoded waveform.
mu (int, optional): The endoceding parameter. Defaults to 255.
quantized (bool, optional): If `True`, the input is assumed to be quantized to `1 + mu` distinct integer values. Defaults to True.
Returns:
np.ndarray: The mu-law decoded waveform.
"""
if mu < 1:
raise ParameterError('mu is typically set as 2**k-1, k=1, 2, 3,...')
@ -586,7 +623,7 @@ def mu_decode(y: array, mu: int=255, quantized: bool=True) -> array:
return x
def randint(high: int) -> int:
def _randint(high: int) -> int:
"""Generate one random integer in range [0 high)
This is a helper function for random data augmentaiton
@ -594,20 +631,18 @@ def randint(high: int) -> int:
return int(np.random.randint(0, high=high))
def rand() -> float:
"""Generate one floating-point number in range [0 1)
This is a helper function for random data augmentaiton
"""
return float(np.random.rand(1))
def depth_augment(y: array,
def depth_augment(y: np.ndarray,
choices: List=['int8', 'int16'],
probs: List[float]=[0.5, 0.5]) -> array:
""" Audio depth augmentation
probs: List[float]=[0.5, 0.5]) -> np.ndarray:
""" Audio depth augmentation. Do audio depth augmentation to simulate the distortion brought by quantization.
Args:
y (np.ndarray): Input waveform array in 1D or 2D.
choices (List, optional): A list of data type to depth conversion. Defaults to ['int8', 'int16'].
probs (List[float], optional): Probabilities to depth conversion. Defaults to [0.5, 0.5].
Do audio depth augmentation to simulate the distortion brought by quantization.
Returns:
np.ndarray: The augmented waveform.
"""
assert len(probs) == len(
choices
@ -621,13 +656,18 @@ def depth_augment(y: array,
return y2
def adaptive_spect_augment(spect: array, tempo_axis: int=0,
level: float=0.1) -> array:
"""Do adpative spectrogram augmentation
def adaptive_spect_augment(spect: np.ndarray,
tempo_axis: int=0,
level: float=0.1) -> np.ndarray:
"""Do adpative spectrogram augmentation. The level of the augmentation is gowern by the paramter level, ranging from 0 to 1, with 0 represents no augmentation.
The level of the augmentation is gowern by the paramter level,
ranging from 0 to 1, with 0 represents no augmentation
Args:
spect (np.ndarray): Input spectrogram.
tempo_axis (int, optional): Indicate the tempo axis. Defaults to 0.
level (float, optional): The level factor of masking. Defaults to 0.1.
Returns:
np.ndarray: The augmented spectrogram.
"""
assert spect.ndim == 2., 'only supports 2d tensor or numpy array'
if tempo_axis == 0:
@ -643,32 +683,40 @@ def adaptive_spect_augment(spect: array, tempo_axis: int=0,
if tempo_axis == 0:
for _ in range(num_time_mask):
start = randint(nt - time_mask_width)
start = _randint(nt - time_mask_width)
spect[start:start + time_mask_width, :] = 0
for _ in range(num_freq_mask):
start = randint(nf - freq_mask_width)
start = _randint(nf - freq_mask_width)
spect[:, start:start + freq_mask_width] = 0
else:
for _ in range(num_time_mask):
start = randint(nt - time_mask_width)
start = _randint(nt - time_mask_width)
spect[:, start:start + time_mask_width] = 0
for _ in range(num_freq_mask):
start = randint(nf - freq_mask_width)
start = _randint(nf - freq_mask_width)
spect[start:start + freq_mask_width, :] = 0
return spect
def spect_augment(spect: array,
def spect_augment(spect: np.ndarray,
tempo_axis: int=0,
max_time_mask: int=3,
max_freq_mask: int=3,
max_time_mask_width: int=30,
max_freq_mask_width: int=20) -> array:
"""Do spectrogram augmentation in both time and freq axis
max_freq_mask_width: int=20) -> np.ndarray:
"""Do spectrogram augmentation in both time and freq axis.
Reference:
Args:
spect (np.ndarray): Input spectrogram.
tempo_axis (int, optional): Indicate the tempo axis. Defaults to 0.
max_time_mask (int, optional): Maximum number of time masking. Defaults to 3.
max_freq_mask (int, optional): Maximum number of frenquence masking. Defaults to 3.
max_time_mask_width (int, optional): Maximum width of time masking. Defaults to 30.
max_freq_mask_width (int, optional): Maximum width of frenquence masking. Defaults to 20.
Returns:
np.ndarray: The augmented spectrogram.
"""
assert spect.ndim == 2., 'only supports 2d tensor or numpy array'
if tempo_axis == 0:
@ -676,52 +724,64 @@ def spect_augment(spect: array,
else:
nf, nt = spect.shape
num_time_mask = randint(max_time_mask)
num_freq_mask = randint(max_freq_mask)
num_time_mask = _randint(max_time_mask)
num_freq_mask = _randint(max_freq_mask)
time_mask_width = randint(max_time_mask_width)
freq_mask_width = randint(max_freq_mask_width)
time_mask_width = _randint(max_time_mask_width)
freq_mask_width = _randint(max_freq_mask_width)
if tempo_axis == 0:
for _ in range(num_time_mask):
start = randint(nt - time_mask_width)
start = _randint(nt - time_mask_width)
spect[start:start + time_mask_width, :] = 0
for _ in range(num_freq_mask):
start = randint(nf - freq_mask_width)
start = _randint(nf - freq_mask_width)
spect[:, start:start + freq_mask_width] = 0
else:
for _ in range(num_time_mask):
start = randint(nt - time_mask_width)
start = _randint(nt - time_mask_width)
spect[:, start:start + time_mask_width] = 0
for _ in range(num_freq_mask):
start = randint(nf - freq_mask_width)
start = _randint(nf - freq_mask_width)
spect[start:start + freq_mask_width, :] = 0
return spect
def random_crop1d(y: array, crop_len: int) -> array:
""" Do random cropping on 1d input signal
def random_crop1d(y: np.ndarray, crop_len: int) -> np.ndarray:
""" Random cropping on a input waveform.
The input is a 1d signal, typically a sound waveform
Args:
y (np.ndarray): Input waveform array in 1D.
crop_len (int): Length of waveform to crop.
Returns:
np.ndarray: The cropped waveform.
"""
if y.ndim != 1:
'only accept 1d tensor or numpy array'
n = len(y)
idx = randint(n - crop_len)
idx = _randint(n - crop_len)
return y[idx:idx + crop_len]
def random_crop2d(s: array, crop_len: int, tempo_axis: int=0) -> array:
""" Do random cropping for 2D array, typically a spectrogram.
def random_crop2d(s: np.ndarray, crop_len: int,
tempo_axis: int=0) -> np.ndarray:
""" Random cropping on a spectrogram.
The cropping is done in temporal direction on the time-freq input signal.
Args:
s (np.ndarray): Input spectrogram in 2D.
crop_len (int): Length of spectrogram to crop.
tempo_axis (int, optional): Indicate the tempo axis. Defaults to 0.
Returns:
np.ndarray: The cropped spectrogram.
"""
if tempo_axis >= s.ndim:
raise ParameterError('axis out of range')
n = s.shape[tempo_axis]
idx = randint(high=n - crop_len)
idx = _randint(high=n - crop_len)
sli = [slice(None) for i in range(s.ndim)]
sli[tempo_axis] = slice(idx, idx + crop_len)
out = s[tuple(sli)]

@ -17,6 +17,7 @@ from typing import Union
import paddle
import paddle.nn as nn
from paddle import Tensor
from ..functional import compute_fbank_matrix
from ..functional import create_dct
@ -32,42 +33,34 @@ __all__ = [
class Spectrogram(nn.Layer):
"""Compute spectrogram of given signals, typically audio waveforms.
The spectorgram is defined as the complex norm of the short-time Fourier transformation.
Args:
n_fft (int, optional): The number of frequency components of the discrete Fourier transform. Defaults to 512.
hop_length (Optional[int], optional): The hop length of the short time FFT. If `None`, it is set to `win_length//4`. Defaults to None.
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
dtype (str, optional): Data type of input and window. Defaults to 'float32'.
"""
def __init__(self,
n_fft: int=512,
hop_length: Optional[int]=None,
win_length: Optional[int]=None,
window: str='hann',
power: float=2.0,
center: bool=True,
pad_mode: str='reflect',
dtype: str=paddle.float32):
"""Compute spectrogram of a given signal, typically an audio waveform.
The spectorgram is defined as the complex norm of the short-time
Fourier transformation.
Parameters:
n_fft (int): the number of frequency components of the discrete Fourier transform.
The default value is 2048,
hop_length (int|None): the hop length of the short time FFT. If None, it is set to win_length//4.
The default value is None.
win_length: the window length of the short time FFt. If None, it is set to same as n_fft.
The default value is None.
window (str): the name of the window function applied to the single before the Fourier transform.
The folllowing window names are supported: 'hamming','hann','kaiser','gaussian',
'exponential','triang','bohman','blackman','cosine','tukey','taylor'.
The default value is 'hann'
center (bool): if True, the signal is padded so that frame t is centered at x[t * hop_length].
If False, frame t begins at x[t * hop_length]
The default value is True
pad_mode (str): the mode to pad the signal if necessary. The supported modes are 'reflect'
and 'constant'. The default value is 'reflect'.
dtype (str): the data type of input and window.
Notes:
The Spectrogram transform relies on STFT transform to compute the spectrogram.
By default, the weights are not learnable. To fine-tune the Fourier coefficients,
set stop_gradient=False before training.
For more information, see STFT().
"""
dtype: str='float32') -> None:
super(Spectrogram, self).__init__()
assert power > 0, 'Power of spectrogram must be > 0.'
self.power = power
if win_length is None:
win_length = n_fft
@ -83,19 +76,46 @@ class Spectrogram(nn.Layer):
pad_mode=pad_mode)
self.register_buffer('fft_window', self.fft_window)
def forward(self, x):
def forward(self, x: Tensor) -> Tensor:
"""
Args:
x (Tensor): Tensor of waveforms with shape `(N, T)`
Returns:
Tensor: Spectrograms with shape `(N, n_fft//2 + 1, num_frames)`.
"""
stft = self._stft(x)
spectrogram = paddle.square(paddle.abs(stft))
spectrogram = paddle.pow(paddle.abs(stft), self.power)
return spectrogram
class MelSpectrogram(nn.Layer):
"""Compute the melspectrogram of given signals, typically audio waveforms. It is computed by multiplying spectrogram with Mel filter bank matrix.
Args:
sr (int, optional): Sample rate. Defaults to 22050.
n_fft (int, optional): The number of frequency components of the discrete Fourier transform. Defaults to 512.
hop_length (Optional[int], optional): The hop length of the short time FFT. If `None`, it is set to `win_length//4`. Defaults to None.
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
f_max (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
htk (bool, optional): Use HTK formula in computing fbank matrix. Defaults to False.
norm (Union[str, float], optional): Type of normalization in computing fbank matrix. Slaney-style is used by default. You can specify norm=1.0/2.0 to use customized p-norm normalization. Defaults to 'slaney'.
dtype (str, optional): Data type of input and window. Defaults to 'float32'.
"""
def __init__(self,
sr: int=22050,
n_fft: int=512,
hop_length: Optional[int]=None,
win_length: Optional[int]=None,
window: str='hann',
power: float=2.0,
center: bool=True,
pad_mode: str='reflect',
n_mels: int=64,
@ -103,38 +123,7 @@ class MelSpectrogram(nn.Layer):
f_max: Optional[float]=None,
htk: bool=False,
norm: Union[str, float]='slaney',
dtype: str=paddle.float32):
"""Compute the melspectrogram of a given signal, typically an audio waveform.
The melspectrogram is also known as filterbank or fbank feature in audio community.
It is computed by multiplying spectrogram with Mel filter bank matrix.
Parameters:
sr(int): the audio sample rate.
The default value is 22050.
n_fft(int): the number of frequency components of the discrete Fourier transform.
The default value is 2048,
hop_length(int|None): the hop length of the short time FFT. If None, it is set to win_length//4.
The default value is None.
win_length: the window length of the short time FFt. If None, it is set to same as n_fft.
The default value is None.
window(str): the name of the window function applied to the single before the Fourier transform.
The folllowing window names are supported: 'hamming','hann','kaiser','gaussian',
'exponential','triang','bohman','blackman','cosine','tukey','taylor'.
The default value is 'hann'
center(bool): if True, the signal is padded so that frame t is centered at x[t * hop_length].
If False, frame t begins at x[t * hop_length]
The default value is True
pad_mode(str): the mode to pad the signal if necessary. The supported modes are 'reflect'
and 'constant'.
The default value is 'reflect'.
n_mels(int): the mel bins.
f_min(float): the lower cut-off frequency, below which the filter response is zero.
f_max(float): the upper cut-off frequency, above which the filter response is zeros.
htk(bool): whether to use HTK formula in computing fbank matrix.
norm(str|float): the normalization type in computing fbank matrix. Slaney-style is used by default.
You can specify norm=1.0/2.0 to use customized p-norm normalization.
dtype(str): the datatype of fbank matrix used in the transform. Use float64 to increase numerical
accuracy. Note that the final transform will be conducted in float32 regardless of dtype of fbank matrix.
"""
dtype: str='float32') -> None:
super(MelSpectrogram, self).__init__()
self._spectrogram = Spectrogram(
@ -142,6 +131,7 @@ class MelSpectrogram(nn.Layer):
hop_length=hop_length,
win_length=win_length,
window=window,
power=power,
center=center,
pad_mode=pad_mode,
dtype=dtype)
@ -163,19 +153,49 @@ class MelSpectrogram(nn.Layer):
dtype=dtype) # float64 for better numerical results
self.register_buffer('fbank_matrix', self.fbank_matrix)
def forward(self, x):
def forward(self, x: Tensor) -> Tensor:
"""
Args:
x (Tensor): Tensor of waveforms with shape `(N, T)`
Returns:
Tensor: Mel spectrograms with shape `(N, n_mels, num_frames)`.
"""
spect_feature = self._spectrogram(x)
mel_feature = paddle.matmul(self.fbank_matrix, spect_feature)
return mel_feature
class LogMelSpectrogram(nn.Layer):
"""Compute log-mel-spectrogram feature of given signals, typically audio waveforms.
Args:
sr (int, optional): Sample rate. Defaults to 22050.
n_fft (int, optional): The number of frequency components of the discrete Fourier transform. Defaults to 512.
hop_length (Optional[int], optional): The hop length of the short time FFT. If `None`, it is set to `win_length//4`. Defaults to None.
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
f_max (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
htk (bool, optional): Use HTK formula in computing fbank matrix. Defaults to False.
norm (Union[str, float], optional): Type of normalization in computing fbank matrix. Slaney-style is used by default. You can specify norm=1.0/2.0 to use customized p-norm normalization. Defaults to 'slaney'.
ref_value (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0.
amin (float, optional): The minimum value of input magnitude. Defaults to 1e-10.
top_db (Optional[float], optional): The maximum db value of spectrogram. Defaults to None.
dtype (str, optional): Data type of input and window. Defaults to 'float32'.
"""
def __init__(self,
sr: int=22050,
n_fft: int=512,
hop_length: Optional[int]=None,
win_length: Optional[int]=None,
window: str='hann',
power: float=2.0,
center: bool=True,
pad_mode: str='reflect',
n_mels: int=64,
@ -186,44 +206,7 @@ class LogMelSpectrogram(nn.Layer):
ref_value: float=1.0,
amin: float=1e-10,
top_db: Optional[float]=None,
dtype: str=paddle.float32):
"""Compute log-mel-spectrogram(also known as LogFBank) feature of a given signal,
typically an audio waveform.
Parameters:
sr (int): the audio sample rate.
The default value is 22050.
n_fft (int): the number of frequency components of the discrete Fourier transform.
The default value is 2048,
hop_length (int|None): the hop length of the short time FFT. If None, it is set to win_length//4.
The default value is None.
win_length: the window length of the short time FFt. If None, it is set to same as n_fft.
The default value is None.
window (str): the name of the window function applied to the single before the Fourier transform.
The folllowing window names are supported: 'hamming','hann','kaiser','gaussian',
'exponential','triang','bohman','blackman','cosine','tukey','taylor'.
The default value is 'hann'
center (bool): if True, the signal is padded so that frame t is centered at x[t * hop_length].
If False, frame t begins at x[t * hop_length]
The default value is True
pad_mode (str): the mode to pad the signal if necessary. The supported modes are 'reflect'
and 'constant'.
The default value is 'reflect'.
n_mels (int): the mel bins.
f_min (float): the lower cut-off frequency, below which the filter response is zero.
f_max (float): the upper cut-off frequency, above which the filter response is zeros.
htk (bool): whether to use HTK formula in computing fbank matrix.
norm (str|float): the normalization type in computing fbank matrix. Slaney-style is used by default.
You can specify norm=1.0/2.0 to use customized p-norm normalization.
ref_value (float): the reference value. If smaller than 1.0, the db level
amin (float): the minimum value of input magnitude, below which the input of the signal will be pulled up accordingly.
Otherwise, the db level is pushed down.
magnitude is clipped(to amin). For numerical stability, set amin to a larger value,
e.g., 1e-3.
top_db (float): the maximum db value of resulting spectrum, above which the
spectrum is clipped(to top_db).
dtype (str): the datatype of fbank matrix used in the transform. Use float64 to increase numerical
accuracy. Note that the final transform will be conducted in float32 regardless of dtype of fbank matrix.
"""
dtype: str='float32') -> None:
super(LogMelSpectrogram, self).__init__()
self._melspectrogram = MelSpectrogram(
@ -232,6 +215,7 @@ class LogMelSpectrogram(nn.Layer):
hop_length=hop_length,
win_length=win_length,
window=window,
power=power,
center=center,
pad_mode=pad_mode,
n_mels=n_mels,
@ -245,8 +229,14 @@ class LogMelSpectrogram(nn.Layer):
self.amin = amin
self.top_db = top_db
def forward(self, x):
# import ipdb; ipdb.set_trace()
def forward(self, x: Tensor) -> Tensor:
"""
Args:
x (Tensor): Tensor of waveforms with shape `(N, T)`
Returns:
Tensor: Log mel spectrograms with shape `(N, n_mels, num_frames)`.
"""
mel_feature = self._melspectrogram(x)
log_mel_feature = power_to_db(
mel_feature,
@ -257,6 +247,29 @@ class LogMelSpectrogram(nn.Layer):
class MFCC(nn.Layer):
"""Compute mel frequency cepstral coefficients(MFCCs) feature of given waveforms.
Args:
sr (int, optional): Sample rate. Defaults to 22050.
n_mfcc (int, optional): [description]. Defaults to 40.
n_fft (int, optional): The number of frequency components of the discrete Fourier transform. Defaults to 512.
hop_length (Optional[int], optional): The hop length of the short time FFT. If `None`, it is set to `win_length//4`. Defaults to None.
win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
f_max (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
htk (bool, optional): Use HTK formula in computing fbank matrix. Defaults to False.
norm (Union[str, float], optional): Type of normalization in computing fbank matrix. Slaney-style is used by default. You can specify norm=1.0/2.0 to use customized p-norm normalization. Defaults to 'slaney'.
ref_value (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0.
amin (float, optional): The minimum value of input magnitude. Defaults to 1e-10.
top_db (Optional[float], optional): The maximum db value of spectrogram. Defaults to None.
dtype (str, optional): Data type of input and window. Defaults to 'float32'.
"""
def __init__(self,
sr: int=22050,
n_mfcc: int=40,
@ -264,6 +277,7 @@ class MFCC(nn.Layer):
hop_length: Optional[int]=None,
win_length: Optional[int]=None,
window: str='hann',
power: float=2.0,
center: bool=True,
pad_mode: str='reflect',
n_mels: int=64,
@ -274,45 +288,7 @@ class MFCC(nn.Layer):
ref_value: float=1.0,
amin: float=1e-10,
top_db: Optional[float]=None,
dtype: str=paddle.float32):
"""Compute mel frequency cepstral coefficients(MFCCs) feature of given waveforms.
Parameters:
sr(int): the audio sample rate.
The default value is 22050.
n_mfcc (int, optional): Number of cepstra in MFCC. Defaults to 40.
n_fft (int): the number of frequency components of the discrete Fourier transform.
The default value is 2048,
hop_length (int|None): the hop length of the short time FFT. If None, it is set to win_length//4.
The default value is None.
win_length: the window length of the short time FFt. If None, it is set to same as n_fft.
The default value is None.
window (str): the name of the window function applied to the single before the Fourier transform.
The folllowing window names are supported: 'hamming','hann','kaiser','gaussian',
'exponential','triang','bohman','blackman','cosine','tukey','taylor'.
The default value is 'hann'
center (bool): if True, the signal is padded so that frame t is centered at x[t * hop_length].
If False, frame t begins at x[t * hop_length]
The default value is True
pad_mode (str): the mode to pad the signal if necessary. The supported modes are 'reflect'
and 'constant'.
The default value is 'reflect'.
n_mels (int): the mel bins.
f_min (float): the lower cut-off frequency, below which the filter response is zero.
f_max (float): the upper cut-off frequency, above which the filter response is zeros.
htk (bool): whether to use HTK formula in computing fbank matrix.
norm (str|float): the normalization type in computing fbank matrix. Slaney-style is used by default.
You can specify norm=1.0/2.0 to use customized p-norm normalization.
ref_value (float): the reference value. If smaller than 1.0, the db level
amin (float): the minimum value of input magnitude, below which the input of the signal will be pulled up accordingly.
Otherwise, the db level is pushed down.
magnitude is clipped(to amin). For numerical stability, set amin to a larger value,
e.g., 1e-3.
top_db (float): the maximum db value of resulting spectrum, above which the
spectrum is clipped(to top_db).
dtype (str): the datatype of fbank matrix used in the transform. Use float64 to increase numerical
accuracy. Note that the final transform will be conducted in float32 regardless of dtype of fbank matrix.
"""
dtype: str=paddle.float32) -> None:
super(MFCC, self).__init__()
assert n_mfcc <= n_mels, 'n_mfcc cannot be larger than n_mels: %d vs %d' % (
n_mfcc, n_mels)
@ -322,6 +298,7 @@ class MFCC(nn.Layer):
hop_length=hop_length,
win_length=win_length,
window=window,
power=power,
center=center,
pad_mode=pad_mode,
n_mels=n_mels,
@ -336,7 +313,14 @@ class MFCC(nn.Layer):
self.dct_matrix = create_dct(n_mfcc=n_mfcc, n_mels=n_mels, dtype=dtype)
self.register_buffer('dct_matrix', self.dct_matrix)
def forward(self, x):
def forward(self, x: Tensor) -> Tensor:
"""
Args:
x (Tensor): Tensor of waveforms with shape `(N, T)`
Returns:
Tensor: Mel frequency cepstral coefficients with shape `(N, n_mfcc, num_frames)`.
"""
log_mel_feature = self._log_melspectrogram(x)
mfcc = paddle.matmul(
log_mel_feature.transpose((0, 2, 1)), self.dct_matrix).transpose(

@ -17,6 +17,7 @@ from typing import Optional
from typing import Union
import paddle
from paddle import Tensor
__all__ = [
'hz_to_mel',
@ -29,19 +30,20 @@ __all__ = [
]
def hz_to_mel(freq: Union[paddle.Tensor, float],
htk: bool=False) -> Union[paddle.Tensor, float]:
def hz_to_mel(freq: Union[Tensor, float],
htk: bool=False) -> Union[Tensor, float]:
"""Convert Hz to Mels.
Parameters:
freq: the input tensor of arbitrary shape, or a single floating point number.
htk: use HTK formula to do the conversion.
The default value is False.
Args:
freq (Union[Tensor, float]): The input tensor with arbitrary shape.
htk (bool, optional): Use htk scaling. Defaults to False.
Returns:
The frequencies represented in Mel-scale.
Union[Tensor, float]: Frequency in mels.
"""
if htk:
if isinstance(freq, paddle.Tensor):
if isinstance(freq, Tensor):
return 2595.0 * paddle.log10(1.0 + freq / 700.0)
else:
return 2595.0 * math.log10(1.0 + freq / 700.0)
@ -58,7 +60,7 @@ def hz_to_mel(freq: Union[paddle.Tensor, float],
min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels)
logstep = math.log(6.4) / 27.0 # step size for log region
if isinstance(freq, paddle.Tensor):
if isinstance(freq, Tensor):
target = min_log_mel + paddle.log(
freq / min_log_hz + 1e-10) / logstep # prevent nan with 1e-10
mask = (freq > min_log_hz).astype(freq.dtype)
@ -71,14 +73,16 @@ def hz_to_mel(freq: Union[paddle.Tensor, float],
return mels
def mel_to_hz(mel: Union[float, paddle.Tensor],
htk: bool=False) -> Union[float, paddle.Tensor]:
def mel_to_hz(mel: Union[float, Tensor],
htk: bool=False) -> Union[float, Tensor]:
"""Convert mel bin numbers to frequencies.
Parameters:
mel: the mel frequency represented as a tensor of arbitrary shape, or a floating point number.
htk: use HTK formula to do the conversion.
Args:
mel (Union[float, Tensor]): The mel frequency represented as a tensor with arbitrary shape.
htk (bool, optional): Use htk scaling. Defaults to False.
Returns:
The frequencies represented in hz.
Union[float, Tensor]: Frequencies in Hz.
"""
if htk:
return 700.0 * (10.0**(mel / 2595.0) - 1.0)
@ -90,7 +94,7 @@ def mel_to_hz(mel: Union[float, paddle.Tensor],
min_log_hz = 1000.0 # beginning of log region (Hz)
min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels)
logstep = math.log(6.4) / 27.0 # step size for log region
if isinstance(mel, paddle.Tensor):
if isinstance(mel, Tensor):
target = min_log_hz * paddle.exp(logstep * (mel - min_log_mel))
mask = (mel > min_log_mel).astype(mel.dtype)
freqs = target * mask + freqs * (
@ -106,16 +110,18 @@ def mel_frequencies(n_mels: int=64,
f_min: float=0.0,
f_max: float=11025.0,
htk: bool=False,
dtype: str=paddle.float32):
dtype: str='float32') -> Tensor:
"""Compute mel frequencies.
Parameters:
n_mels(int): number of Mel bins.
f_min(float): the lower cut-off frequency, below which the filter response is zero.
f_max(float): the upper cut-off frequency, above which the filter response is zero.
htk(bool): whether to use htk formula.
dtype(str): the datatype of the return frequencies.
Args:
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 0.0.
fmax (float, optional): Maximum frequency in Hz. Defaults to 11025.0.
htk (bool, optional): Use htk scaling. Defaults to False.
dtype (str, optional): The data type of the return frequencies. Defaults to 'float32'.
Returns:
The frequencies represented in Mel-scale
Tensor: Tensor of n_mels frequencies in Hz with shape `(n_mels,)`.
"""
# 'Center freqs' of mel bands - uniformly spaced between limits
min_mel = hz_to_mel(f_min, htk=htk)
@ -125,14 +131,16 @@ def mel_frequencies(n_mels: int=64,
return freqs
def fft_frequencies(sr: int, n_fft: int, dtype: str=paddle.float32):
def fft_frequencies(sr: int, n_fft: int, dtype: str='float32') -> Tensor:
"""Compute fourier frequencies.
Parameters:
sr(int): the audio sample rate.
n_fft(float): the number of fft bins.
dtype(str): the datatype of the return frequencies.
Args:
sr (int): Sample rate.
n_fft (int): Number of fft bins.
dtype (str, optional): The data type of the return frequencies. Defaults to 'float32'.
Returns:
The frequencies represented in hz.
Tensor: FFT frequencies in Hz with shape `(n_fft//2 + 1,)`.
"""
return paddle.linspace(0, float(sr) / 2, int(1 + n_fft // 2), dtype=dtype)
@ -144,23 +152,21 @@ def compute_fbank_matrix(sr: int,
f_max: Optional[float]=None,
htk: bool=False,
norm: Union[str, float]='slaney',
dtype: str=paddle.float32):
dtype: str='float32') -> Tensor:
"""Compute fbank matrix.
Parameters:
sr(int): the audio sample rate.
n_fft(int): the number of fft bins.
n_mels(int): the number of Mel bins.
f_min(float): the lower cut-off frequency, below which the filter response is zero.
f_max(float): the upper cut-off frequency, above which the filter response is zero.
htk: whether to use htk formula.
return_complex(bool): whether to return complex matrix. If True, the matrix will
be complex type. Otherwise, the real and image part will be stored in the last
axis of returned tensor.
dtype(str): the datatype of the returned fbank matrix.
Args:
sr (int): Sample rate.
n_fft (int): Number of fft bins.
n_mels (int, optional): Number of mel bins. Defaults to 64.
f_min (float, optional): Minimum frequency in Hz. Defaults to 0.0.
f_max (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
htk (bool, optional): Use htk scaling. Defaults to False.
norm (Union[str, float], optional): Type of normalization. Defaults to 'slaney'.
dtype (str, optional): The data type of the return matrix. Defaults to 'float32'.
Returns:
The fbank matrix of shape (n_mels, int(1+n_fft//2)).
Shape:
output: (n_mels, int(1+n_fft//2))
Tensor: Mel transform matrix with shape `(n_mels, n_fft//2 + 1)`.
"""
if f_max is None:
@ -199,27 +205,20 @@ def compute_fbank_matrix(sr: int,
return weights
def power_to_db(magnitude: paddle.Tensor,
def power_to_db(spect: Tensor,
ref_value: float=1.0,
amin: float=1e-10,
top_db: Optional[float]=None) -> paddle.Tensor:
"""Convert a power spectrogram (amplitude squared) to decibel (dB) units.
The function computes the scaling ``10 * log10(x / ref)`` in a numerically
stable way.
Parameters:
magnitude(Tensor): the input magnitude tensor of any shape.
ref_value(float): the reference value. If smaller than 1.0, the db level
of the signal will be pulled up accordingly. Otherwise, the db level
is pushed down.
amin(float): the minimum value of input magnitude, below which the input
magnitude is clipped(to amin).
top_db(float): the maximum db value of resulting spectrum, above which the
spectrum is clipped(to top_db).
top_db: Optional[float]=None) -> Tensor:
"""Convert a power spectrogram (amplitude squared) to decibel (dB) units. The function computes the scaling `10 * log10(x / ref)` in a numerically stable way.
Args:
spect (Tensor): STFT power spectrogram.
ref_value (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0.
amin (float, optional): Minimum threshold. Defaults to 1e-10.
top_db (Optional[float], optional): Threshold the output at `top_db` below the peak. Defaults to None.
Returns:
The spectrogram in log-scale.
shape:
input: any shape
output: same as input
Tensor: Power spectrogram in db scale.
"""
if amin <= 0:
raise Exception("amin must be strictly positive")
@ -227,8 +226,8 @@ def power_to_db(magnitude: paddle.Tensor,
if ref_value <= 0:
raise Exception("ref_value must be strictly positive")
ones = paddle.ones_like(magnitude)
log_spec = 10.0 * paddle.log10(paddle.maximum(ones * amin, magnitude))
ones = paddle.ones_like(spect)
log_spec = 10.0 * paddle.log10(paddle.maximum(ones * amin, spect))
log_spec -= 10.0 * math.log10(max(ref_value, amin))
if top_db is not None:
@ -242,15 +241,17 @@ def power_to_db(magnitude: paddle.Tensor,
def create_dct(n_mfcc: int,
n_mels: int,
norm: Optional[str]='ortho',
dtype: Optional[str]=paddle.float32) -> paddle.Tensor:
dtype: str='float32') -> Tensor:
"""Create a discrete cosine transform(DCT) matrix.
Parameters:
Args:
n_mfcc (int): Number of mel frequency cepstral coefficients.
n_mels (int): Number of mel filterbanks.
norm (str, optional): Normalizaiton type. Defaults to 'ortho'.
norm (Optional[str], optional): Normalizaiton type. Defaults to 'ortho'.
dtype (str, optional): The data type of the return matrix. Defaults to 'float32'.
Returns:
Tensor: The DCT matrix with shape (n_mels, n_mfcc).
Tensor: The DCT matrix with shape `(n_mels, n_mfcc)`.
"""
n = paddle.arange(n_mels, dtype=dtype)
k = paddle.arange(n_mfcc, dtype=dtype).unsqueeze(1)

@ -20,24 +20,11 @@ from paddle import Tensor
__all__ = [
'get_window',
# windows
'taylor',
'hamming',
'hann',
'tukey',
'kaiser',
'gaussian',
'exponential',
'triang',
'bohman',
'blackman',
'cosine',
]
def _cat(a: List[Tensor], data_type: str) -> Tensor:
l = [paddle.to_tensor(_a, data_type) for _a in a]
def _cat(x: List[Tensor], data_type: str) -> Tensor:
l = [paddle.to_tensor(_, data_type) for _ in x]
return paddle.concat(l)
@ -48,7 +35,7 @@ def _acosh(x: Union[Tensor, float]) -> Tensor:
def _extend(M: int, sym: bool) -> bool:
"""Extend window by 1 sample if needed for DFT-even symmetry"""
"""Extend window by 1 sample if needed for DFT-even symmetry. """
if not sym:
return M + 1, True
else:
@ -56,7 +43,7 @@ def _extend(M: int, sym: bool) -> bool:
def _len_guards(M: int) -> bool:
"""Handle small or incorrect window lengths"""
"""Handle small or incorrect window lengths. """
if int(M) != M or M < 0:
raise ValueError('Window length M must be a non-negative integer')
@ -64,15 +51,15 @@ def _len_guards(M: int) -> bool:
def _truncate(w: Tensor, needed: bool) -> Tensor:
"""Truncate window by 1 sample if needed for DFT-even symmetry"""
"""Truncate window by 1 sample if needed for DFT-even symmetry. """
if needed:
return w[:-1]
else:
return w
def general_gaussian(M: int, p, sig, sym: bool=True,
dtype: str='float64') -> Tensor:
def _general_gaussian(M: int, p, sig, sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute a window with a generalized Gaussian shape.
This function is consistent with scipy.signal.windows.general_gaussian().
"""
@ -86,8 +73,8 @@ def general_gaussian(M: int, p, sig, sym: bool=True,
return _truncate(w, needs_trunc)
def general_cosine(M: int, a: float, sym: bool=True,
dtype: str='float64') -> Tensor:
def _general_cosine(M: int, a: float, sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute a generic weighted sum of cosine terms window.
This function is consistent with scipy.signal.windows.general_cosine().
"""
@ -101,31 +88,23 @@ def general_cosine(M: int, a: float, sym: bool=True,
return _truncate(w, needs_trunc)
def general_hamming(M: int, alpha: float, sym: bool=True,
dtype: str='float64') -> Tensor:
def _general_hamming(M: int, alpha: float, sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute a generalized Hamming window.
This function is consistent with scipy.signal.windows.general_hamming()
"""
return general_cosine(M, [alpha, 1. - alpha], sym, dtype=dtype)
return _general_cosine(M, [alpha, 1. - alpha], sym, dtype=dtype)
def taylor(M: int,
nbar=4,
sll=30,
norm=True,
sym: bool=True,
dtype: str='float64') -> Tensor:
def _taylor(M: int,
nbar=4,
sll=30,
norm=True,
sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute a Taylor window.
The Taylor window taper function approximates the Dolph-Chebyshev window's
constant sidelobe level for a parameterized number of near-in sidelobes.
Parameters:
M(int): window size
nbar, sil, norm: the window-specific parameter.
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
if _len_guards(M):
return paddle.ones((M, ), dtype=dtype)
@ -171,46 +150,25 @@ def taylor(M: int,
return _truncate(w, needs_trunc)
def hamming(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
def _hamming(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Hamming window.
The Hamming window is a taper formed by using a raised cosine with
non-zero endpoints, optimized to minimize the nearest side lobe.
Parameters:
M(int): window size
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
return general_hamming(M, 0.54, sym, dtype=dtype)
return _general_hamming(M, 0.54, sym, dtype=dtype)
def hann(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
def _hann(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Hann window.
The Hann window is a taper formed by using a raised cosine or sine-squared
with ends that touch zero.
Parameters:
M(int): window size
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
return general_hamming(M, 0.5, sym, dtype=dtype)
return _general_hamming(M, 0.5, sym, dtype=dtype)
def tukey(M: int, alpha=0.5, sym: bool=True, dtype: str='float64') -> Tensor:
def _tukey(M: int, alpha=0.5, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Tukey window.
The Tukey window is also known as a tapered cosine window.
Parameters:
M(int): window size
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
if _len_guards(M):
return paddle.ones((M, ), dtype=dtype)
@ -237,32 +195,18 @@ def tukey(M: int, alpha=0.5, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc)
def kaiser(M: int, beta: float, sym: bool=True, dtype: str='float64') -> Tensor:
def _kaiser(M: int, beta: float, sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute a Kaiser window.
The Kaiser window is a taper formed by using a Bessel function.
Parameters:
M(int): window size.
beta(float): the window-specific parameter.
sym(bool)whether to return symmetric window.
The default value is True
Returns:
Tensor: the window tensor
"""
raise NotImplementedError()
def gaussian(M: int, std: float, sym: bool=True,
dtype: str='float64') -> Tensor:
def _gaussian(M: int, std: float, sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute a Gaussian window.
The Gaussian widows has a Gaussian shape defined by the standard deviation(std).
Parameters:
M(int): window size.
std(float): the window-specific parameter.
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
if _len_guards(M):
return paddle.ones((M, ), dtype=dtype)
@ -275,21 +219,12 @@ def gaussian(M: int, std: float, sym: bool=True,
return _truncate(w, needs_trunc)
def exponential(M: int,
center=None,
tau=1.,
sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute an exponential (or Poisson) window.
Parameters:
M(int): window size.
tau(float): the window-specific parameter.
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
def _exponential(M: int,
center=None,
tau=1.,
sym: bool=True,
dtype: str='float64') -> Tensor:
"""Compute an exponential (or Poisson) window. """
if sym and center is not None:
raise ValueError("If sym==True, center must be None.")
if _len_guards(M):
@ -305,15 +240,8 @@ def exponential(M: int,
return _truncate(w, needs_trunc)
def triang(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
def _triang(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a triangular window.
Parameters:
M(int): window size.
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
if _len_guards(M):
return paddle.ones((M, ), dtype=dtype)
@ -330,16 +258,9 @@ def triang(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc)
def bohman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
def _bohman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Bohman window.
The Bohman window is the autocorrelation of a cosine window.
Parameters:
M(int): window size.
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
if _len_guards(M):
return paddle.ones((M, ), dtype=dtype)
@ -353,32 +274,18 @@ def bohman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc)
def blackman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
def _blackman(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a Blackman window.
The Blackman window is a taper formed by using the first three terms of
a summation of cosines. It was designed to have close to the minimal
leakage possible. It is close to optimal, only slightly worse than a
Kaiser window.
Parameters:
M(int): window size.
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
return general_cosine(M, [0.42, 0.50, 0.08], sym, dtype=dtype)
return _general_cosine(M, [0.42, 0.50, 0.08], sym, dtype=dtype)
def cosine(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
def _cosine(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
"""Compute a window with a simple cosine shape.
Parameters:
M(int): window size.
sym(bool)whether to return symmetric window.
The default value is True
dtype(str): the datatype of returned tensor.
Returns:
Tensor: the window tensor
"""
if _len_guards(M):
return paddle.ones((M, ), dtype=dtype)
@ -388,19 +295,20 @@ def cosine(M: int, sym: bool=True, dtype: str='float64') -> Tensor:
return _truncate(w, needs_trunc)
## factory function
def get_window(window: Union[str, Tuple[str, float]],
win_length: int,
fftbins: bool=True,
dtype: str='float64') -> Tensor:
"""Return a window of a given length and type.
Parameters:
window(str|(str,float)): the type of window to create.
win_length(int): the number of samples in the window.
fftbins(bool): If True, create a "periodic" window. Otherwise,
create a "symmetric" window, for use in filter design.
Args:
window (Union[str, Tuple[str, float]]): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'.
win_length (int): Number of samples.
fftbins (bool, optional): If True, create a "periodic" window. Otherwise, create a "symmetric" window, for use in filter design. Defaults to True.
dtype (str, optional): The data type of the return window. Defaults to 'float64'.
Returns:
The window represented as a tensor.
Tensor: The window represented as a tensor.
"""
sym = not fftbins
@ -420,7 +328,7 @@ def get_window(window: Union[str, Tuple[str, float]],
str(type(window)))
try:
winfunc = eval(winstr)
winfunc = eval('_' + winstr)
except KeyError as e:
raise ValueError("Unknown window type.") from e

@ -20,9 +20,7 @@ __all__ = [
def dtw_distance(xs: np.ndarray, ys: np.ndarray) -> float:
"""dtw distance
Dynamic Time Warping.
"""Dynamic Time Warping.
This function keeps a compact matrix, not the full warping paths matrix.
Uses dynamic programming to compute:

@ -11,19 +11,46 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import os
import setuptools
from setuptools.command.install import install
from setuptools.command.test import test
# set the version here
VERSION = '0.2.0'
# Inspired by the example at https://pytest.org/latest/goodpractises.html
class TestCommand(test):
def finalize_options(self):
test.finalize_options(self)
self.test_args = []
self.test_suite = True
def run(self):
self.run_benchmark()
super(TestCommand, self).run()
def run_tests(self):
# Run nose ensuring that argv simulates running nosetests directly
import nose
nose.run_exit(argv=['nosetests', '-w', 'tests'])
def run_benchmark(self):
for benchmark_item in glob.glob('tests/benchmark/*py'):
os.system(f'pytest {benchmark_item}')
class InstallCommand(install):
def run(self):
install.run(self)
def write_version_py(filename='paddleaudio/__init__.py'):
import paddleaudio
if hasattr(paddleaudio,
"__version__") and paddleaudio.__version__ == VERSION:
return
with open(filename, "a") as f:
f.write(f"\n__version__ = '{VERSION}'\n")
f.write(f"__version__ = '{VERSION}'")
def remove_version_py(filename='paddleaudio/__init__.py'):
@ -35,6 +62,7 @@ def remove_version_py(filename='paddleaudio/__init__.py'):
f.write(line)
remove_version_py()
write_version_py()
setuptools.setup(
@ -61,6 +89,16 @@ setuptools.setup(
'colorlog',
'dtaidistance >= 2.3.6',
'mcd >= 0.4',
], )
],
extras_require={
'test': [
'nose', 'librosa==0.8.1', 'soundfile==0.10.3.post1',
'torchaudio==0.10.2', 'pytest-benchmark'
],
},
cmdclass={
'install': InstallCommand,
'test': TestCommand,
}, )
remove_version_py()

@ -0,0 +1,13 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -0,0 +1,34 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import unittest
import urllib.request
mono_channel_wav = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
multi_channels_wav = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav'
class BackendTest(unittest.TestCase):
def setUp(self):
self.initWavInput()
def initWavInput(self):
self.files = []
for url in [mono_channel_wav, multi_channels_wav]:
if not os.path.isfile(os.path.basename(url)):
urllib.request.urlretrieve(url, os.path.basename(url))
self.files.append(os.path.basename(url))
def initParmas(self):
raise NotImplementedError

@ -0,0 +1,13 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -0,0 +1,73 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import filecmp
import os
import unittest
import numpy as np
import soundfile as sf
import paddleaudio
from ..base import BackendTest
class TestIO(BackendTest):
def test_load_mono_channel(self):
sf_data, sf_sr = sf.read(self.files[0])
pa_data, pa_sr = paddleaudio.load(
self.files[0], normal=False, dtype='float64')
self.assertEqual(sf_data.dtype, pa_data.dtype)
self.assertEqual(sf_sr, pa_sr)
np.testing.assert_array_almost_equal(sf_data, pa_data)
def test_load_multi_channels(self):
sf_data, sf_sr = sf.read(self.files[1])
sf_data = sf_data.T # Channel dim first
pa_data, pa_sr = paddleaudio.load(
self.files[1], mono=False, normal=False, dtype='float64')
self.assertEqual(sf_data.dtype, pa_data.dtype)
self.assertEqual(sf_sr, pa_sr)
np.testing.assert_array_almost_equal(sf_data, pa_data)
def test_save_mono_channel(self):
waveform, sr = np.random.randint(
low=-32768, high=32768, size=(48000), dtype=np.int16), 16000
sf_tmp_file = 'sf_tmp.wav'
pa_tmp_file = 'pa_tmp.wav'
sf.write(sf_tmp_file, waveform, sr)
paddleaudio.save(waveform, sr, pa_tmp_file)
self.assertTrue(filecmp.cmp(sf_tmp_file, pa_tmp_file))
for file in [sf_tmp_file, pa_tmp_file]:
os.remove(file)
def test_save_multi_channels(self):
waveform, sr = np.random.randint(
low=-32768, high=32768, size=(2, 48000), dtype=np.int16), 16000
sf_tmp_file = 'sf_tmp.wav'
pa_tmp_file = 'pa_tmp.wav'
sf.write(sf_tmp_file, waveform.T, sr)
paddleaudio.save(waveform.T, sr, pa_tmp_file)
self.assertTrue(filecmp.cmp(sf_tmp_file, pa_tmp_file))
for file in [sf_tmp_file, pa_tmp_file]:
os.remove(file)
if __name__ == '__main__':
unittest.main()

@ -0,0 +1,39 @@
# 1. Prepare
First, install `pytest-benchmark` via pip.
```sh
pip install pytest-benchmark
```
# 2. Run
Run the specific script for profiling.
```sh
pytest melspectrogram.py
```
Result:
```sh
========================================================================== test session starts ==========================================================================
platform linux -- Python 3.7.7, pytest-7.0.1, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /ssd3/chenxiaojie06/PaddleSpeech/DeepSpeech/paddleaudio
plugins: typeguard-2.12.1, benchmark-3.4.1, anyio-3.5.0
collected 4 items
melspectrogram.py .... [100%]
-------------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_melspect_gpu_torchaudio 202.0765 (1.0) 360.6230 (1.0) 218.1168 (1.0) 16.3022 (1.0) 214.2871 (1.0) 21.8451 (1.0) 40;3 4,584.7001 (1.0) 286 1
test_melspect_gpu 657.8509 (3.26) 908.0470 (2.52) 724.2545 (3.32) 106.5771 (6.54) 669.9096 (3.13) 113.4719 (5.19) 1;0 1,380.7300 (0.30) 5 1
test_melspect_cpu_torchaudio 1,247.6053 (6.17) 2,892.5799 (8.02) 1,443.2853 (6.62) 345.3732 (21.19) 1,262.7263 (5.89) 221.6385 (10.15) 56;53 692.8637 (0.15) 399 1
test_melspect_cpu 20,326.2549 (100.59) 20,607.8682 (57.15) 20,473.4125 (93.86) 63.8654 (3.92) 20,467.0429 (95.51) 68.4294 (3.13) 8;1 48.8438 (0.01) 29 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
========================================================================== 4 passed in 21.12s ===========================================================================
```

@ -0,0 +1,124 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import urllib.request
import librosa
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
if not os.path.isfile(os.path.basename(wav_url)):
urllib.request.urlretrieve(wav_url, os.path.basename(wav_url))
waveform, sr = paddleaudio.load(os.path.abspath(os.path.basename(wav_url)))
waveform_tensor = paddle.to_tensor(waveform).unsqueeze(0)
waveform_tensor_torch = torch.from_numpy(waveform).unsqueeze(0)
# Feature conf
mel_conf = {
'sr': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
}
mel_conf_torchaudio = {
'sample_rate': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
'norm': 'slaney',
'mel_scale': 'slaney',
}
def enable_cpu_device():
paddle.set_device('cpu')
def enable_gpu_device():
paddle.set_device('gpu')
log_mel_extractor = paddleaudio.features.LogMelSpectrogram(
**mel_conf, f_min=0.0, top_db=80.0, dtype=waveform_tensor.dtype)
def log_melspectrogram():
return log_mel_extractor(waveform_tensor).squeeze(0)
def test_log_melspect_cpu(benchmark):
enable_cpu_device()
feature_paddleaudio = benchmark(log_melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_log_melspect_gpu(benchmark):
enable_gpu_device()
feature_paddleaudio = benchmark(log_melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=2)
mel_extractor_torchaudio = torchaudio.transforms.MelSpectrogram(
**mel_conf_torchaudio, f_min=0.0)
amplitude_to_DB = torchaudio.transforms.AmplitudeToDB('power', top_db=80.0)
def melspectrogram_torchaudio():
return mel_extractor_torchaudio(waveform_tensor_torch).squeeze(0)
def log_melspectrogram_torchaudio():
mel_specgram = mel_extractor_torchaudio(waveform_tensor_torch)
return amplitude_to_DB(mel_specgram).squeeze(0)
def test_log_melspect_cpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio, amplitude_to_DB
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cpu')
waveform_tensor_torch = waveform_tensor_torch.to('cpu')
amplitude_to_DB = amplitude_to_DB.to('cpu')
feature_paddleaudio = benchmark(log_melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_log_melspect_gpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio, amplitude_to_DB
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cuda')
waveform_tensor_torch = waveform_tensor_torch.to('cuda')
amplitude_to_DB = amplitude_to_DB.to('cuda')
feature_torchaudio = benchmark(log_melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=80.0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_torchaudio.cpu(), decimal=2)

@ -0,0 +1,108 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import urllib.request
import librosa
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
if not os.path.isfile(os.path.basename(wav_url)):
urllib.request.urlretrieve(wav_url, os.path.basename(wav_url))
waveform, sr = paddleaudio.load(os.path.abspath(os.path.basename(wav_url)))
waveform_tensor = paddle.to_tensor(waveform).unsqueeze(0)
waveform_tensor_torch = torch.from_numpy(waveform).unsqueeze(0)
# Feature conf
mel_conf = {
'sr': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
}
mel_conf_torchaudio = {
'sample_rate': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
'norm': 'slaney',
'mel_scale': 'slaney',
}
def enable_cpu_device():
paddle.set_device('cpu')
def enable_gpu_device():
paddle.set_device('gpu')
mel_extractor = paddleaudio.features.MelSpectrogram(
**mel_conf, f_min=0.0, dtype=waveform_tensor.dtype)
def melspectrogram():
return mel_extractor(waveform_tensor).squeeze(0)
def test_melspect_cpu(benchmark):
enable_cpu_device()
feature_paddleaudio = benchmark(melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_melspect_gpu(benchmark):
enable_gpu_device()
feature_paddleaudio = benchmark(melspectrogram)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
mel_extractor_torchaudio = torchaudio.transforms.MelSpectrogram(
**mel_conf_torchaudio, f_min=0.0)
def melspectrogram_torchaudio():
return mel_extractor_torchaudio(waveform_tensor_torch).squeeze(0)
def test_melspect_cpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cpu')
waveform_tensor_torch = waveform_tensor_torch.to('cpu')
feature_paddleaudio = benchmark(melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_melspect_gpu_torchaudio(benchmark):
global waveform_tensor_torch, mel_extractor_torchaudio
mel_extractor_torchaudio = mel_extractor_torchaudio.to('cuda')
waveform_tensor_torch = waveform_tensor_torch.to('cuda')
feature_torchaudio = benchmark(melspectrogram_torchaudio)
feature_librosa = librosa.feature.melspectrogram(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_torchaudio.cpu(), decimal=3)

@ -0,0 +1,122 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import urllib.request
import librosa
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
if not os.path.isfile(os.path.basename(wav_url)):
urllib.request.urlretrieve(wav_url, os.path.basename(wav_url))
waveform, sr = paddleaudio.load(os.path.abspath(os.path.basename(wav_url)))
waveform_tensor = paddle.to_tensor(waveform).unsqueeze(0)
waveform_tensor_torch = torch.from_numpy(waveform).unsqueeze(0)
# Feature conf
mel_conf = {
'sr': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
}
mfcc_conf = {
'n_mfcc': 20,
'top_db': 80.0,
}
mfcc_conf.update(mel_conf)
mel_conf_torchaudio = {
'sample_rate': sr,
'n_fft': 512,
'hop_length': 128,
'n_mels': 40,
'norm': 'slaney',
'mel_scale': 'slaney',
}
mfcc_conf_torchaudio = {
'sample_rate': sr,
'n_mfcc': 20,
}
def enable_cpu_device():
paddle.set_device('cpu')
def enable_gpu_device():
paddle.set_device('gpu')
mfcc_extractor = paddleaudio.features.MFCC(
**mfcc_conf, f_min=0.0, dtype=waveform_tensor.dtype)
def mfcc():
return mfcc_extractor(waveform_tensor).squeeze(0)
def test_mfcc_cpu(benchmark):
enable_cpu_device()
feature_paddleaudio = benchmark(mfcc)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_mfcc_gpu(benchmark):
enable_gpu_device()
feature_paddleaudio = benchmark(mfcc)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
del mel_conf_torchaudio['sample_rate']
mfcc_extractor_torchaudio = torchaudio.transforms.MFCC(
**mfcc_conf_torchaudio, melkwargs=mel_conf_torchaudio)
def mfcc_torchaudio():
return mfcc_extractor_torchaudio(waveform_tensor_torch).squeeze(0)
def test_mfcc_cpu_torchaudio(benchmark):
global waveform_tensor_torch, mfcc_extractor_torchaudio
mel_extractor_torchaudio = mfcc_extractor_torchaudio.to('cpu')
waveform_tensor_torch = waveform_tensor_torch.to('cpu')
feature_paddleaudio = benchmark(mfcc_torchaudio)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddleaudio, decimal=3)
def test_mfcc_gpu_torchaudio(benchmark):
global waveform_tensor_torch, mfcc_extractor_torchaudio
mel_extractor_torchaudio = mfcc_extractor_torchaudio.to('cuda')
waveform_tensor_torch = waveform_tensor_torch.to('cuda')
feature_torchaudio = benchmark(mfcc_torchaudio)
feature_librosa = librosa.feature.mfcc(waveform, **mel_conf)
np.testing.assert_array_almost_equal(
feature_librosa, feature_torchaudio.cpu(), decimal=3)

@ -0,0 +1,13 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -0,0 +1,49 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import unittest
import urllib.request
import numpy as np
import paddle
from paddleaudio import load
wav_url = 'https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav'
class FeatTest(unittest.TestCase):
def setUp(self):
self.initParmas()
self.initWavInput()
self.setUpDevice()
def setUpDevice(self, device='cpu'):
paddle.set_device(device)
def initWavInput(self, url=wav_url):
if not os.path.isfile(os.path.basename(url)):
urllib.request.urlretrieve(url, os.path.basename(url))
self.waveform, self.sr = load(os.path.abspath(os.path.basename(url)))
self.waveform = self.waveform.astype(
np.float32
) # paddlespeech.s2t.transform.spectrogram only supports float32
dim = len(self.waveform.shape)
assert dim in [1, 2]
if dim == 1:
self.waveform = np.expand_dims(self.waveform, 0)
def initParmas(self):
raise NotImplementedError

@ -0,0 +1,49 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
from .base import FeatTest
from paddleaudio.functional.window import get_window
from paddlespeech.s2t.transform.spectrogram import IStft
from paddlespeech.s2t.transform.spectrogram import Stft
class TestIstft(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
self.window_str = 'hann'
def test_istft(self):
ps_stft = Stft(self.n_fft, self.hop_length)
ps_res = ps_stft(
self.waveform.T).squeeze(1).T # (n_fft//2 + 1, n_frmaes)
x = paddle.to_tensor(ps_res)
ps_istft = IStft(self.hop_length)
ps_res = ps_istft(ps_res.T)
window = get_window(
self.window_str, self.n_fft, dtype=self.waveform.dtype)
pd_res = paddle.signal.istft(
x, self.n_fft, self.hop_length, window=window)
np.testing.assert_array_almost_equal(ps_res, pd_res, decimal=5)
if __name__ == '__main__':
unittest.main()

@ -0,0 +1,81 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
import torch
import torchaudio
import paddleaudio
from .base import FeatTest
class TestKaldi(FeatTest):
def initParmas(self):
self.window_size = 1024
self.dtype = 'float32'
def test_window(self):
t_hann_window = torch.hann_window(
self.window_size, periodic=False, dtype=eval(f'torch.{self.dtype}'))
t_hamm_window = torch.hamming_window(
self.window_size,
periodic=False,
alpha=0.54,
beta=0.46,
dtype=eval(f'torch.{self.dtype}'))
t_povey_window = torch.hann_window(
self.window_size, periodic=False,
dtype=eval(f'torch.{self.dtype}')).pow(0.85)
p_hann_window = paddleaudio.functional.window.get_window(
'hann',
self.window_size,
fftbins=False,
dtype=eval(f'paddle.{self.dtype}'))
p_hamm_window = paddleaudio.functional.window.get_window(
'hamming',
self.window_size,
fftbins=False,
dtype=eval(f'paddle.{self.dtype}'))
p_povey_window = paddleaudio.functional.window.get_window(
'hann',
self.window_size,
fftbins=False,
dtype=eval(f'paddle.{self.dtype}')).pow(0.85)
np.testing.assert_array_almost_equal(t_hann_window, p_hann_window)
np.testing.assert_array_almost_equal(t_hamm_window, p_hamm_window)
np.testing.assert_array_almost_equal(t_povey_window, p_povey_window)
def test_fbank(self):
ta_features = torchaudio.compliance.kaldi.fbank(
torch.from_numpy(self.waveform.astype(self.dtype)))
pa_features = paddleaudio.compliance.kaldi.fbank(
paddle.to_tensor(self.waveform.astype(self.dtype)))
np.testing.assert_array_almost_equal(
ta_features, pa_features, decimal=4)
def test_mfcc(self):
ta_features = torchaudio.compliance.kaldi.mfcc(
torch.from_numpy(self.waveform.astype(self.dtype)))
pa_features = paddleaudio.compliance.kaldi.mfcc(
paddle.to_tensor(self.waveform.astype(self.dtype)))
np.testing.assert_array_almost_equal(
ta_features, pa_features, decimal=4)
if __name__ == '__main__':
unittest.main()

@ -0,0 +1,281 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import librosa
import numpy as np
import paddle
import paddleaudio
from .base import FeatTest
from paddleaudio.functional.window import get_window
class TestLibrosa(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
self.n_mels = 40
self.n_mfcc = 20
self.fmin = 0.0
self.window_str = 'hann'
self.pad_mode = 'reflect'
self.top_db = 80.0
def test_stft(self):
if len(self.waveform.shape) == 2: # (C, T)
self.waveform = self.waveform.squeeze(
0) # 1D input for librosa.feature.melspectrogram
feature_librosa = librosa.core.stft(
y=self.waveform,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=None,
window=self.window_str,
center=True,
dtype=None,
pad_mode=self.pad_mode, )
x = paddle.to_tensor(self.waveform).unsqueeze(0)
window = get_window(self.window_str, self.n_fft, dtype=x.dtype)
feature_paddle = paddle.signal.stft(
x=x,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=None,
window=window,
center=True,
pad_mode=self.pad_mode,
normalized=False,
onesided=True, ).squeeze(0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddle, decimal=5)
def test_istft(self):
if len(self.waveform.shape) == 2: # (C, T)
self.waveform = self.waveform.squeeze(
0) # 1D input for librosa.feature.melspectrogram
# Get stft result from librosa.
stft_matrix = librosa.core.stft(
y=self.waveform,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=None,
window=self.window_str,
center=True,
pad_mode=self.pad_mode, )
feature_librosa = librosa.core.istft(
stft_matrix=stft_matrix,
hop_length=self.hop_length,
win_length=None,
window=self.window_str,
center=True,
dtype=None,
length=None, )
x = paddle.to_tensor(stft_matrix).unsqueeze(0)
window = get_window(
self.window_str,
self.n_fft,
dtype=paddle.to_tensor(self.waveform).dtype)
feature_paddle = paddle.signal.istft(
x=x,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=None,
window=window,
center=True,
normalized=False,
onesided=True,
length=None,
return_complex=False, ).squeeze(0)
np.testing.assert_array_almost_equal(
feature_librosa, feature_paddle, decimal=5)
def test_mel(self):
feature_librosa = librosa.filters.mel(
sr=self.sr,
n_fft=self.n_fft,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=None,
htk=False,
norm='slaney',
dtype=self.waveform.dtype, )
feature_compliance = paddleaudio.compliance.librosa.compute_fbank_matrix(
sr=self.sr,
n_fft=self.n_fft,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=None,
htk=False,
norm='slaney',
dtype=self.waveform.dtype, )
x = paddle.to_tensor(self.waveform)
feature_functional = paddleaudio.functional.compute_fbank_matrix(
sr=self.sr,
n_fft=self.n_fft,
n_mels=self.n_mels,
f_min=self.fmin,
f_max=None,
htk=False,
norm='slaney',
dtype=x.dtype, )
np.testing.assert_array_almost_equal(feature_librosa,
feature_compliance)
np.testing.assert_array_almost_equal(feature_librosa,
feature_functional)
def test_melspect(self):
if len(self.waveform.shape) == 2: # (C, T)
self.waveform = self.waveform.squeeze(
0) # 1D input for librosa.feature.melspectrogram
# librosa:
feature_librosa = librosa.feature.melspectrogram(
y=self.waveform,
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin)
# paddleaudio.compliance.librosa:
feature_compliance = paddleaudio.compliance.librosa.melspectrogram(
x=self.waveform,
sr=self.sr,
window_size=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin,
to_db=False)
# paddleaudio.features.layer
x = paddle.to_tensor(
self.waveform, dtype=paddle.float64).unsqueeze(0) # Add batch dim.
feature_extractor = paddleaudio.features.MelSpectrogram(
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
f_min=self.fmin,
dtype=x.dtype)
feature_layer = feature_extractor(x).squeeze(0).numpy()
np.testing.assert_array_almost_equal(
feature_librosa, feature_compliance, decimal=5)
np.testing.assert_array_almost_equal(
feature_librosa, feature_layer, decimal=5)
def test_log_melspect(self):
if len(self.waveform.shape) == 2: # (C, T)
self.waveform = self.waveform.squeeze(
0) # 1D input for librosa.feature.melspectrogram
# librosa:
feature_librosa = librosa.feature.melspectrogram(
y=self.waveform,
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin)
feature_librosa = librosa.power_to_db(feature_librosa, top_db=None)
# paddleaudio.compliance.librosa:
feature_compliance = paddleaudio.compliance.librosa.melspectrogram(
x=self.waveform,
sr=self.sr,
window_size=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin)
# paddleaudio.features.layer
x = paddle.to_tensor(
self.waveform, dtype=paddle.float64).unsqueeze(0) # Add batch dim.
feature_extractor = paddleaudio.features.LogMelSpectrogram(
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
f_min=self.fmin,
dtype=x.dtype)
feature_layer = feature_extractor(x).squeeze(0).numpy()
np.testing.assert_array_almost_equal(
feature_librosa, feature_compliance, decimal=5)
np.testing.assert_array_almost_equal(
feature_librosa, feature_layer, decimal=4)
def test_mfcc(self):
if len(self.waveform.shape) == 2: # (C, T)
self.waveform = self.waveform.squeeze(
0) # 1D input for librosa.feature.melspectrogram
# librosa:
feature_librosa = librosa.feature.mfcc(
y=self.waveform,
sr=self.sr,
S=None,
n_mfcc=self.n_mfcc,
dct_type=2,
norm='ortho',
lifter=0,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin)
# paddleaudio.compliance.librosa:
feature_compliance = paddleaudio.compliance.librosa.mfcc(
x=self.waveform,
sr=self.sr,
n_mfcc=self.n_mfcc,
dct_type=2,
norm='ortho',
lifter=0,
window_size=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin,
top_db=self.top_db)
# paddleaudio.features.layer
x = paddle.to_tensor(
self.waveform, dtype=paddle.float64).unsqueeze(0) # Add batch dim.
feature_extractor = paddleaudio.features.MFCC(
sr=self.sr,
n_mfcc=self.n_mfcc,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
f_min=self.fmin,
top_db=self.top_db,
dtype=x.dtype)
feature_layer = feature_extractor(x).squeeze(0).numpy()
np.testing.assert_array_almost_equal(
feature_librosa, feature_compliance, decimal=4)
np.testing.assert_array_almost_equal(
feature_librosa, feature_layer, decimal=4)
if __name__ == '__main__':
unittest.main()

@ -0,0 +1,50 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
import paddleaudio
from .base import FeatTest
from paddlespeech.s2t.transform.spectrogram import LogMelSpectrogram
class TestLogMelSpectrogram(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
self.n_mels = 40
def test_log_melspect(self):
ps_melspect = LogMelSpectrogram(self.sr, self.n_mels, self.n_fft,
self.hop_length)
ps_res = ps_melspect(self.waveform.T).squeeze(1).T
x = paddle.to_tensor(self.waveform)
# paddlespeech.s2t的特征存在幅度谱和功率谱滥用的情况
ps_melspect = paddleaudio.features.LogMelSpectrogram(
self.sr,
self.n_fft,
self.hop_length,
power=1.0,
n_mels=self.n_mels,
f_min=0.0)
pa_res = (ps_melspect(x) / 10.0).squeeze(0).numpy()
np.testing.assert_array_almost_equal(ps_res, pa_res, decimal=5)
if __name__ == '__main__':
unittest.main()

@ -0,0 +1,42 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
import paddleaudio
from .base import FeatTest
from paddlespeech.s2t.transform.spectrogram import Spectrogram
class TestSpectrogram(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
def test_spectrogram(self):
ps_spect = Spectrogram(self.n_fft, self.hop_length)
ps_res = ps_spect(self.waveform.T).squeeze(1).T # Magnitude
x = paddle.to_tensor(self.waveform)
pa_spect = paddleaudio.features.Spectrogram(
self.n_fft, self.hop_length, power=1.0)
pa_res = pa_spect(x).squeeze(0).numpy()
np.testing.assert_array_almost_equal(ps_res, pa_res, decimal=5)
if __name__ == '__main__':
unittest.main()

@ -0,0 +1,44 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
import paddle
from .base import FeatTest
from paddleaudio.functional.window import get_window
from paddlespeech.s2t.transform.spectrogram import Stft
class TestStft(FeatTest):
def initParmas(self):
self.n_fft = 512
self.hop_length = 128
self.window_str = 'hann'
def test_stft(self):
ps_stft = Stft(self.n_fft, self.hop_length)
ps_res = ps_stft(
self.waveform.T).squeeze(1).T # (n_fft//2 + 1, n_frmaes)
x = paddle.to_tensor(self.waveform)
window = get_window(self.window_str, self.n_fft, dtype=x.dtype)
pd_res = paddle.signal.stft(
x, self.n_fft, self.hop_length, window=window).squeeze(0).numpy()
np.testing.assert_array_almost_equal(ps_res, pd_res, decimal=5)
if __name__ == '__main__':
unittest.main()

@ -193,7 +193,8 @@ class CLSExecutor(BaseExecutor):
sr=feat_conf['sample_rate'],
mono=True,
dtype='float32')
logger.info("Preprocessing audio_file:" + audio_file)
if isinstance(audio_file, (str, os.PathLike)):
logger.info("Preprocessing audio_file:" + audio_file)
# Feature extraction
feature_extractor = LogMelSpectrogram(

@ -178,7 +178,8 @@ class BaseExecutor(ABC):
Returns:
bool: return `True` for job input, `False` otherwise.
"""
return input_ and os.path.isfile(input_) and input_.endswith('.job')
return input_ and os.path.isfile(input_) and (input_.endswith('.job') or
input_.endswith('.txt'))
def _get_job_contents(
self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:

@ -237,6 +237,30 @@ pretrained_models = {
'speech_stats':
'feats_stats.npy',
},
"hifigan_aishell3-zh": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip',
'md5':
'3bb49bc75032ed12f79c00c8cc79a09a',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_2500000.pdz',
'speech_stats':
'feats_stats.npy',
},
"hifigan_vctk-en": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip',
'md5':
'7da8f88359bca2457e705d924cf27bd4',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_2500000.pdz',
'speech_stats':
'feats_stats.npy',
},
# wavernn
"wavernn_csmsc-zh": {
@ -365,6 +389,8 @@ class TTSExecutor(BaseExecutor):
'mb_melgan_csmsc',
'style_melgan_csmsc',
'hifigan_csmsc',
'hifigan_aishell3',
'hifigan_vctk',
'wavernn_csmsc',
],
help='Choose vocoder type of tts task.')

@ -192,7 +192,7 @@ class ConfigCache:
try:
cfg = yaml.load(file, Loader=yaml.FullLoader)
self._data.update(cfg)
except:
except Exception as e:
self.flush()
@property

@ -18,6 +18,7 @@ from .base_commands import ClientHelpCommand
from .base_commands import ServerBaseCommand
from .base_commands import ServerHelpCommand
from .bin.paddlespeech_client import ASRClientExecutor
from .bin.paddlespeech_client import CLSClientExecutor
from .bin.paddlespeech_client import TTSClientExecutor
from .bin.paddlespeech_server import ServerExecutor

@ -31,7 +31,7 @@ from paddlespeech.cli.log import logger
from paddlespeech.server.utils.audio_process import wav2pcm
from paddlespeech.server.utils.util import wav2base64
__all__ = ['TTSClientExecutor', 'ASRClientExecutor']
__all__ = ['TTSClientExecutor', 'ASRClientExecutor', 'CLSClientExecutor']
@cli_client_register(
@ -70,13 +70,9 @@ class TTSClientExecutor(BaseExecutor):
choices=[0, 8000, 16000],
help='Sampling rate, the default is the same as the model')
self.parser.add_argument(
'--output',
type=str,
default="./output.wav",
help='Synthesized audio file')
'--output', type=str, default=None, help='Synthesized audio file')
def postprocess(self, response_dict: dict, outfile: str) -> float:
wav_base64 = response_dict["result"]["audio"]
def postprocess(self, wav_base64: str, outfile: str) -> float:
audio_data_byte = base64.b64decode(wav_base64)
# from byte
samples, sample_rate = soundfile.read(
@ -93,37 +89,38 @@ class TTSClientExecutor(BaseExecutor):
else:
logger.error("The format for saving audio only supports wav or pcm")
duration = len(samples) / sample_rate
return duration
def execute(self, argv: List[str]) -> bool:
args = self.parser.parse_args(argv)
try:
url = 'http://' + args.server_ip + ":" + str(
args.port) + '/paddlespeech/tts'
request = {
"text": args.input,
"spk_id": args.spk_id,
"speed": args.speed,
"volume": args.volume,
"sample_rate": args.sample_rate,
"save_path": args.output
}
st = time.time()
response = requests.post(url, json.dumps(request))
time_consume = time.time() - st
response_dict = response.json()
duration = self.postprocess(response_dict, args.output)
input_ = args.input
server_ip = args.server_ip
port = args.port
spk_id = args.spk_id
speed = args.speed
volume = args.volume
sample_rate = args.sample_rate
output = args.output
try:
time_start = time.time()
res = self(
input=input_,
server_ip=server_ip,
port=port,
spk_id=spk_id,
speed=speed,
volume=volume,
sample_rate=sample_rate,
output=output)
time_end = time.time()
time_consume = time_end - time_start
response_dict = res.json()
logger.info(response_dict["message"])
logger.info("Save synthesized audio successfully on %s." %
(args.output))
logger.info("Audio duration: %f s." % (duration))
logger.info("Save synthesized audio successfully on %s." % (output))
logger.info("Audio duration: %f s." %
(response_dict['result']['duration']))
logger.info("Response time: %f s." % (time_consume))
return True
except BaseException:
except Exception as e:
logger.error("Failed to synthesized audio.")
return False
@ -136,7 +133,7 @@ class TTSClientExecutor(BaseExecutor):
speed: float=1.0,
volume: float=1.0,
sample_rate: int=0,
output: str="./output.wav"):
output: str=None):
"""
Python API to call an executor.
"""
@ -151,20 +148,11 @@ class TTSClientExecutor(BaseExecutor):
"save_path": output
}
try:
st = time.time()
response = requests.post(url, json.dumps(request))
time_consume = time.time() - st
response_dict = response.json()
duration = self.postprocess(response_dict, output)
print(response_dict["message"])
print("Save synthesized audio successfully on %s." % (output))
print("Audio duration: %f s." % (duration))
print("Response time: %f s." % (time_consume))
print("RTF: %f " % (time_consume / duration))
except BaseException:
print("Failed to synthesized audio.")
res = requests.post(url, json.dumps(request))
response_dict = res.json()
if not output:
self.postprocess(response_dict["result"]["audio"], output)
return res
@cli_client_register(
@ -193,24 +181,27 @@ class ASRClientExecutor(BaseExecutor):
def execute(self, argv: List[str]) -> bool:
args = self.parser.parse_args(argv)
url = 'http://' + args.server_ip + ":" + str(
args.port) + '/paddlespeech/asr'
audio = wav2base64(args.input)
data = {
"audio": audio,
"audio_format": args.audio_format,
"sample_rate": args.sample_rate,
"lang": args.lang,
}
time_start = time.time()
input_ = args.input
server_ip = args.server_ip
port = args.port
sample_rate = args.sample_rate
lang = args.lang
audio_format = args.audio_format
try:
r = requests.post(url=url, data=json.dumps(data))
# ending Timestamp
time_start = time.time()
res = self(
input=input_,
server_ip=server_ip,
port=port,
sample_rate=sample_rate,
lang=lang,
audio_format=audio_format)
time_end = time.time()
logger.info(r.json())
logger.info("time cost %f s." % (time_end - time_start))
logger.info(res.json())
logger.info("Response time %f s." % (time_end - time_start))
return True
except BaseException:
except Exception as e:
logger.error("Failed to speech recognition.")
return False
@ -234,12 +225,65 @@ class ASRClientExecutor(BaseExecutor):
"sample_rate": sample_rate,
"lang": lang,
}
time_start = time.time()
res = requests.post(url=url, data=json.dumps(data))
return res
@cli_client_register(
name='paddlespeech_client.cls', description='visit cls service')
class CLSClientExecutor(BaseExecutor):
def __init__(self):
super(CLSClientExecutor, self).__init__()
self.parser = argparse.ArgumentParser(
prog='paddlespeech_client.cls', add_help=True)
self.parser.add_argument(
'--server_ip', type=str, default='127.0.0.1', help='server ip')
self.parser.add_argument(
'--port', type=int, default=8090, help='server port')
self.parser.add_argument(
'--input',
type=str,
default=None,
help='Audio file to classify.',
required=True)
self.parser.add_argument(
'--topk',
type=int,
default=1,
help='Return topk scores of classification result.')
def execute(self, argv: List[str]) -> bool:
args = self.parser.parse_args(argv)
input_ = args.input
server_ip = args.server_ip
port = args.port
topk = args.topk
try:
r = requests.post(url=url, data=json.dumps(data))
# ending Timestamp
time_start = time.time()
res = self(input=input_, server_ip=server_ip, port=port, topk=topk)
time_end = time.time()
print(r.json())
print("time cost %f s." % (time_end - time_start))
except BaseException:
print("Failed to speech recognition.")
logger.info(res.json())
logger.info("Response time %f s." % (time_end - time_start))
return True
except Exception as e:
logger.error("Failed to speech classification.")
return False
@stats_wrapper
def __call__(self,
input: str,
server_ip: str="127.0.0.1",
port: int=8090,
topk: int=1):
"""
Python API to call an executor.
"""
url = 'http://' + server_ip + ":" + str(port) + '/paddlespeech/cls'
audio = wav2base64(input)
data = {"audio": audio, "topk": topk}
res = requests.post(url=url, data=json.dumps(data))
return res

@ -103,13 +103,14 @@ class ServerStatsExecutor():
'--task',
type=str,
default=None,
choices=['asr', 'tts'],
choices=['asr', 'tts', 'cls'],
help='Choose speech task.',
required=True)
self.task_choices = ['asr', 'tts']
self.task_choices = ['asr', 'tts', 'cls']
self.model_name_format = {
'asr': 'Model-Language-Sample Rate',
'tts': 'Model-Language'
'tts': 'Model-Language',
'cls': 'Model-Sample Rate'
}
def show_support_models(self, pretrained_models: dict):
@ -174,53 +175,24 @@ class ServerStatsExecutor():
)
return False
@stats_wrapper
def __call__(
self,
task: str=None, ):
"""
Python API to call an executor.
"""
self.task = task
if self.task not in self.task_choices:
print("Please input correct speech task, choices = ['asr', 'tts']")
elif self.task == 'asr':
try:
from paddlespeech.cli.asr.infer import pretrained_models
print(
"Here is the table of ASR pretrained models supported in the service."
)
self.show_support_models(pretrained_models)
# show ASR static pretrained model
from paddlespeech.server.engine.asr.paddleinference.asr_engine import pretrained_models
print(
"Here is the table of ASR static pretrained models supported in the service."
)
self.show_support_models(pretrained_models)
except BaseException:
print(
"Failed to get the table of ASR pretrained models supported in the service."
)
elif self.task == 'tts':
elif self.task == 'cls':
try:
from paddlespeech.cli.tts.infer import pretrained_models
print(
"Here is the table of TTS pretrained models supported in the service."
from paddlespeech.cli.cls.infer import pretrained_models
logger.info(
"Here is the table of CLS pretrained models supported in the service."
)
self.show_support_models(pretrained_models)
# show TTS static pretrained model
from paddlespeech.server.engine.tts.paddleinference.tts_engine import pretrained_models
print(
"Here is the table of TTS static pretrained models supported in the service."
# show CLS static pretrained model
from paddlespeech.server.engine.cls.paddleinference.cls_engine import pretrained_models
logger.info(
"Here is the table of CLS static pretrained models supported in the service."
)
self.show_support_models(pretrained_models)
return True
except BaseException:
print(
"Failed to get the table of TTS pretrained models supported in the service."
logger.error(
"Failed to get the table of CLS pretrained models supported in the service."
)
return False

@ -9,12 +9,14 @@ port: 8090
# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference']
engine_list: ['asr_python', 'tts_python']
engine_list: ['asr_python', 'tts_python', 'cls_python']
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### ASR #########################################
################### speech task: asr; engine_type: python #######################
asr_python:
model: 'conformer_wenetspeech'
@ -46,6 +48,7 @@ asr_inference:
summary: True # False -> do not show predictor config
################################### TTS #########################################
################### speech task: tts; engine_type: python #######################
tts_python:
# am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc',
@ -105,3 +108,30 @@ tts_inference:
# others
lang: 'zh'
################################### CLS #########################################
################### speech task: cls; engine_type: python #######################
cls_python:
# model choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model: 'panns_cnn14'
cfg_path: # [optional] Config of cls task.
ckpt_path: # [optional] Checkpoint file of model.
label_file: # [optional] Label file of cls task.
device: # set 'gpu:id' or 'cpu'
################### speech task: cls; engine_type: inference #######################
cls_inference:
# model_type choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model_type: 'panns_cnn14'
cfg_path:
model_path: # the pdmodel file of am static model [optional]
params_path: # the pdiparams file of am static model [optional]
label_file: # [optional] Label file of cls task.
predictor_conf:
device: # set 'gpu:id' or 'cpu'
switch_ir_optim: True
glog_info: False # True -> print glog
summary: True # False -> do not show predictor config

@ -0,0 +1,13 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -0,0 +1,13 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -0,0 +1,224 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import io
import os
import time
from typing import Optional
import numpy as np
import paddle
import yaml
from paddlespeech.cli.cls.infer import CLSExecutor
from paddlespeech.cli.log import logger
from paddlespeech.cli.utils import download_and_decompress
from paddlespeech.cli.utils import MODEL_HOME
from paddlespeech.server.engine.base_engine import BaseEngine
from paddlespeech.server.utils.paddle_predictor import init_predictor
from paddlespeech.server.utils.paddle_predictor import run_model
__all__ = ['CLSEngine']
pretrained_models = {
"panns_cnn6-32k": {
'url':
'https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn6_static.tar.gz',
'md5':
'da087c31046d23281d8ec5188c1967da',
'cfg_path':
'panns.yaml',
'model_path':
'inference.pdmodel',
'params_path':
'inference.pdiparams',
'label_file':
'audioset_labels.txt',
},
"panns_cnn10-32k": {
'url':
'https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn10_static.tar.gz',
'md5':
'5460cc6eafbfaf0f261cc75b90284ae1',
'cfg_path':
'panns.yaml',
'model_path':
'inference.pdmodel',
'params_path':
'inference.pdiparams',
'label_file':
'audioset_labels.txt',
},
"panns_cnn14-32k": {
'url':
'https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn14_static.tar.gz',
'md5':
'ccc80b194821274da79466862b2ab00f',
'cfg_path':
'panns.yaml',
'model_path':
'inference.pdmodel',
'params_path':
'inference.pdiparams',
'label_file':
'audioset_labels.txt',
},
}
class CLSServerExecutor(CLSExecutor):
def __init__(self):
super().__init__()
pass
def _get_pretrained_path(self, tag: str) -> os.PathLike:
"""
Download and returns pretrained resources path of current task.
"""
support_models = list(pretrained_models.keys())
assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
tag, '\n\t\t'.join(support_models))
res_path = os.path.join(MODEL_HOME, tag)
decompressed_path = download_and_decompress(pretrained_models[tag],
res_path)
decompressed_path = os.path.abspath(decompressed_path)
logger.info(
'Use pretrained model stored in: {}'.format(decompressed_path))
return decompressed_path
def _init_from_path(
self,
model_type: str='panns_cnn14',
cfg_path: Optional[os.PathLike]=None,
model_path: Optional[os.PathLike]=None,
params_path: Optional[os.PathLike]=None,
label_file: Optional[os.PathLike]=None,
predictor_conf: dict=None, ):
"""
Init model and other resources from a specific path.
"""
if cfg_path is None or model_path is None or params_path is None or label_file is None:
tag = model_type + '-' + '32k'
self.res_path = self._get_pretrained_path(tag)
self.cfg_path = os.path.join(self.res_path,
pretrained_models[tag]['cfg_path'])
self.model_path = os.path.join(self.res_path,
pretrained_models[tag]['model_path'])
self.params_path = os.path.join(
self.res_path, pretrained_models[tag]['params_path'])
self.label_file = os.path.join(self.res_path,
pretrained_models[tag]['label_file'])
else:
self.cfg_path = os.path.abspath(cfg_path)
self.model_path = os.path.abspath(model_path)
self.params_path = os.path.abspath(params_path)
self.label_file = os.path.abspath(label_file)
logger.info(self.cfg_path)
logger.info(self.model_path)
logger.info(self.params_path)
logger.info(self.label_file)
# config
with open(self.cfg_path, 'r') as f:
self._conf = yaml.safe_load(f)
logger.info("Read cfg file successfully.")
# labels
self._label_list = []
with open(self.label_file, 'r') as f:
for line in f:
self._label_list.append(line.strip())
logger.info("Read label file successfully.")
# Create predictor
self.predictor_conf = predictor_conf
self.predictor = init_predictor(
model_file=self.model_path,
params_file=self.params_path,
predictor_conf=self.predictor_conf)
logger.info("Create predictor successfully.")
@paddle.no_grad()
def infer(self):
"""
Model inference and result stored in self.output.
"""
output = run_model(self.predictor, [self._inputs['feats'].numpy()])
self._outputs['logits'] = output[0]
class CLSEngine(BaseEngine):
"""CLS server engine
Args:
metaclass: Defaults to Singleton.
"""
def __init__(self):
super(CLSEngine, self).__init__()
def init(self, config: dict) -> bool:
"""init engine resource
Args:
config_file (str): config file
Returns:
bool: init failed or success
"""
self.executor = CLSServerExecutor()
self.config = config
self.executor._init_from_path(
self.config.model_type, self.config.cfg_path,
self.config.model_path, self.config.params_path,
self.config.label_file, self.config.predictor_conf)
logger.info("Initialize CLS server engine successfully.")
return True
def run(self, audio_data):
"""engine run
Args:
audio_data (bytes): base64.b64decode
"""
self.executor.preprocess(io.BytesIO(audio_data))
st = time.time()
self.executor.infer()
infer_time = time.time() - st
logger.info("inference time: {}".format(infer_time))
logger.info("cls engine type: inference")
def postprocess(self, topk: int):
"""postprocess
"""
assert topk <= len(self.executor._label_list
), 'Value of topk is larger than number of labels.'
result = np.squeeze(self.executor._outputs['logits'], axis=0)
topk_idx = (-result).argsort()[:topk]
topk_results = []
for idx in topk_idx:
res = {}
label, score = self.executor._label_list[idx], result[idx]
res['class_name'] = label
res['prob'] = score
topk_results.append(res)
return topk_results

@ -0,0 +1,13 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -0,0 +1,124 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import io
import time
from typing import List
import paddle
from paddlespeech.cli.cls.infer import CLSExecutor
from paddlespeech.cli.log import logger
from paddlespeech.server.engine.base_engine import BaseEngine
__all__ = ['CLSEngine']
class CLSServerExecutor(CLSExecutor):
def __init__(self):
super().__init__()
pass
def get_topk_results(self, topk: int) -> List:
assert topk <= len(
self._label_list), 'Value of topk is larger than number of labels.'
result = self._outputs['logits'].squeeze(0).numpy()
topk_idx = (-result).argsort()[:topk]
res = {}
topk_results = []
for idx in topk_idx:
label, score = self._label_list[idx], result[idx]
res['class'] = label
res['prob'] = score
topk_results.append(res)
return topk_results
class CLSEngine(BaseEngine):
"""CLS server engine
Args:
metaclass: Defaults to Singleton.
"""
def __init__(self):
super(CLSEngine, self).__init__()
def init(self, config: dict) -> bool:
"""init engine resource
Args:
config_file (str): config file
Returns:
bool: init failed or success
"""
self.input = None
self.output = None
self.executor = CLSServerExecutor()
self.config = config
try:
if self.config.device:
self.device = self.config.device
else:
self.device = paddle.get_device()
paddle.set_device(self.device)
except BaseException:
logger.error(
"Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
)
try:
self.executor._init_from_path(
self.config.model, self.config.cfg_path, self.config.ckpt_path,
self.config.label_file)
except BaseException:
logger.error("Initialize CLS server engine Failed.")
return False
logger.info("Initialize CLS server engine successfully on device: %s." %
(self.device))
return True
def run(self, audio_data):
"""engine run
Args:
audio_data (bytes): base64.b64decode
"""
self.executor.preprocess(io.BytesIO(audio_data))
st = time.time()
self.executor.infer()
infer_time = time.time() - st
logger.info("inference time: {}".format(infer_time))
logger.info("cls engine type: python")
def postprocess(self, topk: int):
"""postprocess
"""
assert topk <= len(self.executor._label_list
), 'Value of topk is larger than number of labels.'
result = self.executor._outputs['logits'].squeeze(0).numpy()
topk_idx = (-result).argsort()[:topk]
topk_results = []
for idx in topk_idx:
res = {}
label, score = self.executor._label_list[idx], result[idx]
res['class_name'] = label
res['prob'] = score
topk_results.append(res)
return topk_results

@ -31,5 +31,11 @@ class EngineFactory(object):
elif engine_name == 'tts' and engine_type == 'python':
from paddlespeech.server.engine.tts.python.tts_engine import TTSEngine
return TTSEngine()
elif engine_name == 'cls' and engine_type == 'inference':
from paddlespeech.server.engine.cls.paddleinference.cls_engine import CLSEngine
return CLSEngine()
elif engine_name == 'cls' and engine_type == 'python':
from paddlespeech.server.engine.cls.python.cls_engine import CLSEngine
return CLSEngine()
else:
return None

@ -250,27 +250,21 @@ class TTSServerExecutor(TTSExecutor):
self.frontend = English(phone_vocab_path=self.phones_dict)
logger.info("frontend done!")
try:
# am predictor
self.am_predictor_conf = am_predictor_conf
self.am_predictor = init_predictor(
model_file=self.am_model,
params_file=self.am_params,
predictor_conf=self.am_predictor_conf)
logger.info("Create AM predictor successfully.")
except BaseException:
logger.error("Failed to create AM predictor.")
try:
# voc predictor
self.voc_predictor_conf = voc_predictor_conf
self.voc_predictor = init_predictor(
model_file=self.voc_model,
params_file=self.voc_params,
predictor_conf=self.voc_predictor_conf)
logger.info("Create Vocoder predictor successfully.")
except BaseException:
logger.error("Failed to create Vocoder predictor.")
# Create am predictor
self.am_predictor_conf = am_predictor_conf
self.am_predictor = init_predictor(
model_file=self.am_model,
params_file=self.am_params,
predictor_conf=self.am_predictor_conf)
logger.info("Create AM predictor successfully.")
# Create voc predictor
self.voc_predictor_conf = voc_predictor_conf
self.voc_predictor = init_predictor(
model_file=self.voc_model,
params_file=self.voc_params,
predictor_conf=self.voc_predictor_conf)
logger.info("Create Vocoder predictor successfully.")
@paddle.no_grad()
def infer(self,
@ -359,27 +353,22 @@ class TTSEngine(BaseEngine):
def init(self, config: dict) -> bool:
self.executor = TTSServerExecutor()
try:
self.config = config
self.executor._init_from_path(
am=self.config.am,
am_model=self.config.am_model,
am_params=self.config.am_params,
am_sample_rate=self.config.am_sample_rate,
phones_dict=self.config.phones_dict,
tones_dict=self.config.tones_dict,
speaker_dict=self.config.speaker_dict,
voc=self.config.voc,
voc_model=self.config.voc_model,
voc_params=self.config.voc_params,
voc_sample_rate=self.config.voc_sample_rate,
lang=self.config.lang,
am_predictor_conf=self.config.am_predictor_conf,
voc_predictor_conf=self.config.voc_predictor_conf, )
except BaseException:
logger.error("Initialize TTS server engine Failed.")
return False
self.config = config
self.executor._init_from_path(
am=self.config.am,
am_model=self.config.am_model,
am_params=self.config.am_params,
am_sample_rate=self.config.am_sample_rate,
phones_dict=self.config.phones_dict,
tones_dict=self.config.tones_dict,
speaker_dict=self.config.speaker_dict,
voc=self.config.voc,
voc_model=self.config.voc_model,
voc_params=self.config.voc_params,
voc_sample_rate=self.config.voc_sample_rate,
lang=self.config.lang,
am_predictor_conf=self.config.am_predictor_conf,
voc_predictor_conf=self.config.voc_predictor_conf, )
logger.info("Initialize TTS server engine successfully.")
return True
@ -542,4 +531,4 @@ class TTSEngine(BaseEngine):
postprocess_time))
logger.info("RTF: {}".format(rtf))
return lang, target_sample_rate, wav_base64
return lang, target_sample_rate, duration, wav_base64

@ -250,4 +250,4 @@ class TTSEngine(BaseEngine):
logger.info("RTF: {}".format(rtf))
logger.info("device: {}".format(self.device))
return lang, target_sample_rate, wav_base64
return lang, target_sample_rate, duration, wav_base64

@ -16,6 +16,7 @@ from typing import List
from fastapi import APIRouter
from paddlespeech.server.restful.asr_api import router as asr_router
from paddlespeech.server.restful.cls_api import router as cls_router
from paddlespeech.server.restful.tts_api import router as tts_router
_router = APIRouter()
@ -25,7 +26,7 @@ def setup_router(api_list: List):
"""setup router for fastapi
Args:
api_list (List): [asr, tts]
api_list (List): [asr, tts, cls]
Returns:
APIRouter
@ -35,6 +36,8 @@ def setup_router(api_list: List):
_router.include_router(asr_router)
elif api_name == 'tts':
_router.include_router(tts_router)
elif api_name == 'cls':
_router.include_router(cls_router)
else:
pass

@ -0,0 +1,92 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import base64
import traceback
from typing import Union
from fastapi import APIRouter
from paddlespeech.server.engine.engine_pool import get_engine_pool
from paddlespeech.server.restful.request import CLSRequest
from paddlespeech.server.restful.response import CLSResponse
from paddlespeech.server.restful.response import ErrorResponse
from paddlespeech.server.utils.errors import ErrorCode
from paddlespeech.server.utils.errors import failed_response
from paddlespeech.server.utils.exception import ServerBaseException
router = APIRouter()
@router.get('/paddlespeech/cls/help')
def help():
"""help
Returns:
json: [description]
"""
response = {
"success": "True",
"code": 200,
"message": {
"global": "success"
},
"result": {
"description": "cls server",
"input": "base64 string of wavfile",
"output": "classification result"
}
}
return response
@router.post(
"/paddlespeech/cls", response_model=Union[CLSResponse, ErrorResponse])
def cls(request_body: CLSRequest):
"""cls api
Args:
request_body (CLSRequest): [description]
Returns:
json: [description]
"""
try:
audio_data = base64.b64decode(request_body.audio)
# get single engine from engine pool
engine_pool = get_engine_pool()
cls_engine = engine_pool['cls']
cls_engine.run(audio_data)
cls_results = cls_engine.postprocess(request_body.topk)
response = {
"success": True,
"code": 200,
"message": {
"description": "success"
},
"result": {
"topk": request_body.topk,
"results": cls_results
}
}
except ServerBaseException as e:
response = failed_response(e.error_code, e.msg)
except BaseException:
response = failed_response(ErrorCode.SERVER_UNKOWN_ERR)
traceback.print_exc()
return response

@ -15,7 +15,7 @@ from typing import Optional
from pydantic import BaseModel
__all__ = ['ASRRequest', 'TTSRequest']
__all__ = ['ASRRequest', 'TTSRequest', 'CLSRequest']
#****************************************************************************************/
@ -63,3 +63,18 @@ class TTSRequest(BaseModel):
volume: float = 1.0
sample_rate: int = 0
save_path: str = None
#****************************************************************************************/
#************************************ CLS request ***************************************/
#****************************************************************************************/
class CLSRequest(BaseModel):
"""
request body example
{
"audio": "exSI6ICJlbiIsCgkgICAgInBvc2l0aW9uIjogImZhbHNlIgoJf...",
"topk": 1
}
"""
audio: str
topk: int = 1

@ -11,9 +11,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List
from pydantic import BaseModel
__all__ = ['ASRResponse', 'TTSResponse']
__all__ = ['ASRResponse', 'TTSResponse', 'CLSResponse']
class Message(BaseModel):
@ -52,10 +54,11 @@ class ASRResponse(BaseModel):
#****************************************************************************************/
class TTSResult(BaseModel):
lang: str = "zh"
sample_rate: int
spk_id: int = 0
speed: float = 1.0
volume: float = 1.0
sample_rate: int
duration: float
save_path: str = None
audio: str
@ -71,9 +74,11 @@ class TTSResponse(BaseModel):
},
"result": {
"lang": "zh",
"sample_rate": 24000,
"spk_id": 0,
"speed": 1.0,
"volume": 1.0,
"sample_rate": 24000,
"duration": 3.6125,
"audio": "LTI1OTIuNjI1OTUwMzQsOTk2OS41NDk4...",
"save_path": "./tts.wav"
}
@ -85,6 +90,45 @@ class TTSResponse(BaseModel):
result: TTSResult
#****************************************************************************************/
#************************************ CLS response **************************************/
#****************************************************************************************/
class CLSResults(BaseModel):
class_name: str
prob: float
class CLSResult(BaseModel):
topk: int
results: List[CLSResults]
class CLSResponse(BaseModel):
"""
response example
{
"success": true,
"code": 0,
"message": {
"description": "success"
},
"result": {
topk: 1
results: [
{
"class":"Speech",
"prob": 0.9027184844017029
}
]
}
}
"""
success: bool
code: int
message: Message
result: CLSResult
#****************************************************************************************/
#********************************** Error response **************************************/
#****************************************************************************************/

@ -98,7 +98,7 @@ def tts(request_body: TTSRequest):
tts_engine = engine_pool['tts']
logger.info("Get tts engine successfully.")
lang, target_sample_rate, wav_base64 = tts_engine.run(
lang, target_sample_rate, duration, wav_base64 = tts_engine.run(
text, spk_id, speed, volume, sample_rate, save_path)
response = {
@ -113,6 +113,7 @@ def tts(request_body: TTSRequest):
"speed": speed,
"volume": volume,
"sample_rate": target_sample_rate,
"duration": duration,
"save_path": save_path,
"audio": wav_base64
}

@ -35,10 +35,12 @@ def init_predictor(model_dir: Optional[os.PathLike]=None,
Returns:
predictor (PaddleInferPredictor): created predictor
"""
if model_dir is not None:
assert os.path.isdir(model_dir), 'Please check model dir.'
config = Config(args.model_dir)
else:
assert os.path.isfile(model_file) and os.path.isfile(
params_file), 'Please check model and parameter files.'
config = Config(model_file, params_file)
# set device
@ -66,7 +68,6 @@ def init_predictor(model_dir: Optional[os.PathLike]=None,
config.enable_memory_optim()
predictor = create_predictor(config)
return predictor
@ -84,10 +85,8 @@ def run_model(predictor, input: List) -> List:
for i, name in enumerate(input_names):
input_handle = predictor.get_input_handle(name)
input_handle.copy_from_cpu(input[i])
# do the inference
predictor.run()
results = []
# get out data from output tensor
output_names = predictor.get_output_names()

@ -0,0 +1,100 @@
009901 昨日,这名伤者与医生全部被警方依法刑事拘留。
009902 钱伟长想到上海来办学校是经过深思熟虑的。
009903 她见我一进门就骂,吃饭时也骂,骂得我抬不起头。
009904 李述德在离开之前,只说了一句柱驼杀父亲了。
009905 这种车票和保险单捆绑出售属于重复性购买。
009906 戴佩妮的男友西米露接唱情歌,让她非常开心。
009907 观大势,谋大局,出大策始终是该院的办院方针。
009908 他们骑着摩托回家,正好为农忙时的父母帮忙。
009909 但是因为还没到退休年龄,只能掰着指头捱日子。
009910 这几天雨水不断,人们恨不得待在家里不出门。
009911 没想到徐赟,张海翔两人就此玩起了人间蒸发。
009912 藤村此番发言可能是为了凸显野田的领导能力。
009913 程长庚,生在清王朝嘉庆年间,安徽的潜山小县。
009914 南海海域综合补给基地码头项目正在论证中。
009915 也就是说今晚成都市民极有可能再次看到飘雪。
009916 随着天气转热,各地的游泳场所开始人头攒动。
009917 更让徐先生纳闷的是,房客的手机也打不通了。
009918 遇到颠簸时,应听从乘务员的安全指令,回座位坐好。
009919 他在后面呆惯了,怕自己一插身后的人会不满,不敢排进去。
009920 傍晚七个小人回来了,白雪公主说,你们就是我命中的七个小矮人吧。
009921 他本想说,教育局管这个,他们是一路的,这样一管岂不是妓女起嫖客?
009922 一种表示商品所有权的财物证券,也称商品证券,如提货单,交货单。
009923 会有很丰富的东西留下来,说都说不完。
009924 这句话像从天而降,吓得四周一片寂静。
009925 记者所在的是受害人家属所在的右区。
009926 不管哈大爷去哪,它都一步不离地跟着。
009927 大家抬头望去,一只老鼠正趴在吊顶上。
009928 我决定过年就辞职,接手我爸的废品站!
009929 最终,中国男子乒乓球队获得此奖项。
009930 防汛抗旱两手抓,抗旱相对抓的不够。
009931 图们江下游地区开发开放的进展如何?
009932 这要求中国必须有一个坚强的政党领导。
009933 再说,关于利益上的事俺俩都不好开口。
009934 明代瓦剌,鞑靼入侵明境也是通过此地。
009935 咪咪舔着孩子,把它身上的毛舔干净。
009936 是否这次的国标修订被大企业绑架了?
009937 判决后,姚某妻子胡某不服,提起上诉。
009938 由此可以看出邯钢的经济效益来自何处。
009939 琳达说,是瑜伽改变了她和马儿的生活。
009940 楼下的保安告诉记者,这里不租也不卖。
009941 习近平说,中斯两国人民传统友谊深厚。
009942 传闻越来越多,后来连老汉儿自己都怕了。
009943 我怒吼一声冲上去,举起砖头砸了过去。
009944 我现在还不会,这就回去问问发明我的人。
009945 显然,洛阳性奴案不具备上述两个前提。
009946 另外,杰克逊有文唇线,眼线,眉毛的动作。
009947 昨晚,华西都市报记者电话采访了尹琪。
009948 涅拉季科未透露这些航空公司的名称。
009949 从运行轨迹上来说,它也不可能是星星。
009950 目前看,如果继续加息也存在两难问题。
009951 曾宝仪在节目录制现场大爆观众糗事。
009952 但任凭周某怎么叫,男子仍酣睡不醒。
009953 老大爷说,小子,你挡我财路了,知道不?
009954 没料到,闯下大头佛的阿伟还不知悔改。
009955 卡扎菲部落式统治已遭遇部落内讧。
009956 这个孩子的生命一半来源于另一位女士捐赠的冷冻卵子。
009957 出现这种泥鳅内阁的局面既是野田有意为之,也实属无奈。
009958 济青高速济南,华山,章丘,邹平,周村,淄博,临淄站。
009959 赵凌飞的话,反映了沈阳赛区所有奥运志愿者的共同心声。
009960 因为,我们所发出的力量必会因难度加大而减弱。
009961 发生事故的楼梯拐角处仍可看到血迹。
009962 想过进公安,可能身高不够,老汉儿也不让我进去。
009963 路上关卡很多,为了方便撤离,只好轻装前进。
009964 原来比尔盖茨就是美国微软公司联合创始人呀。
009965 之后他们一家三口将与双方父母往峇里岛旅游。
009966 谢谢总理,也感谢广大网友的参与,我们明年再见。
009967 事实上是,从来没有一个欺善怕恶的人能作出过稍大一点的成就。
009968 我会打开邮件,你可以从那里继续。
009969 美方对近期东海局势表示关切。
009970 据悉,奥巴马一家人对这座冬季白宫极为满意。
009971 打扫完你会很有成就感的,试一试,你就信了。
009972 诺曼站在滑板车上,各就各位,准备出发啦!
009973 塔河的寒夜,气温降到了零下三十多摄氏度。
009974 其间,连破六点六,六点五,六点四,六点三五等多个重要关口。
009975 算命其实只是人们的一种自我安慰和自我暗示而已,我们还是要相信科学才好。
009976 这一切都令人欢欣鼓舞,阿讷西没理由不坚持到最后。
009977 直至公元前一万一千年,它又再次出现。
009978 尽量少玩电脑,少看电视,少打游戏。
009979 从五到七,前后也就是六个月的时间。
009980 一进咖啡店,他就遇见一张熟悉的脸。
009981 好在众弟兄看到了把她追了回来。
009982 有一个人说,哥们儿我们跑过它才能活。
009983 捅了她以后,模糊记得她没咋动了。
009984 从小到大,葛启义没有收到过压岁钱。
009985 舞台下的你会对舞台上的你说什么?
009986 但考生普遍认为,试题的怪多过难。
009987 我希望每个人都能够尊重我们的隐私。
009988 漫天的红霞使劲给两人增添气氛。
009989 晚上加完班开车回家,太累了,迷迷糊糊开着车,走一半的时候,铛一声!
009990 该车将三人撞倒后,在大雾中逃窜。
009991 这人一哆嗦,方向盘也把不稳了,差点撞上了高速边道护栏。
009992 那女孩儿委屈的说,我一回头见你已经进去了我不敢进去啊!
009993 小明摇摇头说,不是,我只是美女看多了,想换个口味而已。
009994 接下来,红娘要求记者交费,记者表示不知表姐身份证号码。
009995 李东蓊表示,自己当时在法庭上发表了一次独特的公诉意见。
009996 另一男子扑了上来,手里拿着明晃晃的长刀,向他胸口直刺。
009997 今天,快递员拿着一个快递在办公室喊,秦王是哪个,有他快递?
009998 这场抗议活动究竟是如何发展演变的,又究竟是谁伤害了谁?
009999 因华国锋肖鸡,墓地设计根据其属相设计。
010000 在狱中,张明宝悔恨交加,写了一份忏悔书。

@ -34,7 +34,7 @@ def main():
"--generator-type",
type=str,
default="pwgan",
help="type of GANVocoder, should in {pwgan, mb_melgan, style_melgan, } now"
help="type of GANVocoder, should in {pwgan, mb_melgan, style_melgan, hifigan, } now"
)
parser.add_argument("--config", type=str, help="GANVocoder config file.")
parser.add_argument("--checkpoint", type=str, help="snapshot to load.")

@ -17,13 +17,92 @@ from pathlib import Path
import numpy
import soundfile as sf
from paddle import inference
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from timer import timer
from paddlespeech.t2s.exps.syn_utils import get_frontend
from paddlespeech.t2s.exps.syn_utils import get_sentences
from paddlespeech.t2s.utils import str2bool
def get_predictor(args, filed='am'):
full_name = ''
if filed == 'am':
full_name = args.am
elif filed == 'voc':
full_name = args.voc
model_name = full_name[:full_name.rindex('_')]
config = inference.Config(
str(Path(args.inference_dir) / (full_name + ".pdmodel")),
str(Path(args.inference_dir) / (full_name + ".pdiparams")))
if args.device == "gpu":
config.enable_use_gpu(100, 0)
elif args.device == "cpu":
config.disable_gpu()
# This line must be commented for fastspeech2, if not, it will OOM
if model_name != 'fastspeech2':
config.enable_memory_optim()
predictor = inference.create_predictor(config)
return predictor
# only inference for models trained with csmsc now
def main():
def get_am_output(args, am_predictor, frontend, merge_sentences, input):
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
am_input_names = am_predictor.get_input_names()
get_tone_ids = False
get_spk_id = False
if am_name == 'speedyspeech':
get_tone_ids = True
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
get_spk_id = True
spk_id = numpy.array([args.spk_id])
if args.lang == 'zh':
input_ids = frontend.get_input_ids(
input, merge_sentences=merge_sentences, get_tone_ids=get_tone_ids)
phone_ids = input_ids["phone_ids"]
elif args.lang == 'en':
input_ids = frontend.get_input_ids(
input, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else:
print("lang should in {'zh', 'en'}!")
if get_tone_ids:
tone_ids = input_ids["tone_ids"]
tones = tone_ids[0].numpy()
tones_handle = am_predictor.get_input_handle(am_input_names[1])
tones_handle.reshape(tones.shape)
tones_handle.copy_from_cpu(tones)
if get_spk_id:
spk_id_handle = am_predictor.get_input_handle(am_input_names[1])
spk_id_handle.reshape(spk_id.shape)
spk_id_handle.copy_from_cpu(spk_id)
phones = phone_ids[0].numpy()
phones_handle = am_predictor.get_input_handle(am_input_names[0])
phones_handle.reshape(phones.shape)
phones_handle.copy_from_cpu(phones)
am_predictor.run()
am_output_names = am_predictor.get_output_names()
am_output_handle = am_predictor.get_output_handle(am_output_names[0])
am_output_data = am_output_handle.copy_to_cpu()
return am_output_data
def get_voc_output(args, voc_predictor, input):
voc_input_names = voc_predictor.get_input_names()
mel_handle = voc_predictor.get_input_handle(voc_input_names[0])
mel_handle.reshape(input.shape)
mel_handle.copy_from_cpu(input)
voc_predictor.run()
voc_output_names = voc_predictor.get_output_names()
voc_output_handle = voc_predictor.get_output_handle(voc_output_names[0])
wav = voc_output_handle.copy_to_cpu()
return wav
def parse_args():
parser = argparse.ArgumentParser(
description="Paddle Infernce with speedyspeech & parallel wavegan.")
# acoustic model
@ -70,113 +149,97 @@ def main():
parser.add_argument(
"--inference_dir", type=str, help="dir to save inference models")
parser.add_argument("--output_dir", type=str, help="output dir")
# inference
parser.add_argument(
"--use_trt",
type=str2bool,
default=False,
help="Whether to use inference engin TensorRT.", )
parser.add_argument(
"--int8",
type=str2bool,
default=False,
help="Whether to use int8 inference.", )
parser.add_argument(
"--fp16",
type=str2bool,
default=False,
help="Whether to use float16 inference.", )
parser.add_argument(
"--device",
default="gpu",
choices=["gpu", "cpu"],
help="Device selected for inference.", )
args, _ = parser.parse_known_args()
return args
# only inference for models trained with csmsc now
def main():
args = parse_args()
# frontend
if args.lang == 'zh':
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
elif args.lang == 'en':
frontend = English(phone_vocab_path=args.phones_dict)
print("frontend done!")
frontend = get_frontend(args)
# am_predictor
am_predictor = get_predictor(args, filed='am')
# model: {model_name}_{dataset}
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
am_config = inference.Config(
str(Path(args.inference_dir) / (args.am + ".pdmodel")),
str(Path(args.inference_dir) / (args.am + ".pdiparams")))
am_config.enable_use_gpu(100, 0)
# This line must be commented for fastspeech2, if not, it will OOM
if am_name != 'fastspeech2':
am_config.enable_memory_optim()
am_predictor = inference.create_predictor(am_config)
voc_config = inference.Config(
str(Path(args.inference_dir) / (args.voc + ".pdmodel")),
str(Path(args.inference_dir) / (args.voc + ".pdiparams")))
voc_config.enable_use_gpu(100, 0)
voc_config.enable_memory_optim()
voc_predictor = inference.create_predictor(voc_config)
# voc_predictor
voc_predictor = get_predictor(args, filed='voc')
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
sentences = []
print("in new inference")
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
items = line.strip().split()
utt_id = items[0]
if args.lang == 'zh':
sentence = "".join(items[1:])
elif args.lang == 'en':
sentence = " ".join(items[1:])
sentences.append((utt_id, sentence))
get_tone_ids = False
get_spk_id = False
if am_name == 'speedyspeech':
get_tone_ids = True
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
get_spk_id = True
spk_id = numpy.array([args.spk_id])
sentences = get_sentences(args)
am_input_names = am_predictor.get_input_names()
print("am_input_names:", am_input_names)
merge_sentences = True
fs = 24000 if am_dataset != 'ljspeech' else 22050
# warmup
for utt_id, sentence in sentences[:3]:
with timer() as t:
am_output_data = get_am_output(
args,
am_predictor=am_predictor,
frontend=frontend,
merge_sentences=merge_sentences,
input=sentence)
wav = get_voc_output(
args, voc_predictor=voc_predictor, input=am_output_data)
speed = wav.size / t.elapse
rtf = fs / speed
print(
f"{utt_id}, mel: {am_output_data.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
)
print("warm up done!")
N = 0
T = 0
for utt_id, sentence in sentences:
if args.lang == 'zh':
input_ids = frontend.get_input_ids(
sentence,
with timer() as t:
am_output_data = get_am_output(
args,
am_predictor=am_predictor,
frontend=frontend,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids)
phone_ids = input_ids["phone_ids"]
elif args.lang == 'en':
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else:
print("lang should in {'zh', 'en'}!")
if get_tone_ids:
tone_ids = input_ids["tone_ids"]
tones = tone_ids[0].numpy()
tones_handle = am_predictor.get_input_handle(am_input_names[1])
tones_handle.reshape(tones.shape)
tones_handle.copy_from_cpu(tones)
if get_spk_id:
spk_id_handle = am_predictor.get_input_handle(am_input_names[1])
spk_id_handle.reshape(spk_id.shape)
spk_id_handle.copy_from_cpu(spk_id)
phones = phone_ids[0].numpy()
phones_handle = am_predictor.get_input_handle(am_input_names[0])
phones_handle.reshape(phones.shape)
phones_handle.copy_from_cpu(phones)
am_predictor.run()
am_output_names = am_predictor.get_output_names()
am_output_handle = am_predictor.get_output_handle(am_output_names[0])
am_output_data = am_output_handle.copy_to_cpu()
voc_input_names = voc_predictor.get_input_names()
mel_handle = voc_predictor.get_input_handle(voc_input_names[0])
mel_handle.reshape(am_output_data.shape)
mel_handle.copy_from_cpu(am_output_data)
voc_predictor.run()
voc_output_names = voc_predictor.get_output_names()
voc_output_handle = voc_predictor.get_output_handle(voc_output_names[0])
wav = voc_output_handle.copy_to_cpu()
input=sentence)
wav = get_voc_output(
args, voc_predictor=voc_predictor, input=am_output_data)
N += wav.size
T += t.elapse
speed = wav.size / t.elapse
rtf = fs / speed
sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
print(
f"{utt_id}, mel: {am_output_data.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
)
print(f"{utt_id} done!")
print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
if __name__ == "__main__":

@ -0,0 +1,243 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import paddle
from paddle import jit
from paddle.static import InputSpec
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.modules.normalizer import ZScore
model_alias = {
# acoustic model
"speedyspeech":
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech",
"speedyspeech_inference":
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference",
"fastspeech2":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2",
"fastspeech2_inference":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
"tacotron2":
"paddlespeech.t2s.models.tacotron2:Tacotron2",
"tacotron2_inference":
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
# voc
"pwgan":
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
"pwgan_inference":
"paddlespeech.t2s.models.parallel_wavegan:PWGInference",
"mb_melgan":
"paddlespeech.t2s.models.melgan:MelGANGenerator",
"mb_melgan_inference":
"paddlespeech.t2s.models.melgan:MelGANInference",
"style_melgan":
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator",
"style_melgan_inference":
"paddlespeech.t2s.models.melgan:StyleMelGANInference",
"hifigan":
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
"hifigan_inference":
"paddlespeech.t2s.models.hifigan:HiFiGANInference",
"wavernn":
"paddlespeech.t2s.models.wavernn:WaveRNN",
"wavernn_inference":
"paddlespeech.t2s.models.wavernn:WaveRNNInference",
}
# input
def get_sentences(args):
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
items = line.strip().split()
utt_id = items[0]
if 'lang' in args and args.lang == 'zh':
sentence = "".join(items[1:])
elif 'lang' in args and args.lang == 'en':
sentence = " ".join(items[1:])
sentences.append((utt_id, sentence))
return sentences
def get_test_dataset(args, test_metadata, am_name, am_dataset):
if am_name == 'fastspeech2':
fields = ["utt_id", "text"]
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
print("multiple speaker fastspeech2!")
fields += ["spk_id"]
elif 'voice_cloning' in args and args.voice_cloning:
print("voice cloning!")
fields += ["spk_emb"]
else:
print("single speaker fastspeech2!")
elif am_name == 'speedyspeech':
fields = ["utt_id", "phones", "tones"]
elif am_name == 'tacotron2':
fields = ["utt_id", "text"]
if 'voice_cloning' in args and args.voice_cloning:
print("voice cloning!")
fields += ["spk_emb"]
test_dataset = DataTable(data=test_metadata, fields=fields)
return test_dataset
# frontend
def get_frontend(args):
if 'lang' in args and args.lang == 'zh':
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
elif 'lang' in args and args.lang == 'en':
frontend = English(phone_vocab_path=args.phones_dict)
else:
print("wrong lang!")
print("frontend done!")
return frontend
# dygraph
def get_am_inference(args, am_config):
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
tone_size = None
if 'tones_dict' in args and args.tones_dict:
with open(args.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id)
print("tone_size:", tone_size)
spk_num = None
if 'speaker_dict' in args and args.speaker_dict:
with open(args.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id)
print("spk_num:", spk_num)
odim = am_config.n_mels
# model: {model_name}_{dataset}
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
am_class = dynamic_import(am_name, model_alias)
am_inference_class = dynamic_import(am_name + '_inference', model_alias)
if am_name == 'fastspeech2':
am = am_class(
idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"])
elif am_name == 'speedyspeech':
am = am_class(
vocab_size=vocab_size,
tone_size=tone_size,
spk_num=spk_num,
**am_config["model"])
elif am_name == 'tacotron2':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
am.eval()
am_mu, am_std = np.load(args.am_stat)
am_mu = paddle.to_tensor(am_mu)
am_std = paddle.to_tensor(am_std)
am_normalizer = ZScore(am_mu, am_std)
am_inference = am_inference_class(am_normalizer, am)
am_inference.eval()
print("acoustic model done!")
return am_inference, am_name, am_dataset
def get_voc_inference(args, voc_config):
# model: {model_name}_{dataset}
voc_name = args.voc[:args.voc.rindex('_')]
voc_class = dynamic_import(voc_name, model_alias)
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
if voc_name != 'wavernn':
voc = voc_class(**voc_config["generator_params"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
voc.remove_weight_norm()
voc.eval()
else:
voc = voc_class(**voc_config["model"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["main_params"])
voc.eval()
voc_mu, voc_std = np.load(args.voc_stat)
voc_mu = paddle.to_tensor(voc_mu)
voc_std = paddle.to_tensor(voc_std)
voc_normalizer = ZScore(voc_mu, voc_std)
voc_inference = voc_inference_class(voc_normalizer, voc)
voc_inference.eval()
print("voc done!")
return voc_inference
# to static
def am_to_static(args, am_inference, am_name, am_dataset):
if am_name == 'fastspeech2':
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
am_inference = jit.to_static(
am_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64),
InputSpec([1], dtype=paddle.int64),
], )
else:
am_inference = jit.to_static(
am_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
elif am_name == 'speedyspeech':
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
am_inference = jit.to_static(
am_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64), # text
InputSpec([-1], dtype=paddle.int64), # tone
InputSpec([1], dtype=paddle.int64), # spk_id
None # duration
])
else:
am_inference = jit.to_static(
am_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64),
InputSpec([-1], dtype=paddle.int64)
])
elif am_name == 'tacotron2':
am_inference = jit.to_static(
am_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
paddle.jit.save(am_inference, os.path.join(args.inference_dir, args.am))
am_inference = paddle.jit.load(os.path.join(args.inference_dir, args.am))
return am_inference
def voc_to_static(args, voc_inference):
voc_inference = jit.to_static(
voc_inference, input_spec=[
InputSpec([-1, 80], dtype=paddle.float32),
])
paddle.jit.save(voc_inference, os.path.join(args.inference_dir, args.voc))
voc_inference = paddle.jit.load(os.path.join(args.inference_dir, args.voc))
return voc_inference

@ -23,48 +23,11 @@ import yaml
from timer import timer
from yacs.config import CfgNode
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.t2s.exps.syn_utils import get_am_inference
from paddlespeech.t2s.exps.syn_utils import get_test_dataset
from paddlespeech.t2s.exps.syn_utils import get_voc_inference
from paddlespeech.t2s.utils import str2bool
model_alias = {
# acoustic model
"speedyspeech":
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech",
"speedyspeech_inference":
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference",
"fastspeech2":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2",
"fastspeech2_inference":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
"tacotron2":
"paddlespeech.t2s.models.tacotron2:Tacotron2",
"tacotron2_inference":
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
# voc
"pwgan":
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
"pwgan_inference":
"paddlespeech.t2s.models.parallel_wavegan:PWGInference",
"mb_melgan":
"paddlespeech.t2s.models.melgan:MelGANGenerator",
"mb_melgan_inference":
"paddlespeech.t2s.models.melgan:MelGANInference",
"style_melgan":
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator",
"style_melgan_inference":
"paddlespeech.t2s.models.melgan:StyleMelGANInference",
"hifigan":
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
"hifigan_inference":
"paddlespeech.t2s.models.hifigan:HiFiGANInference",
"wavernn":
"paddlespeech.t2s.models.wavernn:WaveRNN",
"wavernn_inference":
"paddlespeech.t2s.models.wavernn:WaveRNNInference",
}
def evaluate(args):
# dataloader has been too verbose
@ -86,96 +49,12 @@ def evaluate(args):
print(am_config)
print(voc_config)
# construct dataset for evaluation
# model: {model_name}_{dataset}
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
if am_name == 'fastspeech2':
fields = ["utt_id", "text"]
spk_num = None
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
print("multiple speaker fastspeech2!")
with open(args.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id)
fields += ["spk_id"]
elif args.voice_cloning:
print("voice cloning!")
fields += ["spk_emb"]
else:
print("single speaker fastspeech2!")
print("spk_num:", spk_num)
elif am_name == 'speedyspeech':
fields = ["utt_id", "phones", "tones"]
elif am_name == 'tacotron2':
fields = ["utt_id", "text"]
if args.voice_cloning:
print("voice cloning!")
fields += ["spk_emb"]
test_dataset = DataTable(data=test_metadata, fields=fields)
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
tone_size = None
if args.tones_dict:
with open(args.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id)
print("tone_size:", tone_size)
# acoustic model
odim = am_config.n_mels
am_class = dynamic_import(am_name, model_alias)
am_inference_class = dynamic_import(am_name + '_inference', model_alias)
if am_name == 'fastspeech2':
am = am_class(
idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"])
elif am_name == 'speedyspeech':
am = am_class(
vocab_size=vocab_size, tone_size=tone_size, **am_config["model"])
elif am_name == 'tacotron2':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
am.eval()
am_mu, am_std = np.load(args.am_stat)
am_mu = paddle.to_tensor(am_mu)
am_std = paddle.to_tensor(am_std)
am_normalizer = ZScore(am_mu, am_std)
am_inference = am_inference_class(am_normalizer, am)
print("am_inference.training0:", am_inference.training)
am_inference.eval()
print("acoustic model done!")
am_inference, am_name, am_dataset = get_am_inference(args, am_config)
test_dataset = get_test_dataset(args, test_metadata, am_name, am_dataset)
# vocoder
# model: {model_name}_{dataset}
voc_name = args.voc[:args.voc.rindex('_')]
voc_class = dynamic_import(voc_name, model_alias)
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
if voc_name != 'wavernn':
voc = voc_class(**voc_config["generator_params"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
voc.remove_weight_norm()
voc.eval()
else:
voc = voc_class(**voc_config["model"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["main_params"])
voc.eval()
voc_mu, voc_std = np.load(args.voc_stat)
voc_mu = paddle.to_tensor(voc_mu)
voc_std = paddle.to_tensor(voc_std)
voc_normalizer = ZScore(voc_mu, voc_std)
voc_inference = voc_inference_class(voc_normalizer, voc)
print("voc_inference.training0:", voc_inference.training)
voc_inference.eval()
print("voc done!")
voc_inference = get_voc_inference(args, voc_config)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
@ -227,7 +106,7 @@ def evaluate(args):
print(f"generation speed: {N / T}Hz, RTF: {am_config.fs / (N / T) }")
def main():
def parse_args():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with acoustic model & vocoder")
@ -264,7 +143,6 @@ def main():
"--tones_dict", type=str, default=None, help="tone vocabulary file.")
parser.add_argument(
"--speaker_dict", type=str, default=None, help="speaker id map file.")
parser.add_argument(
"--voice-cloning",
type=str2bool,
@ -278,10 +156,10 @@ def main():
choices=[
'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
'mb_melgan_csmsc', 'wavernn_csmsc', 'hifigan_csmsc',
'hifigan_ljspeech', 'hifigan_aishell3', 'hifigan_vctk',
'style_melgan_csmsc'
],
help='Choose vocoder type of tts task.')
parser.add_argument(
'--voc_config',
type=str,
@ -302,7 +180,12 @@ def main():
parser.add_argument("--output_dir", type=str, help="output dir.")
args = parser.parse_args()
return args
def main():
args = parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:

@ -12,59 +12,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from paddle import jit
from paddle.static import InputSpec
from timer import timer
from yacs.config import CfgNode
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.modules.normalizer import ZScore
model_alias = {
# acoustic model
"speedyspeech":
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech",
"speedyspeech_inference":
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference",
"fastspeech2":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2",
"fastspeech2_inference":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
"tacotron2":
"paddlespeech.t2s.models.tacotron2:Tacotron2",
"tacotron2_inference":
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
# voc
"pwgan":
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
"pwgan_inference":
"paddlespeech.t2s.models.parallel_wavegan:PWGInference",
"mb_melgan":
"paddlespeech.t2s.models.melgan:MelGANGenerator",
"mb_melgan_inference":
"paddlespeech.t2s.models.melgan:MelGANInference",
"style_melgan":
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator",
"style_melgan_inference":
"paddlespeech.t2s.models.melgan:StyleMelGANInference",
"hifigan":
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
"hifigan_inference":
"paddlespeech.t2s.models.hifigan:HiFiGANInference",
"wavernn":
"paddlespeech.t2s.models.wavernn:WaveRNN",
"wavernn_inference":
"paddlespeech.t2s.models.wavernn:WaveRNNInference",
}
from paddlespeech.t2s.exps.syn_utils import am_to_static
from paddlespeech.t2s.exps.syn_utils import get_am_inference
from paddlespeech.t2s.exps.syn_utils import get_frontend
from paddlespeech.t2s.exps.syn_utils import get_sentences
from paddlespeech.t2s.exps.syn_utils import get_voc_inference
from paddlespeech.t2s.exps.syn_utils import voc_to_static
def evaluate(args):
@ -81,151 +42,24 @@ def evaluate(args):
print(am_config)
print(voc_config)
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
items = line.strip().split()
utt_id = items[0]
if args.lang == 'zh':
sentence = "".join(items[1:])
elif args.lang == 'en':
sentence = " ".join(items[1:])
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
tone_size = None
if args.tones_dict:
with open(args.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id)
print("tone_size:", tone_size)
spk_num = None
if args.speaker_dict:
with open(args.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id)
print("spk_num:", spk_num)
sentences = get_sentences(args)
# frontend
if args.lang == 'zh':
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
elif args.lang == 'en':
frontend = English(phone_vocab_path=args.phones_dict)
print("frontend done!")
frontend = get_frontend(args)
# acoustic model
odim = am_config.n_mels
# model: {model_name}_{dataset}
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
am_class = dynamic_import(am_name, model_alias)
am_inference_class = dynamic_import(am_name + '_inference', model_alias)
if am_name == 'fastspeech2':
am = am_class(
idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"])
elif am_name == 'speedyspeech':
am = am_class(
vocab_size=vocab_size,
tone_size=tone_size,
spk_num=spk_num,
**am_config["model"])
elif am_name == 'tacotron2':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
am.eval()
am_mu, am_std = np.load(args.am_stat)
am_mu = paddle.to_tensor(am_mu)
am_std = paddle.to_tensor(am_std)
am_normalizer = ZScore(am_mu, am_std)
am_inference = am_inference_class(am_normalizer, am)
am_inference.eval()
print("acoustic model done!")
am_inference, am_name, am_dataset = get_am_inference(args, am_config)
# vocoder
# model: {model_name}_{dataset}
voc_name = args.voc[:args.voc.rindex('_')]
voc_class = dynamic_import(voc_name, model_alias)
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
if voc_name != 'wavernn':
voc = voc_class(**voc_config["generator_params"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
voc.remove_weight_norm()
voc.eval()
else:
voc = voc_class(**voc_config["model"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["main_params"])
voc.eval()
voc_mu, voc_std = np.load(args.voc_stat)
voc_mu = paddle.to_tensor(voc_mu)
voc_std = paddle.to_tensor(voc_std)
voc_normalizer = ZScore(voc_mu, voc_std)
voc_inference = voc_inference_class(voc_normalizer, voc)
voc_inference.eval()
print("voc done!")
voc_inference = get_voc_inference(args, voc_config)
# whether dygraph to static
if args.inference_dir:
# acoustic model
if am_name == 'fastspeech2':
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
am_inference = jit.to_static(
am_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64),
InputSpec([1], dtype=paddle.int64)
])
else:
am_inference = jit.to_static(
am_inference,
input_spec=[InputSpec([-1], dtype=paddle.int64)])
elif am_name == 'speedyspeech':
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
am_inference = jit.to_static(
am_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64), # text
InputSpec([-1], dtype=paddle.int64), # tone
InputSpec([1], dtype=paddle.int64), # spk_id
None # duration
])
else:
am_inference = jit.to_static(
am_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64),
InputSpec([-1], dtype=paddle.int64)
])
elif am_name == 'tacotron2':
am_inference = jit.to_static(
am_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
paddle.jit.save(am_inference, os.path.join(args.inference_dir, args.am))
am_inference = paddle.jit.load(
os.path.join(args.inference_dir, args.am))
am_inference = am_to_static(args, am_inference, am_name, am_dataset)
# vocoder
voc_inference = jit.to_static(
voc_inference,
input_spec=[
InputSpec([-1, 80], dtype=paddle.float32),
])
paddle.jit.save(voc_inference,
os.path.join(args.inference_dir, args.voc))
voc_inference = paddle.jit.load(
os.path.join(args.inference_dir, args.voc))
voc_inference = voc_to_static(args, voc_inference)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
@ -298,7 +132,7 @@ def evaluate(args):
print(f"generation speed: {N / T}Hz, RTF: {am_config.fs / (N / T) }")
def main():
def parse_args():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with acoustic model & vocoder")
@ -346,12 +180,19 @@ def main():
type=str,
default='pwgan_csmsc',
choices=[
'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc',
'wavernn_csmsc'
'pwgan_csmsc',
'pwgan_ljspeech',
'pwgan_aishell3',
'pwgan_vctk',
'mb_melgan_csmsc',
'style_melgan_csmsc',
'hifigan_csmsc',
'hifigan_ljspeech',
'hifigan_aishell3',
'hifigan_vctk',
'wavernn_csmsc',
],
help='Choose vocoder type of tts task.')
parser.add_argument(
'--voc_config',
type=str,
@ -386,6 +227,11 @@ def main():
parser.add_argument("--output_dir", type=str, help="output dir.")
args = parser.parse_args()
return args
def main():
args = parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")

@ -21,29 +21,12 @@ import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.t2s.exps.syn_utils import get_am_inference
from paddlespeech.t2s.exps.syn_utils import get_voc_inference
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.vector.exps.ge2e.audio_processor import SpeakerVerificationPreprocessor
from paddlespeech.vector.models.lstm_speaker_encoder import LSTMSpeakerEncoder
model_alias = {
# acoustic model
"fastspeech2":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2",
"fastspeech2_inference":
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
"tacotron2":
"paddlespeech.t2s.models.tacotron2:Tacotron2",
"tacotron2_inference":
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
# voc
"pwgan":
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
"pwgan_inference":
"paddlespeech.t2s.models.parallel_wavegan:PWGInference",
}
def voice_cloning(args):
# Init body.
@ -79,55 +62,14 @@ def voice_cloning(args):
speaker_encoder.eval()
print("GE2E Done!")
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
frontend = Frontend(phone_vocab_path=args.phones_dict)
print("frontend done!")
# acoustic model
odim = am_config.n_mels
# model: {model_name}_{dataset}
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
am_class = dynamic_import(am_name, model_alias)
am_inference_class = dynamic_import(am_name + '_inference', model_alias)
if am_name == 'fastspeech2':
am = am_class(
idim=vocab_size, odim=odim, spk_num=None, **am_config["model"])
elif am_name == 'tacotron2':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
am.eval()
am_mu, am_std = np.load(args.am_stat)
am_mu = paddle.to_tensor(am_mu)
am_std = paddle.to_tensor(am_std)
am_normalizer = ZScore(am_mu, am_std)
am_inference = am_inference_class(am_normalizer, am)
am_inference.eval()
print("acoustic model done!")
am_inference, *_ = get_am_inference(args, am_config)
# vocoder
# model: {model_name}_{dataset}
voc_name = args.voc[:args.voc.rindex('_')]
voc_class = dynamic_import(voc_name, model_alias)
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
voc = voc_class(**voc_config["generator_params"])
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
voc.remove_weight_norm()
voc.eval()
voc_mu, voc_std = np.load(args.voc_stat)
voc_mu = paddle.to_tensor(voc_mu)
voc_std = paddle.to_tensor(voc_std)
voc_normalizer = ZScore(voc_mu, voc_std)
voc_inference = voc_inference_class(voc_normalizer, voc)
voc_inference.eval()
print("voc done!")
frontend = Frontend(phone_vocab_path=args.phones_dict)
print("frontend done!")
voc_inference = get_voc_inference(args, voc_config)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
@ -170,7 +112,7 @@ def voice_cloning(args):
print(f"{utt_id} done!")
def main():
def parse_args():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="")
parser.add_argument(
@ -240,6 +182,11 @@ def main():
parser.add_argument("--output-dir", type=str, help="output dir.")
args = parser.parse_args()
return args
def main():
args = parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")

@ -101,6 +101,16 @@ class LengthRegulator(nn.Layer):
assert alpha > 0
ds = paddle.round(ds.cast(dtype=paddle.float32) * alpha)
ds = ds.cast(dtype=paddle.int64)
'''
from distutils.version import LooseVersion
from paddlespeech.t2s.modules.nets_utils import pad_list
# 这里在 paddle 2.2.2 的动转静是不通的
# if LooseVersion(paddle.__version__) >= "2.3.0" or hasattr(paddle, 'repeat_interleave'):
# if LooseVersion(paddle.__version__) >= "2.3.0":
if hasattr(paddle, 'repeat_interleave'):
repeat = [paddle.repeat_interleave(x, d, axis=0) for x, d in zip(xs, ds)]
return pad_list(repeat, self.pad_value)
'''
if is_inference:
return self.expand(xs, ds)
else:

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save