[vec][score] add plda model, test=doc fix #1667

pull/1681/head
qingen 4 years ago
commit 6446f72cab

6
.gitignore vendored

@ -1,7 +1,7 @@
.DS_Store
*.pyc
.vscode
*log
*.log
*.wav
*.pdmodel
*.pdiparams*
@ -34,4 +34,6 @@ tools/activate_python.sh
tools/miniconda.sh
tools/CRF++-0.58/
speechx/fc_patch/
speechx/fc_patch/
third_party/ctc_decoders/paddlespeech_ctcdecoders.py

@ -52,7 +52,7 @@ pull_request_rules:
add: ["T2S"]
- name: "auto add label=Audio"
conditions:
- files~=^audio/
- files~=^paddleaudio/
actions:
label:
add: ["Audio"]

@ -50,13 +50,13 @@ repos:
entry: bash .pre-commit-hooks/clang-format.hook -i
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
exclude: (?=speechx/speechx/kaldi).*(\.cpp|\.cc|\.h|\.py)$
exclude: (?=speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
- id: copyright_checker
name: copyright_checker
entry: python .pre-commit-hooks/copyright-check.hook
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
exclude: (?=third_party|pypinyin|speechx/speechx/kaldi).*(\.cpp|\.cc|\.h|\.py)$
exclude: (?=third_party|pypinyin|speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
- repo: https://github.com/asottile/reorder_python_imports
rev: v2.4.0
hooks:

@ -1,5 +1,25 @@
# Changelog
Date: 2022-3-22, Author: yt605155624.
Add features to: CLI:
- Support aishell3_hifigan、vctk_hifigan
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1587
Date: 2022-3-09, Author: yt605155624.
Add features to: T2S:
- Add ljspeech hifigan egs.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1549
Date: 2022-3-08, Author: yt605155624.
Add features to: T2S:
- Add aishell3 hifigan egs.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1545
Date: 2022-3-08, Author: yt605155624.
Add features to: T2S:
- Add vctk hifigan egs.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1544
Date: 2022-1-29, Author: yt605155624.
Add features to: T2S:
- Update aishell3 vc0 with new Tacotron2.

@ -7,6 +7,7 @@
<h3>
<a href="#quick-start"> Quick Start </a>
| <a href="#quick-start-server"> Quick Start Server </a>
| <a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a>
</div>
@ -178,7 +179,9 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
<!---
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/akhaliq/paddlespeech) Demos on Hugging Face Spaces are available!
- 👏🏻 2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
- 👏🏻 2022.03.28: PaddleSpeech CLI is available for Speaker Verification.
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 👏🏻 2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech.
### Community
@ -203,10 +206,16 @@ Developers can have a try of our models with [PaddleSpeech Command Line](./paddl
paddlespeech cls --input input.wav
```
**Speaker Verification**
```
paddlespeech vector --task spk --input input_16k.wav
```
**Automatic Speech Recognition**
```shell
paddlespeech asr --lang zh --input input_16k.wav
```
- web demo for Automatic Speech Recognition is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [ASR Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR)
**Speech Translation** (English to Chinese)
(not support for Mac and Windows now)
@ -218,7 +227,7 @@ paddlespeech st --input input_16k.wav
```shell
paddlespeech tts --input "你好,欢迎使用飞桨深度学习框架!" --output output.wav
```
- web demo for Text to Speech is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech)
- web demo for Text to Speech is integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See Demo: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
**Text Postprocessing**
- Punctuation Restoration
@ -241,6 +250,36 @@ For more command lines, please see: [demos](https://github.com/PaddlePaddle/Padd
If you want to try more functions like training and tuning, please have a look at [Speech-to-Text Quick Start](./docs/source/asr/quick_start.md) and [Text-to-Speech Quick Start](./docs/source/tts/quick_start.md).
<a name="quickstartserver"></a>
## Quick Start Server
Developers can have a try of our speech server with [PaddleSpeech Server Command Line](./paddlespeech/server/README.md).
**Start server**
```shell
paddlespeech_server start --config_file ./paddlespeech/server/conf/application.yaml
```
**Access Speech Recognition Services**
```shell
paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
```
**Access Text to Speech Services**
```shell
paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
**Access Audio Classification Services**
```shell
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
```
For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
## Model List
PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
@ -397,9 +436,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tr>
<tr>
<td >HiFiGAN</td>
<td >CSMSC</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td>
<a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
</td>
</tr>
<tr>
@ -457,6 +496,29 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody>
</table>
**Speaker Verification**
<table style="width:100%">
<thead>
<tr>
<th> Task </th>
<th> Dataset </th>
<th> Model Type </th>
<th> Link </th>
</tr>
</thead>
<tbody>
<tr>
<td>Speaker Verification</td>
<td>VoxCeleb12</td>
<td>ECAPA-TDNN</td>
<td>
<a href = "./examples/voxceleb/sv0">ecapa-tdnn-voxceleb12</a>
</td>
</tr>
</tbody>
</table>
**Punctuation Restoration**
<table style="width:100%">
@ -498,6 +560,7 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
- [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
- [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
- [Audio Classification](./demos/audio_tagging/README.md)
- [Speaker Verification](./demos/speaker_verification/README.md)
- [Speech Translation](./demos/speech_translation/README.md)
- [Released Models](./docs/source/released_model.md)
- [Community](#Community)
@ -573,7 +636,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
- Many thanks to [AK391](https://github.com/AK391) for TTS web demo on Huggingface Spaces using Gradio.
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.
- Many thanks to [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) for contributing Punctuation Restoration model.

@ -6,6 +6,7 @@
<h3>
<a href="#quick-start"> 快速开始 </a>
| <a href="#quick-start-server"> 快速使用服务 </a>
| <a href="#documents"> 教程文档 </a>
| <a href="#model-list"> 模型列表 </a>
</div>
@ -179,7 +180,9 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
<!---
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
- 🤗 2021.12.14: 我们在 Hugging Face Spaces 上的 [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) 以及 [TTS](https://huggingface.co/spaces/akhaliq/paddlespeech) Demos 上线啦!
- 👏🏻 2022.03.28: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、以及语音合成。
- 👏🏻 2022.03.28: PaddleSpeech CLI 上线声纹验证。
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 👏🏻 2021.12.10: PaddleSpeech CLI 上线!覆盖了声音分类、语音识别、语音翻译(英译中)以及语音合成。
### 技术交流群
@ -202,6 +205,10 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
```shell
paddlespeech cls --input input.wav
```
**声纹识别**
```shell
paddlespeech vector --task spk --input input_16k.wav
```
**语音识别**
```shell
paddlespeech asr --lang zh --input input_16k.wav
@ -236,6 +243,33 @@ paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
> Note: 如果需要训练或者微调,请查看[语音识别](./docs/source/asr/quick_start.md) [语音合成](./docs/source/tts/quick_start.md)。
## 快速使用服务
安装完成后,开发者可以通过命令行快速使用服务。
**启动服务**
```shell
paddlespeech_server start --config_file ./paddlespeech/server/conf/application.yaml
```
**访问语音识别服务**
```shell
paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
```
**访问语音合成服务**
```shell
paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
**访问音频分类服务**
```shell
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
```
更多服务相关的命令行使用信息,请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
## 模型列表
PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。
@ -392,9 +426,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tr>
<tr>
<td >HiFiGAN</td>
<td >CSMSC</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td>
<a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
</td>
</tr>
<tr>
@ -453,6 +487,30 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tbody>
</table>
**声纹识别**
<table style="width:100%">
<thead>
<tr>
<th> Task </th>
<th> Dataset </th>
<th> Model Type </th>
<th> Link </th>
</tr>
</thead>
<tbody>
<tr>
<td>Speaker Verification</td>
<td>VoxCeleb12</td>
<td>ECAPA-TDNN</td>
<td>
<a href = "./examples/voxceleb/sv0">ecapa-tdnn-voxceleb12</a>
</td>
</tr>
</tbody>
</table>
**标点恢复**
<table style="width:100%">
@ -499,6 +557,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- [中文文本前端](./docs/source/tts/zh_text_frontend.md)
- [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
- [声音分类](./demos/audio_tagging/README_cn.md)
- [声纹识别](./demos/speaker_verification/README_cn.md)
- [语音翻译](./demos/speech_translation/README_cn.md)
- [模型列表](#模型列表)
- [语音识别](#语音识别模型)
@ -521,6 +580,15 @@ author={PaddlePaddle Authors},
howpublished = {\url{https://github.com/PaddlePaddle/PaddleSpeech}},
year={2021}
}
@inproceedings{zheng2021fused,
title={Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation},
author={Zheng, Renjie and Chen, Junkun and Ma, Mingbo and Huang, Liang},
booktitle={International Conference on Machine Learning},
pages={12736--12746},
year={2021},
organization={PMLR}
}
```
<a name="欢迎贡献"></a>
@ -568,7 +636,6 @@ year={2021}
## 致谢
- 非常感谢 [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) 多年来的关注和建议,以及在诸多问题上的帮助。
- 非常感谢 [AK391](https://github.com/AK391) 在 Huggingface Spaces 上使用 Gradio 对我们的语音合成功能进行网页版演示。
- 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)及[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。
- 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。
- 非常感谢 [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) 贡献标点重建相关模型。

@ -20,12 +20,12 @@ of each audio file in the data set.
"""
import argparse
import codecs
import distutils.util
import io
import json
import os
from multiprocessing.pool import Pool
import distutils.util
import soundfile
from utils.utility import download

@ -59,12 +59,19 @@ DEV_TARGET_DATA = "vox1_dev_wav_parta* vox1_dev_wav.zip ae63e55b951748cc486645f5
TEST_LIST = {"vox1_test_wav.zip": "185fdc63c3c739954633d50379a3d102"}
TEST_TARGET_DATA = "vox1_test_wav.zip vox1_test_wav.zip 185fdc63c3c739954633d50379a3d102"
# kaldi trial
# this trial file is organized by kaldi according the official file,
# which is a little different with the official trial veri_test2.txt
KALDI_BASE_URL = "http://www.openslr.org/resources/49/"
TRIAL_LIST = {"voxceleb1_test_v2.txt": "29fc7cc1c5d59f0816dc15d6e8be60f7"}
TRIAL_TARGET_DATA = "voxceleb1_test_v2.txt voxceleb1_test_v2.txt 29fc7cc1c5d59f0816dc15d6e8be60f7"
# voxceleb trial
TRIAL_BASE_URL = "https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/"
TRIAL_LIST = {
"veri_test.txt": "29fc7cc1c5d59f0816dc15d6e8be60f7", # voxceleb1
"veri_test2.txt": "b73110731c9223c1461fe49cb48dddfc", # voxceleb1(cleaned)
"list_test_hard.txt": "21c341b6b2168eea2634df0fb4b8fff1", # voxceleb1-H
"list_test_hard2.txt":
"857790e09d579a68eb2e339a090343c8", # voxceleb1-H(cleaned)
"list_test_all.txt": "b9ecf7aa49d4b656aa927a8092844e4a", # voxceleb1-E
"list_test_all2.txt":
"a53e059deb562ffcfc092bf5d90d9f3a" # voxceleb1-E(cleaned)
}
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
@ -82,7 +89,7 @@ args = parser.parse_args()
def create_manifest(data_dir, manifest_path_prefix):
print("Creating manifest %s ..." % manifest_path_prefix)
print(f"Creating manifest {manifest_path_prefix} from {data_dir}")
json_lines = []
data_path = os.path.join(data_dir, "wav", "**", "*.wav")
total_sec = 0.0
@ -114,6 +121,9 @@ def create_manifest(data_dir, manifest_path_prefix):
# voxceleb1 is given explicit in the path
data_dir_name = Path(data_dir).name
manifest_path_prefix = manifest_path_prefix + "." + data_dir_name
if not os.path.exists(os.path.dirname(manifest_path_prefix)):
os.makedirs(os.path.dirname(manifest_path_prefix))
with codecs.open(manifest_path_prefix, 'w', encoding='utf-8') as f:
for line in json_lines:
f.write(line + "\n")
@ -133,11 +143,13 @@ def create_manifest(data_dir, manifest_path_prefix):
def prepare_dataset(base_url, data_list, target_dir, manifest_path,
target_data):
if not os.path.exists(target_dir):
os.mkdir(target_dir)
os.makedirs(target_dir)
# wav directory already exists, it need do nothing
# we will download the voxceleb1 data to ${target_dir}/vox1/dev/ or ${target_dir}/vox1/test directory
if not os.path.exists(os.path.join(target_dir, "wav")):
# download all dataset part
print("start to download the vox1 dev zip package")
for zip_part in data_list.keys():
download_url = " --no-check-certificate " + base_url + "/" + zip_part
download(
@ -167,10 +179,22 @@ def prepare_dataset(base_url, data_list, target_dir, manifest_path,
create_manifest(data_dir=target_dir, manifest_path_prefix=manifest_path)
def prepare_trial(base_url, data_list, target_dir):
if not os.path.exists(target_dir):
os.makedirs(target_dir)
for trial, md5sum in data_list.items():
target_trial = os.path.join(target_dir, trial)
if not os.path.exists(os.path.join(target_dir, trial)):
download_url = " --no-check-certificate " + base_url + "/" + trial
download(url=download_url, md5sum=md5sum, target_dir=target_dir)
def main():
if args.target_dir.startswith('~'):
args.target_dir = os.path.expanduser(args.target_dir)
# prepare the vox1 dev data
prepare_dataset(
base_url=BASE_URL,
data_list=DEV_LIST,
@ -178,6 +202,7 @@ def main():
manifest_path=args.manifest_prefix,
target_data=DEV_TARGET_DATA)
# prepare the vox1 test data
prepare_dataset(
base_url=BASE_URL,
data_list=TEST_LIST,
@ -185,6 +210,12 @@ def main():
manifest_path=args.manifest_prefix,
target_data=TEST_TARGET_DATA)
# prepare the vox1 trial
prepare_trial(
base_url=TRIAL_BASE_URL,
data_list=TRIAL_LIST,
target_dir=os.path.dirname(args.manifest_prefix))
print("Manifest prepare done!")

@ -0,0 +1,164 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Prepare VoxCeleb2 dataset
Download and unpack the voxceleb2 data files.
Voxceleb2 data is stored as the m4a format,
so we need convert the m4a to wav with the convert.sh scripts
"""
import argparse
import codecs
import glob
import json
import os
from pathlib import Path
import soundfile
from utils.utility import download
from utils.utility import unzip
# all the data will be download in the current data/voxceleb directory default
DATA_HOME = os.path.expanduser('.')
BASE_URL = "--no-check-certificate https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data/"
# dev data
DEV_DATA_URL = BASE_URL + '/vox2_aac.zip'
DEV_MD5SUM = "bbc063c46078a602ca71605645c2a402"
# test data
TEST_DATA_URL = BASE_URL + '/vox2_test_aac.zip'
TEST_MD5SUM = "0d2b3ea430a821c33263b5ea37ede312"
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--target_dir",
default=DATA_HOME + "/voxceleb2/",
type=str,
help="Directory to save the voxceleb1 dataset. (default: %(default)s)")
parser.add_argument(
"--manifest_prefix",
default="manifest",
type=str,
help="Filepath prefix for output manifests. (default: %(default)s)")
parser.add_argument(
"--download",
default=False,
action="store_true",
help="Download the voxceleb2 dataset. (default: %(default)s)")
parser.add_argument(
"--generate",
default=False,
action="store_true",
help="Generate the manifest files. (default: %(default)s)")
args = parser.parse_args()
def create_manifest(data_dir, manifest_path_prefix):
print("Creating manifest %s ..." % manifest_path_prefix)
json_lines = []
data_path = os.path.join(data_dir, "**", "*.wav")
total_sec = 0.0
total_text = 0.0
total_num = 0
speakers = set()
for audio_path in glob.glob(data_path, recursive=True):
audio_id = "-".join(audio_path.split("/")[-3:])
utt2spk = audio_path.split("/")[-3]
duration = soundfile.info(audio_path).duration
text = ""
json_lines.append(
json.dumps(
{
"utt": audio_id,
"utt2spk": str(utt2spk),
"feat": audio_path,
"feat_shape": (duration, ),
"text": text # compatible with asr data format
},
ensure_ascii=False))
total_sec += duration
total_text += len(text)
total_num += 1
speakers.add(utt2spk)
# data_dir_name refer to dev or test
# voxceleb2 is given explicit in the path
data_dir_name = Path(data_dir).name
manifest_path_prefix = manifest_path_prefix + "." + data_dir_name
if not os.path.exists(os.path.dirname(manifest_path_prefix)):
os.makedirs(os.path.dirname(manifest_path_prefix))
with codecs.open(manifest_path_prefix, 'w', encoding='utf-8') as f:
for line in json_lines:
f.write(line + "\n")
manifest_dir = os.path.dirname(manifest_path_prefix)
meta_path = os.path.join(manifest_dir, "voxceleb2." +
data_dir_name) + ".meta"
with codecs.open(meta_path, 'w', encoding='utf-8') as f:
print(f"{total_num} utts", file=f)
print(f"{len(speakers)} speakers", file=f)
print(f"{total_sec / (60 * 60)} h", file=f)
print(f"{total_text} text", file=f)
print(f"{total_text / total_sec} text/sec", file=f)
print(f"{total_sec / total_num} sec/utt", file=f)
def download_dataset(url, md5sum, target_dir, dataset):
if not os.path.exists(target_dir):
os.makedirs(target_dir)
# wav directory already exists, it need do nothing
print("target dir {}".format(os.path.join(target_dir, dataset)))
# unzip the dev dataset will create the dev and unzip the m4a to dev dir
# but the test dataset will unzip to aac
# so, wo create the ${target_dir}/test and unzip the m4a to test dir
if not os.path.exists(os.path.join(target_dir, dataset)):
filepath = download(url, md5sum, target_dir)
if dataset == "test":
unzip(filepath, os.path.join(target_dir, "test"))
def main():
if args.target_dir.startswith('~'):
args.target_dir = os.path.expanduser(args.target_dir)
# download and unpack the vox2-dev data
print("download: {}".format(args.download))
if args.download:
download_dataset(
url=DEV_DATA_URL,
md5sum=DEV_MD5SUM,
target_dir=args.target_dir,
dataset="dev")
download_dataset(
url=TEST_DATA_URL,
md5sum=TEST_MD5SUM,
target_dir=args.target_dir,
dataset="test")
print("VoxCeleb2 download is done!")
if args.generate:
create_manifest(
args.target_dir, manifest_path_prefix=args.manifest_prefix)
if __name__ == '__main__':
main()

@ -4,6 +4,7 @@
The directory containes many speech applications in multi scenarios.
* audio searching - mass audio similarity retrieval
* audio tagging - multi-label tagging of an audio file
* automatic_video_subtitiles - generate subtitles from a video
* metaverse - 2D AR with TTS

@ -4,6 +4,7 @@
该目录包含基于 PaddleSpeech 开发的不同场景的语音应用 Demo
* 声音检索 - 海量音频相似性检索。
* 声音分类 - 基于 AudioSet 的 527 类标签的音频多标签分类。
* 视频字幕生成 - 识别视频中语音的文本,并进行文本后处理。
* 元宇宙 - 基于语音合成的 2D 增强现实。

@ -0,0 +1,235 @@
([简体中文](./README_cn.md)|English)
# Audio Searching
## Introduction
As the Internet continues to evolve, unstructured data such as emails, social media photos, live videos, and customer service voice calls have become increasingly common. If we want to process the data on a computer, we need to use embedding technology to transform the data into vector and store, index, and query it.
However, when there is a large amount of data, such as hundreds of millions of audio tracks, it is more difficult to do a similarity search. The exhaustive method is feasible, but very time consuming. For this scenario, this demo will introduce how to build an audio similarity retrieval system using the open source vector database Milvus.
Audio retrieval (speech, music, speaker, etc.) enables querying and finding similar sounds (or the same speaker) in a large amount of audio data. The audio similarity retrieval system can be used to identify similar sound effects, minimize intellectual property infringement, quickly retrieve the voice print library, and help enterprises control fraud and identity theft. Audio retrieval also plays an important role in the classification and statistical analysis of audio data.
In this demo, you will learn how to build an audio retrieval system to retrieve similar sound snippets. The uploaded audio clips are converted into vector data using paddlespeech-based pre-training models (audio classification model, speaker recognition model, etc.) and stored in Milvus. Milvus automatically generates a unique ID for each vector, then stores the ID and the corresponding audio information (audio ID, audio speaker ID, etc.) in MySQL to complete the library construction. During retrieval, users upload test audio to obtain vector, and then conduct vector similarity search in Milvus.The retrieval result returned by Milvus is vector ID, and the corresponding audio information can be queried in MySQL by ID.
![Workflow of an audio searching system](./img/audio_searching.png)
Notethis demo uses the [CN-Celeb](http://openslr.org/82/) dataset of at least 650,000 audio entries and 3000 speakers to build the audio vector library, which is then retrieved using a preset distance calculation. The dataset can also use other, Adjust as needed, e.g. Librispeech, VoxCeleb, UrbanSound, GloVe, MNIST, etc.
## Usage
### 1. Prepare PaddleSpeech
Audio vector extraction requires PaddleSpeech training model, so please make sure that PaddleSpeech has been installed before running. Specific installation steps: See [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
You can choose one way from easy, meduim and hard to install paddlespeech.
### 2. Prepare MySQL and Milvus services by docker-compose
The audio similarity search system requires Milvus, MySQL services. We can start these containers with one click through [docker-compose.yaml](./docker-compose.yaml), so please make sure you have [installed Docker Engine](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/) before running. then
```bash
## Enter the audio_searching directory for the following example
cd ~/PaddleSpeech/demos/audio_searching/
## Then start the related services within the container
docker-compose -f docker-compose.yaml up -d
```
You will see the that all containers are created:
```bash
Creating network "quick_deploy_app_net" with driver "bridge"
Creating milvus-minio ... done
Creating milvus-etcd ... done
Creating audio-mysql ... done
Creating milvus-standalone ... done
Creating audio-webclient ... done
```
And show all containers with `docker ps`, and you can use `docker logs audio-mysql` to get the logs of server container
```bash
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b2bcf279e599 milvusdb/milvus:v2.0.1 "/tini -- milvus run…" 22 hours ago Up 22 hours 0.0.0.0:19530->19530/tcp milvus-standalone
d8ef4c84e25c mysql:5.7 "docker-entrypoint.s…" 22 hours ago Up 22 hours 0.0.0.0:3306->3306/tcp, 33060/tcp audio-mysql
8fb501edb4f3 quay.io/coreos/etcd:v3.5.0 "etcd -advertise-cli…" 22 hours ago Up 22 hours 2379-2380/tcp milvus-etcd
ffce340b3790 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" 22 hours ago Up 22 hours (healthy) 9000/tcp milvus-minio
15c84a506754 paddlepaddle/paddlespeech-audio-search-client:2.3 "/bin/bash -c '/usr/…" 22 hours ago Up 22 hours (healthy) 0.0.0.0:8068->80/tcp audio-webclient
```
### 3. Start API Server
Then to start the system server, and it provides HTTP backend services.
- Install the Python packages
```bash
pip install -r requirements.txt
```
- Set configuration(In the case of local running, you can skip this step.)
```bash
## Method 1: Modify the source file
vim src/config.py
## Method 2: Modify the environment variables, as shown in
export MILVUS_HOST=127.0.0.1
export MYSQL_HOST=127.0.0.1
```
Here listing some parameters that need to be set, for more information please refer to [config.py](./src/config.py).
| **Parameter** |**Description** | **Default setting** |
| ---------------- | -----------------------| ------------------- |
| MILVUS_HOST | The IP address of Milvus, you can get it by ifconfig. If running everything on one machine, most likely 127.0.0.1 | 127.0.0.1
| MILVUS_PORT | Port of Milvus. | 19530 |
| VECTOR_DIMENSION | Dimension of the vectors. | 2048 |
| MYSQL_HOST | The IP address of Mysql. | 127.0.0.1 |
| MYSQL_PORT | Port of Mysql. | 3306 |
| DEFAULT_TABLE | The milvus and mysql default collection name. | audio_table |
- Run the code
Then start the server with Fastapi.
```bash
export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
python src/main.py
```
Then you will see the Application is started:
```bash
INFO: Started server process [13352]
2022-03-26 22:45:30,838 INFO server.py serve 75 Started server process [13352]
INFO: Waiting for application startup.
2022-03-26 22:45:30,839 INFO on.py startup 45 Waiting for application startup.
INFO: Application startup complete.
2022-03-26 22:45:30,839 INFO on.py startup 59 Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
2022-03-26 22:45:30,840 INFO server.py _log_started_message 206 Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
```
### 4. Usage
- Prepare data
```bash
wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz
```
**Note**: If you want to build a quick demo, you can use ./src/test_main.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example
- Prepare model(Skip this step if you use the default model.)
```bash
## Modify model configuration parameters. Currently, only ecapatdnn_voxceleb12 is supported, and multiple types will be supported in the future
vim ./src/encode.py
```
- Scripts test (Recommended)
The internal process is downloading data, loading the paddlespeech model, extracting embedding, storing library, retrieving and deleting library
```bash
python ./src/test_main.py
```
Output
```bash
Downloading https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz ...
...
Unpacking ./example_audio.tar.gz ...
[2022-03-26 22:50:54,987] [ INFO] - checking the aduio file format......
[2022-03-26 22:50:54,987] [ INFO] - The sample rate is 16000
[2022-03-26 22:50:54,987] [ INFO] - The audio file format is right
[2022-03-26 22:50:54,988] [ INFO] - device type: cpu
[2022-03-26 22:50:54,988] [ INFO] - load the pretrained model: ecapatdnn_voxceleb12-16k
[2022-03-26 22:50:54,990] [ INFO] - Downloading sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz from https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz
...
[2022-03-26 22:51:17,285] [ INFO] - start to dynamic import the model class
[2022-03-26 22:51:17,285] [ INFO] - model name ecapatdnn
[2022-03-26 22:51:23,864] [ INFO] - start to set the model parameters to model
[2022-03-26 22:54:08,115] [ INFO] - create the model instance success
[2022-03-26 22:54:08,116] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_
searching/example_audio/knife_hit_iron3.wav
[2022-03-26 22:54:08,116] [ INFO] - load the audio sample points, shape is: (11012,)
[2022-03-26 22:54:08,150] [ INFO] - extract the audio feat, shape is: (80, 69)
[2022-03-26 22:54:08,152] [ INFO] - feats shape: [1, 80, 69]
[2022-03-26 22:54:08,154] [ INFO] - audio extract the feat success
[2022-03-26 22:54:08,155] [ INFO] - start to do backbone network model forward
[2022-03-26 22:54:08,155] [ INFO] - feats shape:[1, 80, 69], lengths shape: [1]
[2022-03-26 22:54:08,433] [ INFO] - embedding size: (192,)
Extracting feature from audio No. 1 , 20 audios in total
[2022-03-26 22:54:08,435] [ INFO] - checking the aduio file format......
[2022-03-26 22:54:08,435] [ INFO] - The sample rate is 16000
[2022-03-26 22:54:08,436] [ INFO] - The audio file format is right
[2022-03-26 22:54:08,436] [ INFO] - device type: cpu
[2022-03-26 22:54:08,436] [ INFO] - Model has been initialized
[2022-03-26 22:54:08,436] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/sword_wielding.wav
[2022-03-26 22:54:08,436] [ INFO] - load the audio sample points, shape is: (6391,)
[2022-03-26 22:54:08,452] [ INFO] - extract the audio feat, shape is: (80, 40)
[2022-03-26 22:54:08,454] [ INFO] - feats shape: [1, 80, 40]
[2022-03-26 22:54:08,454] [ INFO] - audio extract the feat success
[2022-03-26 22:54:08,454] [ INFO] - start to do backbone network model forward
[2022-03-26 22:54:08,455] [ INFO] - feats shape:[1, 80, 40], lengths shape: [1]
[2022-03-26 22:54:08,633] [ INFO] - embedding size: (192,)
Extracting feature from audio No. 2 , 20 audios in total
...
2022-03-26 22:54:15,892 INFO main.py load_audios 85 Successfully loaded data, total count: 20
2022-03-26 22:54:15,908 INFO main.py count_audio 148 Successfully count the number of data!
[2022-03-26 22:54:15,916] [ INFO] - checking the aduio file format......
[2022-03-26 22:54:15,916] [ INFO] - The sample rate is 16000
[2022-03-26 22:54:15,916] [ INFO] - The audio file format is right
[2022-03-26 22:54:15,916] [ INFO] - device type: cpu
[2022-03-26 22:54:15,916] [ INFO] - Model has been initialized
[2022-03-26 22:54:15,916] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/test.wav
[2022-03-26 22:54:15,917] [ INFO] - load the audio sample points, shape is: (8456,)
[2022-03-26 22:54:15,923] [ INFO] - extract the audio feat, shape is: (80, 53)
[2022-03-26 22:54:15,924] [ INFO] - feats shape: [1, 80, 53]
[2022-03-26 22:54:15,924] [ INFO] - audio extract the feat success
[2022-03-26 22:54:15,924] [ INFO] - start to do backbone network model forward
[2022-03-26 22:54:15,924] [ INFO] - feats shape:[1, 80, 53], lengths shape: [1]
[2022-03-26 22:54:16,051] [ INFO] - embedding size: (192,)
...
2022-03-26 22:54:16,086 INFO main.py search_local_audio 132 search result http://testserver/data?audio_path=./example_audio/test.wav, score 100.0
2022-03-26 22:54:16,087 INFO main.py search_local_audio 132 search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, score 29.182177782058716
2022-03-26 22:54:16,087 INFO main.py search_local_audio 132 search result http://testserver/data?audio_path=./example_audio/knife_cut_into_body.wav, score 22.73637056350708
...
2022-03-26 22:54:16,088 INFO main.py search_local_audio 136 Successfully searched similar audio!
2022-03-26 22:54:17,164 INFO main.py drop_tables 160 Successfully drop tables in Milvus and MySQL!
```
- GUI test (Optional)
Navigate to 127.0.0.1:8068 in your browser to access the front-end interface.
**Note**: If the browser and the service are not on the same machine, then the IP needs to be changed to the IP of the machine where the service is located, and the corresponding API_URL in docker-compose.yaml needs to be changed, and the docker-compose.yaml file needs to be re-executed for the change to take effect.
- Insert data
Download the data on the server and decompress it to a file, for example, /home/speech/data/. Then enter /home/speech/data/ in the address bar of the upload page to upload the data.
![](./img/insert.png)
- Search for similar audio
Select the magnifying glass icon on the left side of the interface. Then, press the "Default Target Audio File" button and upload a .wav sound file from the client you'd like to search. Results will be displayed.
![](./img/search.png)
### 5.Result
machine configuration
- OS: CentOS release 7.6
- kernel4.17.11-1.el7.elrepo.x86_64
- CPUIntel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
- memory132G
dataset
- CN-Celeb, train size 650,000, test size 10,000, dimention 192, distance L2
recall and elapsed time statistics are shown in the following figure
![](./img/result.png)
The retrieval framework based on Milvus takes about 2.9 milliseconds to retrieve on the premise of 90% recall rate, and it takes about 500 milliseconds for feature extraction (testing audio takes about 5 seconds), that is, a single audio test takes about 503 milliseconds in total, which can meet most application scenarios.
### 6.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech :
| Model | Sample Rate
| :--- | :---:
| ecapa_tdnn | 16000

@ -0,0 +1,237 @@
(简体中文|[English](./README.md))
# 音频相似性检索
## 介绍
随着互联网不断发展,电子邮件、社交媒体照片、直播视频、客服语音等非结构化数据已经变得越来越普遍。如果想要使用计算机来处理这些数据,需要使用 embedding 技术将这些数据转化为向量 vector然后进行存储、建索引、并查询。
但是,当数据量很大,比如上亿条音频要做相似度搜索,就比较困难了。穷举法固然可行,但非常耗时。针对这种场景,该 demo 将介绍如何使用开源向量数据库 Milvus 搭建音频相似度检索系统。
音频检索(如演讲、音乐、说话人等检索)实现了在海量音频数据中查询并找出相似声音(或相同说话人)片段。音频相似性检索系统可用于识别相似的音效、最大限度减少知识产权侵权等,还可以快速的检索声纹库、帮助企业控制欺诈和身份盗用等。在音频数据的分类和统计分析中,音频检索也发挥着重要作用。
在本 demo 中,你将学会如何构建一个音频检索系统,用来检索相似的声音片段。使用基于 PaddleSpeech 预训练模型(音频分类模型,说话人识别模型等)将上传的音频片段转换为向量数据,并存储在 Milvus 中。Milvus 自动为每个向量生成唯一的 ID然后将 ID 和 相应的音频信息音频id音频的说话人id等等存储在 MySQL这样就完成建库的工作。用户在检索时上传测试音频得到向量然后在 Milvus 中进行向量相似度搜索Milvus 返回的检索结果为向量 ID通过 ID 在 MySQL 内部查询相应的音频信息即可。
![音频检索流程图](./img/audio_searching.png)
注:该 demo 使用 [CN-Celeb](http://openslr.org/82/) 数据集,包括至少 650000 条音频3000 个说话人来建立音频向量库音频特征或音频说话人特征然后通过预设的距离计算方式进行音频或说话人检索这里面数据集也可以使用其他的根据需要调整如LibrispeechVoxCelebUrbanSoundGloVeMNIST等。
## 使用方法
### 1. PaddleSpeech 安装
音频向量的提取需要用到基于 PaddleSpeech 训练的模型,所以请确保在运行之前已经安装了 PaddleSpeech具体安装步骤详见[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
你可以从 easymediumhard 三种方式中选择一种方式安装。
### 2. MySQL 和 Milvus 安装
音频相似性的检索需要用到 Milvus, MySQL 服务。 我们可以通过 [docker-compose.yaml](./docker-compose.yaml) 一键启动这些容器,所以请确保在运行之前已经安装了 [Docker Engine](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)。 即
```bash
## 先进入到 audio_searching 目录,如下示例
cd ~/PaddleSpeech/demos/audio_searching/
## 然后启动容器内的相关服务
docker-compose -f docker-compose.yaml up -d
```
你会看到所有的容器都被创建:
```bash
Creating network "quick_deploy_app_net" with driver "bridge"
Creating milvus-minio ... done
Creating milvus-etcd ... done
Creating audio-mysql ... done
Creating milvus-standalone ... done
Creating audio-webclient ... done
```
可以采用'docker ps'来显示所有的容器,还可以使用'docker logs audio-mysql'来获取服务器容器的日志:
```bash
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b2bcf279e599 milvusdb/milvus:v2.0.1 "/tini -- milvus run…" 22 hours ago Up 22 hours 0.0.0.0:19530->19530/tcp milvus-standalone
d8ef4c84e25c mysql:5.7 "docker-entrypoint.s…" 22 hours ago Up 22 hours 0.0.0.0:3306->3306/tcp, 33060/tcp audio-mysql
8fb501edb4f3 quay.io/coreos/etcd:v3.5.0 "etcd -advertise-cli…" 22 hours ago Up 22 hours 2379-2380/tcp milvus-etcd
ffce340b3790 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" 22 hours ago Up 22 hours (healthy) 9000/tcp milvus-minio
15c84a506754 paddlepaddle/paddlespeech-audio-search-client:2.3 "/bin/bash -c '/usr/…" 22 hours ago Up 22 hours (healthy) 0.0.0.0:8068->80/tcp audio-webclient
```
### 3. 配置并启动 API 服务
启动系统服务程序,它会提供基于 HTTP 后端服务。
- 安装服务依赖的 python 基础包
```bash
pip install -r requirements.txt
```
- 修改配置(本地运行情况下,一般不用修改,可以跳过该步骤)
```bash
## 方法一:修改源码文件
vim src/config.py
## 方法二:修改环境变量,如下所示
export MILVUS_HOST=127.0.0.1
export MYSQL_HOST=127.0.0.1
```
这里列出了一些需要设置的参数,更多信息请参考 [config.py](./src/config.py)
| **参数** | **描述** | **默认设置** |
| ---------------- | -------------------- | ------------------- |
| MILVUS_HOST | Milvus 服务的 IP 地址 | 127.0.0.1 |
| MILVUS_PORT | Milvus 服务的端口号 | 19530 |
| VECTOR_DIMENSION | 特征向量的维度 | 192 |
| MYSQL_HOST | Mysql 服务的 IP 地址 | 127.0.0.1 |
| MYSQL_PORT | Mysql 服务的端口号 | 3306 |
| DEFAULT_TABLE | 默认存储的表名 | audio_table |
- 运行程序
启动用 Fastapi 构建的服务
```bash
export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
python src/main.py
```
然后你会看到应用程序启动:
```bash
INFO: Started server process [13352]
2022-03-26 22:45:30,838 INFO server.py serve 75 Started server process [13352]
INFO: Waiting for application startup.
2022-03-26 22:45:30,839 INFO on.py startup 45 Waiting for application startup.
INFO: Application startup complete.
2022-03-26 22:45:30,839 INFO on.py startup 59 Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
2022-03-26 22:45:30,840 INFO server.py _log_started_message 206 Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
```
### 4. 测试方法
- 准备数据
```bash
wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz
```
**注**:如果希望快速搭建 demo可以采用 ./src/test_main.py:download_audio_data 内部的 20 条音频,另外后续结果展示以该集合为例
- 准备模型(如果使用默认模型,可以跳过此步骤)
```bash
## 修改模型配置参数,目前 model 仅支持 ecapatdnn_voxceleb12后续将支持多种类型
vim ./src/encode.py
```
- 脚本测试(推荐)
```bash
python ./src/test_main.py
```
注:内部将依次下载数据,加载 paddlespeech 模型,提取 embedding存储建库检索删库
输出:
```bash
Downloading https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz ...
...
Unpacking ./example_audio.tar.gz ...
[2022-03-26 22:50:54,987] [ INFO] - checking the aduio file format......
[2022-03-26 22:50:54,987] [ INFO] - The sample rate is 16000
[2022-03-26 22:50:54,987] [ INFO] - The audio file format is right
[2022-03-26 22:50:54,988] [ INFO] - device type: cpu
[2022-03-26 22:50:54,988] [ INFO] - load the pretrained model: ecapatdnn_voxceleb12-16k
[2022-03-26 22:50:54,990] [ INFO] - Downloading sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz from https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz
...
[2022-03-26 22:51:17,285] [ INFO] - start to dynamic import the model class
[2022-03-26 22:51:17,285] [ INFO] - model name ecapatdnn
[2022-03-26 22:51:23,864] [ INFO] - start to set the model parameters to model
[2022-03-26 22:54:08,115] [ INFO] - create the model instance success
[2022-03-26 22:54:08,116] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_
searching/example_audio/knife_hit_iron3.wav
[2022-03-26 22:54:08,116] [ INFO] - load the audio sample points, shape is: (11012,)
[2022-03-26 22:54:08,150] [ INFO] - extract the audio feat, shape is: (80, 69)
[2022-03-26 22:54:08,152] [ INFO] - feats shape: [1, 80, 69]
[2022-03-26 22:54:08,154] [ INFO] - audio extract the feat success
[2022-03-26 22:54:08,155] [ INFO] - start to do backbone network model forward
[2022-03-26 22:54:08,155] [ INFO] - feats shape:[1, 80, 69], lengths shape: [1]
[2022-03-26 22:54:08,433] [ INFO] - embedding size: (192,)
Extracting feature from audio No. 1 , 20 audios in total
[2022-03-26 22:54:08,435] [ INFO] - checking the aduio file format......
[2022-03-26 22:54:08,435] [ INFO] - The sample rate is 16000
[2022-03-26 22:54:08,436] [ INFO] - The audio file format is right
[2022-03-26 22:54:08,436] [ INFO] - device type: cpu
[2022-03-26 22:54:08,436] [ INFO] - Model has been initialized
[2022-03-26 22:54:08,436] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/sword_wielding.wav
[2022-03-26 22:54:08,436] [ INFO] - load the audio sample points, shape is: (6391,)
[2022-03-26 22:54:08,452] [ INFO] - extract the audio feat, shape is: (80, 40)
[2022-03-26 22:54:08,454] [ INFO] - feats shape: [1, 80, 40]
[2022-03-26 22:54:08,454] [ INFO] - audio extract the feat success
[2022-03-26 22:54:08,454] [ INFO] - start to do backbone network model forward
[2022-03-26 22:54:08,455] [ INFO] - feats shape:[1, 80, 40], lengths shape: [1]
[2022-03-26 22:54:08,633] [ INFO] - embedding size: (192,)
Extracting feature from audio No. 2 , 20 audios in total
...
2022-03-26 22:54:15,892 INFO main.py load_audios 85 Successfully loaded data, total count: 20
2022-03-26 22:54:15,908 INFO main.py count_audio 148 Successfully count the number of data!
[2022-03-26 22:54:15,916] [ INFO] - checking the aduio file format......
[2022-03-26 22:54:15,916] [ INFO] - The sample rate is 16000
[2022-03-26 22:54:15,916] [ INFO] - The audio file format is right
[2022-03-26 22:54:15,916] [ INFO] - device type: cpu
[2022-03-26 22:54:15,916] [ INFO] - Model has been initialized
[2022-03-26 22:54:15,916] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/test.wav
[2022-03-26 22:54:15,917] [ INFO] - load the audio sample points, shape is: (8456,)
[2022-03-26 22:54:15,923] [ INFO] - extract the audio feat, shape is: (80, 53)
[2022-03-26 22:54:15,924] [ INFO] - feats shape: [1, 80, 53]
[2022-03-26 22:54:15,924] [ INFO] - audio extract the feat success
[2022-03-26 22:54:15,924] [ INFO] - start to do backbone network model forward
[2022-03-26 22:54:15,924] [ INFO] - feats shape:[1, 80, 53], lengths shape: [1]
[2022-03-26 22:54:16,051] [ INFO] - embedding size: (192,)
...
2022-03-26 22:54:16,086 INFO main.py search_local_audio 132 search result http://testserver/data?audio_path=./example_audio/test.wav, score 100.0
2022-03-26 22:54:16,087 INFO main.py search_local_audio 132 search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, score 29.182177782058716
2022-03-26 22:54:16,087 INFO main.py search_local_audio 132 search result http://testserver/data?audio_path=./example_audio/knife_cut_into_body.wav, score 22.73637056350708
...
2022-03-26 22:54:16,088 INFO main.py search_local_audio 136 Successfully searched similar audio!
2022-03-26 22:54:17,164 INFO main.py drop_tables 160 Successfully drop tables in Milvus and MySQL!
```
- 前端测试(可选)
在浏览器中输入 127.0.0.1:8068 访问前端页面
**注**:如果浏览器和服务不在同一台机器上,那么 IP 需要修改成服务所在的机器 IP并且 docker-compose.yaml 中相应的 API_URL 也要修改,然后重新执行 docker-compose.yaml 文件,使修改生效。
- 上传音频
在服务端下载数据并解压到一文件夹,假设为 /home/speech/data/,那么在上传页面地址栏输入 /home/speech/data/ 进行数据上传
![](./img/insert.png)
- 检索相似音频
选择左上角放大镜,点击 “Default Target Audio File” 按钮,从客户端上传测试音频,接着你将看到检索结果
![](./img/search.png)
### 5. 结果
机器配置:
- 操作系统: CentOS release 7.6
- 内核4.17.11-1.el7.elrepo.x86_64
- 处理器Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
- 内存132G
数据集:
- CN-Celeb, 训练集 65万, 测试集 1万向量维度 192距离计算方式 L2
召回和耗时统计如下图:
![](./img/result.png)
基于 Milvus 的检索框架在召回率 90% 的前提下,检索耗时约 2.9 毫秒,加上特征提取(Embedding)耗时约 500 毫秒(测试音频时长约 5 秒),即单条音频测试总共耗时约 503 毫秒,可以满足大多数应用场景。
### 6. 预训练模型
以下是 PaddleSpeech 提供的预训练模型列表:
| 模型 | 采样率
| :--- | :---:
| ecapa_tdnn| 16000

@ -0,0 +1,88 @@
version: '3.5'
services:
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.0
networks:
app_net:
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2020-12-03T00-03-10Z
networks:
app_net:
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
command: minio server /minio_data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
standalone:
container_name: milvus-standalone
image: milvusdb/milvus:v2.0.1
networks:
app_net:
ipv4_address: 172.16.23.10
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
ports:
- "19530:19530"
depends_on:
- "etcd"
- "minio"
mysql:
container_name: audio-mysql
image: mysql:5.7
networks:
app_net:
ipv4_address: 172.16.23.11
environment:
- MYSQL_ROOT_PASSWORD=123456
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/mysql:/var/lib/mysql
ports:
- "3306:3306"
webclient:
container_name: audio-webclient
image: paddlepaddle/paddlespeech-audio-search-client:2.3
networks:
app_net:
ipv4_address: 172.16.23.13
environment:
API_URL: 'http://127.0.0.1:8002'
ports:
- "8068:80"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/"]
interval: 30s
timeout: 20s
retries: 3
networks:
app_net:
driver: bridge
ipam:
driver: default
config:
- subnet: 172.16.23.0/24
gateway: 172.16.23.1

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

@ -0,0 +1,13 @@
diskcache==5.2.1
dtaidistance==2.3.1
fastapi
librosa==0.8.0
numpy==1.21.0
pydantic
pymilvus==2.0.1
pymysql
python-multipart
soundfile==0.10.3.post1
starlette
typing
uvicorn

@ -0,0 +1,36 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
############### Milvus Configuration ###############
MILVUS_HOST = os.getenv("MILVUS_HOST", "127.0.0.1")
MILVUS_PORT = int(os.getenv("MILVUS_PORT", "19530"))
VECTOR_DIMENSION = int(os.getenv("VECTOR_DIMENSION", "192"))
INDEX_FILE_SIZE = int(os.getenv("INDEX_FILE_SIZE", "1024"))
METRIC_TYPE = os.getenv("METRIC_TYPE", "L2")
DEFAULT_TABLE = os.getenv("DEFAULT_TABLE", "audio_table")
TOP_K = int(os.getenv("TOP_K", "10"))
############### MySQL Configuration ###############
MYSQL_HOST = os.getenv("MYSQL_HOST", "127.0.0.1")
MYSQL_PORT = int(os.getenv("MYSQL_PORT", "3306"))
MYSQL_USER = os.getenv("MYSQL_USER", "root")
MYSQL_PWD = os.getenv("MYSQL_PWD", "123456")
MYSQL_DB = os.getenv("MYSQL_DB", "mysql")
############### Data Path ###############
UPLOAD_PATH = os.getenv("UPLOAD_PATH", "tmp/audio-data")
############### Number of Log Files ###############
LOGS_NUM = int(os.getenv("logs_num", "0"))

@ -0,0 +1,34 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from logs import LOGGER
from paddlespeech.cli import VectorExecutor
vector_executor = VectorExecutor()
def get_audio_embedding(path):
"""
Use vpr_inference to generate embedding of audio
"""
try:
embedding = vector_executor(
audio_file=path, model='ecapatdnn_voxceleb12')
embedding = embedding / np.linalg.norm(embedding)
embedding = embedding.tolist()
return embedding
except Exception as e:
LOGGER.error(f"Error with embedding:{e}")
return None

@ -0,0 +1,163 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import logging
import os
import re
import sys
from config import LOGS_NUM
class MultiprocessHandler(logging.FileHandler):
"""
A handler class which writes formatted logging records to disk files
"""
def __init__(self,
filename,
when='D',
backupCount=0,
encoding=None,
delay=False):
"""
Open the specified file and use it as the stream for logging
"""
self.prefix = filename
self.backupCount = backupCount
self.when = when.upper()
self.extMath = r"^\d{4}-\d{2}-\d{2}"
self.when_dict = {
'S': "%Y-%m-%d-%H-%M-%S",
'M': "%Y-%m-%d-%H-%M",
'H': "%Y-%m-%d-%H",
'D': "%Y-%m-%d"
}
self.suffix = self.when_dict.get(when)
if not self.suffix:
print('The specified date interval unit is invalid: ', self.when)
sys.exit(1)
self.filefmt = os.path.join('.', "logs",
f"{self.prefix}-{self.suffix}.log")
self.filePath = datetime.datetime.now().strftime(self.filefmt)
_dir = os.path.dirname(self.filefmt)
try:
if not os.path.exists(_dir):
os.makedirs(_dir)
except Exception as e:
print('Failed to create log file: ', e)
print("log_path" + self.filePath)
sys.exit(1)
logging.FileHandler.__init__(self, self.filePath, 'a+', encoding, delay)
def should_change_file_to_write(self):
"""
To write the file
"""
_filePath = datetime.datetime.now().strftime(self.filefmt)
if _filePath != self.filePath:
self.filePath = _filePath
return True
return False
def do_change_file(self):
"""
To change file states
"""
self.baseFilename = os.path.abspath(self.filePath)
if self.stream:
self.stream.close()
self.stream = None
if not self.delay:
self.stream = self._open()
if self.backupCount > 0:
for s in self.get_files_to_delete():
os.remove(s)
def get_files_to_delete(self):
"""
To delete backup files
"""
dir_name, _ = os.path.split(self.baseFilename)
file_names = os.listdir(dir_name)
result = []
prefix = self.prefix + '-'
for file_name in file_names:
if file_name[:len(prefix)] == prefix:
suffix = file_name[len(prefix):-4]
if re.compile(self.extMath).match(suffix):
result.append(os.path.join(dir_name, file_name))
result.sort()
if len(result) < self.backupCount:
result = []
else:
result = result[:len(result) - self.backupCount]
return result
def emit(self, record):
"""
Emit a record
"""
try:
if self.should_change_file_to_write():
self.do_change_file()
logging.FileHandler.emit(self, record)
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
self.handleError(record)
def write_log():
"""
Init a logger
"""
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
# formatter = '%(asctime)s %(levelname)s %(filename)s %(funcName)s %(module)s %(lineno)s %(message)s'
fmt = logging.Formatter(
'%(asctime)s %(levelname)s %(filename)s %(funcName)s %(lineno)s %(message)s'
)
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler.setLevel(logging.INFO)
stream_handler.setFormatter(fmt)
log_name = "audio-searching"
file_handler = MultiprocessHandler(log_name, when='D', backupCount=LOGS_NUM)
file_handler.setLevel(logging.DEBUG)
file_handler.setFormatter(fmt)
file_handler.do_change_file()
logger.addHandler(stream_handler)
logger.addHandler(file_handler)
return logger
LOGGER = write_log()
if __name__ == "__main__":
message = 'test writing logs'
LOGGER.info(message)
LOGGER.debug(message)
LOGGER.error(message)

@ -0,0 +1,168 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from typing import Optional
import uvicorn
from config import UPLOAD_PATH
from diskcache import Cache
from fastapi import FastAPI
from fastapi import File
from fastapi import UploadFile
from logs import LOGGER
from milvus_helpers import MilvusHelper
from mysql_helpers import MySQLHelper
from operations.count import do_count
from operations.drop import do_drop
from operations.load import do_load
from operations.search import do_search
from pydantic import BaseModel
from starlette.middleware.cors import CORSMiddleware
from starlette.requests import Request
from starlette.responses import FileResponse
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"])
MODEL = None
MILVUS_CLI = MilvusHelper()
MYSQL_CLI = MySQLHelper()
# Mkdir 'tmp/audio-data'
if not os.path.exists(UPLOAD_PATH):
os.makedirs(UPLOAD_PATH)
LOGGER.info(f"Mkdir the path: {UPLOAD_PATH}")
@app.get('/data')
def audio_path(audio_path):
# Get the audio file
try:
LOGGER.info(f"Successfully load audio: {audio_path}")
return FileResponse(audio_path)
except Exception as e:
LOGGER.error(f"upload audio error: {e}")
return {'status': False, 'msg': e}, 400
@app.get('/progress')
def get_progress():
# Get the progress of dealing with data
try:
cache = Cache('./tmp')
return f"current: {cache['current']}, total: {cache['total']}"
except Exception as e:
LOGGER.error(f"Upload data error: {e}")
return {'status': False, 'msg': e}, 400
class Item(BaseModel):
Table: Optional[str] = None
File: str
@app.post('/audio/load')
async def load_audios(item: Item):
# Insert all the audio files under the file path to Milvus/MySQL
try:
total_num = do_load(item.Table, item.File, MILVUS_CLI, MYSQL_CLI)
LOGGER.info(f"Successfully loaded data, total count: {total_num}")
return {'status': True, 'msg': "Successfully loaded data!"}
except Exception as e:
LOGGER.error(e)
return {'status': False, 'msg': e}, 400
@app.post('/audio/search')
async def search_audio(request: Request,
table_name: str=None,
audio: UploadFile=File(...)):
# Search the uploaded audio in Milvus/MySQL
try:
# Save the upload data to server.
content = await audio.read()
query_audio_path = os.path.join(UPLOAD_PATH, audio.filename)
with open(query_audio_path, "wb+") as f:
f.write(content)
host = request.headers['host']
_, paths, distances = do_search(host, table_name, query_audio_path,
MILVUS_CLI, MYSQL_CLI)
names = []
for path, score in zip(paths, distances):
names.append(os.path.basename(path))
LOGGER.info(f"search result {path}, score {score}")
res = dict(zip(paths, zip(names, distances)))
# Sort results by distance metric, closest distances first
res = sorted(res.items(), key=lambda item: item[1][1], reverse=True)
LOGGER.info("Successfully searched similar audio!")
return res
except Exception as e:
LOGGER.error(e)
return {'status': False, 'msg': e}, 400
@app.post('/audio/search/local')
async def search_local_audio(request: Request,
query_audio_path: str,
table_name: str=None):
# Search the uploaded audio in Milvus/MySQL
try:
host = request.headers['host']
_, paths, distances = do_search(host, table_name, query_audio_path,
MILVUS_CLI, MYSQL_CLI)
names = []
for path, score in zip(paths, distances):
names.append(os.path.basename(path))
LOGGER.info(f"search result {path}, score {score}")
res = dict(zip(paths, zip(names, distances)))
# Sort results by distance metric, closest distances first
res = sorted(res.items(), key=lambda item: item[1][1], reverse=True)
LOGGER.info("Successfully searched similar audio!")
return res
except Exception as e:
LOGGER.error(e)
return {'status': False, 'msg': e}, 400
@app.get('/audio/count')
async def count_audio(table_name: str=None):
# Returns the total number of vectors in the system
try:
num = do_count(table_name, MILVUS_CLI)
LOGGER.info("Successfully count the number of data!")
return num
except Exception as e:
LOGGER.error(e)
return {'status': False, 'msg': e}, 400
@app.post('/audio/drop')
async def drop_tables(table_name: str=None):
# Delete the collection of Milvus and MySQL
try:
status = do_drop(table_name, MILVUS_CLI, MYSQL_CLI)
LOGGER.info("Successfully drop tables in Milvus and MySQL!")
return status
except Exception as e:
LOGGER.error(e)
return {'status': False, 'msg': e}, 400
if __name__ == '__main__':
uvicorn.run(app=app, host='0.0.0.0', port=8002)

@ -0,0 +1,185 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
from config import METRIC_TYPE
from config import MILVUS_HOST
from config import MILVUS_PORT
from config import VECTOR_DIMENSION
from logs import LOGGER
from pymilvus import Collection
from pymilvus import CollectionSchema
from pymilvus import connections
from pymilvus import DataType
from pymilvus import FieldSchema
from pymilvus import utility
class MilvusHelper:
"""
the basic operations of PyMilvus
# This example shows how to:
# 1. connect to Milvus server
# 2. create a collection
# 3. insert entities
# 4. create index
# 5. search
# 6. delete a collection
"""
def __init__(self):
try:
self.collection = None
connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)
LOGGER.debug(
f"Successfully connect to Milvus with IP:{MILVUS_HOST} and PORT:{MILVUS_PORT}"
)
except Exception as e:
LOGGER.error(f"Failed to connect Milvus: {e}")
sys.exit(1)
def set_collection(self, collection_name):
try:
if self.has_collection(collection_name):
self.collection = Collection(name=collection_name)
else:
raise Exception(
f"There is no collection named:{collection_name}")
except Exception as e:
LOGGER.error(f"Failed to set collection in Milvus: {e}")
sys.exit(1)
def has_collection(self, collection_name):
# Return if Milvus has the collection
try:
return utility.has_collection(collection_name)
except Exception as e:
LOGGER.error(f"Failed to check state of collection in Milvus: {e}")
sys.exit(1)
def create_collection(self, collection_name):
# Create milvus collection if not exists
try:
if not self.has_collection(collection_name):
field1 = FieldSchema(
name="id",
dtype=DataType.INT64,
descrition="int64",
is_primary=True,
auto_id=True)
field2 = FieldSchema(
name="embedding",
dtype=DataType.FLOAT_VECTOR,
descrition="speaker embeddings",
dim=VECTOR_DIMENSION,
is_primary=False)
schema = CollectionSchema(
fields=[field1, field2], description="embeddings info")
self.collection = Collection(
name=collection_name, schema=schema)
LOGGER.debug(f"Create Milvus collection: {collection_name}")
else:
self.set_collection(collection_name)
return "OK"
except Exception as e:
LOGGER.error(f"Failed to create collection in Milvus: {e}")
sys.exit(1)
def insert(self, collection_name, vectors):
# Batch insert vectors to milvus collection
try:
self.create_collection(collection_name)
data = [vectors]
self.set_collection(collection_name)
mr = self.collection.insert(data)
ids = mr.primary_keys
self.collection.load()
LOGGER.debug(
f"Insert vectors to Milvus in collection: {collection_name} with {len(vectors)} rows"
)
return ids
except Exception as e:
LOGGER.error(f"Failed to insert data to Milvus: {e}")
sys.exit(1)
def create_index(self, collection_name):
# Create IVF_FLAT index on milvus collection
try:
self.set_collection(collection_name)
default_index = {
"index_type": "IVF_SQ8",
"metric_type": METRIC_TYPE,
"params": {
"nlist": 16384
}
}
status = self.collection.create_index(
field_name="embedding", index_params=default_index)
if not status.code:
LOGGER.debug(
f"Successfully create index in collection:{collection_name} with param:{default_index}"
)
return status
else:
raise Exception(status.message)
except Exception as e:
LOGGER.error(f"Failed to create index: {e}")
sys.exit(1)
def delete_collection(self, collection_name):
# Delete Milvus collection
try:
self.set_collection(collection_name)
self.collection.drop()
LOGGER.debug("Successfully drop collection!")
return "ok"
except Exception as e:
LOGGER.error(f"Failed to drop collection: {e}")
sys.exit(1)
def search_vectors(self, collection_name, vectors, top_k):
# Search vector in milvus collection
try:
self.set_collection(collection_name)
search_params = {
"metric_type": METRIC_TYPE,
"params": {
"nprobe": 16
}
}
res = self.collection.search(
vectors,
anns_field="embedding",
param=search_params,
limit=top_k)
LOGGER.debug(f"Successfully search in collection: {res}")
return res
except Exception as e:
LOGGER.error(f"Failed to search vectors in Milvus: {e}")
sys.exit(1)
def count(self, collection_name):
# Get the number of milvus collection
try:
self.set_collection(collection_name)
num = self.collection.num_entities
LOGGER.debug(
f"Successfully get the num:{num} of the collection:{collection_name}"
)
return num
except Exception as e:
LOGGER.error(f"Failed to count vectors in Milvus: {e}")
sys.exit(1)

@ -0,0 +1,133 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import pymysql
from config import MYSQL_DB
from config import MYSQL_HOST
from config import MYSQL_PORT
from config import MYSQL_PWD
from config import MYSQL_USER
from logs import LOGGER
class MySQLHelper():
"""
the basic operations of PyMySQL
# This example shows how to:
# 1. connect to MySQL server
# 2. create a table
# 3. insert data to table
# 4. search by milvus ids
# 5. delete table
"""
def __init__(self):
self.conn = pymysql.connect(
host=MYSQL_HOST,
user=MYSQL_USER,
port=MYSQL_PORT,
password=MYSQL_PWD,
database=MYSQL_DB,
local_infile=True)
self.cursor = self.conn.cursor()
def test_connection(self):
try:
self.conn.ping()
except Exception:
self.conn = pymysql.connect(
host=MYSQL_HOST,
user=MYSQL_USER,
port=MYSQL_PORT,
password=MYSQL_PWD,
database=MYSQL_DB,
local_infile=True)
self.cursor = self.conn.cursor()
def create_mysql_table(self, table_name):
# Create mysql table if not exists
self.test_connection()
sql = "create table if not exists " + table_name + "(milvus_id TEXT, audio_path TEXT);"
try:
self.cursor.execute(sql)
LOGGER.debug(f"MYSQL create table: {table_name} with sql: {sql}")
except Exception as e:
LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
sys.exit(1)
def load_data_to_mysql(self, table_name, data):
# Batch insert (Milvus_ids, img_path) to mysql
self.test_connection()
sql = "insert into " + table_name + " (milvus_id,audio_path) values (%s,%s);"
try:
self.cursor.executemany(sql, data)
self.conn.commit()
LOGGER.debug(
f"MYSQL loads data to table: {table_name} successfully")
except Exception as e:
LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
sys.exit(1)
def search_by_milvus_ids(self, ids, table_name):
# Get the img_path according to the milvus ids
self.test_connection()
str_ids = str(ids).replace('[', '').replace(']', '')
sql = "select audio_path from " + table_name + " where milvus_id in (" + str_ids + ") order by field (milvus_id," + str_ids + ");"
try:
self.cursor.execute(sql)
results = self.cursor.fetchall()
results = [res[0] for res in results]
LOGGER.debug("MYSQL search by milvus id.")
return results
except Exception as e:
LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
sys.exit(1)
def delete_table(self, table_name):
# Delete mysql table if exists
self.test_connection()
sql = "drop table if exists " + table_name + ";"
try:
self.cursor.execute(sql)
LOGGER.debug(f"MYSQL delete table:{table_name}")
except Exception as e:
LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
sys.exit(1)
def delete_all_data(self, table_name):
# Delete all the data in mysql table
self.test_connection()
sql = 'delete from ' + table_name + ';'
try:
self.cursor.execute(sql)
self.conn.commit()
LOGGER.debug(f"MYSQL delete all data in table:{table_name}")
except Exception as e:
LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
sys.exit(1)
def count_table(self, table_name):
# Get the number of mysql table
self.test_connection()
sql = "select count(milvus_id) from " + table_name + ";"
try:
self.cursor.execute(sql)
results = self.cursor.fetchall()
LOGGER.debug(f"MYSQL count table:{table_name}")
return results[0][0]
except Exception as e:
LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
sys.exit(1)

@ -0,0 +1,13 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -0,0 +1,33 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
from config import DEFAULT_TABLE
from logs import LOGGER
def do_count(table_name, milvus_cli):
"""
Returns the total number of vectors in the system
"""
if not table_name:
table_name = DEFAULT_TABLE
try:
if not milvus_cli.has_collection(table_name):
return None
num = milvus_cli.count(table_name)
return num
except Exception as e:
LOGGER.error(f"Error attempting to count table {e}")
sys.exit(1)

@ -0,0 +1,34 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
from config import DEFAULT_TABLE
from logs import LOGGER
def do_drop(table_name, milvus_cli, mysql_cli):
"""
Delete the collection of Milvus and MySQL
"""
if not table_name:
table_name = DEFAULT_TABLE
try:
if not milvus_cli.has_collection(table_name):
return "Collection is not exist"
status = milvus_cli.delete_collection(table_name)
mysql_cli.delete_table(table_name)
return status
except Exception as e:
LOGGER.error(f"Error attempting to drop table: {e}")
sys.exit(1)

@ -0,0 +1,84 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
from config import DEFAULT_TABLE
from diskcache import Cache
from encode import get_audio_embedding
from logs import LOGGER
def get_audios(path):
"""
List all wav and aif files recursively under the path folder.
"""
supported_formats = [".wav", ".mp3", ".ogg", ".flac", ".m4a"]
return [
item for sublist in [[os.path.join(dir, file) for file in files]
for dir, _, files in list(os.walk(path))]
for item in sublist if os.path.splitext(item)[1] in supported_formats
]
def extract_features(audio_dir):
"""
Get the vector of audio
"""
try:
cache = Cache('./tmp')
feats = []
names = []
audio_list = get_audios(audio_dir)
total = len(audio_list)
cache['total'] = total
for i, audio_path in enumerate(audio_list):
norm_feat = get_audio_embedding(audio_path)
if norm_feat is None:
continue
feats.append(norm_feat)
names.append(audio_path.encode())
cache['current'] = i + 1
print(
f"Extracting feature from audio No. {i + 1} , {total} audios in total"
)
return feats, names
except Exception as e:
LOGGER.error(f"Error with extracting feature from audio {e}")
sys.exit(1)
def format_data(ids, names):
"""
Combine the id of the vector and the name of the audio into a list
"""
data = []
for i in range(len(ids)):
value = (str(ids[i]), names[i])
data.append(value)
return data
def do_load(table_name, audio_dir, milvus_cli, mysql_cli):
"""
Import vectors to Milvus and data to Mysql respectively
"""
if not table_name:
table_name = DEFAULT_TABLE
vectors, names = extract_features(audio_dir)
ids = milvus_cli.insert(table_name, vectors)
milvus_cli.create_index(table_name)
mysql_cli.create_mysql_table(table_name)
mysql_cli.load_data_to_mysql(table_name, format_data(ids, names))
return len(ids)

@ -0,0 +1,41 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
from config import DEFAULT_TABLE
from config import TOP_K
from encode import get_audio_embedding
from logs import LOGGER
def do_search(host, table_name, audio_path, milvus_cli, mysql_cli):
"""
Search the uploaded audio in Milvus/MySQL
"""
try:
if not table_name:
table_name = DEFAULT_TABLE
feat = get_audio_embedding(audio_path)
vectors = milvus_cli.search_vectors(table_name, [feat], TOP_K)
vids = [str(x.id) for x in vectors[0]]
paths = mysql_cli.search_by_milvus_ids(vids, table_name)
distances = [x.distance for x in vectors[0]]
for i in range(len(paths)):
tmp = "http://" + str(host) + "/data?audio_path=" + str(paths[i])
paths[i] = tmp
distances[i] = (1 - distances[i]) * 100
return vids, paths, distances
except Exception as e:
LOGGER.error(f"Error with search: {e}")
sys.exit(1)

@ -0,0 +1,95 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from fastapi.testclient import TestClient
from main import app
from utils.utility import download
from utils.utility import unpack
client = TestClient(app)
def download_audio_data():
"""
download audio data
"""
url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz"
md5sum = "52ac69316c1aa1fdef84da7dd2c67b39"
target_dir = "./"
filepath = download(url, md5sum, target_dir)
unpack(filepath, target_dir, True)
def test_drop():
"""
Delete the collection of Milvus and MySQL
"""
response = client.post("/audio/drop")
assert response.status_code == 200
def test_load():
"""
Insert all the audio files under the file path to Milvus/MySQL
"""
response = client.post("/audio/load", json={"File": "./example_audio"})
assert response.status_code == 200
assert response.json() == {
'status': True,
'msg': "Successfully loaded data!"
}
def test_progress():
"""
Get the progress of dealing with data
"""
response = client.get("/progress")
assert response.status_code == 200
assert response.json() == "current: 20, total: 20"
def test_count():
"""
Returns the total number of vectors in the system
"""
response = client.get("audio/count")
assert response.status_code == 200
assert response.json() == 20
def test_search():
"""
Search the uploaded audio in Milvus/MySQL
"""
response = client.post(
"/audio/search/local?query_audio_path=.%2Fexample_audio%2Ftest.wav")
assert response.status_code == 200
assert len(response.json()) == 10
def test_data():
"""
Get the audio file
"""
response = client.get("/data?audio_path=.%2Fexample_audio%2Ftest.wav")
assert response.status_code == 200
if __name__ == "__main__":
download_audio_data()
test_load()
test_count()
test_search()
test_drop()

@ -0,0 +1,158 @@
([简体中文](./README_cn.md)|English)
# Speech Verification)
## Introduction
Speaker Verification, refers to the problem of getting a speaker embedding from an audio.
This demo is an implementation to extract speaker embedding from a specific audio file. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
You can choose one way from easy, meduim and hard to install paddlespeech.
### 2. Prepare Input File
The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
Here are sample files for this demo that can be downloaded:
```bash
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
```
### 3. Usage
- Command Line(Recommended)
```bash
paddlespeech vector --task spk --input 85236145389.wav
echo -e "demo1 85236145389.wav" > vec.job
paddlespeech vector --task spk --input vec.job
echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
```
Usage:
```bash
paddlespeech vector --help
```
Arguments:
- `input`(required): Audio file to recognize.
- `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
- `sample_rate`: Sample rate of the model. Default: `16000`.
- `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
- `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
- `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
Output:
```bash
demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
-3.04878 1.611095 10.127234 -10.534177 -15.821609
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
-11.343508 2.3385992 -8.719341 14.213509 15.404744
-0.39327756 6.338786 2.688887 8.7104025 17.469526
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737
8.013747 13.891729 -9.926753 5.655307 -5.9422326
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
1.7594414 -0.6485091 4.485623 2.0207152 7.264915
-6.40137 23.63524 2.9711294 -22.708025 9.93719
20.354511 -10.324688 -0.700492 -8.783211 -5.27593
15.999649 3.3004563 12.747926 15.429879 4.7849145
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
-9.224193 14.568347 -10.568833 4.982321 -4.342062
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469
-11.54324 7.681869 0.44475392 9.708182 -8.932846
0.4123232 -4.361452 1.3948607 9.511665 0.11667654
2.9079323 6.049952 9.275183 -18.078873 6.2983274
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
4.010979 11.000591 -2.8873312 7.1352735 -16.79663
18.495346 -14.293832 7.89578 2.2714825 22.976387
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228
-11.924197 2.171869 2.0423572 -6.173772 10.778437
25.77281 -4.9495463 14.57806 0.3044315 2.6132357
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617
-4.9401326 23.465864 5.1685796 -9.018578 9.037825
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
8.731719 -20.778936 -11.495662 5.8033476 -4.752041
10.833007 -6.717991 4.504732 13.4244375 1.1306485
7.3435574 1.400918 14.704036 -9.501399 7.2315617
-6.417456 1.3333273 11.872697 -0.30664724 8.8845
6.5569253 4.7948146 0.03662816 -8.704245 6.224871
-3.2701402 -11.508579 ]
```
- Python API
```python
import paddle
from paddlespeech.cli import VectorExecutor
vector_executor = VectorExecutor()
audio_emb = vector_executor(
model='ecapatdnn_voxceleb12',
sample_rate=16000,
config=None,
ckpt_path=None,
audio_file='./85236145389.wav',
force_yes=False,
device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb))
```
Output:
```bash
# Vector Result:
[ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
-3.04878 1.611095 10.127234 -10.534177 -15.821609
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
-11.343508 2.3385992 -8.719341 14.213509 15.404744
-0.39327756 6.338786 2.688887 8.7104025 17.469526
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737
8.013747 13.891729 -9.926753 5.655307 -5.9422326
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
1.7594414 -0.6485091 4.485623 2.0207152 7.264915
-6.40137 23.63524 2.9711294 -22.708025 9.93719
20.354511 -10.324688 -0.700492 -8.783211 -5.27593
15.999649 3.3004563 12.747926 15.429879 4.7849145
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
-9.224193 14.568347 -10.568833 4.982321 -4.342062
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469
-11.54324 7.681869 0.44475392 9.708182 -8.932846
0.4123232 -4.361452 1.3948607 9.511665 0.11667654
2.9079323 6.049952 9.275183 -18.078873 6.2983274
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
4.010979 11.000591 -2.8873312 7.1352735 -16.79663
18.495346 -14.293832 7.89578 2.2714825 22.976387
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228
-11.924197 2.171869 2.0423572 -6.173772 10.778437
25.77281 -4.9495463 14.57806 0.3044315 2.6132357
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617
-4.9401326 23.465864 5.1685796 -9.018578 9.037825
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
8.731719 -20.778936 -11.495662 5.8033476 -4.752041
10.833007 -6.717991 4.504732 13.4244375 1.1306485
7.3435574 1.400918 14.704036 -9.501399 7.2315617
-6.417456 1.3333273 11.872697 -0.30664724 8.8845
6.5569253 4.7948146 0.03662816 -8.704245 6.224871
-3.2701402 -11.508579 ]
```
### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
| Model | Sample Rate
| :--- | :---: |
| ecapatdnn_voxceleb12 | 16k

@ -0,0 +1,155 @@
(简体中文|[English](./README.md))
# 声纹识别
## 介绍
声纹识别是一项用计算机程序自动提取说话人特征的技术。
这个 demo 是一个从给定音频文件提取说话人特征,它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。
## 使用方法
### 1. 安装
请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
你可以从 easymediumhard 三中方式中选择一种方式安装。
### 2. 准备输入
这个 demo 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。
可以下载此 demo 的示例音频:
```bash
# 该音频的内容是数字串 85236145389
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
```
### 3. 使用方法
- 命令行 (推荐使用)
```bash
paddlespeech vector --task spk --input 85236145389.wav
echo -e "demo1 85236145389.wav" > vec.job
paddlespeech vector --task spk --input vec.job
echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
```
使用方法:
```bash
paddlespeech vector --help
```
参数:
- `input`(必须输入):用于识别的音频文件。
- `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。
- `sample_rate`:音频采样率,默认值:`16000`。
- `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。
- `ckpt_path`:模型参数文件,若不设置则下载预训练模型使用,默认值:`None`。
- `device`:执行预测的设备,默认值:当前系统下 paddlepaddle 的默认 device。
输出:
```bash
demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
-3.04878 1.611095 10.127234 -10.534177 -15.821609
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
-11.343508 2.3385992 -8.719341 14.213509 15.404744
-0.39327756 6.338786 2.688887 8.7104025 17.469526
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737
8.013747 13.891729 -9.926753 5.655307 -5.9422326
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
1.7594414 -0.6485091 4.485623 2.0207152 7.264915
-6.40137 23.63524 2.9711294 -22.708025 9.93719
20.354511 -10.324688 -0.700492 -8.783211 -5.27593
15.999649 3.3004563 12.747926 15.429879 4.7849145
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
-9.224193 14.568347 -10.568833 4.982321 -4.342062
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469
-11.54324 7.681869 0.44475392 9.708182 -8.932846
0.4123232 -4.361452 1.3948607 9.511665 0.11667654
2.9079323 6.049952 9.275183 -18.078873 6.2983274
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
4.010979 11.000591 -2.8873312 7.1352735 -16.79663
18.495346 -14.293832 7.89578 2.2714825 22.976387
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228
-11.924197 2.171869 2.0423572 -6.173772 10.778437
25.77281 -4.9495463 14.57806 0.3044315 2.6132357
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617
-4.9401326 23.465864 5.1685796 -9.018578 9.037825
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
8.731719 -20.778936 -11.495662 5.8033476 -4.752041
10.833007 -6.717991 4.504732 13.4244375 1.1306485
7.3435574 1.400918 14.704036 -9.501399 7.2315617
-6.417456 1.3333273 11.872697 -0.30664724 8.8845
6.5569253 4.7948146 0.03662816 -8.704245 6.224871
-3.2701402 -11.508579 ]
```
- Python API
```python
import paddle
from paddlespeech.cli import VectorExecutor
vector_executor = VectorExecutor()
audio_emb = vector_executor(
model='ecapatdnn_voxceleb12',
sample_rate=16000,
config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file='./85236145389.wav',
force_yes=False,
device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb))
```
输出:
```bash
# Vector Result:
[ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
-3.04878 1.611095 10.127234 -10.534177 -15.821609
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
-11.343508 2.3385992 -8.719341 14.213509 15.404744
-0.39327756 6.338786 2.688887 8.7104025 17.469526
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737
8.013747 13.891729 -9.926753 5.655307 -5.9422326
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
1.7594414 -0.6485091 4.485623 2.0207152 7.264915
-6.40137 23.63524 2.9711294 -22.708025 9.93719
20.354511 -10.324688 -0.700492 -8.783211 -5.27593
15.999649 3.3004563 12.747926 15.429879 4.7849145
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
-9.224193 14.568347 -10.568833 4.982321 -4.342062
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469
-11.54324 7.681869 0.44475392 9.708182 -8.932846
0.4123232 -4.361452 1.3948607 9.511665 0.11667654
2.9079323 6.049952 9.275183 -18.078873 6.2983274
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
4.010979 11.000591 -2.8873312 7.1352735 -16.79663
18.495346 -14.293832 7.89578 2.2714825 22.976387
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228
-11.924197 2.171869 2.0423572 -6.173772 10.778437
25.77281 -4.9495463 14.57806 0.3044315 2.6132357
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617
-4.9401326 23.465864 5.1685796 -9.018578 9.037825
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
8.731719 -20.778936 -11.495662 5.8033476 -4.752041
10.833007 -6.717991 4.504732 13.4244375 1.1306485
7.3435574 1.400918 14.704036 -9.501399 7.2315617
-6.417456 1.3333273 11.872697 -0.30664724 8.8845
6.5569253 4.7948146 0.03662816 -8.704245 6.224871
-3.2701402 -11.508579 ]
```
### 4.预训练模型
以下是 PaddleSpeech 提供的可以被命令行和 python API 使用的预训练模型列表:
| 模型 | 采样率
| :--- | :---: |
| ecapatdnn_voxceleb12 | 16k

@ -0,0 +1,6 @@
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
# asr
paddlespeech vector --task spk --input ./85236145389.wav

@ -84,5 +84,8 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
| Model | Language | Sample Rate
| :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000
| transformer_librispeech| en| 16000
| conformer_wenetspeech| zh| 16k
| transformer_librispeech| en| 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
|deepspeech2offline_librispeech|en| 16k

@ -81,5 +81,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
| 模型 | 语言 | 采样率
| :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000
| transformer_librispeech| en| 16000
| conformer_wenetspeech | zh | 16k
| transformer_librispeech | en | 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
| deepspeech2offline_librispeech | en | 16k

@ -15,8 +15,8 @@ You can choose one way from meduim and hard to install paddlespeech.
### 2. Prepare config File
The configuration file can be found in `conf/application.yaml` .
Among them, `engine_list` indicates the speech engine that will be included in the service to be started, in the format of <speech task>_<engine type>.
At present, the speech tasks integrated by the service include: asr (speech recognition) and tts (speech synthesis).
Among them, `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
At present, the speech tasks integrated by the service include: asr (speech recognition), tts (text to sppech) and cls (audio classification).
Currently the engine type supports two forms: python and inference (Paddle Inference)
@ -110,21 +110,22 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
import json
asrclient_executor = ASRClientExecutor()
asrclient_executor(
res = asrclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
sample_rate=16000,
lang="zh_cn",
audio_format="wav")
print(res.json())
```
Output:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
time cost 0.604353 s.
```
### 5. TTS Client Usage
@ -146,7 +147,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0
- `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0
- `sample_rate`: Sampling rate, choice: [0, 8000, 16000], the default is the same as the model. Default: 0
- `output`: Output wave filepath. Default: `output.wav`.
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
Output:
```bash
@ -160,9 +161,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor
import json
ttsclient_executor = TTSClientExecutor()
ttsclient_executor(
res = ttsclient_executor(
input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1",
port=8090,
@ -171,6 +173,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
volume=1.0,
sample_rate=0,
output="./output.wav")
response_dict = res.json()
print(response_dict["message"])
print("Save synthesized audio successfully on %s." % (response_dict['result']['save_path']))
print("Audio duration: %f s." %(response_dict['result']['duration']))
```
Output:
@ -178,7 +185,52 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
{'description': 'success.'}
Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s.
Response time: 0.388317 s.
```
### 6. CLS Client Usage
**Note:** The response time will be slightly longer when using the client for the first time
- Command Line (Recommended)
```
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
```
Usage:
```bash
paddlespeech_client cls --help
```
Arguments:
- `server_ip`: server ip. Default: 127.0.0.1
- `port`: server port. Default: 8090
- `input`(required): Audio file to be classified.
- `topk`: topk scores of classification result.
Output:
```bash
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import CLSClientExecutor
import json
clsclient_executor = CLSClientExecutor()
res = clsclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
topk=1)
print(res.json())
```
Output:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
```
@ -189,3 +241,6 @@ Get all models supported by the ASR service via `paddlespeech_server stats --tas
### TTS model
Get all models supported by the TTS service via `paddlespeech_server stats --task tts`, where static models can be used for paddle inference inference.
### CLS model
Get all models supported by the CLS service via `paddlespeech_server stats --task cls`, where static models can be used for paddle inference inference.

@ -17,7 +17,7 @@
### 2. 准备配置文件
配置文件可参见 `conf/application.yaml`
其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)。
目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)以及cls(音频分类)
目前引擎类型支持两种形式python 及 inference (Paddle Inference)
@ -80,7 +80,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
```
### 4. ASR客户端使用方法
### 4. ASR 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
```
@ -111,25 +111,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
import json
asrclient_executor = ASRClientExecutor()
asrclient_executor(
res = asrclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
sample_rate=16000,
lang="zh_cn",
audio_format="wav")
print(res.json())
```
输出:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
time cost 0.604353 s.
```
### 5. TTS客户端使用方法
### 5. TTS 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
@ -150,7 +151,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值1.0
- `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0
- `sample_rate`: 采样率,可选 [0, 8000, 16000],默认与模型相同。 默认值0
- `output`: 输出音频的路径, 默认值:output.wav
- `output`: 输出音频的路径, 默认值:None表示不保存音频到本地
输出:
```bash
@ -163,9 +164,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import TTSClientExecutor
import json
ttsclient_executor = TTSClientExecutor()
ttsclient_executor(
res = ttsclient_executor(
input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1",
port=8090,
@ -174,6 +176,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
volume=1.0,
sample_rate=0,
output="./output.wav")
response_dict = res.json()
print(response_dict["message"])
print("Save synthesized audio successfully on %s." % (response_dict['result']['save_path']))
print("Audio duration: %f s." %(response_dict['result']['duration']))
```
输出:
@ -181,13 +188,63 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
{'description': 'success.'}
Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s.
Response time: 0.388317 s.
```
### 5. CLS 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
```
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
```
使用帮助:
```bash
paddlespeech_client cls --help
```
参数:
- `server_ip`: 服务端ip地址默认: 127.0.0.1。
- `port`: 服务端口,默认: 8090。
- `input`(必须输入): 用于分类的音频文件。
- `topk`: 分类结果的topk。
输出:
```bash
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import CLSClientExecutor
import json
clsclient_executor = CLSClientExecutor()
res = clsclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
port=8090,
topk=1)
print(res.json())
```
输出:
```bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
```
## 服务支持的模型
### ASR支持的模型
通过 `paddlespeech_server stats --task asr` 获取ASR服务支持的所有模型其中静态模型可用于 paddle inference 推理。
### TTS支持的模型
通过 `paddlespeech_server stats --task tts` 获取TTS服务支持的所有模型其中静态模型可用于 paddle inference 推理。
### CLS支持的模型
通过 `paddlespeech_server stats --task cls` 获取CLS服务支持的所有模型其中静态模型可用于 paddle inference 推理。

@ -0,0 +1,4 @@
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav --topk 1

@ -9,12 +9,14 @@ port: 8090
# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference']
engine_list: ['asr_python', 'tts_python']
engine_list: ['asr_python', 'tts_python', 'cls_python']
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### ASR #########################################
################### speech task: asr; engine_type: python #######################
asr_python:
model: 'conformer_wenetspeech'
@ -46,6 +48,7 @@ asr_inference:
summary: True # False -> do not show predictor config
################################### TTS #########################################
################### speech task: tts; engine_type: python #######################
tts_python:
# am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc',
@ -105,3 +108,30 @@ tts_inference:
# others
lang: 'zh'
################################### CLS #########################################
################### speech task: cls; engine_type: python #######################
cls_python:
# model choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model: 'panns_cnn14'
cfg_path: # [optional] Config of cls task.
ckpt_path: # [optional] Checkpoint file of model.
label_file: # [optional] Label file of cls task.
device: # set 'gpu:id' or 'cpu'
################### speech task: cls; engine_type: inference #######################
cls_inference:
# model_type choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
model_type: 'panns_cnn14'
cfg_path:
model_path: # the pdmodel file of am static model [optional]
params_path: # the pdiparams file of am static model [optional]
label_file: # [optional] Label file of cls task.
predictor_conf:
device: # set 'gpu:id' or 'cpu'
switch_ir_optim: True
glog_info: False # True -> print glog
summary: True # False -> do not show predictor config

@ -35,3 +35,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks
* [librosa](https://github.com/librosa/librosa/blob/main/LICENSE.md)
- ISC License
- Audio feature
* [ThreadPool](https://github.com/progschj/ThreadPool/blob/master/COPYING)
- zlib License
- ThreadPool

@ -8,7 +8,8 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.056 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1)
[Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz)| Librispeech Dataset | Char-based | 518 MB | 2 Conv + 3 bidirectional LSTM layers| - |0.0725| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0)
[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1)
@ -49,17 +50,20 @@ Model Type | Dataset| Example Link | Pretrained Models| Static Models|Size (stat
WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)|||
Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)|5.1MB|
Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)|||
Parallel WaveGAN|AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)|||
Parallel WaveGAN| AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)|||
Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)|||
|Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) |8.2MB|
Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | |
HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)|50MB|
HiFiGAN | LJSpeech |[HiFiGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc5)|[hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)|||
HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|||
HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)|||
WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB|
### Voice Cloning
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----:
:-------------:| :------------:| :-----: | :-----: |
GE2E| AISHELL-3, etc. |[ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e)|[ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip)
GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc1)|[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
@ -67,11 +71,17 @@ GE2E + FastSpeech2 | AISHELL-3 |[ge2e-fastspeech2-aishell3](https://github.com/
## Audio Classification Models
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----:
PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams)
Model Type | Dataset| Example Link | Pretrained Models | Static Models
:-------------:| :------------:| :-----: | :-----: | :-----:
PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams) | [panns_cnn6_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn6_static.tar.gz)(18M), [panns_cnn10_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn10_static.tar.gz)(19M), [panns_cnn14_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn14_static.tar.gz)(289M)
PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz)
## Speaker Verification Models
Model Type | Dataset| Example Link | Pretrained Models | Static Models
:-------------:| :------------:| :-----: | :-----: | :-----:
PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
## Punctuation Restoration Models
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----:

@ -168,30 +168,7 @@ bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/transformer/checkpoints/avg_20
```
The performance of the released models are shown below:
### Conformer
| Model | Params | Config | Augmentation | Test set | Decode method | Loss | CER |
| --------- | ------ | ------------------- | ---------------- | -------- | ---------------------- | ---- | -------- |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | attention | - | 0.059858 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | ctc_greedy_search | - | 0.062311 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | - | 0.062196 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | attention_rescoring | - | 0.054694 |
### Chunk Conformer
Need set `decoding.decoding_chunk_size=16` when decoding.
| Model | Params | Config | Augmentation | Test set | Decode method | Chunk Size & Left Chunks | Loss | CER |
| --------- | ------ | ------------------------- | ---------------- | -------- | ---------------------- | ------------------------ | ---- | -------- |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention | 16, -1 | - | 0.061939 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_greedy_search | 16, -1 | - | 0.070806 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | 16, -1 | - | 0.070739 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention_rescoring | 16, -1 | - | 0.059400 |
### Transformer
| Model | Params | Config | Augmentation | Test set | Decode method | Loss | CER |
| ----------- | ------ | --------------------- | ------------ | -------- | ---------------------- | ----------------- | -------- |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention | 3.858648955821991 | 0.057293 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_greedy_search | 3.858648955821991 | 0.061837 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_prefix_beam_search | 3.858648955821991 | 0.061685 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention_rescoring | 3.858648955821991 | 0.053844 |
[The performance of the released models](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/aishell/asr1/RESULTS.md)
## Stage 4: CTC Alignment
If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
```bash

@ -1,24 +1,27 @@
# Aishell
## Conformer
| Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | attention | - | 0.059858 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | ctc_greedy_search | - | 0.062311 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | - | 0.062196 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | attention_rescoring | - | 0.054694 |
paddle version: 2.2.2
paddlespeech version: 0.1.2
| Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention | - | 0.0548 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.05127 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.05131 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.04829 |
## Chunk Conformer
paddle version: 2.2.2
paddlespeech version: 0.1.2
Need set `decoding.decoding_chunk_size=16` when decoding.
| Model | Params | Config | Augmentation| Test set | Decode method | Chunk Size & Left Chunks | Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention | 16, -1 | - | 0.061939 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_greedy_search | 16, -1 | - | 0.070806 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | 16, -1 | - | 0.070739 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention_rescoring | 16, -1 | - | 0.059400 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | attention | 16, -1 | - | 0.0573884 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | ctc_greedy_search | 16, -1 | - | 0.06599091 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | ctc_prefix_beam_search | 16, -1 | - | 0.065991 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | attention_rescoring | 16, -1 | - | 0.056502 |
## Transformer

@ -39,6 +39,7 @@ model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
init_type: 'kaiming_uniform'
###########################################
# Data #
@ -61,28 +62,29 @@ feat_dim: 80
stride_ms: 10.0
window_ms: 25.0
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
batch_size: 64
batch_size: 32
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
minibatches: 0 # for debug
batch_count: auto
batch_bins: 0
batch_bins: 0
batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
num_workers: 0
num_workers: 2
subsampling_factor: 1
num_encs: 1
###########################################
# Training #
###########################################
n_epoch: 240
accum_grad: 2
n_epoch: 180
accum_grad: 1
global_grad_clip: 5.0
dist_sampler: True
optim: adam
optim_conf:
lr: 0.002
lr: 0.001
weight_decay: 1.0e-6
scheduler: warmuplr
scheduler_conf:
@ -92,4 +94,3 @@ log_interval: 100
checkpoint:
kbest_n: 50
latest_n: 5

@ -37,6 +37,7 @@ model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
init_type: 'kaiming_uniform'
###########################################
# Data #
@ -75,6 +76,7 @@ num_encs: 1
n_epoch: 240
accum_grad: 2
global_grad_clip: 5.0
dist_sampler: True
optim: adam
optim_conf:
lr: 0.002

@ -23,7 +23,3 @@ process:
n_mask: 2
inplace: true
replace_with_zero: false

@ -61,16 +61,17 @@ batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
preprocess_config: conf/preprocess.yaml
num_workers: 0
num_workers: 2
subsampling_factor: 1
num_encs: 1
###########################################
# Training #
###########################################
n_epoch: 240
n_epoch: 30
accum_grad: 2
global_grad_clip: 5.0
dist_sampler: False
optim: adam
optim_conf:
lr: 0.002

@ -4,18 +4,44 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pd \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi

@ -4,21 +4,50 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "in hifigan syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi

@ -1,6 +1,6 @@
#!/bin/bash
stage=3
stage=0
stop_stage=100
config_path=$1

@ -3,7 +3,7 @@
set -e
source path.sh
gpus=0
gpus=0,1
stage=0
stop_stage=100

@ -0,0 +1,156 @@
# HiFiGAN with AISHELL-3
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
## Dataset
### Download and Extract
Download AISHELL-3.
```bash
wget https://www.openslr.org/resources/93/data_aishell3.tgz
```
Extract AISHELL-3.
```bash
mkdir data_aishell3
tar zxvf data_aishell3.tgz -C data_aishell3
```
### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`.
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 2500000|24.060|0.1068|7.499
HiFiGAN checkpoint contains files listed below.
```text
hifigan_aishell3_ckpt_0.2.0
├── default.yaml # default config used to train hifigan
├── feats_stats.npy # statistics used to normalize spectrogram when training hifigan
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -0,0 +1,168 @@
# This is the configuration file for AISHELL-3 dataset.
# This configuration is based on HiFiGAN V1, which is
# an official configuration. But I found that the optimizer
# setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales
# is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,55 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./aishell3_alignment_tone \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/data_aishell3/ \
--dataset=aishell3 \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi

@ -0,0 +1,14 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan

@ -0,0 +1,13 @@
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}

@ -0,0 +1,32 @@
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -18,18 +18,17 @@ Download: http://groups.inf.ed.ac.uk/ami/download/
Prepares metadata files (JSON) from manual annotations "segments/" using RTTM format (Oracle VAD).
"""
import argparse
import glob
import json
import logging
import os
import xml.etree.ElementTree as et
from distutils.util import strtobool
from ami_splits import get_AMI_split
from dataio import load_pkl
from dataio import save_pkl
from distutils.util import strtobool
logger = logging.getLogger(__name__)
SAMPLERATE = 16000

@ -7,7 +7,7 @@ ckpt_name=$3
stage=0
stop_stage=0
# TODO: tacotron2 动转静的结果没有态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
# TODO: tacotron2 动转静的结果没有态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \

@ -14,7 +14,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=pwgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
@ -34,7 +34,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=mb_melgan_csmsc \
--voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
@ -53,7 +53,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=style_melgan_csmsc \
--voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
@ -73,7 +73,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=hifigan_csmsc \
--voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
@ -93,7 +93,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--am_stat=dump/train/feats_stats.npy \
--voc=wavernn_csmsc \
--voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
--voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \

@ -226,8 +226,11 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
Pretrained FastSpeech2 model with no silence in the edge of audios:
- [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)
- [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)
- [fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip)
The static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
The static model can be downloaded here:
- [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
- [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:

@ -0,0 +1,107 @@
# use CNND
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # sr
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
# Only used for feats_type != raw
fmin: 80 # Minimum frequency of Mel basis.
fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.
# Only used for the model using pitch features (e.g. FastSpeech2)
f0min: 80 # Minimum f0 for pitch extraction.
f0max: 400 # Maximum f0 for pitch extraction.
###########################################################
# DATA SETTING #
###########################################################
batch_size: 64
num_workers: 4
###########################################################
# MODEL SETTING #
###########################################################
model:
adim: 384 # attention dimension
aheads: 2 # number of attention heads
elayers: 4 # number of encoder layers
eunits: 1536 # number of encoder ff units
dlayers: 4 # number of decoder layers
dunits: 1536 # number of decoder ff units
positionwise_layer_type: conv1d # type of position-wise layer
positionwise_conv_kernel_size: 3 # kernel size of position wise conv layer
duration_predictor_layers: 2 # number of layers of duration predictor
duration_predictor_chans: 256 # number of channels of duration predictor
duration_predictor_kernel_size: 3 # filter size of duration predictor
postnet_layers: 5 # number of layers of postnset
postnet_filts: 5 # filter size of conv layers in postnet
postnet_chans: 256 # number of channels of conv layers in postnet
use_scaled_pos_enc: True # whether to use scaled positional encoding
encoder_normalize_before: True # whether to perform layer normalization before the input
decoder_normalize_before: True # whether to perform layer normalization before the input
reduction_factor: 1 # reduction factor
encoder_type: transformer # encoder type
decoder_type: cnndecoder # decoder type
init_type: xavier_uniform # initialization type
init_enc_alpha: 1.0 # initial value of alpha of encoder scaled position encoding
init_dec_alpha: 1.0 # initial value of alpha of decoder scaled position encoding
transformer_enc_dropout_rate: 0.2 # dropout rate for transformer encoder layer
transformer_enc_positional_dropout_rate: 0.2 # dropout rate for transformer encoder positional encoding
transformer_enc_attn_dropout_rate: 0.2 # dropout rate for transformer encoder attention layer
cnn_dec_dropout_rate: 0.2 # dropout rate for cnn decoder layer
cnn_postnet_dropout_rate: 0.2
cnn_postnet_resblock_kernel_sizes: [256, 256] # kernel sizes for residual block of cnn_postnet
cnn_postnet_kernel_size: 5 # kernel size of cnn_postnet
cnn_decoder_embedding_dim: 256
pitch_predictor_layers: 5 # number of conv layers in pitch predictor
pitch_predictor_chans: 256 # number of channels of conv layers in pitch predictor
pitch_predictor_kernel_size: 5 # kernel size of conv leyers in pitch predictor
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder
energy_predictor_layers: 2 # number of conv layers in energy predictor
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
###########################################################
# UPDATER SETTING #
###########################################################
updater:
use_masking: True # whether to apply masking for padded part in loss calculation
###########################################################
# OPTIMIZER SETTING #
###########################################################
optimizer:
optim: adam # optimizer type
learning_rate: 0.001 # learning rate
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 1000
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
seed: 10086

@ -0,0 +1,92 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_streaming.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e_streaming \
--phones_dict=dump/phone_id_map.txt \
--am_streaming=True
fi
# for more GAN Vocoders
# multi band melgan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_streaming.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=mb_melgan_csmsc \
--voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
--voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e_streaming \
--phones_dict=dump/phone_id_map.txt \
--am_streaming=True
fi
# the pretrained models haven't release now
# style melgan
# style melgan's Dygraph to Static Graph is not ready now
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_streaming.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=style_melgan_csmsc \
--voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e_streaming \
--phones_dict=dump/phone_id_map.txt \
--am_streaming=True
fi
# hifigan
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
echo "in hifigan syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_streaming.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_csmsc \
--voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e_streaming \
--phones_dict=dump/phone_id_map.txt \
--am_streaming=True
fi

@ -0,0 +1,48 @@
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/cnndecoder.yaml
train_output_path=exp/cnndecoder
ckpt_name=snapshot_iter_153.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# synthesize_e2e, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# inference with static model
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# synthesize_e2e, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_streaming.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -4,7 +4,7 @@
对于声音分类任务传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNNCNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。
在IEEE ICASSP 2017 大会上,谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段来源于YouTube视频。目前该数据集已经有210万个已标注的视频数据5800小时的音频数据经过标记的声音样本的标签类别为527。
在IEEE ICASSP 2017 大会上,谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 **10 秒**长度的声音剪辑片段来源于YouTube视频。目前该数据集已经有 210万 个已标注的视频数据5800 小时的音频数据,经过标记的声音样本的标签类别为 527。
`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。
@ -12,14 +12,14 @@
## 模型简介
PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型可供用户选择使用
- CNN14: 该模型主要包含12个卷积层和2个全连接层模型参数的数量为79.6Membbedding维度是2048。
- CNN10: 该模型主要包含8个卷积层和2个全连接层模型参数的数量为4.9Membbedding维度是512。
- CNN6: 该模型主要包含4个卷积层和2个全连接层模型参数的数量为4.5Membbedding维度是512。
- CNN14: 该模型主要包含12个卷积层和2个全连接层模型参数的数量为 79.6Membbedding维度是 2048。
- CNN10: 该模型主要包含8个卷积层和2个全连接层模型参数的数量为 4.9Membbedding维度是 512。
- CNN6: 该模型主要包含4个卷积层和2个全连接层模型参数的数量为 4.5Membbedding维度是 512。
## 数据集
[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的环境声音样本,音频样本采样率为 44,100Hz 的单通道音频文件,所有样本根据标签被划分为 50 个类别,每个类别有 40 个样本。
[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的时长为 **5 秒**环境声音样本,音频样本采样率为 44,100Hz 的单通道音频文件,所有样本根据标签被划分为 50 个类别,每个类别有 40 个样本。
## 模型指标
@ -43,13 +43,13 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns.yaml
```
训练的参数可在 `conf/panns.yaml``training` 中配置,其中:
- `epochs`: 训练轮次默认为50。
- `epochs`: 训练轮次,默认为 50。
- `learning_rate`: Fine-tune的学习率默认为5e-5。
- `batch_size`: 批处理大小请结合显存情况进行调整若出现显存不足请适当调低这一参数默认为16。
- `batch_size`: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 16。
- `num_workers`: Dataloader获取数据的子进程数。默认为0加载数据的流程在主进程执行。
- `checkpoint_dir`: 模型参数文件和optimizer参数文件的保存目录默认为`./checkpoint`。
- `save_freq`: 训练过程中的模型保存频率默认为10。
- `log_freq`: 训练过程中的信息打印频率默认为10。
- `save_freq`: 训练过程中的模型保存频率,默认为 10。
- `log_freq`: 训练过程中的信息打印频率,默认为 10。
示例代码中使用的预训练模型为`CNN14`,如果想更换为其他预训练模型,可通过修改 `conf/panns.yaml``model` 中配置:
```yaml
@ -76,7 +76,7 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 2 conf/panns.yaml
训练的参数可在 `conf/panns.yaml``predicting` 中配置,其中:
- `audio_file`: 指定预测的音频文件。
- `top_k`: 预测显示的top k标签的得分默认为1。
- `top_k`: 预测显示的top k标签的得分默认为 1。
- `checkpoint`: 模型参数checkpoint文件。
输出的预测结果如下:

@ -4,17 +4,42 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_ljspeech \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_ljspeech \
--voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_ljspeech \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_ljspeech \
--voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
# hifigan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_ljspeech \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_ljspeech \
--voc_config=hifigan_ljspeech_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi

@ -4,19 +4,45 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_ljspeech \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_ljspeech \
--voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_ljspeech \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_ljspeech \
--voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_ljspeech \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_ljspeech \
--voc_config=hifigan_ljspeech_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt
fi

@ -0,0 +1,148 @@
# HiFiGAN with the LJSpeech-1.1
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
## Dataset
### Download and Extract
Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` parallel wavegan config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip).
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 2500000|24.492|0.115|7.227
HiFiGAN checkpoint contains files listed below.
```text
hifigan_ljspeech_ckpt_0.2.0
├── default.yaml # default config used to train hifigan
├── feats_stats.npy # statistics used to normalize spectrogram when training hifigan
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -0,0 +1,167 @@
# This is the configuration file for LJSpeech dataset.
# This configuration is based on HiFiGAN V1, which is an official configuration.
# But I found that the optimizer setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 22050 # Sampling rate.
n_fft: 1024 # FFT size (samples).
n_shift: 256 # Hop size (samples). 11.6ms
win_length: null # Window length (samples).
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [8, 8, 2, 2] # Upsampling scales.
upsample_kernel_sizes: [16, 16, 4, 4] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 22050
fft_size: 1024
hop_size: 256
win_length: null
window: "hann"
num_mels: 80
fmin: 0
fmax: 11025
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8192 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,55 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./ljspeech_alignment \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/LJSpeech-1.1/ \
--dataset=ljspeech \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi

@ -0,0 +1,14 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan

@ -0,0 +1,13 @@
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}

@ -0,0 +1,32 @@
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -4,18 +4,43 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_vctk \
--voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
fi

@ -4,21 +4,49 @@ config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_vctk \
--voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--inference_dir=${train_output_path}/inference
fi

@ -0,0 +1,153 @@
# HiFiGAN with VCTK
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443).
## Dataset
### Download and Extract
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
1. `p315`, because of no text for it.
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them.
## Get Started
Assume the path to the dataset is `~/datasets/VCTK-Corpus-0.92`.
Assume the path to the MFA result of VCTK is `./vctk_alignment`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 2500000|58.092|0.1234|24.384
HiFiGAN checkpoint contains files listed below.
```text
hifigan_vctk_ckpt_0.2.0
├── default.yaml # default config used to train hifigan
├── feats_stats.npy # statistics used to normalize spectrogram when training hifigan
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -0,0 +1,168 @@
# This is the configuration file for VCTK dataset.
# This configuration is based on HiFiGAN V1, which is
# an official configuration. But I found that the optimizer
# setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales
# is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,55 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./vctk_alignment \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/VCTK-Corpus-0.92/ \
--dataset=vctk \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi

@ -0,0 +1,14 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan

@ -0,0 +1,13 @@
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}

@ -0,0 +1,32 @@
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_5000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -6,3 +6,45 @@ sv0 - speaker verfication with softmax backend etc, all python code
sv1 - dependence on kaldi, speaker verfication with plda/sc backend,
more info refer to the sv1/readme.txt
## VoxCeleb2 preparation
VoxCeleb2 audio files are released in m4a format. All the VoxCeleb2 m4a audio files must be converted in wav files before feeding them in PaddleSpeech.
Please, follow these steps to prepare the dataset correctly:
1. Download Voxceleb2.
You can find download instructions here: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
2. Convert .m4a to wav
VoxCeleb2 stores files with the m4a audio format. To use them in PaddleSpeech, you have to convert all the m4a audio files into wav files.
``` shell
ffmpeg -y -i %s -ac 1 -vn -acodec pcm_s16le -ar 16000 %s
```
You can do the conversion using ffmpeg https://gist.github.com/seungwonpark/4f273739beef2691cd53b5c39629d830). This operation might take several hours and should be only once.
3. Put all the wav files in a folder called `wav`. You should have something like `voxceleb2/wav/id*/*.wav` (e.g, `voxceleb2/wav/id00012/21Uxsk56VDQ/00001.wav`)
## voxceleb dataset summary
|dataset | vox1 - dev | vox1 - test |vox2 - dev| vox2 - test|
|---------|-----------|------------|-----------|----------|
|spks | 1211 |40 | 5994 | 118|
|utts | 148642 | 4874 | 1092009 |36273|
| time(h) | 340.4 | 11.2 | 2360.2 |79.9 |
## trial summary
| trial | filename | nums | positive | negative |
|--------|-----------|--------|-------|------|
| VoxCeleb1 | veri_test.txt | 37720 | 18860 | 18860 |
| VoxCeleb1(cleaned) | veri_test2.txt | 37611 | 18802 | 18809 |
| VoxCeleb1-H | list_test_hard.txt | 552536 | 276270 | 276266 |
|VoxCeleb1-H(cleaned) |list_test_hard2.txt | 550894 | 275488 | 275406 |
|VoxCeleb1-E | list_test_all.txt | 581480 | 290743 | 290737 |
|VoxCeleb1-E(cleaned) | list_test_all2.txt |579818 |289921 |289897 |

@ -0,0 +1,7 @@
# VoxCeleb
## ECAPA-TDNN
| Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm |
| --- | --- | --- | --- | --- | --- | --- | ---- |
| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 | 1.06 |

@ -0,0 +1,52 @@
###########################################
# Data #
###########################################
# we should explicitly specify the wav path of vox2 audio data converted from m4a
vox2_base_path:
augment: True
batch_size: 16
num_workers: 2
num_speakers: 7205 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
shuffle: True
random_chunk: True
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
# currently, we only support fbank
sr: 16000 # sample rate
n_mels: 80
window_size: 400 #25ms, sample rate 16000, 25 * 16000 / 1000 = 400
hop_size: 160 #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
###########################################################
# MODEL SETTING #
###########################################################
# currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
# if we want use another model, please choose another configuration yaml file
model:
input_size: 80
# "channels": [512, 512, 512, 512, 1536],
channels: [1024, 1024, 1024, 1024, 3072]
kernel_sizes: [5, 3, 3, 3, 1]
dilations: [1, 2, 3, 4, 1]
attention_channels: 128
lin_neurons: 192
###########################################
# Training #
###########################################
seed: 1986 # according from speechbrain configuration
epochs: 10
save_interval: 1
log_interval: 1
learning_rate: 1e-8
###########################################
# Testing #
###########################################
global_embedding_norm: True
embedding_mean_norm: True
embedding_std_norm: False

@ -0,0 +1,58 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
stage=1
stop_stage=100
. ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
if [ $# -ne 2 ] ; then
echo "Usage: $0 [options] <data-dir> <conf-path>";
echo "e.g.: $0 ./data/ conf/ecapa_tdnn.yaml"
echo "Options: "
echo " --stage <stage|-1> # Used to run a partially-completed data process from somewhere in the middle."
echo " --stop-stage <stop-stage|100> # Used to run a partially-completed data process stop stage in the middle"
exit 1;
fi
dir=$1
conf_path=$2
mkdir -p ${dir}
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# data prepare for vox1 and vox2, vox2 must be converted from m4a to wav
# we should use the local/convert.sh convert m4a to wav
python3 local/data_prepare.py \
--data-dir ${dir} \
--config ${conf_path}
fi
TARGET_DIR=${MAIN_ROOT}/dataset
mkdir -p ${TARGET_DIR}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# download data, generate manifests
python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
--manifest_prefix="data/vox1/manifest" \
--target_dir="${TARGET_DIR}/voxceleb/vox1/"
if [ $? -ne 0 ]; then
echo "Prepare voxceleb failed. Terminated."
exit 1
fi
# for dataset in train dev test; do
# mv data/manifest.${dataset} data/manifest.${dataset}.raw
# done
fi

@ -0,0 +1,70 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import paddle
from yacs.config import CfgNode
from paddleaudio.datasets.voxceleb import VoxCeleb
from paddlespeech.s2t.utils.log import Log
from paddlespeech.vector.io.augment import build_augment_pipeline
from paddlespeech.vector.training.seeding import seed_everything
logger = Log(__name__).getlog()
def main(args, config):
# stage0: set the cpu device, all data prepare process will be done in cpu mode
paddle.set_device("cpu")
# set the random seed, it is a must for multiprocess training
seed_everything(config.seed)
# stage 1: generate the voxceleb csv file
# Note: this may occurs c++ execption, but the program will execute fine
# so we ignore the execption
# we explicitly pass the vox2 base path to data prepare and generate the audio info
logger.info("start to generate the voxceleb dataset info")
train_dataset = VoxCeleb(
'train', target_dir=args.data_dir, vox2_base_path=config.vox2_base_path)
# stage 2: generate the augment noise csv file
if config.augment:
logger.info("start to generate the augment dataset info")
augment_pipeline = build_augment_pipeline(target_dir=args.data_dir)
if __name__ == "__main__":
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--data-dir",
default="./data/",
type=str,
help="data directory")
parser.add_argument("--config",
default=None,
type=str,
help="configuration file")
args = parser.parse_args()
# yapf: enable
# https://yaml.org/type/float.html
config = CfgNode(new_allowed=True)
if args.config:
config.merge_from_file(args.config)
config.freeze()
print(config)
main(args, config)

@ -0,0 +1,51 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
. ./path.sh
stage=0
stop_stage=100
exp_dir=exp/ecapa-tdnn-vox12-big/ # experiment directory
conf_path=conf/ecapa_tdnn.yaml
audio_path="demo/voxceleb/00001.wav"
use_gpu=true
. ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
if [ $# -ne 0 ] ; then
echo "Usage: $0 [options]";
echo "e.g.: $0 ./data/ exp/voxceleb12/ conf/ecapa_tdnn.yaml"
echo "Options: "
echo " --use-gpu <true,false|true> # specify is gpu is to be used for training"
echo " --stage <stage|-1> # Used to run a partially-completed data process from somewhere in the middle."
echo " --stop-stage <stop-stage|100> # Used to run a partially-completed data process stop stage in the middle"
echo " --exp-dir # experiment directorh, where is has the model.pdparams"
echo " --conf-path # configuration file for extracting the embedding"
echo " --audio-path # audio-path, which will be processed to extract the embedding"
exit 1;
fi
# set the test device
device="cpu"
if ${use_gpu}; then
device="gpu"
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# extract the audio embedding
python3 ${BIN_DIR}/extract_emb.py --device ${device} \
--config ${conf_path} \
--audio-path ${audio_path} --load-checkpoint ${exp_dir}
fi

@ -0,0 +1,42 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
stage=1
stop_stage=100
use_gpu=true # if true, we run on GPU.
. ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
if [ $# -ne 3 ] ; then
echo "Usage: $0 [options] <data-dir> <exp-dir> <conf-path>";
echo "e.g.: $0 ./data/ exp/voxceleb12/ conf/ecapa_tdnn.yaml"
echo "Options: "
echo " --use-gpu <true,false|true> # specify is gpu is to be used for training"
echo " --stage <stage|-1> # Used to run a partially-completed data process from somewhere in the middle."
echo " --stop-stage <stop-stage|100> # Used to run a partially-completed data process stop stage in the middle"
exit 1;
fi
dir=$1
exp_dir=$2
conf_path=$3
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# test the model and compute the eer metrics
python3 ${BIN_DIR}/test.py \
--data-dir ${dir} \
--load-checkpoint ${exp_dir} \
--config ${conf_path}
fi

@ -0,0 +1,61 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
stage=0
stop_stage=100
use_gpu=true # if true, we run on GPU.
. ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
if [ $# -ne 3 ] ; then
echo "Usage: $0 [options] <data-dir> <exp-dir> <conf-path>";
echo "e.g.: $0 ./data/ exp/voxceleb12/ conf/ecapa_tdnn.yaml"
echo "Options: "
echo " --use-gpu <true,false|true> # specify is gpu is to be used for training"
echo " --stage <stage|-1> # Used to run a partially-completed data process from somewhere in the middle."
echo " --stop-stage <stop-stage|100> # Used to run a partially-completed data process stop stage in the middle"
exit 1;
fi
dir=$1
exp_dir=$2
conf_path=$3
# get the gpu nums for training
ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
echo "using $ngpu gpus..."
# setting training device
device="cpu"
if ${use_gpu}; then
device="gpu"
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train the speaker identification task with voxceleb data
# and we will create the trained model parameters in ${exp_dir}/model.pdparams as the soft link
# Note: we will store the log file in exp/log directory
python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
--data-dir ${dir} --config ${conf_path}
fi
if [ $? -ne 0 ]; then
echo "Failed in training!"
exit 1
fi
exit 0

@ -0,0 +1,28 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/
MODEL=ecapa_tdnn
export BIN_DIR=${MAIN_ROOT}/paddlespeech/vector/exps/${MODEL}

@ -0,0 +1,69 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
. ./path.sh
set -e
#######################################################################
# stage 0: data prepare, including voxceleb1 download and generate {train,dev,enroll,test}.csv
# voxceleb2 data is m4a format, so we need user to convert the m4a to wav yourselves as described in Readme.md with the script local/convert.sh
# stage 1: train the speaker identification model
# stage 2: test speaker identification
# stage 3: extract the training embeding to train the LDA and PLDA
######################################################################
# we can set the variable PPAUDIO_HOME to specifiy the root directory of the downloaded vox1 and vox2 dataset
# default the dataset will be stored in the ~/.paddleaudio/
# the vox2 dataset is stored in m4a format, we need to convert the audio from m4a to wav yourself
# and put all of them to ${PPAUDIO_HOME}/datasets/vox2
# we will find the wav from ${PPAUDIO_HOME}/datasets/vox1/wav and ${PPAUDIO_HOME}/datasets/vox2/wav
# export PPAUDIO_HOME=
stage=0
stop_stage=50
# data directory
# if we set the variable ${dir}, we will store the wav info to this directory
# otherwise, we will store the wav info to vox1 and vox2 directory respectively
# vox2 wav path, we must convert the m4a format to wav format
dir=data/ # data info directory
exp_dir=exp/ecapa-tdnn-vox12-big/ # experiment directory
conf_path=conf/ecapa_tdnn.yaml
gpus=0,1,2,3
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
mkdir -p ${exp_dir}
if [ $stage -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# stage 0: data prepare for vox1 and vox2, vox2 must be converted from m4a to wav
bash ./local/data.sh ${dir} ${conf_path}|| exit -1;
fi
if [ $stage -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# stage 1: train the speaker identification model
CUDA_VISIBLE_DEVICES=${gpus} bash ./local/train.sh ${dir} ${exp_dir} ${conf_path}
fi
if [ $stage -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# stage 2: get the speaker verification scores with cosine function
# now we only support use cosine to get the scores
CUDA_VISIBLE_DEVICES=0 bash ./local/test.sh ${dir} ${exp_dir} ${conf_path}
fi
# if [ $stage -le 3 ]; then
# # stage 2: extract the training embeding to train the LDA and PLDA
# # todo: extract the training embedding
# fi

@ -0,0 +1 @@
../../../utils/

@ -0,0 +1,2 @@
.eggs
*.wav

@ -1,5 +1,9 @@
# Changelog
Date: 2022-3-15, Author: Xiaojie Chen.
- kaldi and librosa mfcc, fbank, spectrogram.
- unit test and benchmark.
Date: 2022-2-25, Author: Hui Zhang.
- Refactor architecture.
- dtw distance and mcd style dtw
- dtw distance and mcd style dtw.

@ -0,0 +1,7 @@
# PaddleAudio
PaddleAudio is an audio library for PaddlePaddle.
## Install
`pip install .`

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save