Merge remote-tracking branch 'upstream/develop' into develop

pull/3006/head
longrookie 3 years ago
commit e91bff79f5

@ -19,7 +19,7 @@ import subprocess
import platform
COPYRIGHT = '''
Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

@ -178,7 +178,10 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update
- 🎉 2023.03.07: Add [TTS ARM Linux C++ Demo](./demos/TTSArmLinux).
- 🔥 2023.03.14: Add SVS(Singing Voice Synthesis) examples with Opencpop dataset, including [DiffSinger](./examples/opencpop/svs1)、[PWGAN](./examples/opencpop/voc1) and [HiFiGAN](./examples/opencpop/voc5), the effect is continuously optimized.
- 👑 2023.03.09: Add [Wav2vec2ASR-zh](./examples/aishell/asr3).
- 🎉 2023.03.07: Add [TTS ARM Linux C++ Demo (with C++ Chinese Text Frontend)](./demos/TTSArmLinux).
- 🔥 2023.03.03 Add Voice Conversion [StarGANv2-VC synthesize pipeline](./examples/vctk/vc3).
- 🎉 2023.02.16: Add [Cantonese TTS](./examples/canton/tts3).
- 🔥 2023.01.10: Add [code-switch asr CLI and Demos](./demos/speech_recognition).
- 👑 2023.01.06: Add [code-switch asr tal_cs recipe](./examples/tal_cs/asr1/).
@ -575,14 +578,14 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</thead>
<tbody>
<tr>
<td> Text Frontend </td>
<td colspan="2"> &emsp; </td>
<td>
<a href = "./examples/other/tn">tn</a> / <a href = "./examples/other/g2p">g2p</a>
</td>
<td> Text Frontend </td>
<td colspan="2"> &emsp; </td>
<td>
<a href = "./examples/other/tn">tn</a> / <a href = "./examples/other/g2p">g2p</a>
</td>
</tr>
<tr>
<td rowspan="5">Acoustic Model</td>
<td rowspan="6">Acoustic Model</td>
<td>Tacotron2</td>
<td>LJSpeech / CSMSC</td>
<td>
@ -617,6 +620,13 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
<a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
</td>
</tr>
<tr>
<td>DiffSinger</td>
<td>Opencpop</td>
<td>
<a href = "./examples/opencpop/svs1">DiffSinger-opencpop</a>
</td>
</tr>
<tr>
<td rowspan="6">Vocoder</td>
<td >WaveFlow</td>
@ -627,9 +637,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tr>
<tr>
<td >Parallel WaveGAN</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
<td>
<a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> / <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a>
<a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> / <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a> / <a href = "./examples/opencpop/voc1">PWGAN-opencpop</a>
</td>
</tr>
<tr>
@ -648,9 +658,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tr>
<tr>
<td>HiFiGAN</td>
<td>LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td>LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
<td>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a> / <a href = "./examples/opencpop/voc5">HiFiGAN-opencpop</a>
</td>
</tr>
<tr>

@ -183,7 +183,10 @@
- 🧩 级联模型应用: 作为传统语音任务的扩展,我们结合了自然语言处理、计算机视觉等任务,实现更接近实际需求的产业级应用。
### 近期更新
- 🎉 2023.03.07: 新增 [TTS ARM Linux C++ 部署示例](./demos/TTSArmLinux)。
- 🔥 2023.03.14: 新增基于 Opencpop 数据集的 SVS (歌唱合成) 示例,包含 [DiffSinger](./examples/opencpop/svs1)、[PWGAN](./examples/opencpop/voc1) 和 [HiFiGAN](./examples/opencpop/voc5),效果持续优化中。
- 👑 2023.03.09: 新增 [Wav2vec2ASR-zh](./examples/aishell/asr3)。
- 🎉 2023.03.07: 新增 [TTS ARM Linux C++ 部署示例 (包含 C++ 中文文本前端模块)](./demos/TTSArmLinux)。
- 🔥 2023.03.03: 新增声音转换模型 [StarGANv2-VC 合成流程](./examples/vctk/vc3)。
- 🎉 2023.02.16: 新增[粤语语音合成](./examples/canton/tts3)。
- 🔥 2023.01.10: 新增[中英混合 ASR CLI 和 Demos](./demos/speech_recognition)。
- 👑 2023.01.06: 新增 [ASR 中英混合 tal_cs 训练推理流程](./examples/tal_cs/asr1/)。
@ -574,43 +577,50 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
<td>
<a href = "./examples/other/tn">tn</a> / <a href = "./examples/other/g2p">g2p</a>
</td>
</tr>
<tr>
<td rowspan="5">声学模型</td>
</tr>
<tr>
<td rowspan="6">声学模型</td>
<td>Tacotron2</td>
<td>LJSpeech / CSMSC</td>
<td>
<a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a> / <a href = "./examples/csmsc/tts0">tacotron2-csmsc</a>
</td>
</tr>
<tr>
</tr>
<tr>
<td>Transformer TTS</td>
<td>LJSpeech</td>
<td>
<a href = "./examples/ljspeech/tts1">transformer-ljspeech</a>
</td>
</tr>
<tr>
</tr>
<tr>
<td>SpeedySpeech</td>
<td>CSMSC</td>
<td >
<a href = "./examples/csmsc/tts2">speedyspeech-csmsc</a>
</td>
</tr>
<tr>
</tr>
<tr>
<td>FastSpeech2</td>
<td>LJSpeech / VCTK / CSMSC / AISHELL-3 / ZH_EN / finetune</td>
<td>
<a href = "./examples/ljspeech/tts3">fastspeech2-ljspeech</a> / <a href = "./examples/vctk/tts3">fastspeech2-vctk</a> / <a href = "./examples/csmsc/tts3">fastspeech2-csmsc</a> / <a href = "./examples/aishell3/tts3">fastspeech2-aishell3</a> / <a href = "./examples/zh_en_tts/tts3">fastspeech2-zh_en</a> / <a href = "./examples/other/tts_finetune/tts3">fastspeech2-finetune</a>
</td>
</tr>
<tr>
</tr>
<tr>
<td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
<td>VCTK / AISHELL-3 / ZH_EN</td>
<td>
<a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
</td>
</tr>
</tr>
<tr>
<td>DiffSinger</td>
<td>Opencpop</td>
<td>
<a href = "./examples/opencpop/svs1">DiffSinger-opencpop</a>
</td>
</tr>
<tr>
<td rowspan="6">声码器</td>
<td >WaveFlow</td>
@ -621,9 +631,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tr>
<tr>
<td >Parallel WaveGAN</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
<td>
<a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> / <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a>
<a href = "./examples/ljspeech/voc1">PWGAN-ljspeech</a> / <a href = "./examples/vctk/voc1">PWGAN-vctk</a> / <a href = "./examples/csmsc/voc1">PWGAN-csmsc</a> / <a href = "./examples/aishell3/voc1">PWGAN-aishell3</a> / <a href = "./examples/opencpop/voc1">PWGAN-opencpop</a>
</td>
</tr>
<tr>
@ -642,9 +652,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tr>
<tr>
<td >HiFiGAN</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3</td>
<td >LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop</td>
<td>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a>
<a href = "./examples/ljspeech/voc5">HiFiGAN-ljspeech</a> / <a href = "./examples/vctk/voc5">HiFiGAN-vctk</a> / <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> / <a href = "./examples/aishell3/voc5">HiFiGAN-aishell3</a> / <a href = "./examples/opencpop/voc5">HiFiGAN-opencpop</a>
</td>
</tr>
<tr>
@ -701,6 +711,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tbody>
</table>
<a name="声音分类模型"></a>
**声音分类**

@ -1,4 +1,8 @@
# 目录
build/
output/
libs/
models/
# 符号连接
dict

@ -10,9 +10,9 @@
### 安装依赖
```
```bash
# Ubuntu
sudo apt install build-essential cmake wget tar unzip
sudo apt install build-essential cmake pkg-config wget tar unzip
# CentOS
sudo yum groupinstall "Development Tools"
@ -25,15 +25,13 @@ sudo yum install cmake wget tar unzip
可用以下命令下载:
```
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech/demos/TTSArmLinux
```bash
./download.sh
```
### 编译 Demo
```
```bash
./build.sh
```
@ -43,12 +41,18 @@ cd PaddleSpeech/demos/TTSArmLinux
### 运行
```
你可以修改 `./front.conf``--phone2id_path` 参数为你自己的声学模型的 `phone_id_map.txt`
```bash
./run.sh
./run.sh --sentence "语音合成测试"
./run.sh --sentence "输出到指定的音频文件" --output_wav ./output/test.wav
./run.sh --help
```
将把 [src/main.cpp](src/main.cpp) 里定义在 `sentencesToChoose` 数组中的十句话转换为 `wav` 文件,保存在 `output` 文件夹中。
目前只支持中文合成,出现任何英文都会导致程序崩溃
如果未指定`--wav_file`,默认输出到`./output/tts.wav`。
## 手动编译 Paddle Lite 库

@ -0,0 +1 @@
src/TTSCppFrontend/build-depends.sh

@ -1,8 +1,11 @@
#!/bin/bash
set -e
set -x
cd "$(dirname "$(realpath "$0")")"
BASE_DIR="$PWD"
# load configure
. ./config.sh
@ -10,11 +13,17 @@ cd "$(dirname "$(realpath "$0")")"
echo "ARM_ABI is ${ARM_ABI}"
echo "PADDLE_LITE_DIR is ${PADDLE_LITE_DIR}"
rm -rf build
mkdir -p build
cd build
echo "Build depends..."
./build-depends.sh "$@"
mkdir -p "$BASE_DIR/build"
cd "$BASE_DIR/build"
cmake -DPADDLE_LITE_DIR="${PADDLE_LITE_DIR}" -DARM_ABI="${ARM_ABI}" ../src
make
if [ "$*" = "" ]; then
make -j$(nproc)
else
make "$@"
fi
echo "make successful!"

@ -1,8 +1,11 @@
#!/bin/bash
set -e
set -x
cd "$(dirname "$(realpath "$0")")"
BASE_DIR="$PWD"
# load configure
. ./config.sh
@ -12,3 +15,9 @@ set -x
rm -rf "$OUTPUT_DIR"
rm -rf "$LIBS_DIR"
rm -rf "$MODELS_DIR"
rm -rf "$BASE_DIR/build"
"$BASE_DIR/src/TTSCppFrontend/clean.sh"
# 符号连接
rm "$BASE_DIR/dict"

@ -10,5 +10,6 @@ OUTPUT_DIR="${PWD}/output"
PADDLE_LITE_DIR="${LIBS_DIR}/inference_lite_lib.armlinux.${ARM_ABI}.gcc.with_extra.with_cv/cxx"
#PADDLE_LITE_DIR="/path/to/Paddle-Lite/build.lite.linux.${ARM_ABI}.gcc/inference_lite_lib.armlinux.${ARM_ABI}/cxx"
AM_MODEL_PATH="${MODELS_DIR}/cpu/fastspeech2_csmsc_arm.nb"
VOC_MODEL_PATH="${MODELS_DIR}/cpu/mb_melgan_csmsc_arm.nb"
ACOUSTIC_MODEL_PATH="${MODELS_DIR}/cpu/fastspeech2_csmsc_arm.nb"
VOCODER_PATH="${MODELS_DIR}/cpu/mb_melgan_csmsc_arm.nb"
FRONT_CONF="${PWD}/front.conf"

@ -3,6 +3,8 @@ set -e
cd "$(dirname "$(realpath "$0")")"
BASE_DIR="$PWD"
# load configure
. ./config.sh
@ -38,6 +40,10 @@ download() {
echo '======================='
}
########################################
echo "Download models..."
download 'inference_lite_lib.armlinux.armv8.gcc.with_extra.with_cv.tar.gz' \
'https://paddlespeech.bj.bcebos.com/demos/TTSArmLinux/inference_lite_lib.armlinux.armv8.gcc.with_extra.with_cv.tar.gz' \
'39e0c6604f97c70f5d13c573d7e709b9' \
@ -54,3 +60,11 @@ download 'fs2cnn_mbmelgan_cpu_v1.3.0.tar.gz' \
"$MODELS_DIR"
echo "Done."
########################################
echo "Download dictionary files..."
ln -s src/TTSCppFrontend/front_demo/dict "$BASE_DIR/"
"$BASE_DIR/src/TTSCppFrontend/download.sh"

@ -0,0 +1,21 @@
# jieba conf
--jieba_dict_path=./dict/jieba/jieba.dict.utf8
--jieba_hmm_path=./dict/jieba/hmm_model.utf8
--jieba_user_dict_path=./dict/jieba/user.dict.utf8
--jieba_idf_path=./dict/jieba/idf.utf8
--jieba_stop_word_path=./dict/jieba/stop_words.utf8
# dict conf fastspeech2_0.4
--seperate_tone=false
--word2phone_path=./dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
--phone2id_path=./dict/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
--tone2id_path=./dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
# dict conf speedyspeech_0.5
#--seperate_tone=true
#--word2phone_path=./dict/speedyspeech_nosil_baker_ckpt_0.5/word2phone.dict
#--phone2id_path=./dict/speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt
#--tone2id_path=./dict/speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
# dict of tranditional_to_simplified
--trand2simpd_path=./dict/tranditional_to_simplified/trand2simp.txt

@ -7,12 +7,13 @@ cd "$(dirname "$(realpath "$0")")"
. ./config.sh
# create dir
rm -rf "$OUTPUT_DIR"
mkdir -p "$OUTPUT_DIR"
# run
for i in {1..10}; do
(set -x; ./build/paddlespeech_tts_demo "$AM_MODEL_PATH" "$VOC_MODEL_PATH" $i "$OUTPUT_DIR/$i.wav")
done
ls -lh "$OUTPUT_DIR"/*.wav
set -x
./build/paddlespeech_tts_demo \
--front_conf "$FRONT_CONF" \
--acoustic_model "$ACOUSTIC_MODEL_PATH" \
--vocoder "$VOCODER_PATH" \
"$@"
# end

@ -1,4 +1,18 @@
cmake_minimum_required(VERSION 3.10)
project(paddlespeech_tts_demo)
########## Global Options ##########
option(WITH_FRONT_DEMO "Build front demo" OFF)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(ABSL_PROPAGATE_CXX_STD ON)
########## ARM Options ##########
set(CMAKE_SYSTEM_NAME Linux)
if(ARM_ABI STREQUAL "armv8")
set(CMAKE_SYSTEM_PROCESSOR aarch64)
@ -13,14 +27,16 @@ else()
return()
endif()
project(paddlespeech_tts_demo)
########## Paddle Lite Options ##########
message(STATUS "TARGET ARCH ABI: ${ARM_ABI}")
message(STATUS "PADDLE LITE DIR: ${PADDLE_LITE_DIR}")
include_directories(${PADDLE_LITE_DIR}/include)
link_directories(${PADDLE_LITE_DIR}/libs/${ARM_ABI})
link_directories(${PADDLE_LITE_DIR}/lib)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")
if(ARM_ABI STREQUAL "armv8")
set(CMAKE_CXX_FLAGS "-march=armv8-a ${CMAKE_CXX_FLAGS}")
set(CMAKE_C_FLAGS "-march=armv8-a ${CMAKE_C_FLAGS}")
@ -29,6 +45,9 @@ elseif(ARM_ABI STREQUAL "armv7hf")
set(CMAKE_C_FLAGS "-march=armv7-a -mfloat-abi=hard -mfpu=neon-vfpv4 ${CMAKE_C_FLAGS}" )
endif()
########## Dependencies ##########
find_package(OpenMP REQUIRED)
if(OpenMP_FOUND OR OpenMP_CXX_FOUND)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
@ -43,5 +62,19 @@ else()
return()
endif()
############### tts cpp frontend ###############
add_subdirectory(TTSCppFrontend)
include_directories(
TTSCppFrontend/src
third-party/build/src/cppjieba/include
third-party/build/src/limonp/include
)
############### paddlespeech_tts_demo ###############
add_executable(paddlespeech_tts_demo main.cc)
target_link_libraries(paddlespeech_tts_demo paddle_light_api_shared)
target_link_libraries(paddlespeech_tts_demo paddle_light_api_shared paddlespeech_tts_front)

@ -1,7 +1,20 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <algorithm>
#include <chrono>
#include <iostream>
#include <fstream>
#include <iostream>
#include <memory>
#include <string>
#include <vector>
@ -9,32 +22,84 @@
using namespace paddle::lite_api;
typedef int16_t WavDataType;
class PredictorInterface {
public:
virtual ~PredictorInterface() = 0;
virtual bool Init(const std::string &AcousticModelPath,
const std::string &VocoderPath,
PowerMode cpuPowerMode,
int cpuThreadNum,
// WAV采样率必须与模型输出匹配
// 如果播放速度和音调异常,请修改采样率
// 常见采样率16000, 24000, 32000, 44100, 48000, 96000
uint32_t wavSampleRate) = 0;
virtual std::shared_ptr<PaddlePredictor> LoadModel(
const std::string &modelPath,
int cpuThreadNum,
PowerMode cpuPowerMode) = 0;
virtual void ReleaseModel() = 0;
virtual bool RunModel(const std::vector<int64_t> &phones) = 0;
virtual std::unique_ptr<const Tensor> GetAcousticModelOutput(
const std::vector<int64_t> &phones) = 0;
virtual std::unique_ptr<const Tensor> GetVocoderOutput(
std::unique_ptr<const Tensor> &&amOutput) = 0;
virtual void VocoderOutputToWav(
std::unique_ptr<const Tensor> &&vocOutput) = 0;
virtual void SaveFloatWav(float *floatWav, int64_t size) = 0;
virtual bool IsLoaded() = 0;
virtual float GetInferenceTime() = 0;
virtual int GetWavSize() = 0;
// 获取WAV持续时间单位毫秒
virtual float GetWavDuration() = 0;
// 获取RTF合成时间 / 音频时长)
virtual float GetRTF() = 0;
virtual void ReleaseWav() = 0;
virtual bool WriteWavToFile(const std::string &wavPath) = 0;
};
class Predictor {
public:
bool Init(const std::string &AMModelPath, const std::string &VOCModelPath, int cpuThreadNum, const std::string &cpuPowerMode) {
PredictorInterface::~PredictorInterface() {}
// WavDataType: WAV数据类型
// 可在 int16_t 和 float 之间切换,
// 用于生成 16-bit PCM 或 32-bit IEEE float 格式的 WAV
template <typename WavDataType>
class Predictor : public PredictorInterface {
public:
bool Init(const std::string &AcousticModelPath,
const std::string &VocoderPath,
PowerMode cpuPowerMode,
int cpuThreadNum,
// WAV采样率必须与模型输出匹配
// 如果播放速度和音调异常,请修改采样率
// 常见采样率16000, 24000, 32000, 44100, 48000, 96000
uint32_t wavSampleRate) override {
// Release model if exists
ReleaseModel();
AM_predictor_ = LoadModel(AMModelPath, cpuThreadNum, cpuPowerMode);
if (AM_predictor_ == nullptr) {
acoustic_model_predictor_ =
LoadModel(AcousticModelPath, cpuThreadNum, cpuPowerMode);
if (acoustic_model_predictor_ == nullptr) {
return false;
}
VOC_predictor_ = LoadModel(VOCModelPath, cpuThreadNum, cpuPowerMode);
if (VOC_predictor_ == nullptr) {
vocoder_predictor_ = LoadModel(VocoderPath, cpuThreadNum, cpuPowerMode);
if (vocoder_predictor_ == nullptr) {
return false;
}
wav_sample_rate_ = wavSampleRate;
return true;
}
~Predictor() {
virtual ~Predictor() {
ReleaseModel();
ReleaseWav();
}
std::shared_ptr<PaddlePredictor> LoadModel(const std::string &modelPath, int cpuThreadNum, const std::string &cpuPowerMode) {
std::shared_ptr<PaddlePredictor> LoadModel(
const std::string &modelPath,
int cpuThreadNum,
PowerMode cpuPowerMode) override {
if (modelPath.empty()) {
return nullptr;
}
@ -43,33 +108,17 @@ public:
MobileConfig config;
config.set_model_from_file(modelPath);
config.set_threads(cpuThreadNum);
if (cpuPowerMode == "LITE_POWER_HIGH") {
config.set_power_mode(PowerMode::LITE_POWER_HIGH);
} else if (cpuPowerMode == "LITE_POWER_LOW") {
config.set_power_mode(PowerMode::LITE_POWER_LOW);
} else if (cpuPowerMode == "LITE_POWER_FULL") {
config.set_power_mode(PowerMode::LITE_POWER_FULL);
} else if (cpuPowerMode == "LITE_POWER_NO_BIND") {
config.set_power_mode(PowerMode::LITE_POWER_NO_BIND);
} else if (cpuPowerMode == "LITE_POWER_RAND_HIGH") {
config.set_power_mode(PowerMode::LITE_POWER_RAND_HIGH);
} else if (cpuPowerMode == "LITE_POWER_RAND_LOW") {
config.set_power_mode(PowerMode::LITE_POWER_RAND_LOW);
} else {
std::cerr << "Unknown cpu power mode!" << std::endl;
return nullptr;
}
config.set_power_mode(cpuPowerMode);
return CreatePaddlePredictor<MobileConfig>(config);
}
void ReleaseModel() {
AM_predictor_ = nullptr;
VOC_predictor_ = nullptr;
void ReleaseModel() override {
acoustic_model_predictor_ = nullptr;
vocoder_predictor_ = nullptr;
}
bool RunModel(const std::vector<int64_t> &phones) {
bool RunModel(const std::vector<int64_t> &phones) override {
if (!IsLoaded()) {
return false;
}
@ -78,28 +127,29 @@ public:
auto start = std::chrono::system_clock::now();
// 执行推理
VOCOutputToWav(GetAMOutput(phones));
VocoderOutputToWav(GetVocoderOutput(GetAcousticModelOutput(phones)));
// 计时结束
auto end = std::chrono::system_clock::now();
// 计算用时
std::chrono::duration<float> duration = end - start;
inference_time_ = duration.count() * 1000; // 单位:毫秒
inference_time_ = duration.count() * 1000; // 单位:毫秒
return true;
}
std::unique_ptr<const Tensor> GetAMOutput(const std::vector<int64_t> &phones) {
auto phones_handle = AM_predictor_->GetInput(0);
std::unique_ptr<const Tensor> GetAcousticModelOutput(
const std::vector<int64_t> &phones) override {
auto phones_handle = acoustic_model_predictor_->GetInput(0);
phones_handle->Resize({static_cast<int64_t>(phones.size())});
phones_handle->CopyFromCpu(phones.data());
AM_predictor_->Run();
acoustic_model_predictor_->Run();
// 获取输出Tensor
auto am_output_handle = AM_predictor_->GetOutput(0);
auto am_output_handle = acoustic_model_predictor_->GetOutput(0);
// 打印输出Tensor的shape
std::cout << "AM Output shape: ";
std::cout << "Acoustic Model Output shape: ";
auto shape = am_output_handle->shape();
for (auto s : shape) {
std::cout << s << ", ";
@ -109,75 +159,91 @@ public:
return am_output_handle;
}
void VOCOutputToWav(std::unique_ptr<const Tensor> &&input) {
auto mel_handle = VOC_predictor_->GetInput(0);
std::unique_ptr<const Tensor> GetVocoderOutput(
std::unique_ptr<const Tensor> &&amOutput) override {
auto mel_handle = vocoder_predictor_->GetInput(0);
// [?, 80]
auto dims = input->shape();
auto dims = amOutput->shape();
mel_handle->Resize(dims);
auto am_output_data = input->mutable_data<float>();
auto am_output_data = amOutput->mutable_data<float>();
mel_handle->CopyFromCpu(am_output_data);
VOC_predictor_->Run();
vocoder_predictor_->Run();
// 获取输出Tensor
auto voc_output_handle = VOC_predictor_->GetOutput(0);
auto voc_output_handle = vocoder_predictor_->GetOutput(0);
// 打印输出Tensor的shape
std::cout << "VOC Output shape: ";
std::cout << "Vocoder Output shape: ";
auto shape = voc_output_handle->shape();
for (auto s : shape) {
std::cout << s << ", ";
}
std::cout << std::endl;
return voc_output_handle;
}
void VocoderOutputToWav(
std::unique_ptr<const Tensor> &&vocOutput) override {
// 获取输出Tensor的数据
int64_t output_size = 1;
for (auto dim : voc_output_handle->shape()) {
for (auto dim : vocOutput->shape()) {
output_size *= dim;
}
auto output_data = voc_output_handle->mutable_data<float>();
auto output_data = vocOutput->mutable_data<float>();
SaveFloatWav(output_data, output_size);
}
inline float Abs(float number) {
return (number < 0) ? -number : number;
}
void SaveFloatWav(float *floatWav, int64_t size) override;
void SaveFloatWav(float *floatWav, int64_t size) {
wav_.resize(size);
float maxSample = 0.01;
// 寻找最大采样值
for (int64_t i=0; i<size; i++) {
float sample = Abs(floatWav[i]);
if (sample > maxSample) {
maxSample = sample;
}
}
// 把采样值缩放到 int_16 范围
for (int64_t i=0; i<size; i++) {
wav_[i] = floatWav[i] * 32767.0f / maxSample;
}
bool IsLoaded() override {
return acoustic_model_predictor_ != nullptr &&
vocoder_predictor_ != nullptr;
}
bool IsLoaded() {
return AM_predictor_ != nullptr && VOC_predictor_ != nullptr;
}
float GetInferenceTime() override { return inference_time_; }
float GetInferenceTime() {
return inference_time_;
}
const std::vector<WavDataType> &GetWav() { return wav_; }
const std::vector<WavDataType> & GetWav() {
return wav_;
}
int GetWavSize() override { return wav_.size() * sizeof(WavDataType); }
int GetWavSize() {
return wav_.size() * sizeof(WavDataType);
// 获取WAV持续时间单位毫秒
float GetWavDuration() override {
return static_cast<float>(GetWavSize()) / sizeof(WavDataType) /
static_cast<float>(wav_sample_rate_) * 1000;
}
void ReleaseWav() {
wav_.clear();
// 获取RTF合成时间 / 音频时长)
float GetRTF() override { return GetInferenceTime() / GetWavDuration(); }
void ReleaseWav() override { wav_.clear(); }
bool WriteWavToFile(const std::string &wavPath) override {
std::ofstream fout(wavPath, std::ios::binary);
if (!fout.is_open()) {
return false;
}
// 写入头信息
WavHeader header;
header.audio_format = GetWavAudioFormat();
header.data_size = GetWavSize();
header.size = sizeof(header) - 8 + header.data_size;
header.sample_rate = wav_sample_rate_;
header.byte_rate = header.sample_rate * header.num_channels *
header.bits_per_sample / 8;
header.block_align = header.num_channels * header.bits_per_sample / 8;
fout.write(reinterpret_cast<const char *>(&header), sizeof(header));
// 写入wav数据
fout.write(reinterpret_cast<const char *>(wav_.data()),
header.data_size);
fout.close();
return true;
}
protected:
struct WavHeader {
// RIFF 头
char riff[4] = {'R', 'I', 'F', 'F'};
@ -187,15 +253,11 @@ public:
// FMT 头
char fmt[4] = {'f', 'm', 't', ' '};
uint32_t fmt_size = 16;
uint16_t audio_format = 1; // 1为整数编码3为浮点编码
uint16_t audio_format = 0;
uint16_t num_channels = 1;
// 如果播放速度和音调异常,请修改采样率
// 常见采样率16000, 24000, 32000, 44100, 48000, 96000
uint32_t sample_rate = 24000;
uint32_t byte_rate = 64000;
uint16_t block_align = 2;
uint32_t sample_rate = 0;
uint32_t byte_rate = 0;
uint16_t block_align = 0;
uint16_t bits_per_sample = sizeof(WavDataType) * 8;
// DATA 头
@ -203,30 +265,56 @@ public:
uint32_t data_size = 0;
};
bool WriteWavToFile(const std::string &wavPath) {
std::ofstream fout(wavPath, std::ios::binary);
if (!fout.is_open()) {
return false;
}
// 写入头信息
WavHeader header;
header.data_size = GetWavSize();
header.size = sizeof(header) - 8 + header.data_size;
header.byte_rate = header.sample_rate * header.num_channels * header.bits_per_sample / 8;
header.block_align = header.num_channels * header.bits_per_sample / 8;
fout.write(reinterpret_cast<const char*>(&header), sizeof(header));
enum WavAudioFormat {
WAV_FORMAT_16BIT_PCM = 1, // 16-bit PCM 格式
WAV_FORMAT_32BIT_FLOAT = 3 // 32-bit IEEE float 格式
};
// 写入wav数据
fout.write(reinterpret_cast<const char*>(wav_.data()), header.data_size);
protected:
// 返回值通过模板特化由 WavDataType 决定
inline uint16_t GetWavAudioFormat();
fout.close();
return true;
}
inline float Abs(float number) { return (number < 0) ? -number : number; }
private:
protected:
float inference_time_ = 0;
std::shared_ptr<PaddlePredictor> AM_predictor_ = nullptr;
std::shared_ptr<PaddlePredictor> VOC_predictor_ = nullptr;
uint32_t wav_sample_rate_ = 0;
std::vector<WavDataType> wav_;
std::shared_ptr<PaddlePredictor> acoustic_model_predictor_ = nullptr;
std::shared_ptr<PaddlePredictor> vocoder_predictor_ = nullptr;
};
template <>
uint16_t Predictor<int16_t>::GetWavAudioFormat() {
return Predictor::WAV_FORMAT_16BIT_PCM;
}
template <>
uint16_t Predictor<float>::GetWavAudioFormat() {
return Predictor::WAV_FORMAT_32BIT_FLOAT;
}
// 保存 16-bit PCM 格式 WAV
template <>
void Predictor<int16_t>::SaveFloatWav(float *floatWav, int64_t size) {
wav_.resize(size);
float maxSample = 0.01;
// 寻找最大采样值
for (int64_t i = 0; i < size; i++) {
float sample = Abs(floatWav[i]);
if (sample > maxSample) {
maxSample = sample;
}
}
// 把采样值缩放到 int_16 范围
for (int64_t i = 0; i < size; i++) {
wav_[i] = floatWav[i] * 32767.0f / maxSample;
}
}
// 保存 32-bit IEEE float 格式 WAV
template <>
void Predictor<float>::SaveFloatWav(float *floatWav, int64_t size) {
wav_.resize(size);
std::copy_n(floatWav, size, wav_.data());
}

@ -0,0 +1 @@
../../TTSCppFrontend/

@ -1,72 +1,162 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <front/front_interface.h>
#include <gflags/gflags.h>
#include <glog/logging.h>
#include <paddle_api.h>
#include <cstdlib>
#include <iostream>
#include <map>
#include <memory>
#include "paddle_api.h"
#include <string>
#include "Predictor.hpp"
using namespace paddle::lite_api;
std::vector<std::vector<int64_t>> sentencesToChoose = {
// 009901 昨日,这名“伤者”与医生全部被警方依法刑事拘留。
{261, 231, 175, 116, 179, 262, 44, 154, 126, 177, 19, 262, 42, 241, 72, 177, 56, 174, 245, 37, 186, 37, 49, 151, 127, 69, 19, 179, 72, 69, 4, 260, 126, 177, 116, 151, 239, 153, 141},
// 009902 钱伟长想到上海来办学校是经过深思熟虑的。
{174, 83, 213, 39, 20, 260, 89, 40, 30, 177, 22, 71, 9, 153, 8, 37, 17, 260, 251, 260, 99, 179, 177, 116, 151, 125, 70, 233, 177, 51, 176, 108, 177, 184, 153, 242, 40, 45},
// 009903 她见我一进门就骂,吃饭时也骂,骂得我抬不起头。
{182, 2, 151, 85, 232, 73, 151, 123, 154, 52, 151, 143, 154, 5, 179, 39, 113, 69, 17, 177, 114, 105, 154, 5, 179, 154, 5, 40, 45, 232, 182, 8, 37, 186, 174, 74, 182, 168},
// 009904 李述德在离开之前,只说了一句“柱驼杀父亲了”。
{153, 74, 177, 186, 40, 42, 261, 10, 153, 73, 152, 7, 262, 113, 174, 83, 179, 262, 115, 177, 230, 153, 45, 73, 151, 242, 180, 262, 186, 182, 231, 177, 2, 69, 186, 174, 124, 153, 45},
// 009905 这种车票和保险单捆绑出售属于重复性购买。
{262, 44, 262, 163, 39, 41, 173, 99, 71, 42, 37, 28, 260, 84, 40, 14, 179, 152, 220, 37, 21, 39, 183, 177, 170, 179, 177, 185, 240, 39, 162, 69, 186, 260, 128, 70, 170, 154, 9},
// 009906 戴佩妮的男友西米露接唱情歌,让她非常开心。
{40, 10, 173, 49, 155, 72, 40, 45, 155, 15, 142, 260, 72, 154, 74, 153, 186, 179, 151, 103, 39, 22, 174, 126, 70, 41, 179, 175, 22, 182, 2, 69, 46, 39, 20, 152, 7, 260, 120},
// 009907 观大势、谋大局、出大策始终是该院的办院方针。
{70, 199, 40, 5, 177, 116, 154, 168, 40, 5, 151, 240, 179, 39, 183, 40, 5, 38, 44, 179, 177, 115, 262, 161, 177, 116, 70, 7, 247, 40, 45, 37, 17, 247, 69, 19, 262, 51},
// 009908 他们骑着摩托回家,正好为农忙时的父母帮忙。
{182, 2, 154, 55, 174, 73, 262, 45, 154, 157, 182, 230, 71, 212, 151, 77, 180, 262, 59, 71, 29, 214, 155, 162, 154, 20, 177, 114, 40, 45, 69, 186, 154, 185, 37, 19, 154, 20},
// 009909 但是因为还没到退休年龄,只能掰着指头捱日子。
{40, 17, 177, 116, 120, 214, 71, 8, 154, 47, 40, 30, 182, 214, 260, 140, 155, 83, 153, 126, 180, 262, 115, 155, 57, 37, 7, 262, 45, 262, 115, 182, 171, 8, 175, 116, 261, 112},
// 009910 这几天雨水不断,人们恨不得待在家里不出门。
{262, 44, 151, 74, 182, 82, 240, 177, 213, 37, 184, 40, 202, 180, 175, 52, 154, 55, 71, 54, 37, 186, 40, 42, 40, 7, 261, 10, 151, 77, 153, 74, 37, 186, 39, 183, 154, 52},
};
void usage(const char *binName) {
std::cerr << "Usage:" << std::endl
<< "\t" << binName << " <AM-model-path> <VOC-model-path> <sentences-index:1-10> <output-wav-path>" << std::endl;
}
DEFINE_string(
sentence,
"你好,欢迎使用语音合成服务",
"Text to be synthesized (Chinese only. English will crash the program.)");
DEFINE_string(front_conf, "./front.conf", "Front configuration file");
DEFINE_string(acoustic_model,
"./models/cpu/fastspeech2_csmsc_arm.nb",
"Acoustic model .nb file");
DEFINE_string(vocoder,
"./models/cpu/fastspeech2_csmsc_arm.nb",
"vocoder .nb file");
DEFINE_string(output_wav, "./output/tts.wav", "Output WAV file");
DEFINE_string(wav_bit_depth,
"16",
"WAV bit depth, 16 (16-bit PCM) or 32 (32-bit IEEE float)");
DEFINE_string(wav_sample_rate,
"24000",
"WAV sample rate, should match the output of the vocoder");
DEFINE_string(cpu_thread, "1", "CPU thread numbers");
int main(int argc, char *argv[]) {
if (argc < 5) {
usage(argv[0]);
gflags::ParseCommandLineFlags(&argc, &argv, true);
PredictorInterface *predictor;
if (FLAGS_wav_bit_depth == "16") {
predictor = new Predictor<int16_t>();
} else if (FLAGS_wav_bit_depth == "32") {
predictor = new Predictor<float>();
} else {
LOG(ERROR) << "Unsupported WAV bit depth: " << FLAGS_wav_bit_depth;
return -1;
}
const char *AMModelPath = argv[1];
const char *VOCModelPath = argv[2];
int sentencesIndex = atoi(argv[3]) - 1;
const char *outputWavPath = argv[4];
if (sentencesIndex < 0 || sentencesIndex >= sentencesToChoose.size()) {
std::cerr << "sentences-index out of range" << std::endl;
/////////////////////////// 前端:文本转音素 ///////////////////////////
// 实例化文本前端引擎
ppspeech::FrontEngineInterface *front_inst = nullptr;
front_inst = new ppspeech::FrontEngineInterface(FLAGS_front_conf);
if ((!front_inst) || (front_inst->init())) {
LOG(ERROR) << "Creater tts engine failed!";
if (front_inst != nullptr) {
delete front_inst;
}
front_inst = nullptr;
return -1;
}
Predictor predictor;
if (!predictor.Init(AMModelPath, VOCModelPath, 1, "LITE_POWER_HIGH")) {
std::cerr << "predictor init failed" << std::endl;
std::wstring ws_sentence = ppspeech::utf8string2wstring(FLAGS_sentence);
// 繁体转简体
std::wstring sentence_simp;
front_inst->Trand2Simp(ws_sentence, &sentence_simp);
ws_sentence = sentence_simp;
std::string s_sentence;
std::vector<std::wstring> sentence_part;
std::vector<int> phoneids = {};
std::vector<int> toneids = {};
// 根据标点进行分句
LOG(INFO) << "Start to segment sentences by punctuation";
front_inst->SplitByPunc(ws_sentence, &sentence_part);
LOG(INFO) << "Segment sentences through punctuation successfully";
// 分句后获取音素id
LOG(INFO)
<< "Start to get the phoneme and tone id sequence of each sentence";
for (int i = 0; i < sentence_part.size(); i++) {
LOG(INFO) << "Raw sentence is: "
<< ppspeech::wstring2utf8string(sentence_part[i]);
front_inst->SentenceNormalize(&sentence_part[i]);
s_sentence = ppspeech::wstring2utf8string(sentence_part[i]);
LOG(INFO) << "After normalization sentence is: " << s_sentence;
if (0 != front_inst->GetSentenceIds(s_sentence, &phoneids, &toneids)) {
LOG(ERROR) << "TTS inst get sentence phoneids and toneids failed";
return -1;
}
}
LOG(INFO) << "The phoneids of the sentence is: "
<< limonp::Join(phoneids.begin(), phoneids.end(), " ");
LOG(INFO) << "The toneids of the sentence is: "
<< limonp::Join(toneids.begin(), toneids.end(), " ");
LOG(INFO) << "Get the phoneme id sequence of each sentence successfully";
/////////////////////////// 后端:音素转音频 ///////////////////////////
// WAV采样率必须与模型输出匹配
// 如果播放速度和音调异常,请修改采样率
// 常见采样率16000, 24000, 32000, 44100, 48000, 96000
const uint32_t wavSampleRate = std::stoul(FLAGS_wav_sample_rate);
// CPU线程数
const int cpuThreadNum = std::stol(FLAGS_cpu_thread);
// CPU电源模式
const PowerMode cpuPowerMode = PowerMode::LITE_POWER_HIGH;
if (!predictor->Init(FLAGS_acoustic_model,
FLAGS_vocoder,
cpuPowerMode,
cpuThreadNum,
wavSampleRate)) {
LOG(ERROR) << "predictor init failed" << std::endl;
return -1;
}
if (!predictor.RunModel(sentencesToChoose[sentencesIndex])) {
std::cerr << "predictor run model failed" << std::endl;
std::vector<int64_t> phones(phoneids.size());
std::transform(phoneids.begin(), phoneids.end(), phones.begin(), [](int x) {
return static_cast<int64_t>(x);
});
if (!predictor->RunModel(phones)) {
LOG(ERROR) << "predictor run model failed" << std::endl;
return -1;
}
std::cout << "Inference time: " << predictor.GetInferenceTime() << " ms, "
<< "WAV size (without header): " << predictor.GetWavSize() << " bytes" << std::endl;
LOG(INFO) << "Inference time: " << predictor->GetInferenceTime() << " ms, "
<< "WAV size (without header): " << predictor->GetWavSize()
<< " bytes, "
<< "WAV duration: " << predictor->GetWavDuration() << " ms, "
<< "RTF: " << predictor->GetRTF() << std::endl;
if (!predictor.WriteWavToFile(outputWavPath)) {
std::cerr << "write wav file failed" << std::endl;
if (!predictor->WriteWavToFile(FLAGS_output_wav)) {
LOG(ERROR) << "write wav file failed" << std::endl;
return -1;
}
delete predictor;
return 0;
}

@ -0,0 +1 @@
TTSCppFrontend/third-party

@ -0,0 +1,2 @@
build/
dict/

@ -0,0 +1,63 @@
cmake_minimum_required(VERSION 3.10)
project(paddlespeech_tts_cpp)
########## Global Options ##########
option(WITH_FRONT_DEMO "Build front demo" ON)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(ABSL_PROPAGATE_CXX_STD ON)
########## Dependencies ##########
set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/third-party/build/lib/pkgconfig:${CMAKE_SOURCE_DIR}/third-party/build/lib64/pkgconfig")
find_package(PkgConfig REQUIRED)
# It is hard to load xxx-config.cmake in a custom location, so use pkgconfig instead.
pkg_check_modules(ABSL REQUIRED absl_strings IMPORTED_TARGET)
pkg_check_modules(GFLAGS REQUIRED gflags IMPORTED_TARGET)
pkg_check_modules(GLOG REQUIRED libglog IMPORTED_TARGET)
# load header-only libraries
include_directories(
${CMAKE_SOURCE_DIR}/third-party/build/src/cppjieba/include
${CMAKE_SOURCE_DIR}/third-party/build/src/limonp/include
)
find_package(Threads REQUIRED)
########## paddlespeech_tts_front ##########
include_directories(src)
file(GLOB FRONT_SOURCES
./src/base/*.cpp
./src/front/*.cpp
)
add_library(paddlespeech_tts_front STATIC ${FRONT_SOURCES})
target_link_libraries(
paddlespeech_tts_front
PUBLIC
PkgConfig::GFLAGS
PkgConfig::GLOG
PkgConfig::ABSL
Threads::Threads
)
########## tts_front_demo ##########
if (WITH_FRONT_DEMO)
file(GLOB FRONT_DEMO_SOURCES front_demo/*.cpp)
add_executable(tts_front_demo ${FRONT_DEMO_SOURCES})
target_include_directories(tts_front_demo PRIVATE ./front_demo)
target_link_libraries(tts_front_demo PRIVATE paddlespeech_tts_front)
endif (WITH_FRONT_DEMO)

@ -0,0 +1,56 @@
# PaddleSpeech TTS CPP Frontend
A TTS frontend that implements text-to-phoneme conversion.
Currently it only supports Chinese, any English word will crash the demo.
## Install Build Tools
```bash
# Ubuntu
sudo apt install build-essential cmake pkg-config
# CentOS
sudo yum groupinstall "Development Tools"
sudo yum install cmake
```
If your cmake version is too old, you can go here to download a precompiled new version: https://cmake.org/download/
## Build
```bash
# Build with all CPU cores
./build.sh
# Build with 1 core
./build.sh -j1
```
Dependent libraries will be automatically downloaded to the `third-party/build` folder.
If the download speed is too slow, you can open [third-party/CMakeLists.txt](third-party/CMakeLists.txt) and modify `GIT_REPOSITORY` URLs.
## Download dictionary files
```bash
./download.sh
```
## Run
You can change `--phone2id_path` in `./front_demo/front.conf` to the `phone_id_map.txt` of your own acoustic model.
```bash
./run_front_demo.sh
./run_front_demo.sh --help
./run_front_demo.sh --sentence "这是语音合成服务的文本前端,用于将文本转换为音素序号数组。"
./run_front_demo.sh --front_conf ./front_demo/front.conf --sentence "你还需要一个语音合成后端才能将其转换为实际的声音。"
```
## Clean
```bash
./clean.sh
```
The folders `front_demo/dict`, `build` and `third-party/build` will be deleted.

@ -0,0 +1,20 @@
#!/bin/bash
set -e
set -x
cd "$(dirname "$(realpath "$0")")"
cd ./third-party
mkdir -p build
cd build
cmake ..
if [ "$*" = "" ]; then
make -j$(nproc)
else
make "$@"
fi
echo "Done."

@ -0,0 +1,21 @@
#!/bin/bash
set -e
set -x
cd "$(dirname "$(realpath "$0")")"
echo "************* Download & Build Dependencies *************"
./build-depends.sh "$@"
echo "************* Build Front Lib and Demo *************"
mkdir -p ./build
cd ./build
cmake ..
if [ "$*" = "" ]; then
make -j$(nproc)
else
make "$@"
fi
echo "Done."

@ -0,0 +1,10 @@
#!/bin/bash
set -e
set -x
cd "$(dirname "$(realpath "$0")")"
rm -rf "./front_demo/dict"
rm -rf "./build"
rm -rf "./third-party/build"
echo "Done."

@ -0,0 +1,62 @@
#!/bin/bash
set -e
cd "$(dirname "$(realpath "$0")")"
download() {
file="$1"
url="$2"
md5="$3"
dir="$4"
cd "$dir"
if [ -f "$file" ] && [ "$(md5sum "$file" | awk '{ print $1 }')" = "$md5" ]; then
echo "File $file (MD5: $md5) has been downloaded."
else
echo "Downloading $file..."
wget -O "$file" "$url"
# MD5 verify
fileMd5="$(md5sum "$file" | awk '{ print $1 }')"
if [ "$fileMd5" == "$md5" ]; then
echo "File $file (MD5: $md5) has been downloaded."
else
echo "MD5 mismatch, file may be corrupt"
echo "$file MD5: $fileMd5, it should be $md5"
fi
fi
echo "Extracting $file..."
echo '-----------------------'
tar -vxf "$file"
echo '======================='
}
########################################
DIST_DIR="$PWD/front_demo/dict"
mkdir -p "$DIST_DIR"
download 'fastspeech2_nosil_baker_ckpt_0.4.tar.gz' \
'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/fastspeech2_nosil_baker_ckpt_0.4.tar.gz' \
'7bf1bab1737375fa123c413eb429c573' \
"$DIST_DIR"
download 'speedyspeech_nosil_baker_ckpt_0.5.tar.gz' \
'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/speedyspeech_nosil_baker_ckpt_0.5.tar.gz' \
'0b7754b21f324789aef469c61f4d5b8f' \
"$DIST_DIR"
download 'jieba.tar.gz' \
'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/jieba.tar.gz' \
'6d30f426bd8c0025110a483f051315ca' \
"$DIST_DIR"
download 'tranditional_to_simplified.tar.gz' \
'https://paddlespeech.bj.bcebos.com/t2s/text_frontend/tranditional_to_simplified.tar.gz' \
'258f5b59d5ebfe96d02007ca1d274a7f' \
"$DIST_DIR"
echo "Done."

@ -0,0 +1,21 @@
# jieba conf
--jieba_dict_path=./front_demo/dict/jieba/jieba.dict.utf8
--jieba_hmm_path=./front_demo/dict/jieba/hmm_model.utf8
--jieba_user_dict_path=./front_demo/dict/jieba/user.dict.utf8
--jieba_idf_path=./front_demo/dict/jieba/idf.utf8
--jieba_stop_word_path=./front_demo/dict/jieba/stop_words.utf8
# dict conf fastspeech2_0.4
--seperate_tone=false
--word2phone_path=./front_demo/dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
--phone2id_path=./front_demo/dict/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
--tone2id_path=./front_demo/dict/fastspeech2_nosil_baker_ckpt_0.4/word2phone_fs2.dict
# dict conf speedyspeech_0.5
#--seperate_tone=true
#--word2phone_path=./front_demo/dict/speedyspeech_nosil_baker_ckpt_0.5/word2phone.dict
#--phone2id_path=./front_demo/dict/speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt
#--tone2id_path=./front_demo/dict/speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
# dict of tranditional_to_simplified
--trand2simpd_path=./front_demo/dict/tranditional_to_simplified/trand2simp.txt

@ -0,0 +1,79 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <gflags/gflags.h>
#include <glog/logging.h>
#include <map>
#include <string>
#include "front/front_interface.h"
DEFINE_string(sentence, "你好,欢迎使用语音合成服务", "Text to be synthesized");
DEFINE_string(front_conf, "./front_demo/front.conf", "Front conf file");
// DEFINE_string(seperate_tone, "true", "If true, get phoneids and tonesid");
int main(int argc, char** argv) {
gflags::ParseCommandLineFlags(&argc, &argv, true);
// 实例化文本前端引擎
ppspeech::FrontEngineInterface* front_inst = nullptr;
front_inst = new ppspeech::FrontEngineInterface(FLAGS_front_conf);
if ((!front_inst) || (front_inst->init())) {
LOG(ERROR) << "Creater tts engine failed!";
if (front_inst != nullptr) {
delete front_inst;
}
front_inst = nullptr;
return -1;
}
std::wstring ws_sentence = ppspeech::utf8string2wstring(FLAGS_sentence);
// 繁体转简体
std::wstring sentence_simp;
front_inst->Trand2Simp(ws_sentence, &sentence_simp);
ws_sentence = sentence_simp;
std::string s_sentence;
std::vector<std::wstring> sentence_part;
std::vector<int> phoneids = {};
std::vector<int> toneids = {};
// 根据标点进行分句
LOG(INFO) << "Start to segment sentences by punctuation";
front_inst->SplitByPunc(ws_sentence, &sentence_part);
LOG(INFO) << "Segment sentences through punctuation successfully";
// 分句后获取音素id
LOG(INFO)
<< "Start to get the phoneme and tone id sequence of each sentence";
for (int i = 0; i < sentence_part.size(); i++) {
LOG(INFO) << "Raw sentence is: "
<< ppspeech::wstring2utf8string(sentence_part[i]);
front_inst->SentenceNormalize(&sentence_part[i]);
s_sentence = ppspeech::wstring2utf8string(sentence_part[i]);
LOG(INFO) << "After normalization sentence is: " << s_sentence;
if (0 != front_inst->GetSentenceIds(s_sentence, &phoneids, &toneids)) {
LOG(ERROR) << "TTS inst get sentence phoneids and toneids failed";
return -1;
}
}
LOG(INFO) << "The phoneids of the sentence is: "
<< limonp::Join(phoneids.begin(), phoneids.end(), " ");
LOG(INFO) << "The toneids of the sentence is: "
<< limonp::Join(toneids.begin(), toneids.end(), " ");
LOG(INFO) << "Get the phoneme id sequence of each sentence successfully";
return EXIT_SUCCESS;
}

@ -0,0 +1,111 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import configparser
from paddlespeech.t2s.frontend.zh_frontend import Frontend
def get_phone(frontend,
word,
merge_sentences=True,
print_info=False,
robot=False,
get_tone_ids=False):
phonemes = frontend.get_phonemes(word, merge_sentences, print_info, robot)
# Some optimizations
phones, tones = frontend._get_phone_tone(phonemes[0], get_tone_ids)
#print(type(phones), phones)
#print(type(tones), tones)
return phones, tones
def gen_word2phone_dict(frontend,
jieba_words_dict,
word2phone_dict,
get_tone=False):
with open(jieba_words_dict, "r") as f1, open(word2phone_dict, "w+") as f2:
for line in f1.readlines():
word = line.split(" ")[0]
phone, tone = get_phone(frontend, word, get_tone_ids=get_tone)
phone_str = ""
if tone:
assert (len(phone) == len(tone))
for i in range(len(tone)):
phone_tone = phone[i] + tone[i]
phone_str += (" " + phone_tone)
phone_str = phone_str.strip("sp0").strip(" ")
else:
for x in phone:
phone_str += (" " + x)
phone_str = phone_str.strip("sp").strip(" ")
print(phone_str)
f2.write(word + " " + phone_str + "\n")
print("Generate word2phone dict successfully.")
def main():
parser = argparse.ArgumentParser(description="Generate dictionary")
parser.add_argument(
"--config", type=str, default="./config.ini", help="config file.")
parser.add_argument(
"--am_type",
type=str,
default="fastspeech2",
help="fastspeech2 or speedyspeech")
args = parser.parse_args()
# Read config
cf = configparser.ConfigParser()
cf.read(args.config)
jieba_words_dict_file = cf.get("jieba",
"jieba_words_dict") # get words dict
am_type = args.am_type
if (am_type == "fastspeech2"):
phone2id_dict_file = cf.get(am_type, "phone2id_dict")
word2phone_dict_file = cf.get(am_type, "word2phone_dict")
frontend = Frontend(phone_vocab_path=phone2id_dict_file)
print("frontend done!")
gen_word2phone_dict(
frontend,
jieba_words_dict_file,
word2phone_dict_file,
get_tone=False)
elif (am_type == "speedyspeech"):
phone2id_dict_file = cf.get(am_type, "phone2id_dict")
tone2id_dict_file = cf.get(am_type, "tone2id_dict")
word2phone_dict_file = cf.get(am_type, "word2phone_dict")
frontend = Frontend(
phone_vocab_path=phone2id_dict_file,
tone_vocab_path=tone2id_dict_file)
print("frontend done!")
gen_word2phone_dict(
frontend,
jieba_words_dict_file,
word2phone_dict_file,
get_tone=True)
else:
print("Please set correct am type, fastspeech2 or speedyspeech.")
if __name__ == "__main__":
main()

@ -0,0 +1,35 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
PHONESFILE = "./dict/phones.txt"
PHONES_ID_FILE = "./dict/phonesid.dict"
TONESFILE = "./dict/tones.txt"
TONES_ID_FILE = "./dict/tonesid.dict"
def GenIdFile(file, idfile):
id = 2
with open(file, 'r') as f1, open(idfile, "w+") as f2:
f2.write("<pad> 0\n")
f2.write("<unk> 1\n")
for line in f1.readlines():
phone = line.strip()
print(phone + " " + str(id) + "\n")
f2.write(phone + " " + str(id) + "\n")
id += 1
if __name__ == "__main__":
GenIdFile(PHONESFILE, PHONES_ID_FILE)
GenIdFile(TONESFILE, TONES_ID_FILE)

@ -0,0 +1,55 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
from pypinyin import lazy_pinyin
from pypinyin import Style
worddict = "./dict/jieba_part.dict.utf8"
newdict = "./dict/word_phones.dict"
def GenPhones(initials, finals, seperate=True):
phones = []
for c, v in zip(initials, finals):
if re.match(r'i\d', v):
if c in ['z', 'c', 's']:
v = re.sub('i', 'ii', v)
elif c in ['zh', 'ch', 'sh', 'r']:
v = re.sub('i', 'iii', v)
if c:
if seperate is True:
phones.append(c + '0')
elif seperate is False:
phones.append(c)
else:
print("Not sure whether phone and tone need to be separated")
if v:
phones.append(v)
return phones
with open(worddict, "r") as f1, open(newdict, "w+") as f2:
for line in f1.readlines():
word = line.split(" ")[0]
initials = lazy_pinyin(
word, neutral_tone_with_five=True, style=Style.INITIALS)
finals = lazy_pinyin(
word, neutral_tone_with_five=True, style=Style.FINALS_TONE3)
phones = GenPhones(initials, finals, True)
temp = " ".join(phones)
f2.write(word + " " + temp + "\n")

@ -0,0 +1,7 @@
#!/bin/bash
set -e
set -x
cd "$(dirname "$(realpath "$0")")"
./build/tts_front_demo "$@"

@ -0,0 +1,28 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "base/type_conv.h"
namespace ppspeech {
// wstring to string
std::string wstring2utf8string(const std::wstring& str) {
static std::wstring_convert<std::codecvt_utf8<wchar_t>> strCnv;
return strCnv.to_bytes(str);
}
// string to wstring
std::wstring utf8string2wstring(const std::string& str) {
static std::wstring_convert<std::codecvt_utf8<wchar_t>> strCnv;
return strCnv.from_bytes(str);
}
} // namespace ppspeech

@ -0,0 +1,31 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef BASE_TYPE_CONVC_H
#define BASE_TYPE_CONVC_H
#include <codecvt>
#include <locale>
#include <string>
namespace ppspeech {
// wstring to string
std::string wstring2utf8string(const std::wstring& str);
// string to wstring
std::wstring utf8string2wstring(const std::string& str);
}
#endif // BASE_TYPE_CONVC_H

File diff suppressed because it is too large Load Diff

@ -0,0 +1,198 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef PADDLE_TTS_SERVING_FRONT_FRONT_INTERFACE_H
#define PADDLE_TTS_SERVING_FRONT_FRONT_INTERFACE_H
#include <glog/logging.h>
#include <fstream>
#include <map>
#include <memory>
#include <string>
//#include "utils/dir_utils.h"
#include <cppjieba/Jieba.hpp>
#include "absl/strings/str_split.h"
#include "front/text_normalize.h"
namespace ppspeech {
class FrontEngineInterface : public TextNormalizer {
public:
explicit FrontEngineInterface(std::string conf) : _conf_file(conf) {
TextNormalizer();
_jieba = nullptr;
_initialed = false;
init();
}
int init();
~FrontEngineInterface() {}
// 读取配置文件
int ReadConfFile();
// 简体转繁体
int Trand2Simp(const std::wstring &sentence, std::wstring *sentence_simp);
// 生成字典
int GenDict(const std::string &file,
std::map<std::string, std::string> *map);
// 由 词+词性的分词结果转为仅包含词的结果
int GetSegResult(std::vector<std::pair<std::string, std::string>> *seg,
std::vector<std::string> *seg_words);
// 生成句子的音素音调id。如果音素和音调未分开则 toneids
// 为空fastspeech2反之则不为空(speedyspeech)
int GetSentenceIds(const std::string &sentence,
std::vector<int> *phoneids,
std::vector<int> *toneids);
// 根据分词结果获取词的音素音调id并对读音进行适当修改
// (ModifyTone)。如果音素和音调未分开,则 toneids
// 为空fastspeech2反之则不为空(speedyspeech)
int GetWordsIds(
const std::vector<std::pair<std::string, std::string>> &cut_result,
std::vector<int> *phoneids,
std::vector<int> *toneids);
// 结巴分词生成包含词和词性的分词结果,再对分词结果进行适当修改
// (MergeforModify)
int Cut(const std::string &sentence,
std::vector<std::pair<std::string, std::string>> *cut_result);
// 字词到音素的映射,查找字典
int GetPhone(const std::string &word, std::string *phone);
// 音素到音素id
int Phone2Phoneid(const std::string &phone,
std::vector<int> *phoneid,
std::vector<int> *toneids);
// 根据韵母判断该词中每个字的读音都为第三声。true表示词中每个字都是第三声
bool AllToneThree(const std::vector<std::string> &finals);
// 判断词是否是叠词
bool IsReduplication(const std::string &word);
// 获取每个字词的声母韵母列表
int GetInitialsFinals(const std::string &word,
std::vector<std::string> *word_initials,
std::vector<std::string> *word_finals);
// 获取每个字词的韵母列表
int GetFinals(const std::string &word,
std::vector<std::string> *word_finals);
// 整个词转成向量形式,向量的每个元素对应词的一个字
int Word2WordVec(const std::string &word,
std::vector<std::wstring> *wordvec);
// 将整个词重新进行 full cut分词后各个词会在词典中
int SplitWord(const std::string &word,
std::vector<std::string> *fullcut_word);
// 对分词结果进行处理:对包含“不”字的分词结果进行整理
std::vector<std::pair<std::string, std::string>> MergeBu(
std::vector<std::pair<std::string, std::string>> *seg_result);
// 对分词结果进行处理:对包含“一”字的分词结果进行整理
std::vector<std::pair<std::string, std::string>> Mergeyi(
std::vector<std::pair<std::string, std::string>> *seg_result);
// 对分词结果进行处理:对前后相同的两个字进行合并
std::vector<std::pair<std::string, std::string>> MergeReduplication(
std::vector<std::pair<std::string, std::string>> *seg_result);
// 对一个词和后一个词他们的读音均为第三声的两个词进行合并
std::vector<std::pair<std::string, std::string>> MergeThreeTones(
std::vector<std::pair<std::string, std::string>> *seg_result);
// 对一个词的最后一个读音和后一个词的第一个读音为第三声的两个词进行合并
std::vector<std::pair<std::string, std::string>> MergeThreeTones2(
std::vector<std::pair<std::string, std::string>> *seg_result);
// 对分词结果进行处理:对包含“儿”字的分词结果进行整理
std::vector<std::pair<std::string, std::string>> MergeEr(
std::vector<std::pair<std::string, std::string>> *seg_result);
// 对分词结果进行处理、修改
int MergeforModify(
std::vector<std::pair<std::string, std::string>> *seg_result,
std::vector<std::pair<std::string, std::string>> *merge_seg_result);
// 对包含“不”字的相关词音调进行修改
int BuSandi(const std::string &word, std::vector<std::string> *finals);
// 对包含“一”字的相关词音调进行修改
int YiSandhi(const std::string &word, std::vector<std::string> *finals);
// 对一些特殊词(包括量词,语助词等)的相关词音调进行修改
int NeuralSandhi(const std::string &word,
const std::string &pos,
std::vector<std::string> *finals);
// 对包含第三声的相关词音调进行修改
int ThreeSandhi(const std::string &word, std::vector<std::string> *finals);
// 对字词音调进行处理、修改
int ModifyTone(const std::string &word,
const std::string &pos,
std::vector<std::string> *finals);
// 对儿化音进行处理
std::vector<std::vector<std::string>> MergeErhua(
const std::vector<std::string> &initials,
const std::vector<std::string> &finals,
const std::string &word,
const std::string &pos);
private:
bool _initialed;
cppjieba::Jieba *_jieba;
std::vector<std::string> _punc;
std::vector<std::string> _punc_omit;
std::string _conf_file;
std::map<std::string, std::string> conf_map;
std::map<std::string, std::string> word_phone_map;
std::map<std::string, std::string> phone_id_map;
std::map<std::string, std::string> tone_id_map;
std::map<std::string, std::string> trand_simp_map;
std::string _jieba_dict_path;
std::string _jieba_hmm_path;
std::string _jieba_user_dict_path;
std::string _jieba_idf_path;
std::string _jieba_stop_word_path;
std::string _seperate_tone;
std::string _word2phone_path;
std::string _phone2id_path;
std::string _tone2id_path;
std::string _trand2simp_path;
std::vector<std::string> must_erhua;
std::vector<std::string> not_erhua;
std::vector<std::string> must_not_neural_tone_words;
std::vector<std::string> must_neural_tone_words;
};
} // namespace ppspeech
#endif

@ -0,0 +1,542 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "front/text_normalize.h"
namespace ppspeech {
// 初始化 digits_map and unit_map
int TextNormalizer::InitMap() {
digits_map["0"] = "";
digits_map["1"] = "";
digits_map["2"] = "";
digits_map["3"] = "";
digits_map["4"] = "";
digits_map["5"] = "";
digits_map["6"] = "";
digits_map["7"] = "";
digits_map["8"] = "";
digits_map["9"] = "";
units_map[1] = "";
units_map[2] = "";
units_map[3] = "";
units_map[4] = "";
units_map[8] = "亿";
return 0;
}
// 替换
int TextNormalizer::Replace(std::wstring *sentence,
const int &pos,
const int &len,
const std::wstring &repstr) {
// 删除原来的
sentence->erase(pos, len);
// 插入新的
sentence->insert(pos, repstr);
return 0;
}
// 根据标点符号切分句子
int TextNormalizer::SplitByPunc(const std::wstring &sentence,
std::vector<std::wstring> *sentence_part) {
std::wstring temp = sentence;
std::wregex reg(L"[:,;。?!,;?!]");
std::wsmatch match;
while (std::regex_search(temp, match, reg)) {
sentence_part->push_back(
temp.substr(0, match.position(0) + match.length(0)));
Replace(&temp, 0, match.position(0) + match.length(0), L"");
}
// 如果最后没有标点符号
if (temp != L"") {
sentence_part->push_back(temp);
}
return 0;
}
// 数字转文本10200 - > 一万零二百
std::string TextNormalizer::CreateTextValue(const std::string &num_str,
bool use_zero) {
std::string num_lstrip =
std::string(absl::StripPrefix(num_str, "0")).data();
int len = num_lstrip.length();
if (len == 0) {
return "";
} else if (len == 1) {
if (use_zero && (len < num_str.length())) {
return digits_map["0"] + digits_map[num_lstrip];
} else {
return digits_map[num_lstrip];
}
} else {
int largest_unit = 0; // 最大单位
std::string first_part;
std::string second_part;
if (len > 1 && len <= 2) {
largest_unit = 1;
} else if (len > 2 && len <= 3) {
largest_unit = 2;
} else if (len > 3 && len <= 4) {
largest_unit = 3;
} else if (len > 4 && len <= 8) {
largest_unit = 4;
} else if (len > 8) {
largest_unit = 8;
}
first_part = num_str.substr(0, num_str.length() - largest_unit);
second_part = num_str.substr(num_str.length() - largest_unit);
return CreateTextValue(first_part, use_zero) + units_map[largest_unit] +
CreateTextValue(second_part, use_zero);
}
}
// 数字一个一个对应,可直接用于年份,电话,手机,
std::string TextNormalizer::SingleDigit2Text(const std::string &num_str,
bool alt_one) {
std::string text = "";
if (alt_one) {
digits_map["1"] = "";
} else {
digits_map["1"] = "";
}
for (size_t i = 0; i < num_str.size(); i++) {
std::string num_int(1, num_str[i]);
if (digits_map.find(num_int) == digits_map.end()) {
LOG(ERROR) << "digits_map doesn't have key: " << num_int;
}
text += digits_map[num_int];
}
return text;
}
std::string TextNormalizer::SingleDigit2Text(const std::wstring &num,
bool alt_one) {
std::string num_str = wstring2utf8string(num);
return SingleDigit2Text(num_str, alt_one);
}
// 数字整体对应,可直接用于月份,日期,数值整数部分
std::string TextNormalizer::MultiDigit2Text(const std::string &num_str,
bool alt_one,
bool use_zero) {
LOG(INFO) << "aaaaaaaaaaaaaaaa: " << alt_one << use_zero;
if (alt_one) {
digits_map["1"] = "";
} else {
digits_map["1"] = "";
}
std::wstring result =
utf8string2wstring(CreateTextValue(num_str, use_zero));
std::wstring result_0(1, result[0]);
std::wstring result_1(1, result[1]);
// 一十八 --> 十八
if ((result_0 == utf8string2wstring(digits_map["1"])) &&
(result_1 == utf8string2wstring(units_map[1]))) {
return wstring2utf8string(result.substr(1, result.length()));
} else {
return wstring2utf8string(result);
}
}
std::string TextNormalizer::MultiDigit2Text(const std::wstring &num,
bool alt_one,
bool use_zero) {
std::string num_str = wstring2utf8string(num);
return MultiDigit2Text(num_str, alt_one, use_zero);
}
// 数字转文本,包括整数和小数
std::string TextNormalizer::Digits2Text(const std::string &num_str) {
std::string text;
std::vector<std::string> integer_decimal;
integer_decimal = absl::StrSplit(num_str, ".");
if (integer_decimal.size() == 1) { // 整数
text = MultiDigit2Text(integer_decimal[0]);
} else if (integer_decimal.size() == 2) { // 小数
if (integer_decimal[0] == "") { // 无整数的小数类型,例如:.22
text = "" +
SingleDigit2Text(
std::string(absl::StripSuffix(integer_decimal[1], "0"))
.data());
} else { // 常规小数类型例如12.34
text = MultiDigit2Text(integer_decimal[0]) + "" +
SingleDigit2Text(
std::string(absl::StripSuffix(integer_decimal[1], "0"))
.data());
}
} else {
return "The value does not conform to the numeric format";
}
return text;
}
std::string TextNormalizer::Digits2Text(const std::wstring &num) {
std::string num_str = wstring2utf8string(num);
return Digits2Text(num_str);
}
// 日期2021年8月18日 --> 二零二一年八月十八日
int TextNormalizer::ReData(std::wstring *sentence) {
std::wregex reg(
L"(\\d{4}|\\d{2})年((0?[1-9]|1[0-2])月)?(((0?[1-9])|((1|2)[0-9])|30|31)"
L"([日号]))?");
std::wsmatch match;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
rep = "";
rep += SingleDigit2Text(match[1]) + "";
if (match[3] != L"") {
rep += MultiDigit2Text(match[3], false, false) + "";
}
if (match[5] != L"") {
rep += MultiDigit2Text(match[5], false, false) +
wstring2utf8string(match[9]);
}
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// XX-XX-XX or XX/XX/XX 例如2021/08/18 --> 二零二一年八月十八日
int TextNormalizer::ReData2(std::wstring *sentence) {
std::wregex reg(
L"(\\d{4})([- /.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])");
std::wsmatch match;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
rep = "";
rep += (SingleDigit2Text(match[1]) + "");
rep += (MultiDigit2Text(match[3], false, false) + "");
rep += (MultiDigit2Text(match[4], false, false) + "");
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// XX:XX:XX 09:09:02 --> 九点零九分零二秒
int TextNormalizer::ReTime(std::wstring *sentence) {
std::wregex reg(L"([0-1]?[0-9]|2[0-3]):([0-5][0-9])(:([0-5][0-9]))?");
std::wsmatch match;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
rep = "";
rep += (MultiDigit2Text(match[1], false, false) + "");
if (absl::StartsWith(wstring2utf8string(match[2]), "0")) {
rep += "";
}
rep += (MultiDigit2Text(match[2]) + "");
if (absl::StartsWith(wstring2utf8string(match[4]), "0")) {
rep += "";
}
rep += (MultiDigit2Text(match[4]) + "");
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 温度,例如:-24.3℃ --> 零下二十四点三度
int TextNormalizer::ReTemperature(std::wstring *sentence) {
std::wregex reg(L"(-?)(\\d+(\\.\\d+)?)(°C|℃|度|摄氏度)");
std::wsmatch match;
std::string rep;
std::string sign;
std::vector<std::string> integer_decimal;
std::string unit;
while (std::regex_search(*sentence, match, reg)) {
match[1] == L"-" ? sign = "" : sign = "";
match[4] == L"摄氏度" ? unit = "摄氏度" : unit = "";
rep = sign + Digits2Text(match[2]) + unit;
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 分数,例如: 1/3 --> 三分之一
int TextNormalizer::ReFrac(std::wstring *sentence) {
std::wregex reg(L"(-?)(\\d+)/(\\d+)");
std::wsmatch match;
std::string sign;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
match[1] == L"-" ? sign = "" : sign = "";
rep = sign + MultiDigit2Text(match[3]) + "分之" +
MultiDigit2Text(match[2]);
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 百分数例如45.5% --> 百分之四十五点五
int TextNormalizer::RePercentage(std::wstring *sentence) {
std::wregex reg(L"(-?)(\\d+(\\.\\d+)?)%");
std::wsmatch match;
std::string sign;
std::string rep;
std::vector<std::string> integer_decimal;
while (std::regex_search(*sentence, match, reg)) {
match[1] == L"-" ? sign = "" : sign = "";
rep = sign + "百分之" + Digits2Text(match[2]);
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 手机号码,例如:+86 18883862235 --> 八六幺八八八三八六二二三五
int TextNormalizer::ReMobilePhone(std::wstring *sentence) {
std::wregex reg(
L"(\\d)?((\\+?86 ?)?1([38]\\d|5[0-35-9]|7[678]|9[89])\\d{8})(\\d)?");
std::wsmatch match;
std::string rep;
std::vector<std::string> country_phonenum;
while (std::regex_search(*sentence, match, reg)) {
country_phonenum = absl::StrSplit(wstring2utf8string(match[0]), "+");
rep = "";
for (int i = 0; i < country_phonenum.size(); i++) {
LOG(INFO) << country_phonenum[i];
rep += SingleDigit2Text(country_phonenum[i], true);
}
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 座机号码例如010-51093154 --> 零幺零五幺零九三幺五四
int TextNormalizer::RePhone(std::wstring *sentence) {
std::wregex reg(
L"(\\d)?((0(10|2[1-3]|[3-9]\\d{2})-?)?[1-9]\\d{6,7})(\\d)?");
std::wsmatch match;
std::vector<std::string> zone_phonenum;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
rep = "";
zone_phonenum = absl::StrSplit(wstring2utf8string(match[0]), "-");
for (int i = 0; i < zone_phonenum.size(); i++) {
rep += SingleDigit2Text(zone_phonenum[i], true);
}
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 范围例如60~90 --> 六十到九十
int TextNormalizer::ReRange(std::wstring *sentence) {
std::wregex reg(
L"((-?)((\\d+)(\\.\\d+)?)|(\\.(\\d+)))[-~]((-?)((\\d+)(\\.\\d+)?)|(\\.("
L"\\d+)))");
std::wsmatch match;
std::string rep;
std::string sign1;
std::string sign2;
while (std::regex_search(*sentence, match, reg)) {
rep = "";
match[2] == L"-" ? sign1 = "" : sign1 = "";
if (match[6] != L"") {
rep += sign1 + Digits2Text(match[6]) + "";
} else {
rep += sign1 + Digits2Text(match[3]) + "";
}
match[9] == L"-" ? sign2 = "" : sign2 = "";
if (match[13] != L"") {
rep += sign2 + Digits2Text(match[13]);
} else {
rep += sign2 + Digits2Text(match[10]);
}
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 带负号的整数,例如:-10 --> 负十
int TextNormalizer::ReInterger(std::wstring *sentence) {
std::wregex reg(L"(-)(\\d+)");
std::wsmatch match;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
rep = "" + MultiDigit2Text(match[2]);
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 纯小数
int TextNormalizer::ReDecimalNum(std::wstring *sentence) {
std::wregex reg(L"(-?)((\\d+)(\\.\\d+))|(\\.(\\d+))");
std::wsmatch match;
std::string sign;
std::string rep;
// std::vector<std::string> integer_decimal;
while (std::regex_search(*sentence, match, reg)) {
match[1] == L"-" ? sign = "" : sign = "";
if (match[5] != L"") {
rep = sign + Digits2Text(match[5]);
} else {
rep = sign + Digits2Text(match[2]);
}
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 正整数 + 量词
int TextNormalizer::RePositiveQuantifiers(std::wstring *sentence) {
std::wstring common_quantifiers =
L"(朵|匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|"
L"担|颗|壳|窠|曲|墙|群|腔|砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|"
L"溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|针|线|管|名|位|身|堂|课|"
L"本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|"
L"毫|厘|(公)分|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|米|撮|勺|"
L"合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|"
L"卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|纪|岁|世|更|"
L"夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块|"
L"元|(亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|美|)元|(亿|千万|"
L"百万|万|千|百|)块|角|毛|分)";
std::wregex reg(L"(\\d+)([多余几])?" + common_quantifiers);
std::wsmatch match;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
rep = MultiDigit2Text(match[1]);
Replace(sentence,
match.position(1),
match.length(1),
utf8string2wstring(rep));
}
return 0;
}
// 编号类数字,例如: 89757 --> 八九七五七
int TextNormalizer::ReDefalutNum(std::wstring *sentence) {
std::wregex reg(L"\\d{3}\\d*");
std::wsmatch match;
while (std::regex_search(*sentence, match, reg)) {
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(SingleDigit2Text(match[0])));
}
return 0;
}
int TextNormalizer::ReNumber(std::wstring *sentence) {
std::wregex reg(L"(-?)((\\d+)(\\.\\d+)?)|(\\.(\\d+))");
std::wsmatch match;
std::string sign;
std::string rep;
while (std::regex_search(*sentence, match, reg)) {
match[1] == L"-" ? sign = "" : sign = "";
if (match[5] != L"") {
rep = sign + Digits2Text(match[5]);
} else {
rep = sign + Digits2Text(match[2]);
}
Replace(sentence,
match.position(0),
match.length(0),
utf8string2wstring(rep));
}
return 0;
}
// 整体正则,按顺序
int TextNormalizer::SentenceNormalize(std::wstring *sentence) {
ReData(sentence);
ReData2(sentence);
ReTime(sentence);
ReTemperature(sentence);
ReFrac(sentence);
RePercentage(sentence);
ReMobilePhone(sentence);
RePhone(sentence);
ReRange(sentence);
ReInterger(sentence);
ReDecimalNum(sentence);
RePositiveQuantifiers(sentence);
ReDefalutNum(sentence);
ReNumber(sentence);
return 0;
}
} // namespace ppspeech

@ -0,0 +1,77 @@
// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef PADDLE_TTS_SERVING_FRONT_TEXT_NORMALIZE_H
#define PADDLE_TTS_SERVING_FRONT_TEXT_NORMALIZE_H
#include <glog/logging.h>
#include <codecvt>
#include <map>
#include <regex>
#include <string>
#include "absl/strings/str_split.h"
#include "absl/strings/strip.h"
#include "base/type_conv.h"
namespace ppspeech {
class TextNormalizer {
public:
TextNormalizer() { InitMap(); }
~TextNormalizer() {}
int InitMap();
int Replace(std::wstring *sentence,
const int &pos,
const int &len,
const std::wstring &repstr);
int SplitByPunc(const std::wstring &sentence,
std::vector<std::wstring> *sentence_part);
std::string CreateTextValue(const std::string &num, bool use_zero = true);
std::string SingleDigit2Text(const std::string &num_str,
bool alt_one = false);
std::string SingleDigit2Text(const std::wstring &num, bool alt_one = false);
std::string MultiDigit2Text(const std::string &num_str,
bool alt_one = false,
bool use_zero = true);
std::string MultiDigit2Text(const std::wstring &num,
bool alt_one = false,
bool use_zero = true);
std::string Digits2Text(const std::string &num_str);
std::string Digits2Text(const std::wstring &num);
int ReData(std::wstring *sentence);
int ReData2(std::wstring *sentence);
int ReTime(std::wstring *sentence);
int ReTemperature(std::wstring *sentence);
int ReFrac(std::wstring *sentence);
int RePercentage(std::wstring *sentence);
int ReMobilePhone(std::wstring *sentence);
int RePhone(std::wstring *sentence);
int ReRange(std::wstring *sentence);
int ReInterger(std::wstring *sentence);
int ReDecimalNum(std::wstring *sentence);
int RePositiveQuantifiers(std::wstring *sentence);
int ReDefalutNum(std::wstring *sentence);
int ReNumber(std::wstring *sentence);
int SentenceNormalize(std::wstring *sentence);
private:
std::map<std::string, std::string> digits_map;
std::map<int, std::string> units_map;
};
} // namespace ppspeech
#endif

@ -0,0 +1,64 @@
cmake_minimum_required(VERSION 3.10)
project(tts_third_party_libs)
include(ExternalProject)
# gflags
ExternalProject_Add(gflags
GIT_REPOSITORY https://github.com/gflags/gflags.git
GIT_TAG v2.2.2
PREFIX ${CMAKE_CURRENT_BINARY_DIR}
INSTALL_DIR ${CMAKE_CURRENT_BINARY_DIR}
CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
-DCMAKE_POSITION_INDEPENDENT_CODE=ON
-DBUILD_STATIC_LIBS=OFF
-DBUILD_SHARED_LIBS=ON
)
# glog
ExternalProject_Add(
glog
GIT_REPOSITORY https://github.com/google/glog.git
GIT_TAG v0.6.0
PREFIX ${CMAKE_CURRENT_BINARY_DIR}
INSTALL_DIR ${CMAKE_CURRENT_BINARY_DIR}
CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
-DCMAKE_POSITION_INDEPENDENT_CODE=ON
DEPENDS gflags
)
# abseil
ExternalProject_Add(
abseil
GIT_REPOSITORY https://github.com/abseil/abseil-cpp.git
GIT_TAG 20230125.1
PREFIX ${CMAKE_CURRENT_BINARY_DIR}
INSTALL_DIR ${CMAKE_CURRENT_BINARY_DIR}
CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
-DCMAKE_POSITION_INDEPENDENT_CODE=ON
-DABSL_PROPAGATE_CXX_STD=ON
)
# cppjieba (header-only)
ExternalProject_Add(
cppjieba
GIT_REPOSITORY https://github.com/yanyiwu/cppjieba.git
GIT_TAG v5.0.3
PREFIX ${CMAKE_CURRENT_BINARY_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
TEST_COMMAND ""
)
# limonp (header-only)
ExternalProject_Add(
limonp
GIT_REPOSITORY https://github.com/yanyiwu/limonp.git
GIT_TAG v0.6.6
PREFIX ${CMAKE_CURRENT_BINARY_DIR}
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
TEST_COMMAND ""
)

@ -9,7 +9,7 @@ This demo is an implementation of starting the streaming speech service and acce
Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol.
服务接口定义请参考:
For service interface definitions, please refer to:
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
## Usage
@ -23,7 +23,7 @@ You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to
### 2. Prepare config File
The configuration file can be found in `conf/ws_application.yaml` `conf/ws_conformer_wenetspeech_application.yaml`.
The configuration file can be found in `conf/ws_application.yaml` or `conf/ws_conformer_wenetspeech_application.yaml`.
At present, the speech tasks integrated by the model include: DeepSpeech2 and conformer.
@ -87,7 +87,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
server_executor = ServerExecutor()
server_executor(
config_file="./conf/ws_conformer_wenetspeech_application.yaml",
config_file="./conf/ws_conformer_wenetspeech_application_faster.yaml",
log_file="./log/paddlespeech.log")
```

@ -90,7 +90,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
server_executor = ServerExecutor()
server_executor(
config_file="./conf/ws_conformer_wenetspeech_application",
config_file="./conf/ws_conformer_wenetspeech_application_faster.yaml",
log_file="./log/paddlespeech.log")
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 294 KiB

@ -38,8 +38,8 @@ sphinx-markdown-tables
sphinx_rtd_theme
textgrid
timer
ToJyutping
typeguard
ToJyutping==0.2.1
typeguard==2.13.3
webrtcvad
websockets
yacs~=0.1.8

@ -25,7 +25,7 @@ Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions
[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - |
[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
[Wav2vec2-large-wenetspeech-self Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2-large-wenetspeech-self_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | - | 714 MB |Pre-trained Wav2vec2.0 Model | - | - | - |
[Wav2vec2ASR-large-aishell1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | aishell1 (train set) | 1.17 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | 0.0453 | - | - |
[Wav2vec2ASR-large-aishell1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | aishell1 (train set) | 1.18 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | 0.0510 | - | - |
### Whisper Model
Demo Link | Training Data | Size | Descriptions | CER | Model

@ -0,0 +1,183 @@
本人非音乐专业人士,如文档中有误欢迎指正。
# 一、常见基础
## 1.1 简谱和音名note
<p align="left">
<img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/seven.png" width="300"/>
</p>
上图从左往右的黑键音名分别是C#/DbD#/DbF#/DbG#/AbA#/Bb
钢琴88键如下图分为大字一组大字组小字组小字一组小字二组小字三组小字四组。分别对应音名的后缀是 1 2 3 4 5 6例如小字一组C大调包含的键分别为 C4C#4/Db4D4D#4/Eb4E4F4F#4/Gb4G4G#4/Ab4A4A#4/Bb4B4
钢琴八度音就是12345671八个音最后一个音是高1。**遵循:全全半全全全半** 就会得到 1 2 3 4 5 6 7 (高)1 的音
<p align="left">
<img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/piano_88.png" />
</p>
## 1.2 十二大调
“#”表示升调
<p align="left">
<img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/up.png" />
</p>
“b”表示降调
<p align="left">
<img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/down.png" />
</p>
什么大调表示Do(简谱1) 这个音从哪个键开始例如D大调则用D这个键来表示 Do这个音。
下图是十二大调下简谱与音名的对应表。
<p align="left">
<img src="../../../docs/images/note_map.png" />
</p>
## 1.3 Tempo
Tempo 用于表示速度Speed of the beat/pulse一分钟里面有几拍beats per mimute BPM
<p align="left">
<img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/note_beat.png" width="450"/>
</p>
whole note --> 4 beats</br>
half note --> 2 beats</br>
quarter note --> 1 beat</br>
eighth note --> 1/2 beat</br>
sixteenth note --> 1/4 beat</br>
# 二、应用试验
## 2.1 从谱中获取 music scores
music scores 包含notenote_duris_slur
<p align="left">
<img src="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/pu.png" width="600"/>
</p>
从左上角的谱信息 *bE* 可以得出该谱子是 **降E大调**可以对应1.2小节十二大调简谱音名对照表根据 简谱获取对应的note
从左上角的谱信息 *quarter note* 可以得出该谱子的速度是 **一分钟95拍beat**,一拍的时长 = **60/95 = 0.631578s**
从左上角的谱信息 *4/4* 可以得出该谱子表示四分音符为一拍分母的4每小节有4拍分子的4
从该简谱上可以获取 music score 如下:
|text |phone |简谱(辅助)后面的点表示高八音 |note (从小字组开始算) |几拍(辅助) |note_dur |is_slur|
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: |
|小 |x |5 |A#3/Bb3 |半 |0.315789 |0 |
| |iao |5 |A#3/Bb3 |半 |0.315789 |0 |
|酒 |j |1. |D#4/Eb4 |半 |0.315789 |0 |
| |iu |1. |D#4/Eb4 |半 |0.315789 |0 |
|窝 |w |2. |F4 |半 |0.315789 |0 |
| |o |2. |F4 |半 |0.315789 |0 |
|长 |ch |3. |G4 |半 |0.315789 |0 |
| |ang |3. |G4 |半 |0.315789 |0 |
| |ang |1. |D#4/Eb4 |半 |0.315789 |1 |
|睫 |j |1. |D#4/Eb4 |半 |0.315789 |0 |
| |ie |1. |D#4/Eb4 |半 |0.315789 |0 |
| |ie |5 |A#3/Bb3 |半 |0.315789 |1 |
|毛 |m |5 |A#3/Bb3 |一 |0.631578 |0 |
| |ao |5 |A#3/Bb3 |一 |0.631578 |0 |
|是 |sh |5 |A#3/Bb3 |半 |0.315789 |0 |
| |i |5 |A#3/Bb3 |半 |0.315789 |0 |
|你 |n |3. |G4 |半 |0.315789 |0 |
| |i |3. |G4 |半 |0.315789 |0 |
|最 |z |2. |F4 |半 |0.315789 |0 |
| |ui |2. |F4 |半 |0.315789 |0 |
|美 |m |3. |G4 |半 |0.315789 |0 |
| |ei |3. |G4 |半 |0.315789 |0 |
|的 |d |2. |F4 |半 |0.315789 |0 |
| |e |2. |F4 |半 |0.315789 |0 |
|记 |j |7 |D4 |半 |0.315789 |0 |
| |i |7 |D4 |半 |0.315789 |0 |
|号 |h |5 |A#3/Bb3 |半 |0.315789 |0 |
| |ao |5 |A#3/Bb3 |半 |0.315789 |0 |
## 2.2 一些实验
<div align = "center">
<table style="width:100%">
<thead>
<tr>
<th> 序号 </th>
<th width="500"> 说明 </th>
<th> 合成音频diffsinger_opencpop + pwgan_opencpop </th>
</tr>
</thead>
<tbody>
<tr>
<td > 1 </td>
<td > 原始 opencpop 标注的 notesnote_dursis_slurs升F大调起始在小字组第3组 </td>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test1.wav" rel="nofollow">
<img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
</td>
</tr>
<tr>
<td > 2 </td>
<td > 原始 opencpop 标注的 notes 和 is_slursnote_durs 改变(从谱子获取) </td>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test2.wav" rel="nofollow">
<img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
</td>
</tr>
<tr>
<td > 3 </td>
<td > 原始 opencpop 标注的 notes 去掉 rest毛字一拍is_slurs 和 note_durs 改变(从谱子获取) </td>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test3.wav" rel="nofollow">
<img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
</td>
</tr>
<tr>
<td > 4 </td>
<td > 从谱子获取 notesnote dursis_slurs不含 rest毛字一拍起始在小字一组第3组 </td>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test4.wav" rel="nofollow">
<img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
</td>
</tr>
<tr>
<td > 5 </td>
<td > 从谱子获取 notesnote dursis_slurs加上 rest 毛字半拍rest半拍起始在小字一组第3组</td>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test5.wav" rel="nofollow">
<img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
</td>
</tr>
<tr>
<td > 6 </td>
<td > 从谱子获取 notes is_slurs包含 restnote_durs 从原始标注获取起始在小字一组第3组 </td>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test6.wav" rel="nofollow">
<img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
</td>
</tr>
<tr>
<td > 7 </td>
<td > 从谱子获取 notesnote dursis_slurs不含 rest毛字一拍起始在小字一组第4组 </td>
<td align = "center">
<a href="https://paddlespeech.bj.bcebos.com/t2s/svs/svs_music_scores/test7.wav" rel="nofollow">
<img align="center" src="../../../docs/images/audio_icon.png" width="200 style="max-width: 100%;"></a><br>
</td>
</tr>
</tbody>
</table>
</div>
上述实验表明通过该方法来提取 music score 是可行的,但是在应用中可以**灵活地在歌词中加"AP"(用来表示吸气声)和"SP"(用来表示停顿声)**,对应的在 **note 上加 rest**,会使得整体的歌声合成更自然。
除此之外,还要考虑哪一个大调并且以哪一组为起始**得到的 note 在训练数据集中出现过**,如若推理时传入训练数据中没有见过的 note 合成出来的音频可能不是我们期待的音调。
# 三、其他
## 3.1 读取midi
```python
import mido
mid = mido.MidiFile('2093.midi')
```

@ -0,0 +1,98 @@
############################################
# Network Architecture #
############################################
cmvn_file:
cmvn_file_type: "json"
# encoder related
encoder: squeezeformer
encoder_conf:
encoder_dim: 256 # dimension of attention
output_size: 256 # dimension of output
attention_heads: 4
num_blocks: 12 # the number of encoder blocks
reduce_idx: 5
recover_idx: 11
feed_forward_expansion_factor: 8
input_dropout_rate: 0.1
feed_forward_dropout_rate: 0.1
attention_dropout_rate: 0.1
adaptive_scale: true
cnn_module_kernel: 31
normalize_before: false
activation_type: 'swish'
pos_enc_layer_type: 'rel_pos'
time_reduction_layer_type: 'stream'
causal: true
use_dynamic_chunk: true
use_dynamic_left_chunk: false
# decoder related
decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1 # sublayer output dropout
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
init_type: 'kaiming_uniform' # !Warning: need to convergence
###########################################
# Data #
###########################################
train_manifest: data/manifest.train
dev_manifest: data/manifest.dev
test_manifest: data/manifest.test
###########################################
# Dataloader #
###########################################
vocab_filepath: data/lang_char/vocab.txt
spm_model_prefix: ''
unit_type: 'char'
preprocess_config: conf/preprocess.yaml
feat_dim: 80
stride_ms: 10.0
window_ms: 25.0
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
batch_size: 32
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
minibatches: 0 # for debug
batch_count: auto
batch_bins: 0
batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
num_workers: 2
subsampling_factor: 1
num_encs: 1
###########################################
# Training #
###########################################
n_epoch: 240
accum_grad: 1
global_grad_clip: 5.0
dist_sampler: True
optim: adam
optim_conf:
lr: 0.001
weight_decay: 1.0e-6
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000
lr_decay: 1.0
log_interval: 100
checkpoint:
kbest_n: 50
latest_n: 5

@ -0,0 +1,93 @@
############################################
# Network Architecture #
############################################
cmvn_file:
cmvn_file_type: "json"
# encoder related
encoder: squeezeformer
encoder_conf:
encoder_dim: 256 # dimension of attention
output_size: 256 # dimension of output
attention_heads: 4
num_blocks: 12 # the number of encoder blocks
reduce_idx: 5
recover_idx: 11
feed_forward_expansion_factor: 8
input_dropout_rate: 0.1
feed_forward_dropout_rate: 0.1
attention_dropout_rate: 0.1
adaptive_scale: true
cnn_module_kernel: 31
normalize_before: false
activation_type: 'swish'
pos_enc_layer_type: 'rel_pos'
time_reduction_layer_type: 'conv1d'
# decoder related
decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
init_type: 'kaiming_uniform' # !Warning: need to convergence
###########################################
# Data #
###########################################
train_manifest: data/manifest.train
dev_manifest: data/manifest.dev
test_manifest: data/manifest.test
###########################################
# Dataloader #
###########################################
vocab_filepath: data/lang_char/vocab.txt
spm_model_prefix: ''
unit_type: 'char'
preprocess_config: conf/preprocess.yaml
feat_dim: 80
stride_ms: 10.0
window_ms: 25.0
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
batch_size: 32
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
minibatches: 0 # for debug
batch_count: auto
batch_bins: 0
batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
num_workers: 2
subsampling_factor: 1
num_encs: 1
###########################################
# Training #
###########################################
n_epoch: 150
accum_grad: 8
global_grad_clip: 5.0
dist_sampler: False
optim: adam
optim_conf:
lr: 0.002
weight_decay: 1.0e-6
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000
lr_decay: 1.0
log_interval: 100
checkpoint:
kbest_n: 50
latest_n: 5

@ -164,8 +164,8 @@ using the `tar` scripts to unpack the model and then you can use the script to t
For example:
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1
@ -185,14 +185,14 @@ In some situations, you want to use the trained model to do the inference for th
```
you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
tar xzvf wav2vec2ASR-large-aishell1_ckpt_1.4.0.model.tar.gz
```
You can download the audio demo:
```bash
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wav -P data/
```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/wav2vec2ASR.yaml conf/tuning/decode.yaml exp/wav2vec2ASR/checkpoints/avg_1 data/demo_002_en.wav
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/wav2vec2ASR.yaml conf/tuning/decode.yaml exp/wav2vec2ASR/checkpoints/avg_1 data/demo_01_03.wav
```

@ -0,0 +1,18 @@
# AISHELL
## Version
* paddle version: develop (commit id: daea892c67e85da91906864de40ce9f6f1b893ae)
* paddlespeech version: develop (commit id: c14b4238b256693281e59605abff7c9435b3e2b2)
* paddlenlp version: 2.5.2
## Device
* python: 3.7
* cuda: 10.2
* cudnn: 7.6
## Result
train: Epoch 80, 2*V100-32G, batchsize:5
| Model | Params | Config | Augmentation| Test set | Decode method | WER |
| --- | --- | --- | --- | --- | --- | --- |
| wav2vec2ASR | 324.49 M | conf/wav2vec2ASR.yaml | spec_aug | test-set | greedy search | 5.1009 |

@ -83,7 +83,7 @@ dnn_neurons: 1024
freeze_wav2vec: False
dropout: 0.15
tokenizer: !apply:transformers.BertTokenizer.from_pretrained
tokenizer: !apply:paddlenlp.transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: bert-base-chinese
# bert-base-chinese tokens length
output_neurons: 21128

@ -107,6 +107,7 @@ vocab_filepath: data/lang_char/vocab.txt
###########################################
unit_type: 'char'
tokenizer: bert-base-chinese
mean_std_filepath:
preprocess_config: conf/preprocess.yaml
sortagrad: -1 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
@ -139,12 +140,10 @@ n_epoch: 80
accum_grad: 1
global_grad_clip: 5.0
model_optim: adadelta
model_optim: sgd
model_optim_conf:
lr: 1.0
weight_decay: 0.0
rho: 0.95
epsilon: 1.0e-8
wav2vec2_optim: adam
wav2vec2_optim_conf:
@ -165,3 +164,4 @@ log_interval: 1
checkpoint:
kbest_n: 50
latest_n: 5

@ -0,0 +1,168 @@
############################################
# Network Architecture #
############################################
freeze_wav2vec2: False
normalize_wav: True
output_norm: True
init_type: 'kaiming_uniform' # !Warning: need to convergence
enc:
input_shape: 1024
dnn_blocks: 3
dnn_neurons: 1024
activation: True
normalization: True
dropout_rate: [0.15, 0.15, 0.0]
ctc:
enc_n_units: 1024
blank_id: 0
dropout_rate: 0.0
audio_augment:
speeds: [90, 100, 110]
spec_augment:
time_warp: True
time_warp_window: 5
time_warp_mode: bicubic
freq_mask: True
n_freq_mask: 2
time_mask: True
n_time_mask: 2
replace_with_zero: False
freq_mask_width: 30
time_mask_width: 40
wav2vec2_params_path: exp/wav2vec2/chinese-wav2vec2-large.pdparams
############################################
# Wav2Vec2.0 #
############################################
# vocab_size: 1000000
hidden_size: 1024
num_hidden_layers: 24
num_attention_heads: 16
intermediate_size: 4096
hidden_act: gelu
hidden_dropout: 0.1
activation_dropout: 0.0
attention_dropout: 0.1
feat_proj_dropout: 0.1
feat_quantizer_dropout: 0.0
final_dropout: 0.0
layerdrop: 0.1
initializer_range: 0.02
layer_norm_eps: 1e-5
feat_extract_norm: layer
feat_extract_activation: gelu
conv_dim: [512, 512, 512, 512, 512, 512, 512]
conv_stride: [5, 2, 2, 2, 2, 2, 2]
conv_kernel: [10, 3, 3, 3, 3, 2, 2]
conv_bias: True
num_conv_pos_embeddings: 128
num_conv_pos_embedding_groups: 16
do_stable_layer_norm: True
apply_spec_augment: False
mask_channel_length: 10
mask_channel_min_space: 1
mask_channel_other: 0.0
mask_channel_prob: 0.0
mask_channel_selection: static
mask_feature_length: 10
mask_feature_min_masks: 0
mask_feature_prob: 0.0
mask_time_length: 10
mask_time_min_masks: 2
mask_time_min_space: 1
mask_time_other: 0.0
mask_time_prob: 0.075
mask_time_selection: static
num_codevectors_per_group: 320
num_codevector_groups: 2
contrastive_logits_temperature: 0.1
num_negatives: 100
codevector_dim: 256
proj_codevector_dim: 256
diversity_loss_weight: 0.1
use_weighted_layer_sum: False
# pad_token_id: 0
# bos_token_id: 1
# eos_token_id: 2
add_adapter: False
adapter_kernel_size: 3
adapter_stride: 2
num_adapter_layers: 3
output_hidden_size: None
###########################################
# Data #
###########################################
train_manifest: data/manifest.train
dev_manifest: data/manifest.dev
test_manifest: data/manifest.test
vocab_filepath: data/lang_char/vocab.txt
###########################################
# Dataloader #
###########################################
unit_type: 'char'
tokenizer: bert-base-chinese
mean_std_filepath:
preprocess_config: conf/preprocess.yaml
sortagrad: -1 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
batch_size: 5 # Different batch_size may cause large differences in results
maxlen_in: 51200000000 # if input length > maxlen-in batchsize is automatically reduced
maxlen_out: 1500000 # if output length > maxlen-out batchsize is automatically reduced
minibatches: 0 # for debug
batch_count: auto
batch_bins: 0
batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
num_workers: 6
subsampling_factor: 1
num_encs: 1
dist_sampler: True
shortest_first: True
return_lens_rate: True
###########################################
# use speechbrain dataloader #
###########################################
use_sb_pipeline: True # whether use speechbrain pipeline. Default is True.
sb_pipeline_conf: conf/train_with_wav2vec.yaml
###########################################
# Training #
###########################################
n_epoch: 80
accum_grad: 1
global_grad_clip: 5.0
model_optim: adadelta
model_optim_conf:
lr: 1.0
weight_decay: 0.0
rho: 0.95
epsilon: 1.0e-8
wav2vec2_optim: adam
wav2vec2_optim_conf:
lr: 0.0001
weight_decay: 0.0
model_scheduler: newbobscheduler
model_scheduler_conf:
improvement_threshold: 0.0025
annealing_factor: 0.8
patient: 0
wav2vec2_scheduler: newbobscheduler
wav2vec2_scheduler_conf:
improvement_threshold: 0.0025
annealing_factor: 0.9
patient: 0
log_interval: 1
checkpoint:
kbest_n: 50
latest_n: 5

@ -21,7 +21,7 @@ import glob
import logging
import os
from paddlespeech.s2t.models.wav2vec2.io.dataio import read_audio
from paddlespeech.s2t.io.speechbrain.dataio import read_audio
logger = logging.getLogger(__name__)

@ -1,7 +1,7 @@
#!/bin/bash
stage=-1
stop_stage=-1
stop_stage=3
dict_dir=data/lang_char
. ${MAIN_ROOT}/utils/parse_options.sh || exit -1;

@ -8,9 +8,7 @@ echo "using $ngpu gpus..."
expdir=exp
datadir=data
train_set=train_960
recog_set="test-clean test-other dev-clean dev-other"
recog_set="test-clean"
train_set=train
config_path=$1
decode_config_path=$2
@ -75,7 +73,7 @@ for type in ctc_prefix_beam_search; do
--trans_hyp ${ckpt_prefix}.${type}.rsl.text
python3 utils/compute-wer.py --char=1 --v=1 \
data/manifest.test-clean.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
data/manifest.test.text ${ckpt_prefix}.${type}.rsl.text > ${ckpt_prefix}.${type}.error
echo "decoding ${type} done."
done

@ -14,7 +14,7 @@ ckpt_prefix=$3
audio_file=$4
mkdir -p data
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wav -P data/
if [ $? -ne 0 ]; then
exit 1
fi

@ -15,11 +15,11 @@ resume= # xx e.g. 30
export FLAGS_cudnn_deterministic=1
. ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
audio_file=data/demo_002_en.wav
audio_file=data/demo_01_03.wav
avg_ckpt=avg_${avg_num}
ckpt=$(basename ${conf_path} | awk -F'.' '{print $1}')
echo "checkpoint name ${ckpt}"git revert -v
echo "checkpoint name ${ckpt}"
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data

@ -43,10 +43,7 @@ fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.0.0' ]]; then
pip install paddle2onnx==1.0.0
fi
pip install paddle2onnx --upgrade
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_aishell3
# considering the balance between speed and quality, we recommend that you use hifigan as vocoder
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_aishell3

@ -46,10 +46,7 @@ fi
# we have only tested the following models so far
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.0.0' ]]; then
pip install paddle2onnx==1.0.0
fi
pip install paddle2onnx --upgrade
../../csmsc/tts3/local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_canton
# considering the balance between speed and quality, we recommend that you use hifigan as vocoder
# ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc

@ -45,10 +45,7 @@ fi
# we have only tested the following models so far
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.0.0' ]]; then
pip install paddle2onnx==1.0.0
fi
pip install paddle2onnx --upgrade
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx speedyspeech_csmsc
# considering the balance between speed and quality, we recommend that you use hifigan as vocoder
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc

@ -45,10 +45,7 @@ fi
# we have only tested the following models so far
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.0.0' ]]; then
pip install paddle2onnx==1.0.0
fi
pip install paddle2onnx --upgrade
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
# considering the balance between speed and quality, we recommend that you use hifigan as vocoder
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc

@ -58,10 +58,7 @@ fi
# paddle2onnx non streaming
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.0.0' ]]; then
pip install paddle2onnx==1.0.0
fi
pip install paddle2onnx --upgrade
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
# considering the balance between speed and quality, we recommend that you use hifigan as vocoder
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc
@ -77,10 +74,7 @@ fi
# paddle2onnx streaming
if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.0.0' ]]; then
pip install paddle2onnx==1.0.0
fi
pip install paddle2onnx --upgrade
# streaming acoustic model
./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_encoder_infer
./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_decoder

@ -0,0 +1 @@
../../tts3/local/paddle2onnx.sh

@ -39,3 +39,31 @@ fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} ${add_blank}|| exit -1
fi
# # not ready yet for operator missing in Paddle2ONNX
# # paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
# # we have only tested the following models so far
# if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# # install paddle2onnx
# pip install paddle2onnx --upgrade
# ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx vits_csmsc
# fi
# # inference with onnxruntime
# if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
# ./local/ort_predict.sh ${train_output_path}
# fi
# # not ready yet for operator missing in Paddle-Lite
# # must run after stage 3 (which stage generated static models)
# if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# # NOTE by yuantian 2022.11.21: please compile develop version of Paddle-Lite to export and run TTS models,
# # cause TTS models are supported by https://github.com/PaddlePaddle/Paddle-Lite/pull/9587
# # and https://github.com/PaddlePaddle/Paddle-Lite/pull/9706
# ./local/export2lite.sh ${train_output_path} inference pdlite vits_csmsc x86
# fi
# if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
# CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
# fi

@ -6,7 +6,7 @@ set -e
gpus=0
stage=0
stop_stage=0
stop_stage=4
conf_path=conf/wav2vec2ASR.yaml
ips= #xx.xx.xx.xx,xx.xx.xx.xx
decode_conf_path=conf/tuning/decode.yaml

@ -45,10 +45,7 @@ fi
# we have only tested the following models so far
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.0.0' ]]; then
pip install paddle2onnx==1.0.0
fi
pip install paddle2onnx --upgrade
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_ljspeech
# considering the balance between speed and quality, we recommend that you use hifigan as vocoder
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_ljspeech

@ -0,0 +1,6 @@
# Opencpop
* svs1 - DiffSinger
* voc1 - Parallel WaveGAN
* voc5 - HiFiGAN

@ -0,0 +1,276 @@
([简体中文](./README_cn.md)|English)
# DiffSinger with Opencpop
This example contains code used to train a [DiffSinger](https://arxiv.org/abs/2105.02446) model with [Mandarin singing corpus](https://wenet.org.cn/opencpop/).
## Dataset
### Download and Extract
Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`.
## Get Started
Assume the path to the dataset is `~/datasets/Opencpop`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
- (Supporting) synthesize waveform from a text file.
5. (Supporting) inference using the static model.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── energy_stats.npy
├── norm
├── pitch_stats.npy
├── raw
├── speech_stats.npy
└── speech_stretchs.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech, pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. `speech_stretchs.npy` contains the minimum and maximum values of each dimension of the mel spectrum, which is used for linear stretching before training/inference of the diffusion module.
Note: Since the training effect of non-norm features is due to norm, the features saved under `norm` are features that have not been normed.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains utterance id, speaker id, phones, text_lengths, speech_lengths, phone durations, the path of speech features, the path of pitch features, the path of energy features, note, note durations, slur.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--phones-dict PHONES_DICT]
[--speaker-dict SPEAKER_DICT] [--speech-stretchs SPEECH_STRETCHS]
Train a FastSpeech2 model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG diffsinger config file.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu=0, use cpu.
--phones-dict PHONES_DICT
phone vocabulary file.
--speaker-dict SPEAKER_DICT
speaker id map file for multiple speaker model.
--speech-stretchs SPEECH_STRETCHS
min amd max mel for stretching.
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file.
6. `--speech-stretchs` is the path of mel's min-max data file.
### Synthesizing
We use parallel wavegan as the neural vocoder.
Download pretrained parallel wavegan model from [pwgan_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip) and unzip it.
```bash
unzip pwgan_opencpop_ckpt_1.4.0.zip
```
Parallel WaveGAN checkpoint contains files listed below.
```text
pwgan_opencpop_ckpt_1.4.0.zip
├── default.yaml # default config used to train parallel wavegan
├── snapshot_iter_100000.pdz # model parameters of parallel wavegan
└── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
```
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h]
[--am {diffsinger_opencpop}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--voc {pwgan_opencpop}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
[--speech_stretchs SPEECH_STRETCHS]
Synthesize with acoustic model & vocoder
optional arguments:
-h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task.
{diffsinger_opencpop} Choose acoustic model type of svs task.
--am_config AM_CONFIG
Config of acoustic model.
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model.
--phones_dict PHONES_DICT
phone vocabulary file.
--tones_dict TONES_DICT
tone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task.
{pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
--voc_config VOC_CONFIG
Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata.
--output_dir OUTPUT_DIR
output dir.
--speech-stretchs SPEECH_STRETCHS
The min and max values of the mel spectrum, using on diffusion of diffsinger.
```
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
`local/pinyin_to_phone.txt` comes from the readme of the opencpop dataset, indicating the mapping from pinyin to phonemes in opencpop.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
[--text TEXT] [--output_dir OUTPUT_DIR]
[--pinyin_phone PINYIN_PHONE]
[--speech_stretchs SPEECH_STRETCHS]
Synthesize with acoustic model & vocoder
optional arguments:
-h, --help show this help message and exit
--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task.
{diffsinger_opencpop} Choose acoustic model type of svs task.
--am_config AM_CONFIG
Config of acoustic model.
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model.
--phones_dict PHONES_DICT
phone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task.
{pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
--voc_config VOC_CONFIG
Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG {zh, en, mix, canton} Choose language type of tts task.
{sing} Choose language type of svs task.
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize file, a 'utt_id sentence' pair per line for tts task.
A '{ utt_id input_type (is word) text notes note_durs}' or '{utt_id input_type (is phoneme) phones notes note_durs is_slurs}' pair per line for svs task.
--output_dir OUTPUT_DIR
output dir.
--pinyin_phone PINYIN_PHONE
pinyin to phone map file, using on sing_frontend.
--speech_stretchs SPEECH_STRETCHS
The min and max values of the mel spectrum, using on diffusion of diffsinger.
```
1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the diffsinger pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is language. `zh`, `en`, `mix` and `canton` for tts task. `sing` for tts task.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
10. `--inference_dir` is the directory to save static models. If this line is not added, it will not be generated and saved as a static model.
11. `--pinyin_phone` pinyin to phone map file, using on sing_frontend.
12. `--speech_stretchs` The min and max values of the mel spectrum, using on diffusion of diffsinger.
Note: At present, the diffsinger model does not support dynamic to static, so do not add `--inference_dir`.
## Pretrained Model
Pretrained DiffSinger model:
- [diffsinger_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/diffsinger_opencpop_ckpt_1.4.0.zip)
DiffSinger checkpoint contains files listed below.
```text
diffsinger_opencpop_ckpt_1.4.0.zip
├── default.yaml # default config used to train diffsinger
├── energy_stats.npy # statistics used to normalize energy when training diffsinger if norm is needed
├── phone_id_map.txt # phone vocabulary file when training diffsinger
├── pinyin_to_phone.txt # pinyin-to-phoneme mapping file when training diffsinger
├── pitch_stats.npy # statistics used to normalize pitch when training diffsinger if norm is needed
├── snapshot_iter_160000.pdz # model parameters of diffsinger
├── speech_stats.npy # statistics used to normalize mel when training diffsinger if norm is needed
└── speech_stretchs.npy # min and max values to use for mel spectral stretching before training diffusion
```
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences_sing.txt` using pretrained diffsinger and parallel wavegan models.
```bash
source path.sh
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=diffsinger_opencpop \
--am_config=diffsinger_opencpop_ckpt_1.4.0/default.yaml \
--am_ckpt=diffsinger_opencpop_ckpt_1.4.0/snapshot_iter_160000.pdz \
--am_stat=diffsinger_opencpop_ckpt_1.4.0/speech_stats.npy \
--voc=pwgan_opencpop \
--voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
--voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
--voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
--lang=sing \
--text=${BIN_DIR}/../sentences_sing.txt \
--output_dir=exp/default/test_e2e \
--phones_dict=diffsinger_opencpop_ckpt_1.4.0/phone_id_map.txt \
--pinyin_phone=diffsinger_opencpop_ckpt_1.4.0/pinyin_to_phone.txt \
--speech_stretchs=diffsinger_opencpop_ckpt_1.4.0/speech_stretchs.npy
```

@ -0,0 +1,280 @@
(简体中文|[English](./README.md))
# 用 Opencpop 数据集训练 DiffSinger 模型
本用例包含用于训练 [DiffSinger](https://arxiv.org/abs/2105.02446) 模型的代码,使用 [Mandarin singing corpus](https://wenet.org.cn/opencpop/) 数据集。
## 数据集
### 下载并解压
从 [官方网站](https://wenet.org.cn/opencpop/download/) 下载数据集
## 开始
假设数据集的路径是 `~/datasets/Opencpop`.
运行下面的命令会进行如下操作:
1. **设置原路径**。
2. 对数据集进行预处理。
3. 训练模型
4. 合成波形
- 从 `metadata.jsonl` 合成波形。
- (支持中)从文本文件合成波形。
5. (支持中)使用静态模型进行推理。
```bash
./run.sh
```
您可以选择要运行的一系列阶段,或者将 `stage` 设置为 `stop-stage` 以仅使用一个阶段,例如,运行以下命令只会预处理数据集。
```bash
./run.sh --stage 0 --stop-stage 0
```
### 数据预处理
```bash
./local/preprocess.sh ${conf_path}
```
当它完成时。将在当前目录中创建 `dump` 文件夹。转储文件夹的结构如下所示。
```text
dump
├── dev
│ ├── norm
│ └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── energy_stats.npy
├── norm
├── pitch_stats.npy
├── raw
├── speech_stats.npy
└── speech_stretchs.npy
```
数据集分为三个部分,即 `train``dev``test` ,每个部分都包含一个 `norm``raw` 子文件夹。原始文件夹包含每个话语的语音、音调和能量特征,而 `norm` 文件夹包含规范化的特征。用于规范化特征的统计数据是从 `dump/train/*_stats.npy` 中的训练集计算出来的。`speech_stretchs.npy` 中包含 mel谱每个维度上的最小值和最大值用于 diffusion 模块训练/推理前的线性拉伸。
注意:由于非 norm 特征训练效果由于 norm因此 `norm` 下保存的特征是未经过 norm 的特征。
此外,还有一个 `metadata.jsonl` 在每个子文件夹中。它是一个类似表格的文件包含话语id音色id音素、文本长度、语音长度、音素持续时间、语音特征路径、音调特征路径、能量特征路径、音调音调持续时间是否为转音。
### 模型训练
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` 调用 `${BIN_DIR}/train.py`
以下是完整的帮助信息。
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--phones-dict PHONES_DICT]
[--speaker-dict SPEAKER_DICT] [--speech-stretchs SPEECH_STRETCHS]
Train a DiffSinger model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG diffsinger config file.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu=0, use cpu.
--phones-dict PHONES_DICT
phone vocabulary file.
--speaker-dict SPEAKER_DICT
speaker id map file for multiple speaker model.
--speech-stretchs SPEECH_STRETCHS
min amd max mel for stretching.
```
1. `--config` 是一个 yaml 格式的配置文件,用于覆盖默认配置,位于 `conf/default.yaml`.
2. `--train-metadata``--dev-metadata` 应为 `dump` 文件夹中 `train``dev` 下的规范化元数据文件
3. `--output-dir` 是保存结果的目录。 检查点保存在此目录中的 `checkpoints/` 目录下。
4. `--ngpu` 要使用的 GPU 数,如果 ngpu==0则使用 cpu 。
5. `--phones-dict` 是音素词汇表文件的路径。
6. `--speech-stretchs` mel的最小最大值数据的文件路径。
### 合成
我们使用 parallel opencpop 作为神经声码器vocoder
从 [pwgan_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip) 下载预训练的 parallel wavegan 模型并将其解压。
```bash
unzip pwgan_opencpop_ckpt_1.4.0.zip
```
Parallel WaveGAN 检查点包含如下文件。
```text
pwgan_opencpop_ckpt_1.4.0.zip
├── default.yaml # 用于训练 parallel wavegan 的默认配置
├── snapshot_iter_100000.pdz # parallel wavegan 的模型参数
└── feats_stats.npy # 训练平行波形时用于规范化谱图的统计数据
```
`./local/synthesize.sh` 调用 `${BIN_DIR}/../synthesize.py` 即可从 `metadata.jsonl`中合成波形。
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h]
[--am {diffsinger_opencpop}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--voc {pwgan_opencpop}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
[--speech_stretchs SPEECH_STRETCHS]
Synthesize with acoustic model & vocoder
optional arguments:
-h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task.
{diffsinger_opencpop} Choose acoustic model type of svs task.
--am_config AM_CONFIG
Config of acoustic model.
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model.
--phones_dict PHONES_DICT
phone vocabulary file.
--tones_dict TONES_DICT
tone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task.
{pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
--voc_config VOC_CONFIG
Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata.
--output_dir OUTPUT_DIR
output dir.
--speech-stretchs SPEECH_STRETCHS
The min and max values of the mel spectrum, using on diffusion of diffsinger.
```
`./local/synthesize_e2e.sh` 调用 `${BIN_DIR}/../synthesize_e2e.py`,即可从文本文件中合成波形。
`local/pinyin_to_phone.txt`来源于opencpop数据集中的README表示opencpop中拼音到音素的映射。
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
[--text TEXT] [--output_dir OUTPUT_DIR]
[--pinyin_phone PINYIN_PHONE]
[--speech_stretchs SPEECH_STRETCHS]
Synthesize with acoustic model & vocoder
optional arguments:
-h, --help show this help message and exit
--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task.
{diffsinger_opencpop} Choose acoustic model type of svs task.
--am_config AM_CONFIG
Config of acoustic model.
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model.
--phones_dict PHONES_DICT
phone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task.
{pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
--voc_config VOC_CONFIG
Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG {zh, en, mix, canton} Choose language type of tts task.
{sing} Choose language type of svs task.
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize file, a 'utt_id sentence' pair per line for tts task.
A '{ utt_id input_type (is word) text notes note_durs}' or '{utt_id input_type (is phoneme) phones notes note_durs is_slurs}' pair per line for svs task.
--output_dir OUTPUT_DIR
output dir.
--pinyin_phone PINYIN_PHONE
pinyin to phone map file, using on sing_frontend.
--speech_stretchs SPEECH_STRETCHS
The min and max values of the mel spectrum, using on diffusion of diffsinger.
```
1. `--am` 声学模型格式是否符合 {model_name}_{dataset}
2. `--am_config`, `--am_ckpt`, `--am_stat``--phones_dict` 是声学模型的参数,对应于 diffsinger 预训练模型中的 4 个文件。
3. `--voc` 声码器(vocoder)格式是否符合 {model_name}_{dataset}
4. `--voc_config`, `--voc_ckpt`, `--voc_stat` 是声码器的参数,对应于 parallel wavegan 预训练模型中的 3 个文件。
5. `--lang` tts对应模型的语言可以是 `zh`、`en`、`mix`和`canton`。 svs 对应的语言是 `sing`
6. `--test_metadata` 应为 `dump` 文件夹中 `test` 下的规范化元数据文件、
7. `--text` 是文本文件,其中包含要合成的句子。
8. `--output_dir` 是保存合成音频文件的目录。
9. `--ngpu` 要使用的GPU数如果 ngpu==0则使用 cpu。
10. `--inference_dir` 静态模型保存的目录。如果不加这一行,就不会生并保存成静态模型。
11. `--pinyin_phone` 拼音到音素的映射文件。
12. `--speech_stretchs` mel谱的最大最小值用于diffsinger中diffusion之前的线性拉伸。
注意: 目前 diffsinger 模型还不支持动转静,所以不要加 `--inference_dir`
## 预训练模型
预先训练的 DiffSinger 模型:
- [diffsinger_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/diffsinger_opencpop_ckpt_1.4.0.zip)
DiffSinger 检查点包含下列文件。
```text
diffsinger_opencpop_ckpt_1.4.0.zip
├── default.yaml # 用于训练 diffsinger 的默认配置
├── energy_stats.npy # 训练 diffsinger 时如若需要 norm energy 会使用到的统计数据
├── phone_id_map.txt # 训练 diffsinger 时的音素词汇文件
├── pinyin_to_phone.txt # 训练 diffsinger 时的拼音到音素映射文件
├── pitch_stats.npy # 训练 diffsinger 时如若需要 norm pitch 会使用到的统计数据
├── snapshot_iter_160000.pdz # 模型参数和优化器状态
├── speech_stats.npy # 训练 diffsinger 时用于规范化频谱图的统计数据
└── speech_stretchs.npy # 训练 diffusion 前用于 mel 谱拉伸的最小及最大值
```
您可以使用以下脚本通过使用预训练的 diffsinger 和 parallel wavegan 模型为 `${BIN_DIR}/../sentences_sing.txt` 合成句子
```bash
source path.sh
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=diffsinger_opencpop \
--am_config=diffsinger_opencpop_ckpt_1.4.0/default.yaml \
--am_ckpt=diffsinger_opencpop_ckpt_1.4.0/snapshot_iter_160000.pdz \
--am_stat=diffsinger_opencpop_ckpt_1.4.0/speech_stats.npy \
--voc=pwgan_opencpop \
--voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
--voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
--voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
--lang=sing \
--text=${BIN_DIR}/../sentences_sing.txt \
--output_dir=exp/default/test_e2e \
--phones_dict=diffsinger_opencpop_ckpt_1.4.0/phone_id_map.txt \
--pinyin_phone=diffsinger_opencpop_ckpt_1.4.0/pinyin_to_phone.txt \
--speech_stretchs=diffsinger_opencpop_ckpt_1.4.0/speech_stretchs.npy
```

@ -0,0 +1,159 @@
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # sr
n_fft: 512 # FFT size (samples).
n_shift: 128 # Hop size (samples). 12.5ms
win_length: 512 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
# Only used for feats_type != raw
fmin: 30 # Minimum frequency of Mel basis.
fmax: 12000 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.
# Only used for the model using pitch features (e.g. FastSpeech2)
f0min: 80 # Minimum f0 for pitch extraction.
f0max: 750 # Maximum f0 for pitch extraction.
###########################################################
# DATA SETTING #
###########################################################
batch_size: 48 # batch size
num_workers: 1 # number of gpu
###########################################################
# MODEL SETTING #
###########################################################
model:
# music score related
note_num: 300 # number of note
is_slur_num: 2 # number of slur
# fastspeech2 module options
use_energy_pred: False # whether use energy predictor
use_postnet: False # whether use postnet
# fastspeech2 module
fastspeech2_params:
adim: 256 # attention dimension
aheads: 2 # number of attention heads
elayers: 4 # number of encoder layers
eunits: 1024 # number of encoder ff units
dlayers: 4 # number of decoder layers
dunits: 1024 # number of decoder ff units
positionwise_layer_type: conv1d-linear # type of position-wise layer
positionwise_conv_kernel_size: 9 # kernel size of position wise conv layer
transformer_enc_dropout_rate: 0.1 # dropout rate for transformer encoder layer
transformer_enc_positional_dropout_rate: 0.1 # dropout rate for transformer encoder positional encoding
transformer_enc_attn_dropout_rate: 0.0 # dropout rate for transformer encoder attention layer
transformer_activation_type: "gelu" # Activation function type in transformer.
encoder_normalize_before: True # whether to perform layer normalization before the input
decoder_normalize_before: True # whether to perform layer normalization before the input
reduction_factor: 1 # reduction factor
init_type: xavier_uniform # initialization type
init_enc_alpha: 1.0 # initial value of alpha of encoder scaled position encoding
init_dec_alpha: 1.0 # initial value of alpha of decoder scaled position encoding
use_scaled_pos_enc: True # whether to use scaled positional encoding
transformer_dec_dropout_rate: 0.1 # dropout rate for transformer decoder layer
transformer_dec_positional_dropout_rate: 0.1 # dropout rate for transformer decoder positional encoding
transformer_dec_attn_dropout_rate: 0.0 # dropout rate for transformer decoder attention layer
duration_predictor_layers: 5 # number of layers of duration predictor
duration_predictor_chans: 256 # number of channels of duration predictor
duration_predictor_kernel_size: 3 # filter size of duration predictor
duration_predictor_dropout_rate: 0.5 # dropout rate in energy predictor
pitch_predictor_layers: 5 # number of conv layers in pitch predictor
pitch_predictor_chans: 256 # number of channels of conv layers in pitch predictor
pitch_predictor_kernel_size: 5 # kernel size of conv leyers in pitch predictor
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder
energy_predictor_layers: 2 # number of conv layers in energy predictor
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
postnet_layers: 5 # number of layers of postnet
postnet_filts: 5 # filter size of conv layers in postnet
postnet_chans: 256 # number of channels of conv layers in postnet
postnet_dropout_rate: 0.5 # dropout rate for postnet
# denoiser module
denoiser_params:
in_channels: 80 # Number of channels of the input mel-spectrogram
out_channels: 80 # Number of channels of the output mel-spectrogram
kernel_size: 3 # Kernel size of the residual blocks inside
layers: 20 # Number of residual blocks inside
stacks: 5 # The number of groups to split the residual blocks into
residual_channels: 256 # Residual channel of the residual blocks
gate_channels: 512 # Gate channel of the residual blocks
skip_channels: 256 # Skip channel of the residual blocks
aux_channels: 256 # Auxiliary channel of the residual blocks
dropout: 0.1 # Dropout of the residual blocks
bias: True # Whether to use bias in residual blocks
use_weight_norm: False # Whether to use weight norm in all convolutions
init_type: "kaiming_normal" # Type of initialize weights of a neural network module
diffusion_params:
num_train_timesteps: 100 # The number of timesteps between the noise and the real during training
beta_start: 0.0001 # beta start parameter for the scheduler
beta_end: 0.06 # beta end parameter for the scheduler
beta_schedule: "linear" # beta schedule parameter for the scheduler
num_max_timesteps: 100 # The max timestep transition from real to noise
stretch: True # whether to stretch before diffusion
###########################################################
# UPDATER SETTING #
###########################################################
fs2_updater:
use_masking: True # whether to apply masking for padded part in loss calculation
ds_updater:
use_masking: True # whether to apply masking for padded part in loss calculation
###########################################################
# OPTIMIZER SETTING #
###########################################################
# fastspeech2 optimizer
fs2_optimizer:
optim: adam # optimizer type
learning_rate: 0.001 # learning rate
# diffusion optimizer
ds_optimizer_params:
beta1: 0.9
beta2: 0.98
weight_decay: 0.0
ds_scheduler_params:
learning_rate: 0.001
gamma: 0.5
step_size: 50000
ds_grad_norm: 1
###########################################################
# INTERVAL SETTING #
###########################################################
only_train_diffusion: True # Whether to freeze fastspeech2 parameters when training diffusion
ds_train_start_steps: 160000 # Number of steps to start to train diffusion module.
train_max_steps: 320000 # Number of training steps.
save_interval_steps: 2000 # Interval steps to save checkpoint.
eval_interval_steps: 2000 # Interval steps to evaluate the network.
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
seed: 10086

@ -0,0 +1,418 @@
a|a
ai|ai
an|an
ang|ang
ao|ao
ba|b a
bai|b ai
ban|b an
bang|b ang
bao|b ao
bei|b ei
ben|b en
beng|b eng
bi|b i
bian|b ian
biao|b iao
bie|b ie
bin|b in
bing|b ing
bo|b o
bu|b u
ca|c a
cai|c ai
can|c an
cang|c ang
cao|c ao
ce|c e
cei|c ei
cen|c en
ceng|c eng
cha|ch a
chai|ch ai
chan|ch an
chang|ch ang
chao|ch ao
che|ch e
chen|ch en
cheng|ch eng
chi|ch i
chong|ch ong
chou|ch ou
chu|ch u
chua|ch ua
chuai|ch uai
chuan|ch uan
chuang|ch uang
chui|ch ui
chun|ch un
chuo|ch uo
ci|c i
cong|c ong
cou|c ou
cu|c u
cuan|c uan
cui|c ui
cun|c un
cuo|c uo
da|d a
dai|d ai
dan|d an
dang|d ang
dao|d ao
de|d e
dei|d ei
den|d en
deng|d eng
di|d i
dia|d ia
dian|d ian
diao|d iao
die|d ie
ding|d ing
diu|d iu
dong|d ong
dou|d ou
du|d u
duan|d uan
dui|d ui
dun|d un
duo|d uo
e|e
ei|ei
en|en
eng|eng
er|er
fa|f a
fan|f an
fang|f ang
fei|f ei
fen|f en
feng|f eng
fo|f o
fou|f ou
fu|f u
ga|g a
gai|g ai
gan|g an
gang|g ang
gao|g ao
ge|g e
gei|g ei
gen|g en
geng|g eng
gong|g ong
gou|g ou
gu|g u
gua|g ua
guai|g uai
guan|g uan
guang|g uang
gui|g ui
gun|g un
guo|g uo
ha|h a
hai|h ai
han|h an
hang|h ang
hao|h ao
he|h e
hei|h ei
hen|h en
heng|h eng
hm|h m
hng|h ng
hong|h ong
hou|h ou
hu|h u
hua|h ua
huai|h uai
huan|h uan
huang|h uang
hui|h ui
hun|h un
huo|h uo
ji|j i
jia|j ia
jian|j ian
jiang|j iang
jiao|j iao
jie|j ie
jin|j in
jing|j ing
jiong|j iong
jiu|j iu
ju|j v
juan|j van
jue|j ve
jun|j vn
ka|k a
kai|k ai
kan|k an
kang|k ang
kao|k ao
ke|k e
kei|k ei
ken|k en
keng|k eng
kong|k ong
kou|k ou
ku|k u
kua|k ua
kuai|k uai
kuan|k uan
kuang|k uang
kui|k ui
kun|k un
kuo|k uo
la|l a
lai|l ai
lan|l an
lang|l ang
lao|l ao
le|l e
lei|l ei
leng|l eng
li|l i
lia|l ia
lian|l ian
liang|l iang
liao|l iao
lie|l ie
lin|l in
ling|l ing
liu|l iu
lo|l o
long|l ong
lou|l ou
lu|l u
luan|l uan
lun|l un
luo|l uo
lv|l v
lve|l ve
m|m
ma|m a
mai|m ai
man|m an
mang|m ang
mao|m ao
me|m e
mei|m ei
men|m en
meng|m eng
mi|m i
mian|m ian
miao|m iao
mie|m ie
min|m in
ming|m ing
miu|m iu
mo|m o
mou|m ou
mu|m u
n|n
na|n a
nai|n ai
nan|n an
nang|n ang
nao|n ao
ne|n e
nei|n ei
nen|n en
neng|n eng
ng|n g
ni|n i
nian|n ian
niang|n iang
niao|n iao
nie|n ie
nin|n in
ning|n ing
niu|n iu
nong|n ong
nou|n ou
nu|n u
nuan|n uan
nun|n un
nuo|n uo
nv|n v
nve|n ve
o|o
ou|ou
pa|p a
pai|p ai
pan|p an
pang|p ang
pao|p ao
pei|p ei
pen|p en
peng|p eng
pi|p i
pian|p ian
piao|p iao
pie|p ie
pin|p in
ping|p ing
po|p o
pou|p ou
pu|p u
qi|q i
qia|q ia
qian|q ian
qiang|q iang
qiao|q iao
qie|q ie
qin|q in
qing|q ing
qiong|q iong
qiu|q iu
qu|q v
quan|q van
que|q ve
qun|q vn
ran|r an
rang|r ang
rao|r ao
re|r e
ren|r en
reng|r eng
ri|r i
rong|r ong
rou|r ou
ru|r u
rua|r ua
ruan|r uan
rui|r ui
run|r un
ruo|r uo
sa|s a
sai|s ai
san|s an
sang|s ang
sao|s ao
se|s e
sen|s en
seng|s eng
sha|sh a
shai|sh ai
shan|sh an
shang|sh ang
shao|sh ao
she|sh e
shei|sh ei
shen|sh en
sheng|sh eng
shi|sh i
shou|sh ou
shu|sh u
shua|sh ua
shuai|sh uai
shuan|sh uan
shuang|sh uang
shui|sh ui
shun|sh un
shuo|sh uo
si|s i
song|s ong
sou|s ou
su|s u
suan|s uan
sui|s ui
sun|s un
suo|s uo
ta|t a
tai|t ai
tan|t an
tang|t ang
tao|t ao
te|t e
tei|t ei
teng|t eng
ti|t i
tian|t ian
tiao|t iao
tie|t ie
ting|t ing
tong|t ong
tou|t ou
tu|t u
tuan|t uan
tui|t ui
tun|t un
tuo|t uo
wa|w a
wai|w ai
wan|w an
wang|w ang
wei|w ei
wen|w en
weng|w eng
wo|w o
wu|w u
xi|x i
xia|x ia
xian|x ian
xiang|x iang
xiao|x iao
xie|x ie
xin|x in
xing|x ing
xiong|x iong
xiu|x iu
xu|x v
xuan|x van
xue|x ve
xun|x vn
ya|y a
yan|y an
yang|y ang
yao|y ao
ye|y e
yi|y i
yin|y in
ying|y ing
yo|y o
yong|y ong
you|y ou
yu|y v
yuan|y van
yue|y ve
yun|y vn
za|z a
zai|z ai
zan|z an
zang|z ang
zao|z ao
ze|z e
zei|z ei
zen|z en
zeng|z eng
zha|zh a
zhai|zh ai
zhan|zh an
zhang|zh ang
zhao|zh ao
zhe|zh e
zhei|zh ei
zhen|zh en
zheng|zh eng
zhi|zh i
zhong|zh ong
zhou|zh ou
zhu|zh u
zhua|zh ua
zhuai|zh uai
zhuan|zh uan
zhuang|zh uang
zhui|zh ui
zhun|zh un
zhuo|zh uo
zi|z i
zong|z ong
zou|z ou
zu|z u
zuan|z uan
zui|z ui
zun|z un
zuo|z uo

@ -0,0 +1,74 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/preprocess.py \
--dataset=opencpop \
--rootdir=~/datasets/Opencpop/segments \
--dumpdir=dump \
--label-file=~/datasets/Opencpop/segments/transcriptions.txt \
--config=${config_path} \
--num-cpu=20 \
--cut-sil=True
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="speech"
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="pitch"
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="energy"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize and covert phone/speaker to id, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--speech-stats=dump/train/speech_stats.npy \
--pitch-stats=dump/train/pitch_stats.npy \
--energy-stats=dump/train/energy_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--speech-stats=dump/train/speech_stats.npy \
--pitch-stats=dump/train/pitch_stats.npy \
--energy-stats=dump/train/energy_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--speech-stats=dump/train/speech_stats.npy \
--pitch-stats=dump/train/pitch_stats.npy \
--energy-stats=dump/train/energy_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# Get feature(mel) extremum for diffusion stretch
echo "Get feature(mel) extremum ..."
python3 ${BIN_DIR}/get_minmax.py \
--metadata=dump/train/norm/metadata.jsonl \
--speech-stretchs=dump/train/speech_stretchs.npy
fi

@ -0,0 +1,27 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=diffsinger_opencpop \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_opencpop \
--voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
--voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
--voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speech_stretchs=dump/train/speech_stretchs.npy
fi

@ -0,0 +1,53 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=diffsinger_opencpop \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_opencpop \
--voc_config=pwgan_opencpop_ckpt_1.4.0/default.yaml \
--voc_ckpt=pwgan_opencpop_ckpt_1.4.0/snapshot_iter_100000.pdz \
--voc_stat=pwgan_opencpop_ckpt_1.4.0/feats_stats.npy \
--lang=sing \
--text=${BIN_DIR}/../sentences_sing.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speech_stretchs=dump/train/speech_stretchs.npy \
--pinyin_phone=local/pinyin_to_phone.txt
fi
# for more GAN Vocoders
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "in hifigan syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=diffsinger_opencpop \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_opencpop \
--voc_config=hifigan_opencpop_ckpt_1.4.0/default.yaml \
--voc_ckpt=hifigan_opencpop_ckpt_1.4.0/snapshot_iter_625000.pdz \
--voc_stat=hifigan_opencpop_ckpt_1.4.0/feats_stats.npy \
--lang=sing \
--text=${BIN_DIR}/../sentences_sing.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speech_stretchs=dump/train/speech_stretchs.npy \
--pinyin_phone=local/pinyin_to_phone.txt
fi

@ -0,0 +1,13 @@
#!/bin/bash
config_path=$1
train_output_path=$2
python3 ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1 \
--phones-dict=dump/phone_id_map.txt \
--speech-stretchs=dump/train/speech_stretchs.npy

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=diffsinger
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}

@ -0,0 +1,37 @@
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_320000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize, vocoder is pwgan by default
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# synthesize_e2e, vocoder is pwgan by default
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -0,0 +1,139 @@
# Parallel WaveGAN with Opencpop
This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Mandarin singing corpus](https://wenet.org.cn/opencpop/).
## Dataset
### Download and Extract
Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`.
## Get Started
Assume the path to the dataset is `~/datasets/Opencpop`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG ParallelWaveGAN config file.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
```
1. `--config` parallel wavegan config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
The pretrained model can be downloaded here:
- [pwgan_opencpop_ckpt_1.4.0](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip)
Parallel WaveGAN checkpoint contains files listed below.
```text
pwgan_opencpop_ckpt_1.4.0
├── default.yaml # default config used to train parallel wavegan
├── snapshot_iter_100000.pdz # generator parameters of parallel wavegan
└── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -0,0 +1,119 @@
# This is the hyperparameter configuration file for Parallel WaveGAN.
# Please make sure this is adjusted for the CSMSC dataset. If you want to
# apply to the other dataset, you might need to carefully change some parameters.
# This configuration requires 12 GB GPU memory and takes ~3 days on RTX TITAN.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 512 # FFT size (samples).
n_shift: 128 # Hop size (samples). 12.5ms
win_length: 512 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 30 # Minimum freq in mel basis calculation. (Hz)
fmax: 12000 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Kernel size of dilated convolution.
layers: 30 # Number of residual block layers.
stacks: 3 # Number of stacks i.e., dilation cycles.
residual_channels: 64 # Number of channels in residual conv.
gate_channels: 128 # Number of channels in gated conv.
skip_channels: 64 # Number of channels in skip conv.
aux_channels: 80 # Number of channels for auxiliary feature conv.
# Must be the same as num_mels.
aux_context_window: 2 # Context window size for auxiliary feature.
# If set to 2, previous 2 and future 2 frames will be considered.
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
bias: True # use bias in residual blocks
use_weight_norm: True # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
use_causal_conv: False # use causal conv in residual blocks and upsample layers
upsample_scales: [8, 4, 2, 2] # Upsampling scales. Prodcut of these must be the same as hop size.
interpolate_mode: "nearest" # upsample net interpolate mode
freq_axis_kernel_size: 1 # upsamling net: convolution kernel size in frequencey axis
nonlinear_activation: null
nonlinear_activation_params: {}
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Number of output channels.
layers: 10 # Number of conv layers.
conv_channels: 64 # Number of chnn layers.
bias: True # Whether to use bias parameter in conv.
use_weight_norm: True # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in leakyrelu.
###########################################################
# STFT LOSS SETTING #
###########################################################
stft_loss_params:
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
window: "hann" # Window function for STFT-based loss
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_adv: 4.0 # Loss balancing coefficient.
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 8 # Batch size.
batch_max_steps: 25500 # Length of each audio in batch. Make sure dividable by n_shift.
num_workers: 1 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
epsilon: 1.0e-6 # Generator's epsilon.
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 0.0001 # Generator's learning rate.
step_size: 200000 # Generator's scheduler step size.
gamma: 0.5 # Generator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
generator_grad_norm: 10 # Generator's gradient norm.
discriminator_optimizer_params:
epsilon: 1.0e-6 # Discriminator's epsilon.
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 0.00005 # Discriminator's learning rate.
step_size: 200000 # Discriminator's scheduler step size.
gamma: 0.5 # Discriminator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
discriminator_grad_norm: 1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 400000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_save_intermediate_results: 4 # Number of results to be saved as intermediate results.
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1 @@
../../../csmsc/voc1/local/PTQ_static.sh

@ -0,0 +1,15 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../../dygraph_to_static.py \
--type=voc \
--voc=pwgan_opencpop \
--voc_config=${config_path} \
--voc_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--voc_stat=dump/train/feats_stats.npy \
--inference_dir=exp/default/inference/

@ -0,0 +1,47 @@
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/Opencpop/segments/ \
--dataset=opencpop \
--dumpdir=dump \
--dur-file=~/datasets/Opencpop/segments/transcriptions.txt \
--config=${config_path} \
--cut-sil=False \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi

@ -0,0 +1 @@
../../../csmsc/voc1/local/synthesize.sh

@ -0,0 +1 @@
../../../csmsc/voc1/local/train.sh

@ -0,0 +1 @@
../../csmsc/voc1/path.sh

@ -0,0 +1,42 @@
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_100000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
# dygraph to static
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/dygraph_to_static.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
# PTQ_static
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/PTQ_static.sh ${train_output_path} pwgan_opencpop || exit -1
fi

@ -0,0 +1,167 @@
# This is the configuration file for CSMSC dataset.
# This configuration is based on HiFiGAN V1, which is an official configuration.
# But I found that the optimizer setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 512 # FFT size (samples).
n_shift: 128 # Hop size (samples). 12.5ms
win_length: 512 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 12000 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [8, 4, 2, 2] # Upsampling scales.
upsample_kernel_sizes: [16, 8, 4, 4] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 512
hop_size: 128
win_length: 512
window: "hann"
num_mels: 80
fmin: 30
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 1 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 4 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,168 @@
# This is the configuration file for CSMSC dataset.
# This configuration is based on HiFiGAN V1, which is an official configuration.
# But I found that the optimizer setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 512 # FFT size (samples).
n_shift: 128 # Hop size (samples). 12.5ms
win_length: 512 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 12000 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [8, 4, 2, 2] # Upsampling scales.
upsample_kernel_sizes: [16, 8, 4, 4] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: True # Whether to use additional conv layer in residual blocks.
bias: True # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: True
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: True # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: True # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: True # Whether to apply weight normalization.
use_spectral_norm: False # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: False # Whether to use multi-resolution STFT loss.
use_mel_loss: True # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 512
hop_size: 128
win_length: 512
window: "hann"
num_mels: 80
fmin: 30
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
use_feat_match_loss: True
feat_match_loss_params:
average_by_discriminators: False # Whether to average loss by #discriminators.
average_by_layers: False # Whether to average loss by #layers in each discriminator.
include_final_outputs: False # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
#batch_size: 16 # Batch size.
batch_size: 1 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 1 # Number of workers in DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2600000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 4 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,74 @@
#!/bin/bash
source path.sh
gpus=0
stage=0
stop_stage=100
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${MAIN_ROOT}/paddlespeech/t2s/exps/diffsinger/gen_gta_mel.py \
--diffsinger-config=diffsinger_opencpop_ckpt_1.4.0/default.yaml \
--diffsinger-checkpoint=diffsinger_opencpop_ckpt_1.4.0/snapshot_iter_160000.pdz \
--diffsinger-stat=diffsinger_opencpop_ckpt_1.4.0/speech_stats.npy \
--diffsinger-stretch=diffsinger_opencpop_ckpt_1.4.0/speech_stretchs.npy \
--dur-file=~/datasets/Opencpop/segments/transcriptions.txt \
--output-dir=dump_finetune \
--phones-dict=diffsinger_opencpop_ckpt_1.4.0/phone_id_map.txt \
--dataset=opencpop \
--rootdir=~/datasets/Opencpop/segments/
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${MAIN_ROOT}/utils/link_wav.py \
--old-dump-dir=dump \
--dump-dir=dump_finetune
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
cp dump/train/feats_stats.npy dump_finetune/train/
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump_finetune/train/raw/metadata.jsonl \
--dumpdir=dump_finetune/train/norm \
--stats=dump_finetune/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump_finetune/dev/raw/metadata.jsonl \
--dumpdir=dump_finetune/dev/norm \
--stats=dump_finetune/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump_finetune/test/raw/metadata.jsonl \
--dumpdir=dump_finetune/test/norm \
--stats=dump_finetune/train/feats_stats.npy
fi
# create finetune env
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
echo "create finetune env"
python3 local/prepare_env.py \
--pretrained_model_dir=exp/default/checkpoints/ \
--output_dir=exp/finetune/
fi
# finetune
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
CUDA_VISIBLE_DEVICES=${gpus} \
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump_finetune/train/norm/metadata.jsonl \
--dev-metadata=dump_finetune/dev/norm/metadata.jsonl \
--config=conf/finetune.yaml \
--output-dir=exp/finetune \
--ngpu=1
fi

@ -0,0 +1 @@
../../../csmsc/voc1/local/PTQ_static.sh

@ -0,0 +1,15 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../../dygraph_to_static.py \
--type=voc \
--voc=hifigan_opencpop \
--voc_config=${config_path} \
--voc_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--voc_stat=dump/train/feats_stats.npy \
--inference_dir=exp/default/inference/

@ -0,0 +1 @@
../../../other/tts_finetune/tts3/local/prepare_env.py

@ -0,0 +1 @@
../../voc1/local/preprocess.sh

@ -0,0 +1 @@
../../../csmsc/voc5/local/synthesize.sh

@ -0,0 +1 @@
../../../csmsc/voc1/local/train.sh

@ -0,0 +1 @@
../../csmsc/voc5/path.sh

@ -0,0 +1,42 @@
#!/bin/bash
set -e
source path.sh
gpus=0
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_2500000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
# dygraph to static
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/dygraph_to_static.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
# PTQ_static
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/PTQ_static.sh ${train_output_path} hifigan_opencpop || exit -1
fi

@ -32,7 +32,7 @@ iPad Pro的秒控键盘这次也推出白色版本。|iPad Pro的秒控键盘这
明天有62%的概率降雨|明天有百分之六十二的概率降雨
这是固话0421-33441122|这是固话零四二一三三四四一一二二
这是手机+86 18544139121|这是手机八六一八五四四一三九一二一
小王的身高是153.5cm,梦想是打篮球!我觉得有0.1%的可能性。|小王的身高是一百五十三点五cm,梦想是打篮球!我觉得有百分之零点一的可能性。
小王的身高是153.5cm,梦想是打篮球!我觉得有0.1%的可能性。|小王的身高是一百五十三点五厘米,梦想是打篮球!我觉得有百分之零点一的可能性。
不管三七二十一|不管三七二十一
九九八十一难|九九八十一难
2018年5月23号上午10点10分|二零一八年五月二十三号上午十点十分

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save